Wednesday, June 9, 2010

Design for optimizability

Only optimize when you found a bottleneck by measuring. That's what we all have been taught, right? It's a good rule. But is it really right? What if you're designing a performance critical application? Should you really be ignorant about performance until you measured and found a bottleneck?

I say: don't optimize, but plan for optimizability. I think Dwight D. Eisenhower said it best, though
Plans are nothing; planning is everything.

And now a disclaimer. I am not encouraging over-design in this post. Neither am I proposing that you should analyze every silly detail of your application domain before you start coding. I encourage you to start hacking on you application without worrying about performance, because you know (if you follow the tips that follows and use your brain) that you can optimize it later. Knowing that you can rewrite that slow and memory-hungry Ruby script into a snappy C program (that don't use any heap memory at all) makes it so much easier to sleep at night. Believe me.

Ok, now let's get going!

Caring about planning

Planning for optimizability means that you make sure that:
  • seams are placed to allow optimizing,
  • system-wide concerns are encapsulated,
  • it's possible to do design lowering later on,
  • algorithmic decisions are delayed,
  • intelligence can be moved from data to code to data,
  • it's possible to move run-time aspects to compile-time,
  • you know your tools.
We will now go through the first three of these bullets one by one and discuss them in detail. I will discuss the remaining four in a later post.

Seams and optimization borders

A seams in software design is a some-what vaguely defined concept. If two components of a software design are loosely coupled, we say that there is a seam between them. In other words, seams separate the design into parts that can vary independently of each other. Inheritance and dependency injection is often used to achieve this in object-oriented systems; function pointers play a similar role in procedural programming.

Since planning for optimizability means that we wish to make it easy to replace slow code with faster, we thus should place seams in the design such that they surround potentially slow code. This implies that design seams also are optimization borders; borders over which optimizations cannot easily be done without affecting other parts of the system.

This doesn't mean that it's impossible to optimize over optimization borders, just that other parts of the system will be affected by it. Thus, optimizing over optimization boarders is harder than optimizing within them. Let's take an example.

Assume that we iterate over a a large list in a performance critical part of our application:
void Foo::iterateOverLargeList(std::list<Thing> l) {
for (int i = 0; i < l.size(); i++)
l[i].doEasyCalculation();
}
Where should we place the design seams in this code? Or, more concretely, how do we use virtual functions to enable us to optimize this piece of code as much as possible without affecting the surrounding code? As I see it, we have the following alternatives:
  1. make Thing::doEasyCalculation virtual,
  2. make Foo::iterateOverLargeList virtual, or
  3. encapsulate std::list<Thing> in a new class ThingList, move iterateOverLargeList to ThingList, and make it virtual.
Of these three alternatives 3) gives us the most freedom to increase the performance of the loop, because it allows us to do not only optimize how we iterate, but also change the underlying data structure. For instance, changing from std::list (a linked list) to std::vector (essentially a linear array) will make the access pattern linear, thus the (L1, L2, L3) cache can do a better job.

However, 3) is also the option that is the hardest to refactor to, because we have to change every part of the application that uses the std::list<Thing> list. So, if our application is large, this may involve a lot of refactoring work. Thus, if we choose 3) early in the development we don't have to refactor so much later. In other words, early understanding of the domain, the application's use-cases, and the application design, is important for us when designing for optimizability.

This example touches on another important tool when designing for optimizability: encapsulation. We will now discuss this in greater detail.

Concerning encapsulation

One of the worst kind of optimizations you can do is to optimize something that affect the entire system, e.g., the data flow. A global property like is a system-wide concern and having the system's performance depend on such property is very very bad. We need to encapsulate these global properties to make sure that there are design seams that enable us to optimize them when we need to.

This means that we need to design our code such that if we need to optimize something, that something is encapsulated into a single class (or few classes). This does not mean that we should encapsulate every little petty detail, though. It does mean, however, that we need to put a bit more effort in understanding our application, and then think about its design, data flow, etc. This is actually something we should do anyway even if we're not designing for optimizability, because understand the current design of the application make it easier to improve it.

When we got a firm understanding of the application and how it will be used, then we can identify the system-wide concerns, and only then can we start to encapsulate them to make them possible to optimize in the future.

Avoid memory copying

Reading data from memory is one of the most costly thing a modern CPU can do. If the data is in (L1) cache reading memory is fast (3-5 cycles or so), but if the data needs to be read from main memory it takes 100 times longer. That's right: 500 cycles to read a single bit of data. Writing data to main memory is also very costly operation, thus, copying is extremely costly operation.

With this in mind, it may be tempting to return a pointer to a shared data buffer instead of returning a copy of that data. For example:
class Thingie {
int* stuff;
// constructor that initializes 'stuff'.
int* getStuff() { return stuff; }
};
This may look efficient since there is no memory copying going on. However, this design is horrible with regard to encapsulation. For instance, what if we need to modify the returned data in some other part of the program without affecting the Thing instance? In that case we need to make a copy and modify the copied data. But what if that copied data is returned just like the int* stuff above? And what if it needs to be modified in some other part of the program? Then we need to make another copy!

In situations like this, you need to take a step back and look at the data flow through your application and then try to optimize it to minimize the number of copies. This is not an easy task, so we really wish to avoid doing it, or at least make it easier for us to do. Is there some way to design our application (with little extra effort) from the start to make optimizations like this possible (without huge redesigns)?

In the example above, the system-wide concern is how the int* stuff data is passed around in the application and how to make sure that it is not copied unnecessarily.

Let's rewrite this example a bit into:
class Thingie2 {
MyBuffer stuff;
// constructor that initializes 'stuff'.
MyBuffer getStuff() { return stuff; }
};
Since the data now is encapsulated in MyBuffer (and the entire application uses MyBuffer to refer the the data), we can now implement a lot of optimizations that affect the data-flow of the application. For instance, we can implement copy-on-write semantics and reference counting; this would be extremely hard to do without using MyBuffer to encapsulate the data-flow, which is a system-wide concern.

Design lowering

The term instruction lowering is an optimization technique used in compilers to generate more efficient code. For example, the C code a = 16 * b; can be implemented (in pseudo-assembler) by a multiplying operation A = MULT(16, B) but also by a simpler and faster shift operation A = SHL(4, B).

With the term design lowering I mean a similar concept: reimplementing the design in a simpler and faster language or platform, or using faster language constructs or data structures. Examples of such design lowering include:
  • rewrite Python code in C,
  • replace reflection-heavy Java code with simpler non-reflection code,
  • using a simple array instead of a hash map or a linked list,
  • using the memory layout of C structs as data format instead of XML in files, over network, etc
  • use hardware acceleration like graphics cards.
Design lowering can easily increase the performance 10-100 times or even much more. Lets take an example of this.

A story of left and right

Imagine two small program left and right; left generates data and sends it to right, which sums the received data. Here is the first version of this pair of programs implemented in Python:
--- left.py ---
a = { "first":[1, 123], "middle":[1, 2, 456], "last":[1, 2, 3, 789] }
for i in xrange(0, 1000000):
print a

--- right.py ---
import sys
total = 0
for line in sys.stdin:
a = eval(line)
total = total + sum(a["first"]) + sum(a["middle"]) + sum(a["last"])
print total
Translating this Python code into C++ idioms this becomes:
--- left.cc ---
struct Data {
Data() {
first[0] = 1; first[1] = 123;
middle[0] = 1; middle[1] = 2; middle[2] = 456;
last[0] = 1; last[1] = 2; last[2] = 3; last[3] = 789;
}
int first[2];
int middle[3];
int last[4];
};

int main() {
Data data;
for (size_t a = 0; a < 1000000; a++) {
for (size_t b = 0; b < sizeof(Data); b++)
putchar(reinterpret_cast<char*>(&data)[b]);
}
}

--- right.cc ---
struct Data {
int first[2];
int middle[3];
int last[4];
};

int main() {
Data data;
long total = 0;
for (size_t a = 0; a < 1000000; a++) {
for (size_t b = 0; b < sizeof(Data); b++)
reinterpret_cast<char*>(&data)[b] = getchar();
total += data.first[0] + data.first[1] + data.middle[0] + data.middle[1] +
data.middle[2] + data.last[0] + data.last[1] + data.last[2] + data.last[3];
}
printf("%ld\n", total);
}
And running them we get:
$ time ./left.py | ./right.py
1378000000

real 1m3.284s
user 1m14.057s
sys 0m0.256s
$ g++ -O2 right.cc -o right && g++ -O2 left.cc -o left && time ./left | ./right
1378000000

real 0m0.763s
user 0m1.292s
sys 0m0.088s
$ echo 74.067/1.292 | bc
57
Thus, the C++ version of these little programs are 57 times faster than the equivalent Python code. But this is not the neat thing... Let's get back to discussing design lowering.

Where to do design lowering

The neat thing is that you can implement left and right in Python without worrying about performance because they are easily design lowered to C++. I can say this because left and right comes in pair; if you redesign one side you also redesign the other. In other words, the details of the implementations are free to change in ways to make them execute more efficient. Thus, it's easy to do design lowering.

However, this would not be the case if left sent data to an unknown part via a predefined protocol. If this was so, then reimplementing left in C++ would probably be considerably harder. In other words, design lowering left would be less feasible. It would still be possible to do, though.

In summary, design lowering can be used as a plan for optimizability when the design can vary in ways that make it feasible to implement faster in another language (or data structure, etc). How the design can vary depends on many things, though, thus design lowering is not always easy to do without first refactoring the design.

As before, design lowering is not something you get for free, though: you need to ha good understanding of how/where the application will be used. When you understand your application, you're more likely to make correct assumptions of what parts of that can be design lowered and which parts that can't.

Conclusions

Just like optimizing, designing for optimizability requires us to understand the problem. We cannot escape it: to do something good, we need to know what "good" is, and to know what "good" is we need to understand the domain. Beware of over-analyzing though: concentrate on system-wide concerns that are likely to affect performance, and know how the design can vary to enable design lowering. Also, make sure that design seams don't hinder optimizability.

There are still much more to discuss on this topic, so stay tuned!

No comments: