9 Optimizing memory access
9.1 Caching of code and data
A cache is a proxy for the main memory in a computer. The proxy is smaller and closer to the CPU than the main memory and therefore it is accessed much faster. There may be two or three levels of cache for the sake of fastest possible access to the most used data.
The speed of CPUs is increasing faster than the speed of RAM memory. Efficient caching is therefore becoming more and more important.
9.2 Cache organization
It is useful to know how a cache is organized if you are making programs that have big data structures with non-sequential access and you want to prevent cache contention. You may skip this section if you are satisfied with more heuristic guidelines.
Most caches are organized into lines and sets. Let me explain this with an example. My example is a cache of 8 kb size with a line size of 64 bytes. Each line covers 64 consecutive bytes of memory. One kilobyte is 1024 bytes, so we can calculate that the number of lines is 8*1024/64 = 128. These lines are organized as 32 sets x 4 ways. This means that a particular memory address cannot be loaded into an arbitrary cache line. Only one of the 32 sets can be used, but any of the 4 lines in the set can be used. We can calculate which set of cache lines to use for a particular memory address by the formula: (set) = (memory address) / (line size) % (number of sets). Here, / means integer division with truncation, and % means modulo. For example, if we want to read from memory address a = 10000, then we have (set) = (10000 / 64) % 32 = 28. This means that a must be read into one of the four cache lines in set number 28. The calculation becomes easier if we use hexadecimal numbers because all the numbers are powers of 2. Using hexadecimal numbers, we have a = 0x2710 and (set) = (0x2710 / 0x40) % 0x20 = 0x1C. Reading or writing a variable from address 0x2710 will cause the cache to load the entire 64 or 0x40 bytes from address 0x2700 to 0x273F into one of the four cache lines from set 0x1C. If the program afterwards reads or writes to any other address in this range then the value is already in the cache so we don't have to wait for another memory access.
Assume that a program reads from address 0x2710 and later reads from addresses 0x2F00, 0x3700, 0x3F00 and 0x4700. These addresses all belong to set number 0x1C. There are only four cache lines in each set. If the cache always chooses the least recently used cache line then the line that covered the address range from 0x2700 to 0x273F will be evicted when we read from 0x4700. Reading again from address 0x2710 will cause a cache miss. But if the program had read from different addresses with different set values then the line containing the address range from 0x2700 to 0x273F would still be in the cache. The problem only occurs because the addresses are spaced a multiple of 0x800 apart. I will call this distance the critical stride. Variables whose distance in memory is a multiple of the critical stride will contend for the same cache lines. The critical stride can be calculated as (critical stride) = (number of sets) x (line size) = (total cache size) / (number of ways).
If a program contains many variables and objects that are scattered around in memory then there is a risk that several variables happen to be spaced by a multiple of the critical stride and cause contentions in the data cache. The same can happen in the code cache if there are many functions scattered around in program memory. If several functions that are used in the same part of the program happen to be spaced by a multiple of the critical stride then this can cause contentions in the code cache. The subsequent sections describe various ways to avoid these problems.
More details about how caches work can be found in Wikipedia under CPU cache (en.wikipedia.org/wiki/L2_cache).
The details of cache organization for different processors are covered in manual 3: "The microarchitecture of Intel, AMD and VIA CPUs".
9.3 Functions that are used together should be stored together
The code cache works most efficiently if functions that are used near each other are also stored near each other in the code memory. The functions are usually stored in the order in which they appear in the source code. It is therefore a good idea to collect the functions that are used in the most critical part of the code together near each other in the same source file. Keep often used functions separate from seldom used functions, and put seldom used branches such as error handling in the end of a function or in a separate function.
Sometimes, functions are kept in different source files for the sake of modularity. For example, it may be convenient to have the member functions of a parent class in one source file and the derived class in another source file. If the member functions of parent class and derived class are called from the same critical part of the program then it can be advantageous to keep the two modules contiguous in program memory. This can be done by controlling the order in which the modules are linked together. The link order is usually the order in which the modules appear in the project window or makefile. You can check the order of functions in memory by requesting a map file from the linker. The map file tells the address of each function relative to the beginning of the program. The map file includes the addresses of library functions linked from static libraries (.lib or .a), but not dynamic libraries (.dll or .so). There is no easy way to control the addresses of dynamically linked library functions.
9.4 Variables that are used together should be stored together
Cache misses are very expensive. A variable can be fetched from the cache in just a few clock cycles, but it can take more than a hundred clock cycles to fetch the variable from RAM memory if it is not in the cache.
The cache works most efficiently if pieces of data that are used together are stored near each other in memory. Variables and objects should preferably be declared in the function in which they are used. Such variables and objects will be stored on the stack, which is very likely to be in the level-1 cache. The different kinds of variable storage are explained on page 26. Avoid global and static variables if possible, and avoid dynamic memory allocation (new and delete).
Object oriented programming can be an efficient way of keeping data together. Data members of a class (also called properties) are always stored together in an object of the class. Data members of a parent class and a derived class are stored together in an object of the derived class (see page 51).
The order in which data are stored can be important if you have big data structures. For example, if a program has two arrays, a and b, and the elements are accessed in the order a[0], b[0], a[1], b[1], ... then you may improve the performance by organizing the data as an array of structures:
// Example 9.1a
int Func(int);
const int size = 1024;
int a[size], b[size], i;
...
for (i = 0; i < size; i++) {
b[i] = Func(a[i]);
}
The data in this example can be accessed sequentially in memory if organized as follows:
// Example 9.1b
int Func(int);
const int size = 1024;
struct Sab {int a; int b;};
Sab ab[size];
int i;
...
for (i = 0; i < size; i++) {
ab[i].b = Func(ab[i].a);
}
There will be no extra overhead in the program code for making the structure in example 9.1b. On the contrary, the code becomes simpler because it needs only calculate element addresses for one array rather than two.
Some compilers will use different memory spaces for different arrays even if they are never used at the same time. Example:
// Example 9.2a
void F1(int x[]);
void F2(float x[]);
void F3(bool y) {
if (y) {
int a[1000];
F1(a);
}
else {
float b[1000];
F2(b);
}
}
Here it is possible to use the same memory area for a and b because their live ranges do not overlap. You can save a lot of cache space by joining a and b in a union:
// Example 9.2b
void F3(bool y) { union {
int a[1000];
float b[1000];
};
if (y) {
F1(a);
}
else {
F2(b);
}
}
Using a union is not a safe programming practice, of course, because you will get no warning from the compiler if the uses of a and b overlap. You should use this method only for big objects that take a lot of cache space. Putting simple variables into a union is not optimal because it prevents the use of register variables.
9.5 Alignment of data
A variable is accessed most efficiently if it is stored at a memory address which is divisible by the size of the variable. For example, a double takes 8 bytes of storage space. It should therefore preferably be stored at an address divisible by 8. The size should always be a power of 2. Objects bigger than 16 bytes should be stored at an address divisible by 16. You can generally assume that the compiler takes care of this alignment automatically.
The alignment of structure and class members may cause a waste of cache space, as explained in example 7.35 page 52.
You may choose to align large objects and arrays by the cache line size, which is typically 64 bytes. This makes sure that the beginning of the object or array coincides with the beginning of a cache line. Some compilers will align large static arrays automatically but you may as well specify the alignment explicitly by writing:
__declspec(align(64)) int BigArray[1024]; // Windows syntax
or
int BigArray[1024] __attribute__((aligned(64))); // Linux syntax
See page 95 and 120 for discussion of aligning dynamically allocated memory.
9.6 Dynamic memory allocation
Objects and arrays can be allocated dynamically with new and delete, or malloc and free. This can be useful when the amount of memory required is not known at compile time. Four typical uses of dynamic memory allocation can be mentioned here:
The advantages of dynamic memory allocation are:
The disadvantages of dynamic memory allocation are:
It is important to weigh the advantages over the disadvantages when deciding whether to use dynamic memory allocation. There is no reason to use dynamic memory allocation when the size of an array or the number of objects is known at compile time or a reasonable upper limit can be defined.
The cost of dynamic memory allocation is negligible when the number of allocations is limited. Dynamic memory allocation can therefore be advantageous when a program has one or a few arrays of variable size. The alternative solution of making the arrays very big to cover the worst case situation is a waste of cache space. A situation where a program has several large arrays and where the size of each array is a multiple of the critical stride (see above, page 87) is likely to cause contentions in the data cache.
If the number of elements in an array grows during program execution then it is preferable to allocate the final array size right from the beginning rather than allocating more space step by step. In most systems, you cannot increase the size of a memory block that has already been allocated. If the final size cannot be predicted or if the prediction turns out to be too small, then it is necessary to allocate a new bigger memory block and copy the contents of the old memory block into the beginning of the new bigger memory block. This is inefficient, of course, and causes the heap space to become fragmented. An alternative is to keep multiple memory blocks, either in the form of a linked list or with an index of memory blocks. A method with multiple memory blocks makes the access to individual array elements more complicated and time consuming.
A collection of a variable number of objects is often implemented as a linked list. Each element in a linked list has its own memory block and a pointer to the next block. A linked list is less efficient than a linear array for the following reasons:
It is often more efficient to allocate one big block of memory for all the objects (memory pooling) than to allocate a small block for each object.
A little-known alternative to using new and delete is to allocate variable-size arrays with alloca. This is a function that allocates memory on the stack rather than the heap. The space is automatically deallocated when returning from the function in which alloca was called. There is no need to deallocate the space explicitly when alloca is used. The advantages of alloca over new and delete or malloc and free are:
The following example shows how to make a variable-size array with alloca:
// Example 9.3
#include <malloc.h>
void SomeFunction (int n) {
if (n > 0) {
// Make dynamic array of n floats:
float * DynamicArray = (float *)alloca(n * sizeof(float));
// (Some compilers use the name