16 Testing speed
Testing the speed of a program is an important part of the optimization job. You have to check if your modifications actually increase the speed or not.
There are various profilers available which are useful for finding the hot spots and measuring the overall performance of a program. The profilers are not always accurate, however, and it may be difficult to measure exactly what you want when the program spends most of its time waiting for user input or reading disk files. See page 16 for a discussion of profiling.
When a hot spot has been identified, then it may be useful to isolate the hot spot and make measurements on this part of the code only. This can be done with the resolution of the CPU clock by using the so-called time stamp counter. This is a counter that measures the number of clock pulses since the CPU was started. The length of a clock cycle is the reciprocal of the clock frequency, as explained on page 16. If you read the value of the time stamp counter before and after executing a critical piece of code then you can get the exact time consumption as the difference between the two clock counts.
The value of the time stamp counter can be obtained with the function ReadTSC listed below in example 16.1. This code works only for compilers that support intrinsic functions. Alternatively, you can use the header file timingtest.h from www.agner.org/optimize/testp.zip or get ReadTSC as a library function from www.agner.org/optimize/asmlib.zip.
// Example 16.1
#include <intrin.h> // Or #include <ia32intrin.h> etc.
long long ReadTSC() { // Returns time stamp counter
int dummy[4]; // For unused returns
volatile int DontSkip; // Volatile to prevent optimizing
long long clock; // Time
__cpuid(dummy, 0); // Serialize
DontSkip = dummy[0]; // Prevent optimizing away cpuid
clock = __rdtsc(); // Read time
return clock;
}
You can use this function to measure the clock count before and after executing the critical code. A test setup may look like this:
// Example 16.2
#include <stdio.h>
#include <asmlib.h> // Use ReadTSC() from library asmlib..
// or from example 16.1
void CriticalFunction(); // This is the function we want to measure
...
const int NumberOfTests = 10; // Number of times to test
int i; long long time1;
long long timediff[NumberOfTests]; // Time difference for each test
for (i = 0; i < NumberOfTests; i++) { // Repeat NumberOfTests times
time1 = ReadTSC(); // Time before test
CriticalFunction(); // Critical function to test
timediff[i] = ReadTSC() - time1; // (time after) - (time before)
}
printf("\nResults:"); // Print heading
for (i = 0; i < NumberOfTests; i++) { // Loop to print out results
printf("\n%2i %10I64i", i, timediff[i]);
}
The code in example 16.2 calls the critical function ten times and stores the time consumption of each run in an array. The values are then output after the test loop. The time that is measured in this way includes the time it takes to call the ReadTSC function. You can subtract this value from the counts. It is measured simply by removing the call to CriticalFunction in example 16.2.
The measured time is interpreted in the following way. The first count is usually higher than the subsequent counts. This is the time it takes to execute CriticalFunction when code and data are not cached. The subsequent counts give the execution time when code and data are cached as good as possible. The first count and the subsequent counts represent the "worst case" and "best case" values. Which of these two values is closest to the truth depends on whether CriticalFunction is called once or multiple times in the final program and whether there is other code that uses the cache in between the calls to CriticalFunction. If your optimization effort is concentrated on CPU efficiency then it is the "best case" counts that you should look at to see if a certain modification is profitable. On the other hand, if your optimization effort is concentrated on arranging data in order to improve cache efficiency, then you may also look at the "worst case" counts. In any event, the clock counts should be multiplied by the clock period and by the number of times CriticalFunction is called in a typical application to calculate the time delay that the end user is likely to experience.
Occasionally, the clock counts that you measure are much higher than normal. This happens when a task switch occurs during execution of CriticalFunction. You cannot avoid this in a protected operating system, but you can reduce the problem by increasing the thread priority before the test and setting the priority back to normal afterwards.
The clock counts are often fluctuating and it may be difficult to get reproducible results. This is because modern CPUs can change their clock frequency dynamically depending on the work load. The clock frequency is increased when the work load is high and decreased when the work load is low in order to save power. There are various ways to get more reproducible time measurements:
16.1 Using performance monitor counters
Many CPUs have a built-in test feature called performance monitor counters. A performance monitor counter is a counter inside the CPU which can be set up to count certain events, such as the number of machine instructions executed, cache misses, branch mispredictions, etc. These counters can be very useful for investigating performance problems. The performance monitor counters are CPU-specific and each CPU model has its own set of performance monitoring options.
CPU vendors are offering profiling tools that fit their CPUs. Intel's profiler is called VTune; AMD's profiler is called CodeAnalyst. These profilers are useful for identifying hot spots in the code.
For my own research, I have developed a test tool for using the performance monitor counters. My test tool supports both Intel, AMD and VIA processors, and it is available from www.agner.org/optimize/testp.zip. This tool is not a profiler. It is not intended for finding hot spots, but for studying a piece of code once the hot spots have been identified.
My test tool can be used in two ways. The first way is to insert the piece of code to test in the test program itself and recompile it. I am using this for testing single assembly instructions or small sequences of code. The second way is to set up the performance monitor counters before running a program you want to optimize, and reading the performance counters inside your program before and after the piece of code you want to test. You can use the same principle as in example 16.2 above, but read one or more performance monitor counters instead of (or in addition to) the time stamp counter. The test tool can set up and enable one or more performance monitor counters in all the CPU cores and leave them enabled (there is one set of counters in each CPU core). The counters will stay on until you turn them off or until the computer is reset or goes into sleep mode. See the manual for my test tool for details (www.agner.org/optimize/testp.zip).
A particularly useful performance monitor counter in Intel processors is called core clock cycles. The core clock cycles counter is counting clock cycles at the actual clock frequency that the CPU core is running at, rather than the external clock. This gives a measure that is almost independent of changes in the clock frequency. The core clock cycle counter is very useful when testing which version of a piece of code is fastest because you can avoid the problem that the clock frequency goes up and down.
Remember to insert a switch in your program to turn off the reading of the counters when you are not testing. Trying to read the performance monitor counters when they are disabled will crash the program.
16.2 The pitfalls of unit-testing
It is common practice to test each function or class separately in software development. This unit-testing is necessary for verifying the functionality of an optimized function, but unfortunately the unit-test does not give the full information about the performance of the function in terms of speed.
Assume that you have two different versions of a critical function and you want to find out which one is fastest. The typical way to test this is to make a small test program that calls the critical function many times with a suitable set of test data and measure how long time it takes. The version that performs best under this unit-test may have a larger memory footprint than the alternative version. The penalty of cache misses is not seen in the unit- test because the total amount of code and data memory used by the test program is likely to be less than the cache size.
When the critical function is inserted in the final program, it is very likely that code cache and data cache are critical resources. Modern CPUs are so fast that the clock cycles spent on executing instructions are less likely to be a bottleneck than memory access and cache size. If this is the case then the optimal version of the critical function may be the one that takes longer time in the unit-test but has a smaller memory footprint.
If, for example, you want to find out whether it is advantageous to roll out a big loop then you cannot rely on a unit-test without taking cache effects into account.
You can calculate how much memory a function uses by looking at a link map or an assembly listing. Use the "generate map file" option for the linker. Both code cache use and data cache use can be critical. The branch target buffer is also a cache that can be critical. Therefore, the number of jumps, calls and branches in a function should also be considered.
A realistic performance test should include not only a single function or hot spot but also the innermost loop that includes the critical functions and hot spots. The test should be performed with a realistic set of data in order to get reliable results for branch mispredictions. The performance measurement should not include any part of the program that waits for user input. The time used for file input and output should be measured separately.
The fallacy of measuring performance by unit-testing is unfortunately very common. Even some of the best optimized function libraries available use excessive loop unrolling so that the memory footprint is unreasonably large.
16.3 Worst-case testing
Most performance tests are done under the best-case conditions. All disturbing influences are removed, all resources are sufficient, and the caching conditions are optimal. Best-case testing is useful because it gives more reliable and reproducible results. If you want to compare the performance of two different implementations of the same algorithm, then you need to remove all disturbing influences in order to make the measurements as accurate and reproducible as possible.
However, there are cases where it is more relevant to test the performance under the worst- case conditions. For example, if you want to make sure that the response time to user input never exceeds an acceptable limit, then you should test the response time under worst-case conditions.
Programs that produce streaming audio or video should also be tested under worst-case conditions in order to make sure that they always keep up with the expected real-time speed. Delays or glitches in the output are unacceptable.
Each of the following methods could possibly be relevant when testing worst-case performance: