9.1 Identify the most critical parts of your code
Optimizing software is not just a question of fiddling with the right assembly instructions. Many modern applications use much more time on loading modules, resource files, databases, interface frameworks, etc. than on actually doing the calculations the program is made for. Optimizing the calculation time does not help when the program spends 99.9% of its time on something other than calculation. It is important to find out where the biggest time consumers are before you start to optimize anything. Sometimes the solution can be to change from C# to C++, to use a different user interface framework, to organize file input and output differently, to cache network data, to avoid dynamic memory allocation, etc., rather than using assembly language. See manual 1: "Optimizing software in C++" for further discussion.
The use of assembly code for optimizing software is relevant only for highly CPU-intensive programs such as sound and image processing, encryption, sorting, data compression and complicated mathematical calculations.
In CPU-intensive software programs, you will often find that more than 99% of the CPU time is used in the innermost loop. Identifying the most critical part of the software is therefore necessary if you want to improve the speed of computation. Optimizing less critical parts of the code will not only be a waste of time, it also makes the code less clear, and less easy to debug and maintain. Most compiler packages include a profiler that can tell you which part of the code is most critical. If you don't have a profiler and if it is not obvious which part of the code is most critical, then set up a number of counter variables that are incremented at different places in the code to see which part is executed most times. Use the methods described on page 163 for measuring how long time each part of the code takes.
It is important to study the algorithm used in the critical part of the code to see if it can be improved. Often you can gain more speed simply by choosing the optimal algorithm than by any other optimization method.
All modern x86 processors can execute instructions out of order. Consider this example:
; Example 9.1a, Out-of-order execution
mov eax, [mem1]
imul eax, 6
mov [mem2], eax
mov ebx, [mem3]
add ebx, 2
mov [mem4], ebx
This piece of code is doing two things that have nothing to do with each other: multiplying [mem1] by 6 and adding 2 to [mem3]. If it happens that [mem1] is not in the cache then the CPU has to wait many clock cycles while this operand is being fetched from main memory. The CPU will look for something else to do in the meantime. It cannot do the second instruction imul eax,6 because it depends on the output of the first instruction. But the fourth instruction mov ebx,[mem3] is independent of the preceding instructions so it is possible to execute mov ebx,[mem3] and add ebx,2 while it is waiting for [mem1]. The CPUs have many features to support efficient out-of-order execution. Most important is, of course, the ability to detect whether an instruction depends on the output of a previous instruction. Another important feature is register renaming. Assume that the we are using the same register for multiplying and adding in example 9.1a because there are no more spare registers:
; Example 9.1b, Out-of-order execution with register renaming
mov eax, [mem1]
imul eax, 6
mov [mem2], eax
mov eax, [mem3]
add eax, 2
mov [mem4], eax
Example 9.1b will work exactly as fast as example 9.1a because the CPU is able to use different physical registers for the same logical register eax. This works in a very elegant way. The CPU assigns a new physical register to hold the value of eax every time eax is written to. This means that the above code is changed inside the CPU to a code that uses four different physical registers for eax. The first register is used for the value loaded from [mem1]. The second register is used for the output of the imul instruction. The third register is used for the value loaded from [mem3]. And the fourth register is used for the output of the add instruction. The use of different physical registers for the same logical register enables the CPU to make the last three instructions in example 9.1b independent of the first three instructions. The CPU must have a lot of physical registers for this mechanism to work efficiently. The number of physical registers is different for different microprocessors, but you can generally assume that the number is sufficient for quite a lot of instruction reordering.
Partial registers
Some CPUs can keep different parts of a register separate, while other CPUs always treat a register as a whole. If we change example 9.1b so that the second part uses 16-bit registers then we have the problem of a false dependence:
; Example 9.1c, False dependence of partial register
mov eax, [mem1] ; 32 bit memory operand
imul eax, 6
mov [mem2], eax
mov ax, [mem3] ; 16 bit memory operand
add ax, 2
mov [mem4], ax
Here the instruction mov ax,[mem3] changes only the lower 16 bits of register eax, while the upper 16 bits retain the value they got from the imul instruction. Some CPUs from both Intel, AMD and VIA are unable to rename a partial register. The consequence is that the mov ax,[mem3] instruction has to wait for the imul instruction to finish because it needs to combine the 16 lower bits from [mem3] with the 16 upper bits from the imul instruction.
Other CPUs are able to split the register into parts in order to avoid the false dependence, but this has another disadvantage in case the two parts have to be joined together again. Assume, for example, that the code in example 9.1c is followed by PUSH EAX. On some processors, this instruction has to wait for the two parts of EAX to retire in order to join them together, at the cost of 5-6 clock cycles. Other processors will generate an extra µop for joining the two parts of the register together.
These problems are avoided by replacing mov ax,[mem3] with movzx eax,[mem3]. This resets the high bits of eax and breaks the dependence on any previous value of eax. In 64- bit mode, it is sufficient to write to the 32-bit register because this always resets the upper part of a 64-bit register. Thus, movzx eax,[mem3] and movzx rax,[mem3] are doing exactly the same thing. The 32-bit version of the instruction is one byte shorter than the 64- bit version. Any use of the high 8-bit registers AH, BH, CH, DH should be avoided because it can cause false dependences and less efficient code.
The flags register can cause similar problems for instructions that modify some of the flag bits and leave other bits unchanged. For example, the INC and DEC instructions leave the carry flag unchanged but modifies the zero and sign flags.
Micro-operations
Another important feature is the splitting of instructions into micro-operations (abbreviated µops or uops). The following example shows the advantage of this:
; Example 9.2, Splitting instructions into uops
push eax
call SomeFunction
The push eax instruction does two things. It subtracts 4 from the stack pointer and stores eax to the address pointed to by the stack pointer. Assume now that eax is the result of a long and time-consuming calculation. This delays the push instruction. The call instruction depends on the value of the stack pointer which is modified by the push instruction. If instructions were not split into µops then the call instruction would have to wait until the push instruction was finished. But the CPU splits the push eax instruction into sub esp,4 followed by mov [esp],eax. The sub esp,4 micro-operation can be executed before eax is ready, so the call instruction will wait only for sub esp,4, not for mov [esp],eax.
Execution units
Out-of-order execution becomes even more efficient when the CPU can do more than one thing at the same time. Many CPUs can do two, three or four things at the same time if the things to do are independent of each other and do not use the same execution units in the CPU. Most CPUs have at least two integer ALU's (Arithmetic Logic Units) so that they can do two or more integer additions per clock cycle. There is usually one floating point add unit and one floating point multiplication unit so that it is possible to do a floating point addition and a multiplication at the same time. There may be one memory read unit and one memory write unit so that it is possible to read and write to memory at the same time. The maximum average number of µops per clock cycle is three or four on many processors so that it is possible, for example, to do an integer operation, a floating point operation, and a memory operation in the same clock cycle. The maximum number of arithmetic operations (i.e. anything else than memory read or write) is limited to two or three µops per clock cycle, depending on the CPU.
Pipelined instructions
Floating point operations typically take more than one clock cycle, but they are usually pipelined so that e.g. a new floating point addition can start before the previous addition is finished. MMX and XMM instructions use the floating point execution units even for integer instructions on many CPUs. The details about which instructions can be executed simultaneously or pipelined and how many clock cycles each instruction takes are CPU specific. The details for each type of CPU are explained manual 3: "The microarchitecture of Intel, AMD and VIA CPUs" and manual 4: "Instruction tables".
Summary
The most important things you have to be aware of in order to take maximum advantage of out-or-order execution are:
9.3 Instruction fetch, decoding and retirement
Instruction fetching can be a bottleneck. Many processors cannot fetch more than 16 bytes of instruction code per clock cycle. It may be necessary to make instructions as short as possible if this limit turns out to be critical. One way of making instructions shorter is to replace memory operands by pointers (see chapter 10 page 73). The address of memory operands can possibly be loaded into pointer registers outside of a loop if instruction fetching is a bottleneck. Large constants can likewise be loaded into registers.
Instruction fetching is delayed by jumps on most processors. It is important to minimize the number of jumps in critical code. Branches that are not taken and correctly predicted do not delay instruction fetching. It is therefore advantageous to organize if-else branches so that the branch that is followed most commonly is the one where the conditional jump is not taken.
Most processors fetch instructions in aligned 16-byte or 32-byte blocks. It can be advantageous to align critical loop entries and subroutine entries by 16 in order to minimize the number of 16-byte boundaries in the code. Alternatively, make sure that there is no 16- byte boundary in the first few instructions after a critical loop entry or subroutine entry.
Instruction decoding is often a bottleneck. The organization of instructions that gives the optimal decoding is processor-specific. Intel PM processors require a 4-1-1 decode pattern. This means that instructions which generate 2, 3 or 4 µops should be interspersed by two single-µop instructions. On Core2 processors the optimal decode pattern is 4-1-1-1. On AMD processors it is preferred to avoid instructions that generate more than 2 µops.
Instructions with multiple prefixes can slow down decoding. The maximum number of prefixes that an instruction can have without slowing down decoding is 1 on 32-bit Intel processors, 2 on Intel P4E processors, 3 on AMD processors, and unlimited on Core2. Avoid address size prefixes. Avoid operand size prefixes on instructions with an immediate operand. For example, it is preferred to replace MOV AX,2 by MOV EAX,2.
Decoding is rarely a bottleneck on processors with a trace cache, but there are specific requirements for optimal use of the trace cache.