Optimizing Subroutines in Assembly Language by Agner Fog - HTML preview

/ Home / Computer Sciences / Optimizing Subroutines in Assembly Language

PLEASE NOTE: This is an HTML preview only and some elements such as links or page numbers may be incorrect.
Download the book in PDF, ePub, Kindle for a complete version.

2 Before you start

2.1 Things to decide before you start programming

Before you start to program in assembly, you have to think about why you want to use assembly language, which part of your program you need to make in assembly, and what programming method to use. If you haven't made your development strategy clear, then you will soon find yourself wasting time optimizing the wrong parts of the program, doing things in assembly that could have been done in C++, attempting to optimize things that cannot be optimized further, making spaghetti code that is difficult to maintain, and making code that is full or errors and difficult to debug.

Here is a checklist of things to consider before you start programming:

Never make the whole program in assembly. That is a waste of time. Assembly code should be used only where speed is critical and where a significant improvement in speed can be obtained. Most of the program should be made in C or C++. These are the programming languages that are most easily combined with assembly code.
If the purpose of using assembly is to make system code or use special instructions that are not available in standard C++ then you should isolate the part of the program that needs these instructions in a separate function or class with a well defined functionality. Use intrinsic functions (see p. 34) if possible.
If the purpose of using assembly is to optimize for speed then you have to identify the part of the program that consumes the most CPU time, possibly with the use of a profiler. Check if the bottleneck is file access, memory access, CPU instructions, or something else, as described in manual 1: "Optimizing software in C++". Isolate the critical part of the program into a function or class with a well-defined functionality.
If the purpose of using assembly is to make a function library then you should clearly define the functionality of the library. Decide whether to make a function library or a class library. Decide whether to use static linking (.lib in Windows, .a in Linux) or dynamic linking (.dll in Windows, .so in Linux). Static linking is usually more efficient, but dynamic linking may be necessary if the library is called from languages such as C# and Visual Basic. You may possibly make both a static and a dynamic link version of the library.
If the purpose of using assembly is to optimize an embedded application for size or speed then find a development tool that supports both C/C++ and assembly and make as much as possible in C or C++.
Decide if the code is reusable or application-specific. Spending time on careful optimization is more justified if the code is reusable. A reusable code is most appropriately implemented as a function library or class library.
Decide if the code should support multithreading. A multithreading application can take advantage of microprocessors with multiple cores. Any data that must be preserved from one function call to the next on a per-thread basis should be stored in a C++ class or a per-thread buffer supplied by the calling program.
Decide if portability is important for your application. Should the application work in both Windows, Linux and Intel-based Mac OS? Should it work in both 32 bit and 64 bit mode? Should it work on non-x86 platforms? This is important for the choice of compiler, assembler and programming method.
Decide if your application should work on old microprocessors. If so, then you may make one version for microprocessors with, for example, the SSE2 instruction set, and another version which is compatible with old microprocessors. You may even make several versions, each optimized for a particular CPU. It is recommended to make automatic CPU dispatching (see page 139).
There are three assembly programming methods to choose between: (1) Use intrinsic functions and vector classes in a C++ compiler. (2) Use inline assembly in a C++ compiler. (3) Use an assembler. These three methods and their relative advantages and disadvantages are described in chapter 5, 6 and 7 respectively (page 34, 36 and 45 respectively).
If you are using an assembler then you have to choose between different syntax dialects. It may be preferred to use an assembler that is compatible with the assembly code that your C++ compiler can generate.
Make your code in C++ first and optimize it as much as you can, using the methods described in manual 1: "Optimizing software in C++". Make the compiler translate the code to assembly. Look at the compiler-generated code and see if there are any possibilities for improvement in the code.
Highly optimized code tends to be very difficult to read and understand for others and even for yourself when you get back to it after some time. In order to make it possible to maintain the code, it is important that you organize it into small logical units (procedures or macros) with a well-defined interface and calling convention and appropriate comments. Decide on a consistent strategy for code comments and documentation.
Save the compiler, assembler and all other development tools together with the source code and project files for later maintenance. Compatible tools may not be available in a few years when updates and modifications in the code are needed.

2.2 Make a test strategy

Assembly code is error prone, difficult to debug, difficult to make in a clearly structured way, difficult to read, and difficult to maintain, as I have already mentioned. A consistent test strategy can ameliorate some of these problems and save you a lot of time.

My recommendation is to make the assembly code as an isolated module, function, class or library with a well-defined interface to the calling program. Make it all in C++ first. Then make a test program which can test all aspects of the code you want to optimize. It is easier and safer to use a test program than to test the module in the final application.

The test program has two purposes. The first purpose is to verify that the assembly code works correctly in all situations. And the second purpose is to test the speed of the assembly code without invoking the user interface, file access and other parts of the final application program that may make the speed measurements less accurate and less reproducible.

You should use the test program repeatedly after each step in the development process and after each modification of the code.

Make sure the test program works correctly. It is quite common to spend a lot of time looking for an error in the code under test when in fact the error is in the test program.

There are different test methods that can be used for verifying that the code works correctly. A white box test supplies a carefully chosen series of different sets of input data to make sure that all branches, paths and special cases in the code are tested. A black box test supplies a series of random input data and verifies that the output is correct. A very long series of random data from a good random number generator can sometimes find rarely occurring errors that the white box test hasn't found.

The test program may compare the output of the assembly code with the output of a C++ implementation to verify that it is correct. The test should cover all boundary cases and preferably also illegal input data to see if the code generates the correct error responses.

The speed test should supply a realistic set of input data. A significant part of the CPU time may be spent on branch mispredictions in code that contains a lot of branches. The amount of branch mispredictions depends on the degree of randomness in the input data. You may experiment with the degree of randomness in the input data to see how much it influences the computation time, and then decide on a realistic degree of randomness that matches a typical real application.

An automatic test program that supplies a long stream of test data will typically find more errors and find them much faster than testing the code in the final application. A good test program will find most errors, but you cannot be sure that it finds all errors. It is possible that some errors show up only in combination with the final application.

2.3 Common coding pitfalls

The following list points out some of the most common programming errors in assembly code.

1. Forgetting to save registers. Some registers have callee-save status, for example EBX. These registers must be saved in the prolog of a function and restored in the epilog if they are modified inside the function. Remember that the order of POP instructions must be the opposite of the order of PUSH instructions. See page 28 for a list of callee-save registers.

2. Unmatched PUSH and POP instructions. The number of PUSH and POP instructions must be equal for all possible paths through a function. Example:

Example 2.1. Unmatched push/pop

push ebx

test ecx, ecx

jz Finished

...

pop ebx

Finished: ; Wrong! Label should be before pop ebx

ret

Here, the value of EBX that is pushed is not popped again if ECX is zero. The result is that the RET instruction will pop the former value of EBX and jump to a wrong address.

3. Using a register that is reserved for another purpose. Some compilers reserve the use of EBP or EBX for a frame pointer or other purpose. Using these registers for a different purpose in inline assembly can cause errors.

4. Stack-relative addressing after push. When addressing a variable relative to the stack pointer, you must take into account all preceding instructions that modify the stack pointer. Example:

Example 2.2. Stack-relative addressing

mov [esp+4], edi

push ebp

push ebx

cmp esi, [esp+4] ; Probably wrong!

Here, the programmer probably intended to compare ESI with EDI, but the value of ESP that is used for addressing has been changed by the two PUSH instructions, so that ESI is in fact compared with EBP instead.

5. Confusing value and address of a variable. Example:

Example 2.3. Value versus address (MASM syntax)

.data

MyVariable DD 0 ; Define variable

.code

mov eax, MyVariable ; Gets value of MyVariable

mov eax, offset MyVariable; Gets address of MyVariable

lea eax, MyVariable ; Gets address of MyVariable

mov ebx, [eax] ; Gets value of MyVariable through pointer

mov ebx, [100] ; Gets the constant 100 despite brackets

mov ebx, ds:[100] ; Gets value from address 100

6. Ignoring calling conventions. It is important to observe the calling conventions for functions, such as the order of parameters, whether parameters are transferred on the stack or in registers, and whether the stack is cleaned up by the caller or the called function. See page 27.

7. Function name mangling. A C++ code that calls an assembly function should use extern "C" to avoid name mangling. Some systems require that an underscore (_) is put in front of the name in the assembly code. See page 30.

8. Forgetting return. A function declaration must end with both RET and ENDP. Using one of these is not enough. The execution will continue in the code after the procedure if there is no RET.

9. Forgetting stack alignment. The stack pointer must point to an address divisible by 16 before any call statement, except in 16-bit systems and 32-bit Windows. See page 27.

10. Forgetting shadow space in 64-bit Windows. It is required to reserve 32 bytes of empty stack space before any function call in 64-bit Windows. See page 30.

11. Mixing calling conventions. The calling conventions in 64-bit Windows and 64-bit Linux are different. See page 27.

12. Forgetting to clean up floating point register stack. All floating point stack registers that are used by a function must be cleared, typically with FSTP ST(0), before the function returns, except for ST(0) if it is used for return value. It is necessary to keep track of exactly how many floating point registers are in use. If a functions pushes more values on the floating point register stack than it pops, then the register stack will grow each time the function is called. An exception is generated when the stack is full. This exception may occur somewhere else in the program.

13. Forgetting to clear MMX state. A function that uses MMX registers must clear these with the EMMS instruction before any call or return.

14. Forgetting to clear YMM state. A function that uses YMM registers must clear these with the VZEROUPPER or VZEROALL instruction before any call or return.

15. Forgetting to clear direction flag. Any function that sets the direction flag with STD must clear it with CLD before any call or return.

16. Mixing signed and unsigned integers. Unsigned integers are compared using the JB and JA instructions. Signed integers are compared using the JL and JG instructions. Mixing signed and unsigned integers can have unintended consequences.

17. Forgetting to scale array index. An array index must be multiplied by the size of one array element. For example mov eax, MyIntegerArray[ebx*4].

18. Exceeding array bounds. An array with n elements is indexed from 0 to n - 1, not from 1 to n. A defective loop writing outside the bounds of an array can cause errors elsewhere in the program that are hard to find.

19. Loop with ECX = 0. A loop that ends with the LOOP instruction will repeat 232 times if ECX is zero. Be sure to check if ECX is zero before the loop.

20. Reading carry flag after INC or DEC. The INC and DEC instructions do not change the carry flag. Do not use instructions that read the carry flag, such as ADC, SBB, JC, JBE, SETA, etc. after INC or DEC. Use ADD and SUB instead of INC and DEC to avoid this problem.