Optimizing Subroutines in Assembly Language by Agner Fog - HTML preview

/ Home / Computer Sciences / Optimizing Subroutines in Assembly Language

PLEASE NOTE: This is an HTML preview only and some elements such as links or page numbers may be incorrect.
Download the book in PDF, ePub, Kindle for a complete version.

7 Using an assembler

There are certain limitations on what you can do with intrinsic functions and inline assembly. These limitations can be overcome by using an assembler. The principle is to write one or more assembly files containing the most critical functions of a program and writing the less critical parts in C++. The different modules are then linked together into an executable file.

The advantages of using an assembler are:

There are almost no limitations on what you can do.
You have complete control of all details of the final executable code.
All aspects of the code can be optimized, including function prolog and epilog, parameter transfer methods, register usage, data alignment, etc.
It is possible to make functions with multiple entries.
It is possible to make code that is compatible with multiple compilers and multiple operating systems (see page 52).
MASM and some other assemblers have a powerful macro language which opens up possibilities that are absent in most compiled high-level languages (see page 113).

The disadvantages of using an assembler are:

Assembly language is difficult to learn. There are many instructions to remember.
Coding in assembly takes more time than coding in a high level language.
The assembly language syntax is not fully standardized.
Assembly code tends to become poorly structured and spaghetti-like. It takes a lot of discipline to make assembly code well structured and readable for others.
Assembly code is not portable between different microprocessor architectures.
The programmer must know all details of the calling conventions and obey these conventions in the code.
The assembler provides very little syntax checking. Many programming errors are not detected by the assembler.
There are many things that you can do wrong in assembly code and the errors can have serious consequences.
You may inadvertently mix VEX and non-VEX vector instructions. This incurs a large penalty (see chapter 13.6).
Errors in assembly code can be difficult to trace. For example, the error of not saving a register can cause a completely different part of the program to malfunction.
Assembly language is not suitable for making a complete program. Most of the program has to be made in a different programming language.

The best way to start if you want to make assembly code is to first make the entire program in C or C++. Optimize the program with the use of the methods described in manual 1: "Optimizing software in C++". If any part of the program needs further optimization then isolate this part in a separate module. Then translate the critical module from C++ to assembly. There is no need to do this translation manually. Most C++ compilers can produce assembly code. Turn on all relevant optimization options in the compiler when translating the C++ code to assembly. The assembly code thus produced by the compiler is a good starting point for further optimization. The compiler-generated assembly code is sure to have the calling conventions right. (The output produced by 64-bit compilers for Windows is not yet fully compatible with any assembler).

Inspect the assembly code produced by the compiler to see if there are any possibilities for further optimization. Sometimes compilers are very smart and produce code that is better optimized than what an average assembly programmer can do. In other cases, compilers are incredibly stupid and do things in very awkward and inefficient ways. It is in the latter case that it is justified to spend time on assembly coding.

Most IDE's (Integrated Development Environments) provide a way of including assembly modules in a C++ project. For example in Microsoft Visual Studio, you can define a "custom build step" for an assembly source file. The specification for the custom build step may, for example, look like this. Command line: ml /c /Cx /Zi /coff $(InputName).asm. Outputs: $(InputName).obj. Alternatively, you may use a makefile (see page 51) or a batch file.

The C++ files that call the functions in the assembly module should include a header file (*.h) containing function prototypes for the assembly functions. It is recommended to add extern "C" to the function prototypes to remove the compiler-specific name mangling codes from the function names.

Examples of assembly functions for different platforms are provided in paragraph 4.5, page 31ff.

7.1 Static link libraries

It is convenient to collect assembled code from multiple assembly files into a function library. The advantages of organizing assembly modules into function libraries are:

The library can contain many functions and modules. The linker will automatically pick the modules that are needed in a particular project and leave out the rest so that no superfluous code is added to the project.
A function library is easy and convenient to include in a C++ project. All C++ compilers and IDE's support function libraries.
A function library is reusable. The extra time spent on coding and testing a function in assembly language is better justified if the code can be reused in different projects.
Making as a reusable function library forces you to make well tested and well documented code with a well defined functionality and a well defined interface to the calling program.
A reusable function library with a general functionality is easier to maintain and verify than an application-specific assembly code with a less well-defined responsibility.
A function library can be used by other programmers who have no knowledge of assembly language.

A static link function library for Windows is built by using the library manager (e.g. lib.exe) to combine one or more *.obj files into a *.lib file.

A static link function library for Linux is built by using the archive manager (ar) to combine one or more *.o files into an *.a file.

A function library must be supplemented by a header file (*.h) containing function prototypes for the functions in the library. This header file is included in the C++ files that call the library functions (e.g. #include "mylibrary.h").

It is convenient to use a makefile (see page 51) for managing the commands necessary for building and updating a function library.

7.2 Dynamic link libraries

The difference between static linking and dynamic linking is that the static link library is linked into the executable program file so that the executable file contains a copy of the necessary parts of the library. A dynamic link library (*.dll in Windows, *.so in Linux) is distributed as a separate file which is loaded at runtime by the executable file.

The advantages of dynamic link libraries are:

Only one instance of the dynamic link library is loaded into memory when multiple programs running simultaneously use the same library.
The dynamic link library can be updated without modifying the executable file.
A dynamic link library can be called from most programming languages, such as Pascal, C#, Visual Basic (Calling from Java is possible but difficult).

The disadvantages of dynamic link libraries are:

The whole library is loaded into memory even when only a small part of it is needed.
Loading a dynamic link library takes extra time when the executable program file is loaded.
Calling a function in a dynamic link library is less efficient than a static library because of extra call overhead and because of less efficient code cache use.
The dynamic link library must be distributed together with the executable file.
Multiple programs installed on the same computer must use the same version of a dynamic link library. This can cause many compatibility problems.

A DLL for Windows is made with the Microsoft linker (link.exe). The linker must be supplied one or more .obj or .lib files containing the necessary library functions and a DllEntry function, which just returns 1. A module definition file (*.def) is also needed. Note that DLL functions in 32-bit Windows use the __stdcall calling convention, while static link library functions use the __cdecl calling convention by default. An example source code can be found in www.agner.org/random/randoma.zip. Further instructions can be found in the Microsoft compiler documentation and in Iczelion's tutorials at win32asm.cjb.net.

I have no experience in making dynamic link libraries (shared objects) for Linux.

7.3 Libraries in source code form

A problem with subroutine libraries in binary form is that the compiler cannot optimize the function call. This problem can be solved by supplying the library functions as C++ source code.

If the library functions are supplied as C++ source code then the compiler can optimize away the function calling overhead by inlining the function. It can optimize register allocation across the function. It can do constant propagation. It can move invariant code when the function is called inside a loop, etc.

The compiler can only do these optimizations with C++ source code, not with assembly code. The code may contain inline assembly or intrinsic function calls. The compiler can do further optimizations if the code uses intrinsic function calls, but not if it uses inline assembly. Note that different compilers will not optimize the code equally well.

If the compiler uses whole program optimization then the library functions can simply be supplied as a C++ source file. If not, then the library code must be included with #include statements in order to enable optimization across the function calls. A function defined in an included file should be declared static and/or inline in order to avoid clashes between multiple instances of the function.

Some compilers with whole program optimization features can produce half-compiled object files that allow further optimization at the link stage. Unfortunately, the format of such files is not standardized - not even between different versions of the same compiler. It is possible that future compiler technology will allow a standardized format for half-compiled code. This format should, as a minimum, specify which registers are used for parameter transfer and which registers are modified by each function. It should preferably also allow register allocation at link time, constant propagation, common subexpression elimination across functions, and invariant code motion.

As long as such facilities are not available, we may consider using the alternative strategy of putting the entire innermost loop into an optimized library function rather than calling the library function from inside a C++ loop. This solution is used in Intel's Math Kernel Library (www.intel.com). If, for example, you need to calculate a thousand logarithms then you can supply an array of thousand arguments to a vector logarithm function in the library and receive an array of thousand results back from the library. This has the disadvantage that intermediate results have to be stored in arrays rather than transferred in registers.

7.4 Making classes in assembly

Classes are coded as structures in assembly and member functions are coded as functions that receive a pointer to the class/structure as a parameter.

It is not possible to apply the extern "C" declaration to a member function in C++ because extern "C" refers to the calling conventions of the C language which doesn't have classes and member functions. The most logical solution is to use the mangled function name. Returning to example 6.2a and b page 40, we can write the member function int MyList::Sum() with a mangled name as follows:

; Example 7.1a (Example 6.2b translated to stand alone assembly)

; Member function, 32-bit Windows, Microsoft compiler

; Define structure corresponding to class MyList:

MyList STRUC

length_ DD ? ; length is a reserved name. Use length_

buffer DD 100 DUP (?) ; int buffer[100];

MyList ENDS

; int MyList::Sum()

; Mangled function name compatible with Microsoft compiler (32 bit):

?Sum@MyList@@QAEHXZ PROC near

; Microsoft compiler puts 'this' in ECX

assume ecx: ptr MyList ; ecx points to structure MyList

xor eax, eax ; sum = 0

xor edx, edx ; Loop index i = 0

cmp [ecx].length_, 0 ; this->length

je L9 ; Skip if length = 0

L1: add eax, [ecx].buffer[edx*4] ; sum += buffer[i]

add edx, 1 ; i++

cmp edx, [ecx].length_ ; while (i < length)

jb L1 ; Loop

L9: ret ; Return value is in eax

?Sum@MyList@@QAEHXZ ENDP ; End of int MyList::Sum()

assume ecx: nothing ; ecx no longer points to anything

The mangled function name ?Sum@MyList@@QAEHXZ is compiler specific. Other compilers may have other name-mangling codes. Furthermore, other compilers may put 'this' on the stack rather than in a register. These incompatibilities can be solved by using a friend function rather than a member function. This solves the problem that a member function cannot be declared extern "C". The declaration in the C++ header file must then be changed to the following:

// Example 7.1b. Member function changed to friend function:

// An incomplete class declaration is needed here:

class MyList;

// Function prototype for friend function with 'this' as parameter:

extern "C" int MyList_Sum(MyList * ThisP);

// Class declaration:

class MyList {

protected:

int length; // Data members:

int buffer[100];

public:

MyList(); // Constructor

void AttItem(int item); // Add item to list

// Make MyList_Sum a friend:

friend int MyList_Sum(MyList * ThisP);

// Translate Sum to MyList_Sum by inline call:

int Sum() {return MyList_Sum(this);}

};

The prototype for the friend function must come before the class declaration because some compilers do not allow extern "C" inside the class declaration. An incomplete class declaration is needed because the friend function needs a pointer to the class.

The above declarations will make the compiler replace any call to MyList::Sum by a call to MyList_Sum because the latter function is inlined into the former. The assembly implementation of MyList_Sum does not need a mangled name:

; Example 7.1c. Friend function, 32-bit mode

; Define structure corresponding to class MyList:

MyList STRUC

length_ DD ? ; length is a reserved name. Use length_

buffer DD 100 DUP (?) ; int buffer[100];

MyList ENDS

; extern "C" friend int MyList_Sum()

_MyList_Sum PROC near

; Parameter ThisP is on stack

mov ecx, [esp+4] ; ThisP