There are certain limitations on what you can do with intrinsic functions and inline assembly. These limitations can be overcome by using an assembler. The principle is to write one or more assembly files containing the most critical functions of a program and writing the less critical parts in C++. The different modules are then linked together into an executable file.
The advantages of using an assembler are:
The disadvantages of using an assembler are:
The best way to start if you want to make assembly code is to first make the entire program in C or C++. Optimize the program with the use of the methods described in manual 1: "Optimizing software in C++". If any part of the program needs further optimization then isolate this part in a separate module. Then translate the critical module from C++ to assembly. There is no need to do this translation manually. Most C++ compilers can produce assembly code. Turn on all relevant optimization options in the compiler when translating the C++ code to assembly. The assembly code thus produced by the compiler is a good starting point for further optimization. The compiler-generated assembly code is sure to have the calling conventions right. (The output produced by 64-bit compilers for Windows is not yet fully compatible with any assembler).
Inspect the assembly code produced by the compiler to see if there are any possibilities for further optimization. Sometimes compilers are very smart and produce code that is better optimized than what an average assembly programmer can do. In other cases, compilers are incredibly stupid and do things in very awkward and inefficient ways. It is in the latter case that it is justified to spend time on assembly coding.
Most IDE's (Integrated Development Environments) provide a way of including assembly modules in a C++ project. For example in Microsoft Visual Studio, you can define a "custom build step" for an assembly source file. The specification for the custom build step may, for example, look like this. Command line: ml /c /Cx /Zi /coff $(InputName).asm. Outputs: $(InputName).obj. Alternatively, you may use a makefile (see page 51) or a batch file.
The C++ files that call the functions in the assembly module should include a header file (*.h) containing function prototypes for the assembly functions. It is recommended to add extern "C" to the function prototypes to remove the compiler-specific name mangling codes from the function names.
Examples of assembly functions for different platforms are provided in paragraph 4.5, page 31ff.
It is convenient to collect assembled code from multiple assembly files into a function library. The advantages of organizing assembly modules into function libraries are:
A static link function library for Windows is built by using the library manager (e.g. lib.exe) to combine one or more *.obj files into a *.lib file.
A static link function library for Linux is built by using the archive manager (ar) to combine one or more *.o files into an *.a file.
A function library must be supplemented by a header file (*.h) containing function prototypes for the functions in the library. This header file is included in the C++ files that call the library functions (e.g. #include "mylibrary.h").
It is convenient to use a makefile (see page 51) for managing the commands necessary for building and updating a function library.
The difference between static linking and dynamic linking is that the static link library is linked into the executable program file so that the executable file contains a copy of the necessary parts of the library. A dynamic link library (*.dll in Windows, *.so in Linux) is distributed as a separate file which is loaded at runtime by the executable file.
The advantages of dynamic link libraries are:
The disadvantages of dynamic link libraries are:
A DLL for Windows is made with the Microsoft linker (link.exe). The linker must be supplied one or more .obj or .lib files containing the necessary library functions and a DllEntry function, which just returns 1. A module definition file (*.def) is also needed. Note that DLL functions in 32-bit Windows use the __stdcall calling convention, while static link library functions use the __cdecl calling convention by default. An example source code can be found in www.agner.org/random/randoma.zip. Further instructions can be found in the Microsoft compiler documentation and in Iczelion's tutorials at win32asm.cjb.net.
I have no experience in making dynamic link libraries (shared objects) for Linux.
7.3 Libraries in source code form
A problem with subroutine libraries in binary form is that the compiler cannot optimize the function call. This problem can be solved by supplying the library functions as C++ source code.
If the library functions are supplied as C++ source code then the compiler can optimize away the function calling overhead by inlining the function. It can optimize register allocation across the function. It can do constant propagation. It can move invariant code when the function is called inside a loop, etc.
The compiler can only do these optimizations with C++ source code, not with assembly code. The code may contain inline assembly or intrinsic function calls. The compiler can do further optimizations if the code uses intrinsic function calls, but not if it uses inline assembly. Note that different compilers will not optimize the code equally well.
If the compiler uses whole program optimization then the library functions can simply be supplied as a C++ source file. If not, then the library code must be included with #include statements in order to enable optimization across the function calls. A function defined in an included file should be declared static and/or inline in order to avoid clashes between multiple instances of the function.
Some compilers with whole program optimization features can produce half-compiled object files that allow further optimization at the link stage. Unfortunately, the format of such files is not standardized - not even between different versions of the same compiler. It is possible that future compiler technology will allow a standardized format for half-compiled code. This format should, as a minimum, specify which registers are used for parameter transfer and which registers are modified by each function. It should preferably also allow register allocation at link time, constant propagation, common subexpression elimination across functions, and invariant code motion.
As long as such facilities are not available, we may consider using the alternative strategy of putting the entire innermost loop into an optimized library function rather than calling the library function from inside a C++ loop. This solution is used in Intel's Math Kernel Library (www.intel.com). If, for example, you need to calculate a thousand logarithms then you can supply an array of thousand arguments to a vector logarithm function in the library and receive an array of thousand results back from the library. This has the disadvantage that intermediate results have to be stored in arrays rather than transferred in registers.
7.4 Making classes in assembly
Classes are coded as structures in assembly and member functions are coded as functions that receive a pointer to the class/structure as a parameter.
It is not possible to apply the extern "C" declaration to a member function in C++ because extern "C" refers to the calling conventions of the C language which doesn't have classes and member functions. The most logical solution is to use the mangled function name. Returning to example 6.2a and b page 40, we can write the member function int MyList::Sum() with a mangled name as follows:
; Example 7.1a (Example 6.2b translated to stand alone assembly)
; Member function, 32-bit Windows, Microsoft compiler
; Define structure corresponding to class MyList:
MyList STRUC
length_ DD ? ; length is a reserved name. Use length_
buffer DD 100 DUP (?) ; int buffer[100];
MyList ENDS
; int MyList::Sum()
; Mangled function name compatible with Microsoft compiler (32 bit):
?Sum@MyList@@QAEHXZ PROC near
; Microsoft compiler puts 'this' in ECX
assume ecx: ptr MyList ; ecx points to structure MyList
xor eax, eax ; sum = 0
xor edx, edx ; Loop index i = 0
cmp [ecx].length_, 0 ; this->length
je L9 ; Skip if length = 0
L1: add eax, [ecx].buffer[edx*4] ; sum += buffer[i]
add edx, 1 ; i++
cmp edx, [ecx].length_ ; while (i < length)
jb L1 ; Loop
L9: ret ; Return value is in eax
?Sum@MyList@@QAEHXZ ENDP ; End of int MyList::Sum()
assume ecx: nothing ; ecx no longer points to anything
The mangled function name ?Sum@MyList@@QAEHXZ is compiler specific. Other compilers may have other name-mangling codes. Furthermore, other compilers may put 'this' on the stack rather than in a register. These incompatibilities can be solved by using a friend function rather than a member function. This solves the problem that a member function cannot be declared extern "C". The declaration in the C++ header file must then be changed to the following:
// Example 7.1b. Member function changed to friend function:
// An incomplete class declaration is needed here:
class MyList;
// Function prototype for friend function with 'this' as parameter:
extern "C" int MyList_Sum(MyList * ThisP);
// Class declaration:
class MyList {
protected:
int length; // Data members:
int buffer[100];
public:
MyList(); // Constructor
void AttItem(int item); // Add item to list
// Make MyList_Sum a friend:
friend int MyList_Sum(MyList * ThisP);
// Translate Sum to MyList_Sum by inline call:
int Sum() {return MyList_Sum(this);}
};
The prototype for the friend function must come before the class declaration because some compilers do not allow extern "C" inside the class declaration. An incomplete class declaration is needed because the friend function needs a pointer to the class.
The above declarations will make the compiler replace any call to MyList::Sum by a call to MyList_Sum because the latter function is inlined into the former. The assembly implementation of MyList_Sum does not need a mangled name:
; Example 7.1c. Friend function, 32-bit mode
; Define structure corresponding to class MyList:
MyList STRUC
length_ DD ? ; length is a reserved name. Use length_
buffer DD 100 DUP (?) ; int buffer[100];
MyList ENDS
; extern "C" friend int MyList_Sum()
_MyList_Sum PROC near
; Parameter ThisP is on stack
mov ecx, [esp+4] ; ThisP
assume ecx: ptr MyList ; ecx points to structure MyList
xor eax, eax ; sum = 0
xor edx, edx ; Loop index i = 0
cmp [ecx].length_, 0 ; this->length
je L9 ; Skip if length = 0
L1: add eax, [ecx].buffer[edx*4] ; sum += buffer[i]
add edx, 1 ; i++
cmp edx, [ecx].length_ ; while (i < length)
jb L1 ; Loop
L9: ret ; Return value is in eax
_MyList_Sum ENDP ; End of int MyList_Sum()
assume ecx: nothing ; ecx no longer points to anything
A thread-safe or reentrant function is a function that works correctly when it is called simultaneously from more than one thread. Multithreading is used for taking advantage of computer