CHAPTER
NUMEXPR ON NUMPY VECTORS
numexpr is a wonderfully simple library - you wrap your numpy expression in numexpr.evaluate(<your code>) and often it’ll simply run faster! In the example below I’ve commented out the numpy vector code from the section above and replaced it with the numexpr variant:
import numexpr
...
def calculate_z_numpy(q, maxiter, z):
output = np.resize(np.array(0,), q.shape)
for iteration in range(maxiter):
#z = z*z + q
z = numexpr.evaluate("z*z+q")
#done = np.greater(abs(z), 2.0)
done = numexpr.evaluate("abs(z).real > 2.0")
#q = np.where(done,0+0j, q)
q = numexpr.evaluate("where(done, 0+0j, q)")
#z = np.where(done,0+0j, z)
z = numexpr.evaluate("where(done, 0+0j, z)")
#output = np.where(done, iteration, output)
output = numexpr.evaluate("where(done, iteration, output)")
return output
I’ve replaced np.greater with >, the use of np.greater just showed another way of achieving the same task earlier (but numexpr doesn’t let us refer to numpy functions, just the functions it provides).
You can only use numexpr on numpy code and it only makes sense to use it on vector operations. In the background numexpr breaks operations down into smaller segments that will fit into the CPU’s cache, it’ll also auto-vectorise across the available math units on the CPU if possible.
On my dual-core MacBook I see a 2-3* speed-up. If I had an Intel MKL version of numexpr (warning - needs a commercial license from Intel or Enthought) then I might see an even greater speed-up.
numexpr can give us some useful system information:
>>> numexpr.print_versions()
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Numexpr version: 1.4.2
NumPy version: 1.5.1
Python version: 2.7.1 (r271:86882M, Nov 30 2010, 09:39:13)
[GCC 4.0.1 (Apple Inc. build 5494)]
Platform: darwin-i386
AMD/Intel CPU? False
VML available? False
Detected cores: 2
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
It can also gives us some very low-level information about our CPU:
>>> numexpr.cpu.info
{’arch’: ’i386’,
’machine’: ’i486’,
’sysctl_hw’: {’hw.availcpu’: ’2’,
’hw.busfrequency’: ’1064000000’,
’hw.byteorder’: ’1234’,
’hw.cachelinesize’: ’64’,
’hw.cpufrequency’: ’2000000000’,
’hw.epoch’: ’0’,
’hw.l1dcachesize’: ’32768’,
’hw.l1icachesize’: ’32768’,
’hw.l2cachesize’: ’3145728’,
’hw.l2settings’: ’1’,
’hw.machine’: ’i386’,
’hw.memsize’: ’4294967296’,
’hw.model’: ’MacBook5,2’,
’hw.ncpu’: ’2’,
’hw.pagesize’: ’4096’,
’hw.physmem’: ’2147483648’,
’hw.tbfrequency’: ’1000000000’,
’hw.usermem’: ’1841561600’,
’hw.vectorunit’: ’1’}}
We can also use it to pre-compile expressions (so they don’t have to be compiled dynamically in each loop - this can save time if you have a very fast loop) and then look as the disassembly (though I doubt you’d do anything with the disassembled output):
>>> expr=numexpr.NumExpr(’avector > 2.0’) # pre-compile an expression
>>> numexpr.disassemble(expr):
[(’gt_bdd’, ’r0’, ’r1[output]’, ’c2[2.0]’)]
>>> somenbrs=np.arange(10) # -> array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> expr.run(somenbrs)
array([False, False, False, True, True, True, True, True, True, True], dtype=bool)
You might choose to pre-compile an expression in a fast loop if the overhead of compiling (as reported by kernprof.py) reduces the benefit of the speed-ups achieved.