High Performance Python (from Training at EuroPython 2011) by Ian Ozsvald - HTML preview

PLEASE NOTE: This is an HTML preview only and some elements such as links or page numbers may be incorrect.
Download the book in PDF, ePub, Kindle for a complete version.

CHAPTER

FIFTEEN

 

SHEDSKIN

ShedSkin automatically annotates your Python module and compiles it down to C. It works in a more restricted set of circumstances than Cython but when it works - it Just Works and requires very little effort on your part.  One of the included examples is a Commodore 64 emulator that jumps from a few frames per second with CPython when demoing a game to over 50 FPS, where the main emulation is compiled by ShedSkin and used as an extension module to pyGTK running in CPython.

Its main limitations are:

  • prefers short modules (less than 3,000 lines of code - this is still rather a lot for a bottleneck routine!)
  • only uses built-in modules (e.g. you can’t import numpy or PIL into a ShedSkin module)

The  release  announce  for  v0.8  includes  a  scalability  graph  http://shed-skin.blogspot.com/2011/06/shed-skin-08-programming-language.html showing compile times for longer Python modules.   It can output either a compiled executable or an importable module.

You run it using shedskin  your_module.py. In our case move pure_python_2.py into a new directory (shedskin_pure_python\shedskin_pure_python.py). We could make a new module (as we did for the Cython example) but for now we’ll just one the one Python file.

shedskin shedskin_pure_python.py

make

After this you’ll have shedskin_pure_python which is an executable. Try it and see what sort of speed-up you get.

ShedSkin has local C implementations of all of the core Python library (it can only import C-implemented modules that someone has written for ShedSkin!). For this reason we can’t use numpy in a ShedSkin executable or module, you can pass a Python list across (and numpy lets you make a Python list from an array type), but that comes with a speed hit.

The complex datatype has been implemented in a way that isn’t as efficient as it could be (ShedSkin’s author Mark Dufour has stated that it could be made much more efficient if there’s demand). If we expand the math using some algebra in exactly the same way that we did for the Cython example we get another huge jump in performance:

def calculate_z_serial_purepython(q, maxiter, z):

output=[0] * len(q)

for i in range(len(q)):

zx, zy=z[i].real, z[i].imag

qx, qy=q[i].real, q[i].imag

for iteration in range(maxiter):

# expand complex numbers to floats, do raw float arithmetic

# as the shedskin variant isn’t so fast

# I believe MD said that complex numbers are allocated on the heap

# and this could easily be improved for the next shedskin

zx_new=(zx * zx – zy * zy) + qx

zy_new=(2* (zx * zy))+qy # note that zx(old) is used so we make zx_new on previous line

zx=zx_new

zy=zy_new

# remove need for abs and just square the numbers

if zx * zx+zy * zy>4.0:

output[i]=iteration

break

return output

When debugging it is helpful to know what types the code analysis has detected. Use:

shedskin -a your_module.py

and you’ll have annotated .cpp and .hpp files which tie the generated C with the original Python.

15.1 Profiling

I’ve never tried profiling ShedSkin but several options (using ValGrind and GProf) were presented in the Google Group: http://groups.google.com/group/shedskin-discuss/browse_thread/thread/fd39b6bb38cfb6d1

15.2 Faster code

You can disable bounds-checking with the -b flag, generally this gives a small speed improvement.  Wrap-around checking can be disabled with -w.  Neither optimisation improved the run-time for this problem.  For int64 long integer support add -1. For other flags see the documentation.

The author made some notes in the ShedSkin Google Group http://groups.google.com/group/shedskin- discuss/browse_thread/thread/c5bf965a80292a43 on speeding up the code by editing the generated Makefile:

  • adding -ffast-math to FLAGS seems to reduce run-time by about 10%
  • compiling first with -fprofile-generate then -fprofile-use saves about 7%
  • using libgc  7.2alpha6 instead of the common libgc  6.8 helps about 3% (you may already use this one)

It is possible that automatic vectorisation (e.g. with gcc http://gcc.gnu.org/projects/tree-ssa/vectorization.html) will help, I don’t have an up to date gcc (e.g. 4.6) on my MacBook so I’ve yet to experiment with this.