Google Analytics

Valued Lessons

I've learned. I'll share.

January 14, 2009

Popcount in Python (with benchmarks)

Purpose

A slightly uncommon bit-twiddling operation is "popcount", which counts the number high bits. While uncommon, it's crucial for implementing Array Mapped Tries, which is a way to implement immutable maps and vectors. I'm trying to implement immutable maps and vectors, so I needed to implement popcount.

I found A TON of ways, but had no idea which was fastest. So, I implemented them all and tested them. The short answer is to do this:

#http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetTable
POPCOUNT_TABLE16 = [0] * 2**16
for index in xrange(len(POPCOUNT_TABLE16)):
    POPCOUNT_TABLE16[index] = (index & 1) + POPCOUNT_TABLE16[index >> 1]

def popcount32_table16(v):
    return (POPCOUNT_TABLE16[ v        & 0xffff] +
            POPCOUNT_TABLE16[(v >> 16) & 0xffff])
And here's the long answer:

Results

I ran popcount on 30,000 random ints between 0 and 231-1. Here are the results from my Linux Desktop with a 3.2Ghz Core 2 Quad:

Name TimeMax BitsURL
c_bagwell 0.003032http://lamp.epfl.ch/papers/idealhashtrees.pdf
c_bsdmacro 0.003032http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html
c_parallel 0.003132http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel
c_ops64 0.003632http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSet64
table16 0.009832http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetTable
c_table8 0.011032http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetNaive
table+kern 0.0132 -http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetTable
table8 0.016332http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetTable
bsdmacro 0.019032http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html
parallel 0.019932http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html
ops64 0.020732http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSet64
bagwell 0.024232http://lamp.epfl.ch/papers/idealhashtrees.pdf
freebsd 0.025732http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html
timpeters3 0.027732http://mail.python.org/pipermail/python-list/1999-July/007696.html
timpeters2 0.058032http://mail.python.org/pipermail/python-list/1999-July/007696.html
timpeters 0.0724 ?http://mail.python.org/pipermail/python-list/1999-July/007696.html
kernighan 0.0745 -http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetKernighan
elklund 0.0889 -http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html
naive 0.1519 -http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetNaive
seistrup 0.2447 ?http://mail.python.org/pipermail/python-list/1999-July/007696.html

And here are the results for my MacBook with a 2.0Ghz Core 2 Duo:

Name TimeMax BitsURL
table16 0.027932http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetTable
table+kern 0.0372 -http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetTable
table8 0.041732http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetTable
bsdmacro 0.048132http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html
bagwell 0.059632http://lamp.epfl.ch/papers/idealhashtrees.pdf
timpeters3 0.074432http://mail.python.org/pipermail/python-list/1999-July/007696.html
parallel 0.080732http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html
ops64 0.129032http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSet64
timpeters2 0.143932http://mail.python.org/pipermail/python-list/1999-July/007696.html
kernighan 0.1527 -http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetKernighan
timpeters 0.1668 ?http://mail.python.org/pipermail/python-list/1999-July/007696.html
freebsd 0.1772 -http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html
elklund 0.1913 -http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html
naive 0.2889 -http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetNaive
seistrup 0.6106 ?http://mail.python.org/pipermail/python-list/1999-July/007696.html

Observations

  1. Implementations written in C (using Pyrex) are 5-6 times faster than Python.
  2. My 3.2 Ghz Linux Desktop is 200% faster than my 2.0 Ghz MacBook even though it only has a 60% clockspeed advantage. I'm not sure what to make of that. Python doesn't use multiple cores well, so I'm sure it's not that.
  3. If you want to run popcount on 32-bit values in pure Python, table16 is the fastest by far, and only 3 times slower than implementations in C.
  4. If you need to run popcount on arbitrarily large integers, kernighan is the best, but doing a hybrid of table16 and kernighan is probably better.

Conclusion

If you don't mind using about 64KB of memory, here's is the fastest popcount in pure python:
#http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetTable
POPCOUNT_TABLE16 = [0] * 2**16
for index in xrange(len(POPCOUNT_TABLE16)):
    POPCOUNT_TABLE16[index] = (index & 1) + POPCOUNT_TABLE16[index >> 1]

def popcount32_table16(v):
    return (POPCOUNT_TABLE16[ v        & 0xffff] +
            POPCOUNT_TABLE16[(v >> 16) & 0xffff])
If you do mind the memory usage, here's a slighly slower version:
#http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetTable
POPCOUNT_TABLE8 = [0] * 2**8
for index in xrange(len(POPCOUNT_TABLE8)):
    POPCOUNT_TABLE8[index] = (index & 1) + POPCOUNT_TABLE8[index >> 1]

def popcount32_table8(v):
    return (POPCOUNT_TABLE8[ v        & 0xff] +
            POPCOUNT_TABLE8[(v >> 8)  & 0xff] +
            POPCOUNT_TABLE8[(v >> 16) & 0xff] +
            POPCOUNT_TABLE8[ v >> 24        ])
And if you need to handle values of more than 32 bits, one of these two are the best:
#http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetKernighan
def popcount_kernighan(v):
    c = 0
    while v:
        v &= v - 1
        c += 1
    return c

POPCOUNT32_LIMIT = 2**32-1
POPCOUNT_TABLE8 = [0] * 2**8
for index in xrange(len(POPCOUNT_TABLE8)):
    POPCOUNT_TABLE8[index] = (index & 1) + POPCOUNT_TABLE8[index >> 1]

def popcount_hybrid(v):
    if v <= POPCOUNT32_LIMIT:
        return (POPCOUNT_TABLE16[ v        & 0xffff] +
                POPCOUNT_TABLE16[(v >> 16) & 0xffff])
    else:
        c = 0
        while v:
            v &= v - 1
            c += 1
        return c

If it needs to be faster than that, write in C!

Test For Yourself

If you want to test these yourself, here's some code you can run.
#http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetNaive
def popcount_naive(v):
    c = 0
    while v:
        c += (v & 1)
        v >>= 1
    return c

#http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetKernighan
def popcount_kernighan(v):
    c = 0
    while v:
        v &= v - 1
        c += 1
    return c

#http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetTable
POPCOUNT_TABLE8 = [0] * 2**8
for index in xrange(len(POPCOUNT_TABLE8)):
    POPCOUNT_TABLE8[index] = (index & 1) + POPCOUNT_TABLE8[index >> 1]

def popcount32_table8(v):
    return (POPCOUNT_TABLE8[ v        & 0xff] +
            POPCOUNT_TABLE8[(v >> 8)  & 0xff] +
            POPCOUNT_TABLE8[(v >> 16) & 0xff] +
            POPCOUNT_TABLE8[ v >> 24        ])

POPCOUNT_TABLE16 = [0] * 2**16
for index in xrange(len(POPCOUNT_TABLE16)):
    POPCOUNT_TABLE16[index] = (index & 1) + POPCOUNT_TABLE16[index >> 1]

def popcount32_table16(v):
    return (POPCOUNT_TABLE16[ v        & 0xffff] +
            POPCOUNT_TABLE16[(v >> 16) & 0xffff])

POPCOUNT32_LIMIT = 2**32-1
def popcount_table16_kernighan(v):
    if v <= POPCOUNT32_LIMIT:
        return (POPCOUNT_TABLE16[ v        & 0xffff] +
                POPCOUNT_TABLE16[(v >> 16) & 0xffff])
    else:
        c = 0
        while v:
            v &= v - 1
            c += 1
        return c


#http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSet64
def popcount32_ops64(v):
    return ((((v & 0xfff) * 0x1001001001001 & 0x84210842108421)            % 0x1f) +
            ((((v & 0xfff000) >> 12) * 0x1001001001001 & 0x84210842108421) % 0x1f) +
            (((v >> 24) * 0x1001001001001 & 0x84210842108421)              % 0x1f))
   

#http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel
#also http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html
def popcount32_parallel(v):
    v = v                - ((v >> 1) & 0x55555555)
    v = (v & 0x33333333) + ((v >> 2) & 0x33333333)
    #v = (v & 0x0F0F0F0F) + ((v >> 4) & 0x0F0F0F0F)
    #v = v                + ((v >> 4) & 0x0F0F0F0F)
    v = (v + (v >> 4)) & 0x0F0F0F0F
    v = (v * 0x1010101) >> 24
    return v % 256 #I added %256. I'm not sure why it's needed. It's probably because of signed ints in Python

def popcount32_bagwell(v):
    v = v                - ((v >> 1) & 0x55555555)
    v = (v & 0x33333333) + ((v >> 2) & 0x33333333)
    v = (v & 0x0F0F0F0F) + ((v >> 4) & 0x0F0F0F0F)
    v = v                +  (v >> 8)
    v = (v + (v >> 16)) & 0x3F
    return v

#http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html
def popcount_elklund(v):
    c = 0
    while v:
        v ^= v & -v
        c += 1
    return c

#http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html
def popcount32_freebsd(v):
    v = (v & 0x55555555) + ((v & 0xaaaaaaaa) >> 1);
    v = (v & 0x33333333) + ((v & 0xcccccccc) >> 2);
    v = (v & 0x0f0f0f0f) + ((v & 0xf0f0f0f0) >> 4);
    v = (v & 0x00ff00ff) + ((v & 0xff00ff00) >> 8);
    v = (v & 0x0000ffff) + ((v & 0xffff0000) >> 16);
    return v

#http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html
def popcount32_bsdmacro(v):
    #define BITCOUNT(x) (((BX_(x)+(BX_(x)>>4)) & 0x0F0F0F0F) % 255)
    #define BX_(x) ((x) - (((x)>>1)&) - (((x)>>2)&0x33333333) - (((x)>>3)&0x11111111))
    #
    #def bx(v): (v - ((v >> 1) & 0x77777777) - ((v >> 2) & 0x33333333) - ((v >> 3) & 0x11111111))
    #def bc(v): ((bx(v) + (bx(v) >> 4)) & 0x0F0F0F0F) % 255

    v = (v - ((v >> 1) & 0x77777777) - ((v >> 2) & 0x33333333) - ((v >> 3) & 0x11111111))
    v = ((v + (v >> 4)) & 0x0F0F0F0F)
    return v % 255



#http://mail.python.org/pipermail/python-list/1999-July/007696.html
import marshal, array, struct, string
POPCOUNT8_TRANS_TABLE = "".join(map(chr, POPCOUNT_TABLE8))

#changed by me to match new dumps() and use sum()
def popcount_timpeters(v):
    counts = array.array("B", string.translate(marshal.dumps(v), POPCOUNT8_TRANS_TABLE))
    # overwrite type code
    counts[0] = 0
    return sum(counts)

#changed by me to add loop unrolling and not setting digitcounts[0]
def popcount32_timpeters2(v):
    counts = array.array("B", string.translate(marshal.dumps(v), POPCOUNT8_TRANS_TABLE))
    return counts[1] + counts[2] + counts[3] + counts[4]

#improved by me: no need to translate type char
def popcount32_timpeters3(v):
    dumped = marshal.dumps(v)
    return POPCOUNT_TABLE8[ord(dumped[1])] + POPCOUNT_TABLE8[ord(dumped[2])] + POPCOUNT_TABLE8[ord(dumped[3])] + POPCOUNT_TABLE8[ord(dumped[4])]

#http://mail.python.org/pipermail/python-list/1999-July/007696.html
_run2mask = {1: 0x5555555555555555L,
             2: 0x3333333333333333L,
             4: 0x0F0F0F0F0F0F0F0FL,
             8: 0x00FF00FF00FF00FFL}

def buildmask2(run, n):
    run2 = run + run
    k = (n + run2 - 1) / run2
    n = k * run2
    try:
        answer = _run2mask[run]
        k2 = 64 / run2
    except KeyError:
        answer = (1L << run) - 1
        k2 = 1
    while k > k2:
        k2 = k2 + k2
        if k >= k2:
            answer = answer | (answer << (run * k2))
        else:
            answer = answer | (answer << (run2 * (k - k2/2)))
            k2 = k
    if k2 > k:
        answer = answer >> (run2 * (k2 - k))
    return answer, n

def nbits(v):
    return 32 #???

def popcount_seistrup(v):
    lomask, n = buildmask2(1, nbits(v))
    v = v - ((v >> 1) & lomask)
    target = 2
    while n > target:
        lomask, n = buildmask2(target, n)
        v = (v & lomask) + ((v >> target) & lomask)
        target = target + target
        for i in range(1, target/2):
            if n <= target:
                break
            n = n >> 1
            n = (n + target - 1) / target * target
            v = (v & ((1L <<>> n)
    return int(v)
   

if __name__ == "__main__":
    import time, random

    def time_func(func):
        before = time.time()
        result = func()
        after  = time.time()
        span   = after - before
        return result, span

    popcounts = [popcount_naive,       
                 popcount32_table8,    
                 popcount32_table16,   
                 popcount_table16_kernighan,
                 popcount_kernighan,   
                 popcount32_ops64,     
                 popcount32_parallel,  
                 popcount32_bagwell,   
                 popcount_elklund,     
                 popcount32_freebsd,   
                 popcount32_bsdmacro,  
                 popcount_seistrup,    
                 popcount_timpeters,   
                 popcount32_timpeters2,
                 popcount32_timpeters3,
                 ]

    test_count      = 30000
    max_int         = 2**31 - 2
    ints            = range(0, 257) + [random.randint(0, max_int) for _ in xrange(test_count)] + range(max_int - 100, max_int+1)
    expected_counts = map(popcount_naive, ints)
    for popcount in popcounts:
        counts, timespan = time_func(lambda : map(popcount, ints))
        print "%5.5f %s" % ((timespan if (counts == expected_counts) else -1), popcount.__name__) #-1 means failure

November 11, 2008

Easy Python Decorators with the decorator decorator

Decorators in python are very powerful, but are often a pain to get right, especially when you want to pass arguments to the decorator. Typically, it involves defining a function in a function in a function, etc, up to 4 layers deep. Can we make it easier? What if we even made a decorator to make decorators?

Perhaps we could use it catch and log errors:

@decorator
def log_error(func, *args, **kargs):
   try:
       return func(*args, **kargs)
   except Exception, error:
       print "error occurred: %r" % error

@log_error
def blowup():
   raise Exception("blew up")

blowup() # this gets caught and prints the error

Or maybe we'd like to synchronize a function to make it thread-safe:

@decorator
def synchronized(lock, func, *args, **kargs):
    lock.acquire()
    try:
        return func(self, *args, **kargs)
    finally:
        lock.release()

missle_lock = thread.RLock()

@synchronized(missle_lock)
def launch_missiles():
    pass

Or we could even use arguments to do do some object __init__ trickery (something I've had to do when working with wxpython):

@decorator
def inherits_format(method, *args, **kargs):
    self, parent = args[:2]
    self.format = parent.format
    return method(*args, **kargs)

class Child:
    @inherits_format
    def __init__(self, parent):
        pass

class Parent:
    def __init__(self, format):
        self.format = format

format = object()
child  = Child(Parent(format))
assert child.format is format

If you've never worked python decorators, these are few examples of how powerful they are. But you'll probably never want to write your own because it can be a pain. If that's the case, then the decorator decorator is for you! Here's the code for it. It's a little tricky, but all you have to do is import it, slap @decorator before your decorator, make sure you call func(*args, **kargs), and you're all set. This is as easy as decorators can get.

Update: Dave left a comment and notified me that there is another, much more complete, implementation of the same idea at http://www.phyast.pitt.edu/~micheles/python/documentation.html. So, if you want a very complete module, go there. If you want something small and simple to cut and paste into your own code, use the following.

def decorator(decorator_func):
    decorator_expected_arg_count = decorator_func.func_code.co_argcount - 1
    if decorator_expected_arg_count:
        def decorator_maker(*decorator_args):
            assert len(decorator_args) == decorator_expected_arg_count, "%s expects %d args" % (decorator_func.__name__, decorator_expected_arg_count)

            def _decorator(func):
                assert callable(func), "Decorator not given a function.  Did you forget to give %r arguments?" % (decorator_func.__name__)

                def decorated_func(*args, **kargs):
                    full_args = decorator_args + (func,) + args
                    return decorator_func(*full_args, **kargs)

                decorated_func.__name__ = func.__name__
                return decorated_func
            return _decorator
        return decorator_maker
    else:
        def _decorator(func):
            def decorated_func(*args, **kargs):
                return decorator_func(func, *args, **kargs)

            decorated_func.__name__ = func.__name__
            return decorated_func
        return _decorator
    

October 14, 2008

The manycore era is already upon us (and python isn't keeping up)

For many months, perhaps years, we as programmers have realized our trouble ahead coming with the "manycore era". The attitude seems to be "someday we're really going to have to figure this concurrency thing out". But this month I have been hit with a terrible realization: the manycore era is already upon us. Technically, it's the "multicore", not "manycore", but I think that's a small distinction.

Why the sudden epiphany? A few weeks ago, I built my self a new computer and put in an Intel quad-core CPU. My computer is much faster now, and I'm very happy with it, but after a few weeks I have realized something: I never use more than 1/4 of my CPU. I have a little graph on my screen at all times showing me the CPU usage. It goes up to 25% all of the time, but I have never seen it go higher. Ever.

On one hand this is a good thing. The new computer is so fast that everything I do is instant, and only uses a tiny bit of power. If it isn't instant, it's usually disk or network bound, not CPU bound.

On the other hand, even my own software isn't using more than 25%, and it is CPU bound at times. I'd like to fix it, but it's written in python, and the short version of a long, boring story is that python has a thing called The GIL which makes it so python as currently implemented cannot use more than 25% of the CPU except under very rare circumstances.

It seems that programming languages takes a long time to be adopted, but I think concurrency is a big enough rule changer to shake up which programming languages are dominant. In my specific case, if python doesn't fix it's concurrency problems soon, I'm going to have to stop considering it because I'll never be able to get it to use more than 1 little piece of the CPU. Right now, that's 25%, but in a few years, it will be only 3%. Either python is going to have to change, or I'm going to have to change programming languages (or use something like Jython or IronPython, I suppose).

A few weeks ago, sitting behind my single core computer, I was in the "someday, we'll have to tackle concurrency" camp. Now, sitting in front of my quad core machine, never using more than 25% of its power, everything has changed. Concurrency is no longer a question of if or when, it's here right now. If you don't believe me, get yourself a quad-core machine and watch the CPU usage graph. I think you'll be surprised.

How to DTrace Python in OSX

DTrace is an incredible tool. It basically lets you do profiling of a live application with no performance penatly. I'm writing a Python that needed some profiling, and I found the "normal" techniques like the profile/cProfile module very lacking. Luckily, Mac OSX comes with DTrace and it even works with Python. The only snag is that it's hard to find how to use the darn thing. I finally figured it out, so I figured it pass on the knowledge.

So, here's how you use dtrace on your python application in Mac OSX:

  1. Get DTraceToolkit.
  2. Edit Python/py_cputime.d by replacing "function-entry" with "entry" and "function-return" with "exit".
  3. Call "sudo dtrace -s Python/py_cputime.d"
  4. Let it sit there a while and hit ctrl-c.
  5. Enjoy the results

I can only assume you have to edit the file because of some difference between Solaris and OSX. You can try files other than py_cputime.d, but you might have to edit them too. Not all of them work, but most do.

The last thing to know is that you have to use the python that comes with OSX. A custom-built python doesn't seem to work.

Hope that helps!

Python Memory Usage: What values are taking up so much memory?

Python seems to use a lot of memory. So what exactly is the overhead of each type of value? Short answer:

int 24
float 24
tuple 63
list 101
dict 298
old-style class 345
new-style class 336
subclassed tuple 79
Record 79
Record with old class mixin 79
Record with new class mixin 79
Measured in bytes using Python 2.5 in 64-bit Ubuntu Linux

I measured these by running a simple program that loaded up 1,000,000 values in a list and then did time.sleep(1000). I ran that for different value types and then ran "top" to see how much memory was being used. I took that value, substracted the memory usage for a list of all the same value (14 bytes each), subtracted the value of a child value (usually an int, 24 bytes each), and then divided by 1,000,000. I'll include the code I ran at the end if you want to cut and paste. So what lessons do we learn from this?

  • Python objects are very expensive at over 300 bytes each.
  • Tuples have 1/5 as much overhead.
  • Records are almost as good as tuples, even when a mixin is added.

So, if you want to have lots of values in memory without using lots of memory, use Record.

If you want to run the test for yourself, here's the code. Just comment out the "make_val" that you want to test.

import time
from Record import Record

class TupleClass(tuple):
    pass

class RecordClass(Record("val")):
    pass

class OldClass:
    def __init__(self, val):
        self.val = val

    def method(self):
        pass

class NewClass(object):
    def __init__(self, val):
        self.val = val

    def method(self):
        pass

class RecordWithOldClass(Record("val"), OldClass):
    pass

class RecordWithNewClass(Record("val"), NewClass):
    pass

make_val = lambda i : 1 #nothing (base overhead)
#make_val = lambda i : i    
#make_val = float           
#make_val = lambda i : (i,) 
#make_val = lambda i : [i]  
#make_val = lambda i : {i:i}
#make_val = TupleClass
#make_val = RecordClass
#make_val = OldClass
#make_val = NewClass
#make_val = RecordWithOldClass
#make_val = RecordWithNewClass

count = 1000000
lst = [make_val(i) for i in xrange(count)]
time.sleep(100000)

June 27, 2008

Message Passing Conccurrency (Actor Model) in Python

As I wrote previously, I'm a fan of message-passing concurrency. But to use it in python, I had to write my own infrastructure. It's far from perfect, but it's working for me. I've simplified it a bit and prepared some code to share with you.

The basic idea of my framework is to be and simple and easy to use as possible. There's no Erlang-style pattern matching or process linking. There isn't even any built-in error handling. But, there's a lot of support for the typical "react loop" style of message handling.

The architecture is very simple. There's one thread per actor. An actor is just an object you can send a message to. The actor receives the messages in the order they were sent. You always know the sender of the message; it's built right in. There's also "Bridge" that allows you to interact with actors from non-actor code.

The only tricky thing you'll see is a "handles" decorator. This is used to help you identify how you want an actor to handle a given message type. It helps you avoid re-creating a message loop and dispatcher yourself for every actor.

I've included a little example of how to use the framework. It obviously quite simple, but keep in mind that I used little more for the foundation of a rather complex file synchronization application. A little bit goes a long way.

So, here it is. Cut, paste, and run.

from threading import Thread, Event
from Queue     import Queue, Empty

class IActor:
    pass
    # send(message, sender)

class HandlerRegistry(dict):
    # should be used as a decorator
    def __call__(self, typ):
        def register(func):
            self[typ] = func
            return func
        return register

OtherMessage = "Other Message"

class Stop(Exception):
    def __repr__(self):
        return "Stop()"

class Stopped:
    def __repr__(self):
        return "Stopped()"

class ActorNotStartedError(Exception):
    def __init__(self):
        Exception.__init__(self, "actor not started")

class Actor(IActor):
    @classmethod
    def spawn(cls, *args, **kargs):
        self = cls(*args, **kargs)
        self.mailbox = Mailbox()
        start_thread(target = self.act, as_daemon = True)
        return self

    def send(self, message, sender):
        if self.mailbox is None:
            raise ActorNotStartedError()
        else:
            self.mailbox.send(message, sender)

    def receive(self):
        if self.mailbox is None:
            raise ActorNotStartedError()
        else:
            return self.mailbox.receive()

    # override if necessary
    def act(self):
        self.handleMessages()



    handles = HandlerRegistry()

    @classmethod
    def makeHandles(*classes):
        return HandlerRegistry((typ, handler) for cls in classes for (typ, handler) in cls.handles.iteritems())

    def handleMessages(self):
        try:
            while True:
                message, sender = self.receive()
                self.handleMessageWithRegistry(message, sender)
        except Stop:
            pass

    def handleMessageWithRegistry(self, message, sender):
        registry = self.__class__.handles
        handler  = registry.get(message.__class__) or registry.get(OtherMessage)
        if handler is not None:
            handler(self, message, sender)

    @handles(OtherMessage)
    def onOther(self, message, sender):
        pass

    @handles(Stop)
    def onStop(self, message, sender):
        sender.send(Stopped(), self)
        raise message

def start_thread(target, as_daemon, name = None):
    thread = Thread(target = target)
    if name:
        thread.setName(name)
    thread.setDaemon(as_daemon) 
    thread.start()
    return thread

class Mailbox:
    def __init__(self):
        self.mailbox = Queue()

    def send(self, message, sender):
        self.mailbox.put((message, sender), block = False)

    def receive(self, timeout = None):
        return self.mailbox.get(block = True, timeout = timeout)

class Bridge(IActor):
    def __init__(self):
        self.mailbox = Mailbox()

    def send(self, message, sender):
        self.mailbox.send(message, sender)

    def call(self, target, request, timeout, default = None):
        self.sendRequest(target, request)
        return self.receiveResponse(timeout, default)

    # targeted_requests can be an iterator
    def multiCall(self, targeted_requests, timeout, default = None):
        count = 0
        for target, request in targeted_requests:
            self.sendRequest(target, request)
            count += 1
            
        for _ in xrange(count):
            yield self.receiveResponse(timeout, default)

    def stop(self, actors, timeout):
        stop = Stop()
        return list(self.multiCall(((actor, stop) for actor in actors), timeout, default = None))

    def sendRequest(self, target, request):
        target.send(request, self)

    def receiveResponse(self, timeout, default):
        try:
            message, sender = self.mailbox.receive(timeout = timeout)
            return message
        except Empty:
            return default
        

if __name__ == "__main__":
    import time

    class GetInventory:
        pass

    class Task:
        def __init__(self, input, destination):
            self.input       = input
            self.destination = destination

    class Worker(Actor):
        handles = Actor.makeHandles()

        def __init__(self, skill):
            self.skill = skill

        @handles(Task)
        def onTask(self, task, sender):
            output = self.skill(task.input)
            task.destination.send(output, self)

    class Warehouse(Actor):
        handles = Actor.makeHandles()

        def __init__(self):
            self.inventory = []

        @handles(GetInventory)
        def onGetInventory(self, message, sender):
            # copy the inventory to avoid anyone mutating it
            sender.send(list(self.inventory), self)
            
        @handles(OtherMessage)
        def onTaskResult(self, result, sender):
            self.inventory.append(result)

    worker    = Worker.spawn(lambda x : x * 2)
    positives = Warehouse.spawn()
    negatives = Warehouse.spawn()
    bridge    = Bridge()

    for val in [1, 2, 3, -2, -4, -6]:
        warehouse = positives if val >= 0 else negatives
        worker.send(Task(val, warehouse), sender = None)

    print bridge.call(positives, GetInventory(), 1.0) #should be [ 2,  4,   6]
    print bridge.call(negatives, GetInventory(), 1.0) #should be [-4, -8, -12]
    print bridge.stop([worker, positives, negatives], 1.0) #should be [Stopped(), Stopped(), Stopped()]

    
    class Start:
        def __init__(self, target):
            self.target = target

    class Ping:
        def __repr__(self):
            return "Ping()"

    class Pong:
        def __repr__(self):
            return "Pong()"

    class Pinger(Actor):
        handles = Actor.makeHandles()

        @handles(Start)
        def onStart(self, start, sender):
            start.target.send(Ping(), self)

        @handles(Pong)
        def onPong(self, pong, sender):
            print "-",
            sender.send(Ping(), self)
            
    class Ponger(Actor):
        handles = Actor.makeHandles()

        @handles(Ping)
        def onPing(self, ping, sender):
            print "+",
            sender.send(Pong(), self)

    # should print lots of +-+-+-
    pinger = Pinger.spawn()
    ponger = Ponger.spawn()
    pinger.send(Start(ponger), sender = None)
    time.sleep(0.1)
    bridge.stop([pinger, ponger], 1.0)

My Experience with Message Passing Concurrency

I'm working on a peer-to-peer file synchronization program. It's really concurrent and distributed, and it's forced me to learn a thing or two about the concurrency models that we use as programmers. Over the past few years, I've tried both the shared-state model common to C, C++, Java, C#, Python, etc, and the message-passing model unique to Erlang and Scala. In my experience, the message-passing model (aka Actor Model) is far superior to the shared-state model for writing distributed applications. After having used both extensively, I'd go so far as to say that for distributed programming, shared-state concurrency is upside down and backwards.

Here's my informal proof that message-passing concurrency is necessary in a distributed system:

  1. Since the system in distributed, real shared state is impossible.
  2. The only way for the distributed components of an application to communicate is by sending a message from one to another.
  3. Everything else is an abstraction on top of message passing.
HTTP is an abstraction on top of message-passing. AJAX is an abstraction on top of HTTP on top of message-passing. XML-RPC is an abstraction on top of HTTP on top of message-passing. SOAP is an abstraction on top of an abstraction on top of an abstraction on top of HTTP on top of message-passing. Any RPC mechanism is an abstraction on top of message-passing. SQL queries to a remote database are an abstraction on top of message-passing. CORBA is a nasty, tangled mess on top of a foundation of message-passing.

See a pattern?

Not only are all of these abstractions, but they are leaky abstractions. Just about all RPC (Remote Procedure Call) frameworks try to pretend that remote objects or local. But the facade is impossible to keep from leaking. Calls to a remote object might fail or take arbitrarily long. If you want to make the same call to two different remote nodes, those calls must be made synchronously and sequentially; in order to call them in parallel on the remote nodes, they most be called in parallel on the local node. If a call to a remote node triggers a call back to the local node, which may trigger a call to a third node, you end up with a huge spaghetti mess of calls and threads.

I've been down that road. It wasn't pretty. There is a better way.

I think AJAX has opened our eyes a bit. It's a lot closer to message-passing than it is to RPC. In a REST architecture, you "send" messages by POSTing to a URL and you "receive" messages by GETing a URL. All messages are clearly local copies that were serialized and deserialized from the remote data. There isn't any leaky abstraction of data or classes pretending to be in two places at once. Data, timeouts, and failures are in your face and you have to deal with them. So, AJAX is a lot less leaky.

But AJAX is far from perfect. I can't help but think of the Greenspun's Tenth Rule of Programming with a twist on distributed programming: "Any sufficiently complicated distributed platform contains an ad hoc, informally-specified, bug-ridden, slow implementation of half of a message-passing system (Erlang)". Once you get message-passing in your head, you can't help but think of AJAX in that way.

I've been down a better road. It was much more pleasant.

About six months ago, I rewrote major portions of our application with message-passing concurrency. I took a long time. It was tricky. It hurt my head. But it worked. It's a success. It's capable of things that the shared-state system could never do.

Having done it, I can emphatically say that if I could start all over, I would go with message-passing. Most of the work was building the infrastructure and wrapping my head around a new way of thinking. But now that I've done both of those, I've paid the costs and I'm reaping the rewards. We're moving the application in directions that would have been nearly in possible with the old concurrency model.

In my experience, message-passing concurrency is the best way to write distributed applications. But it isn't an easy road to go down. Support for it just isn't there in most programming languages and environments. So far, Erlang has been the pioneer. I would humbly agree that the way I've implemented message-passing in Python, compared to Erlang, looks like an ad-hoc, bug-ridden half-implementation. But for various reasons, I can't use Erlang. I'm hoping for someone to create Erlang++ or E# or Eython or something that combines the concurrency model of Erlang with a modern programming language. Until then, I'll just keep on cobbling together what I can onto the programming language I happen to be using.

More on that in another article.

Blog Archive

Labels