I've learned. I'll share.

January 14, 2009

Popcount in Python (with benchmarks)

Purpose

A slightly uncommon bit-twiddling operation is "popcount", which counts the number high bits. While uncommon, it's crucial for implementing Array Mapped Tries, which is a way to implement immutable maps and vectors. I'm trying to implement immutable maps and vectors, so I needed to implement popcount.

I found A TON of ways, but had no idea which was fastest. So, I implemented them all and tested them. The short answer is to do this:

#http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetTable
POPCOUNT_TABLE16 = [0] * 2**16
for index in xrange(len(POPCOUNT_TABLE16)):
    POPCOUNT_TABLE16[index] = (index & 1) + POPCOUNT_TABLE16[index >> 1]

def popcount32_table16(v):
    return (POPCOUNT_TABLE16[ v        & 0xffff] +
            POPCOUNT_TABLE16[(v >> 16) & 0xffff])
And here's the long answer:

Results

I ran popcount on 30,000 random ints between 0 and 231-1. Here are the results from my Linux Desktop with a 3.2Ghz Core 2 Quad:

Name TimeMax BitsURL
c_bagwell 0.003032http://lamp.epfl.ch/papers/idealhashtrees.pdf
c_bsdmacro 0.003032http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html
c_parallel 0.003132http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel
c_ops64 0.003632http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSet64
table16 0.009832http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetTable
c_table8 0.011032http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetNaive
table+kern 0.0132 -http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetTable
table8 0.016332http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetTable
bsdmacro 0.019032http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html
parallel 0.019932http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html
ops64 0.020732http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSet64
bagwell 0.024232http://lamp.epfl.ch/papers/idealhashtrees.pdf
freebsd 0.025732http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html
timpeters3 0.027732http://mail.python.org/pipermail/python-list/1999-July/007696.html
timpeters2 0.058032http://mail.python.org/pipermail/python-list/1999-July/007696.html
timpeters 0.0724 ?http://mail.python.org/pipermail/python-list/1999-July/007696.html
kernighan 0.0745 -http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetKernighan
elklund 0.0889 -http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html
naive 0.1519 -http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetNaive
seistrup 0.2447 ?http://mail.python.org/pipermail/python-list/1999-July/007696.html

And here are the results for my MacBook with a 2.0Ghz Core 2 Duo:

Name TimeMax BitsURL
table16 0.027932http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetTable
table+kern 0.0372 -http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetTable
table8 0.041732http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetTable
bsdmacro 0.048132http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html
bagwell 0.059632http://lamp.epfl.ch/papers/idealhashtrees.pdf
timpeters3 0.074432http://mail.python.org/pipermail/python-list/1999-July/007696.html
parallel 0.080732http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html
ops64 0.129032http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSet64
timpeters2 0.143932http://mail.python.org/pipermail/python-list/1999-July/007696.html
kernighan 0.1527 -http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetKernighan
timpeters 0.1668 ?http://mail.python.org/pipermail/python-list/1999-July/007696.html
freebsd 0.1772 -http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html
elklund 0.1913 -http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html
naive 0.2889 -http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetNaive
seistrup 0.6106 ?http://mail.python.org/pipermail/python-list/1999-July/007696.html

Observations

  1. Implementations written in C (using Pyrex) are 5-6 times faster than Python.
  2. My 3.2 Ghz Linux Desktop is 200% faster than my 2.0 Ghz MacBook even though it only has a 60% clockspeed advantage. I'm not sure what to make of that. Python doesn't use multiple cores well, so I'm sure it's not that.
  3. If you want to run popcount on 32-bit values in pure Python, table16 is the fastest by far, and only 3 times slower than implementations in C.
  4. If you need to run popcount on arbitrarily large integers, kernighan is the best, but doing a hybrid of table16 and kernighan is probably better.

Conclusion

If you don't mind using about 64KB of memory, here's is the fastest popcount in pure python:
#http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetTable
POPCOUNT_TABLE16 = [0] * 2**16
for index in xrange(len(POPCOUNT_TABLE16)):
    POPCOUNT_TABLE16[index] = (index & 1) + POPCOUNT_TABLE16[index >> 1]

def popcount32_table16(v):
    return (POPCOUNT_TABLE16[ v        & 0xffff] +
            POPCOUNT_TABLE16[(v >> 16) & 0xffff])
If you do mind the memory usage, here's a slighly slower version:
#http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetTable
POPCOUNT_TABLE8 = [0] * 2**8
for index in xrange(len(POPCOUNT_TABLE8)):
    POPCOUNT_TABLE8[index] = (index & 1) + POPCOUNT_TABLE8[index >> 1]

def popcount32_table8(v):
    return (POPCOUNT_TABLE8[ v        & 0xff] +
            POPCOUNT_TABLE8[(v >> 8)  & 0xff] +
            POPCOUNT_TABLE8[(v >> 16) & 0xff] +
            POPCOUNT_TABLE8[ v >> 24        ])
And if you need to handle values of more than 32 bits, one of these two are the best:
#http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetKernighan
def popcount_kernighan(v):
    c = 0
    while v:
        v &= v - 1
        c += 1
    return c

POPCOUNT32_LIMIT = 2**32-1
POPCOUNT_TABLE8 = [0] * 2**8
for index in xrange(len(POPCOUNT_TABLE8)):
    POPCOUNT_TABLE8[index] = (index & 1) + POPCOUNT_TABLE8[index >> 1]

def popcount_hybrid(v):
    if v <= POPCOUNT32_LIMIT:
        return (POPCOUNT_TABLE16[ v        & 0xffff] +
                POPCOUNT_TABLE16[(v >> 16) & 0xffff])
    else:
        c = 0
        while v:
            v &= v - 1
            c += 1
        return c

If it needs to be faster than that, write in C!

Test For Yourself

If you want to test these yourself, here's some code you can run.
#http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetNaive
def popcount_naive(v):
    c = 0
    while v:
        c += (v & 1)
        v >>= 1
    return c

#http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetKernighan
def popcount_kernighan(v):
    c = 0
    while v:
        v &= v - 1
        c += 1
    return c

#http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetTable
POPCOUNT_TABLE8 = [0] * 2**8
for index in xrange(len(POPCOUNT_TABLE8)):
    POPCOUNT_TABLE8[index] = (index & 1) + POPCOUNT_TABLE8[index >> 1]

def popcount32_table8(v):
    return (POPCOUNT_TABLE8[ v        & 0xff] +
            POPCOUNT_TABLE8[(v >> 8)  & 0xff] +
            POPCOUNT_TABLE8[(v >> 16) & 0xff] +
            POPCOUNT_TABLE8[ v >> 24        ])

POPCOUNT_TABLE16 = [0] * 2**16
for index in xrange(len(POPCOUNT_TABLE16)):
    POPCOUNT_TABLE16[index] = (index & 1) + POPCOUNT_TABLE16[index >> 1]

def popcount32_table16(v):
    return (POPCOUNT_TABLE16[ v        & 0xffff] +
            POPCOUNT_TABLE16[(v >> 16) & 0xffff])

POPCOUNT32_LIMIT = 2**32-1
def popcount_table16_kernighan(v):
    if v <= POPCOUNT32_LIMIT:
        return (POPCOUNT_TABLE16[ v        & 0xffff] +
                POPCOUNT_TABLE16[(v >> 16) & 0xffff])
    else:
        c = 0
        while v:
            v &= v - 1
            c += 1
        return c


#http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSet64
def popcount32_ops64(v):
    return ((((v & 0xfff) * 0x1001001001001 & 0x84210842108421)            % 0x1f) +
            ((((v & 0xfff000) >> 12) * 0x1001001001001 & 0x84210842108421) % 0x1f) +
            (((v >> 24) * 0x1001001001001 & 0x84210842108421)              % 0x1f))
   

#http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel
#also http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html
def popcount32_parallel(v):
    v = v                - ((v >> 1) & 0x55555555)
    v = (v & 0x33333333) + ((v >> 2) & 0x33333333)
    #v = (v & 0x0F0F0F0F) + ((v >> 4) & 0x0F0F0F0F)
    #v = v                + ((v >> 4) & 0x0F0F0F0F)
    v = (v + (v >> 4)) & 0x0F0F0F0F
    v = (v * 0x1010101) >> 24
    return v % 256 #I added %256. I'm not sure why it's needed. It's probably because of signed ints in Python

def popcount32_bagwell(v):
    v = v                - ((v >> 1) & 0x55555555)
    v = (v & 0x33333333) + ((v >> 2) & 0x33333333)
    v = (v & 0x0F0F0F0F) + ((v >> 4) & 0x0F0F0F0F)
    v = v                +  (v >> 8)
    v = (v + (v >> 16)) & 0x3F
    return v

#http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html
def popcount_elklund(v):
    c = 0
    while v:
        v ^= v & -v
        c += 1
    return c

#http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html
def popcount32_freebsd(v):
    v = (v & 0x55555555) + ((v & 0xaaaaaaaa) >> 1);
    v = (v & 0x33333333) + ((v & 0xcccccccc) >> 2);
    v = (v & 0x0f0f0f0f) + ((v & 0xf0f0f0f0) >> 4);
    v = (v & 0x00ff00ff) + ((v & 0xff00ff00) >> 8);
    v = (v & 0x0000ffff) + ((v & 0xffff0000) >> 16);
    return v

#http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html
def popcount32_bsdmacro(v):
    #define BITCOUNT(x) (((BX_(x)+(BX_(x)>>4)) & 0x0F0F0F0F) % 255)
    #define BX_(x) ((x) - (((x)>>1)&) - (((x)>>2)&0x33333333) - (((x)>>3)&0x11111111))
    #
    #def bx(v): (v - ((v >> 1) & 0x77777777) - ((v >> 2) & 0x33333333) - ((v >> 3) & 0x11111111))
    #def bc(v): ((bx(v) + (bx(v) >> 4)) & 0x0F0F0F0F) % 255

    v = (v - ((v >> 1) & 0x77777777) - ((v >> 2) & 0x33333333) - ((v >> 3) & 0x11111111))
    v = ((v + (v >> 4)) & 0x0F0F0F0F)
    return v % 255



#http://mail.python.org/pipermail/python-list/1999-July/007696.html
import marshal, array, struct, string
POPCOUNT8_TRANS_TABLE = "".join(map(chr, POPCOUNT_TABLE8))

#changed by me to match new dumps() and use sum()
def popcount_timpeters(v):
    counts = array.array("B", string.translate(marshal.dumps(v), POPCOUNT8_TRANS_TABLE))
    # overwrite type code
    counts[0] = 0
    return sum(counts)

#changed by me to add loop unrolling and not setting digitcounts[0]
def popcount32_timpeters2(v):
    counts = array.array("B", string.translate(marshal.dumps(v), POPCOUNT8_TRANS_TABLE))
    return counts[1] + counts[2] + counts[3] + counts[4]

#improved by me: no need to translate type char
def popcount32_timpeters3(v):
    dumped = marshal.dumps(v)
    return POPCOUNT_TABLE8[ord(dumped[1])] + POPCOUNT_TABLE8[ord(dumped[2])] + POPCOUNT_TABLE8[ord(dumped[3])] + POPCOUNT_TABLE8[ord(dumped[4])]

#http://mail.python.org/pipermail/python-list/1999-July/007696.html
_run2mask = {1: 0x5555555555555555L,
             2: 0x3333333333333333L,
             4: 0x0F0F0F0F0F0F0F0FL,
             8: 0x00FF00FF00FF00FFL}

def buildmask2(run, n):
    run2 = run + run
    k = (n + run2 - 1) / run2
    n = k * run2
    try:
        answer = _run2mask[run]
        k2 = 64 / run2
    except KeyError:
        answer = (1L << run) - 1
        k2 = 1
    while k > k2:
        k2 = k2 + k2
        if k >= k2:
            answer = answer | (answer << (run * k2))
        else:
            answer = answer | (answer << (run2 * (k - k2/2)))
            k2 = k
    if k2 > k:
        answer = answer >> (run2 * (k2 - k))
    return answer, n

def nbits(v):
    return 32 #???

def popcount_seistrup(v):
    lomask, n = buildmask2(1, nbits(v))
    v = v - ((v >> 1) & lomask)
    target = 2
    while n > target:
        lomask, n = buildmask2(target, n)
        v = (v & lomask) + ((v >> target) & lomask)
        target = target + target
        for i in range(1, target/2):
            if n <= target:
                break
            n = n >> 1
            n = (n + target - 1) / target * target
            v = (v & ((1L <<>> n)
    return int(v)
   

if __name__ == "__main__":
    import time, random

    def time_func(func):
        before = time.time()
        result = func()
        after  = time.time()
        span   = after - before
        return result, span

    popcounts = [popcount_naive,       
                 popcount32_table8,    
                 popcount32_table16,   
                 popcount_table16_kernighan,
                 popcount_kernighan,   
                 popcount32_ops64,     
                 popcount32_parallel,  
                 popcount32_bagwell,   
                 popcount_elklund,     
                 popcount32_freebsd,   
                 popcount32_bsdmacro,  
                 popcount_seistrup,    
                 popcount_timpeters,   
                 popcount32_timpeters2,
                 popcount32_timpeters3,
                 ]

    test_count      = 30000
    max_int         = 2**31 - 2
    ints            = range(0, 257) + [random.randint(0, max_int) for _ in xrange(test_count)] + range(max_int - 100, max_int+1)
    expected_counts = map(popcount_naive, ints)
    for popcount in popcounts:
        counts, timespan = time_func(lambda : map(popcount, ints))
        print "%5.5f %s" % ((timespan if (counts == expected_counts) else -1), popcount.__name__) #-1 means failure

2 comments:

  1. In python 3 I've found (to my disappointment) that a simple bin(num).count('1') is faster than the table16 algorithm you've posted. As I was trying to replace the gmpy version of popcount I was using, this was not good enough. Thankfully, I was interested in a special case of the problem, namely I was doing a test for

    popcount(a) > 1

    so, inspired by the posted algorithm, I was able to replace that with:

    a not in onebit

    where onebit is a tuple constructed as follows:

    onebit = [0]
    for i in range(32): onebit.append(2**i)
    onebit = tuple(onebit)

    another benefit is that you get to set the size you are interested in just by changing the range of the loop. (in my case, 17 was sufficient.

    ReplyDelete
  2. Interesting results, thanks for compiling this.

    One option that has been over-looked, perhaps, is an 11-bit look-up table:

    POPCOUNT_TABLE11 = [0] * (1 << 11)
    for index in xrange(len(POPCOUNT_TABLE11)):
        POPCOUNT_TABLE11[index] = (index & 1) + POPCOUNT_TABLE11[index >> 1]

    def popcount32_table11(v):
        return (POPCOUNT_TABLE11[ v        & 0x7ff] +
                POPCOUNT_TABLE11[(v >> 11) & 0x7ff] +
                POPCOUNT_TABLE11[ v >> 22         ])


    This actually runs the fastest on my system, I imagine due to caching better than the 16-bit table, being 32x smaller.

    ReplyDelete

Blog Archive

Google Analytics