Purpose
A slightly uncommon bit-twiddling operation is "popcount", which counts the number high bits. While uncommon, it's crucial for implementing Array Mapped Tries, which is a way to implement immutable maps and vectors. I'm trying to implement immutable maps and vectors, so I needed to implement popcount.I found A TON of ways, but had no idea which was fastest. So, I implemented them all and tested them. The short answer is to do this:
#http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetTable POPCOUNT_TABLE16 = [0] * 2**16 for index in xrange(len(POPCOUNT_TABLE16)): POPCOUNT_TABLE16[index] = (index & 1) + POPCOUNT_TABLE16[index >> 1] def popcount32_table16(v): return (POPCOUNT_TABLE16[ v & 0xffff] + POPCOUNT_TABLE16[(v >> 16) & 0xffff])And here's the long answer:
Results
I ran popcount on 30,000 random ints between 0 and 2
Name | Time | Max Bits | URL |
c_bagwell | 0.0030 | 32 | http://lamp.epfl.ch/papers/idealhashtrees.pdf |
c_bsdmacro | 0.0030 | 32 | http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html |
c_parallel | 0.0031 | 32 | http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel |
c_ops64 | 0.0036 | 32 | http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSet64 |
table16 | 0.0098 | 32 | http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetTable |
c_table8 | 0.0110 | 32 | http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetNaive |
table+kern | 0.0132 | - | http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetTable |
table8 | 0.0163 | 32 | http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetTable |
bsdmacro | 0.0190 | 32 | http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html |
parallel | 0.0199 | 32 | http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html |
ops64 | 0.0207 | 32 | http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSet64 |
bagwell | 0.0242 | 32 | http://lamp.epfl.ch/papers/idealhashtrees.pdf |
freebsd | 0.0257 | 32 | http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html |
timpeters3 | 0.0277 | 32 | http://mail.python.org/pipermail/python-list/1999-July/007696.html |
timpeters2 | 0.0580 | 32 | http://mail.python.org/pipermail/python-list/1999-July/007696.html |
timpeters | 0.0724 | ? | http://mail.python.org/pipermail/python-list/1999-July/007696.html |
kernighan | 0.0745 | - | http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetKernighan |
elklund | 0.0889 | - | http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html |
naive | 0.1519 | - | http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetNaive |
seistrup | 0.2447 | ? | http://mail.python.org/pipermail/python-list/1999-July/007696.html |
And here are the results for my MacBook with a 2.0Ghz Core 2 Duo:
Name | Time | Max Bits | URL |
table16 | 0.0279 | 32 | http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetTable |
table+kern | 0.0372 | - | http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetTable |
table8 | 0.0417 | 32 | http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetTable |
bsdmacro | 0.0481 | 32 | http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html |
bagwell | 0.0596 | 32 | http://lamp.epfl.ch/papers/idealhashtrees.pdf |
timpeters3 | 0.0744 | 32 | http://mail.python.org/pipermail/python-list/1999-July/007696.html |
parallel | 0.0807 | 32 | http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html |
ops64 | 0.1290 | 32 | http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSet64 |
timpeters2 | 0.1439 | 32 | http://mail.python.org/pipermail/python-list/1999-July/007696.html |
kernighan | 0.1527 | - | http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetKernighan |
timpeters | 0.1668 | ? | http://mail.python.org/pipermail/python-list/1999-July/007696.html |
freebsd | 0.1772 | - | http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html |
elklund | 0.1913 | - | http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html |
naive | 0.2889 | - | http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetNaive |
seistrup | 0.6106 | ? | http://mail.python.org/pipermail/python-list/1999-July/007696.html |
Observations
- Implementations written in C (using Pyrex) are 5-6 times faster than Python.
- My 3.2 Ghz Linux Desktop is 200% faster than my 2.0 Ghz MacBook even though it only has a 60% clockspeed advantage. I'm not sure what to make of that. Python doesn't use multiple cores well, so I'm sure it's not that.
- If you want to run popcount on 32-bit values in pure Python, table16 is the fastest by far, and only 3 times slower than implementations in C.
- If you need to run popcount on arbitrarily large integers, kernighan is the best, but doing a hybrid of table16 and kernighan is probably better.
Conclusion
If you don't mind using about 64KB of memory, here's is the fastest popcount in pure python:#http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetTable POPCOUNT_TABLE16 = [0] * 2**16 for index in xrange(len(POPCOUNT_TABLE16)): POPCOUNT_TABLE16[index] = (index & 1) + POPCOUNT_TABLE16[index >> 1] def popcount32_table16(v): return (POPCOUNT_TABLE16[ v & 0xffff] + POPCOUNT_TABLE16[(v >> 16) & 0xffff])If you do mind the memory usage, here's a slighly slower version:
#http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetTable POPCOUNT_TABLE8 = [0] * 2**8 for index in xrange(len(POPCOUNT_TABLE8)): POPCOUNT_TABLE8[index] = (index & 1) + POPCOUNT_TABLE8[index >> 1] def popcount32_table8(v): return (POPCOUNT_TABLE8[ v & 0xff] + POPCOUNT_TABLE8[(v >> 8) & 0xff] + POPCOUNT_TABLE8[(v >> 16) & 0xff] + POPCOUNT_TABLE8[ v >> 24 ])And if you need to handle values of more than 32 bits, one of these two are the best:
#http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetKernighan def popcount_kernighan(v): c = 0 while v: v &= v - 1 c += 1 return c POPCOUNT32_LIMIT = 2**32-1 POPCOUNT_TABLE8 = [0] * 2**8 for index in xrange(len(POPCOUNT_TABLE8)): POPCOUNT_TABLE8[index] = (index & 1) + POPCOUNT_TABLE8[index >> 1] def popcount_hybrid(v): if v <= POPCOUNT32_LIMIT: return (POPCOUNT_TABLE16[ v & 0xffff] + POPCOUNT_TABLE16[(v >> 16) & 0xffff]) else: c = 0 while v: v &= v - 1 c += 1 return cIf it needs to be faster than that, write in C!
Test For Yourself
If you want to test these yourself, here's some code you can run.#http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetNaive def popcount_naive(v): c = 0 while v: c += (v & 1) v >>= 1 return c #http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetKernighan def popcount_kernighan(v): c = 0 while v: v &= v - 1 c += 1 return c #http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetTable POPCOUNT_TABLE8 = [0] * 2**8 for index in xrange(len(POPCOUNT_TABLE8)): POPCOUNT_TABLE8[index] = (index & 1) + POPCOUNT_TABLE8[index >> 1] def popcount32_table8(v): return (POPCOUNT_TABLE8[ v & 0xff] + POPCOUNT_TABLE8[(v >> 8) & 0xff] + POPCOUNT_TABLE8[(v >> 16) & 0xff] + POPCOUNT_TABLE8[ v >> 24 ]) POPCOUNT_TABLE16 = [0] * 2**16 for index in xrange(len(POPCOUNT_TABLE16)): POPCOUNT_TABLE16[index] = (index & 1) + POPCOUNT_TABLE16[index >> 1] def popcount32_table16(v): return (POPCOUNT_TABLE16[ v & 0xffff] + POPCOUNT_TABLE16[(v >> 16) & 0xffff]) POPCOUNT32_LIMIT = 2**32-1 def popcount_table16_kernighan(v): if v <= POPCOUNT32_LIMIT: return (POPCOUNT_TABLE16[ v & 0xffff] + POPCOUNT_TABLE16[(v >> 16) & 0xffff]) else: c = 0 while v: v &= v - 1 c += 1 return c #http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSet64 def popcount32_ops64(v): return ((((v & 0xfff) * 0x1001001001001 & 0x84210842108421) % 0x1f) + ((((v & 0xfff000) >> 12) * 0x1001001001001 & 0x84210842108421) % 0x1f) + (((v >> 24) * 0x1001001001001 & 0x84210842108421) % 0x1f)) #http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel #also http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html def popcount32_parallel(v): v = v - ((v >> 1) & 0x55555555) v = (v & 0x33333333) + ((v >> 2) & 0x33333333) #v = (v & 0x0F0F0F0F) + ((v >> 4) & 0x0F0F0F0F) #v = v + ((v >> 4) & 0x0F0F0F0F) v = (v + (v >> 4)) & 0x0F0F0F0F v = (v * 0x1010101) >> 24 return v % 256 #I added %256. I'm not sure why it's needed. It's probably because of signed ints in Python def popcount32_bagwell(v): v = v - ((v >> 1) & 0x55555555) v = (v & 0x33333333) + ((v >> 2) & 0x33333333) v = (v & 0x0F0F0F0F) + ((v >> 4) & 0x0F0F0F0F) v = v + (v >> 8) v = (v + (v >> 16)) & 0x3F return v #http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html def popcount_elklund(v): c = 0 while v: v ^= v & -v c += 1 return c #http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html def popcount32_freebsd(v): v = (v & 0x55555555) + ((v & 0xaaaaaaaa) >> 1); v = (v & 0x33333333) + ((v & 0xcccccccc) >> 2); v = (v & 0x0f0f0f0f) + ((v & 0xf0f0f0f0) >> 4); v = (v & 0x00ff00ff) + ((v & 0xff00ff00) >> 8); v = (v & 0x0000ffff) + ((v & 0xffff0000) >> 16); return v #http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html def popcount32_bsdmacro(v): #define BITCOUNT(x) (((BX_(x)+(BX_(x)>>4)) & 0x0F0F0F0F) % 255) #define BX_(x) ((x) - (((x)>>1)&) - (((x)>>2)&0x33333333) - (((x)>>3)&0x11111111)) # #def bx(v): (v - ((v >> 1) & 0x77777777) - ((v >> 2) & 0x33333333) - ((v >> 3) & 0x11111111)) #def bc(v): ((bx(v) + (bx(v) >> 4)) & 0x0F0F0F0F) % 255 v = (v - ((v >> 1) & 0x77777777) - ((v >> 2) & 0x33333333) - ((v >> 3) & 0x11111111)) v = ((v + (v >> 4)) & 0x0F0F0F0F) return v % 255 #http://mail.python.org/pipermail/python-list/1999-July/007696.html import marshal, array, struct, string POPCOUNT8_TRANS_TABLE = "".join(map(chr, POPCOUNT_TABLE8)) #changed by me to match new dumps() and use sum() def popcount_timpeters(v): counts = array.array("B", string.translate(marshal.dumps(v), POPCOUNT8_TRANS_TABLE)) # overwrite type code counts[0] = 0 return sum(counts) #changed by me to add loop unrolling and not setting digitcounts[0] def popcount32_timpeters2(v): counts = array.array("B", string.translate(marshal.dumps(v), POPCOUNT8_TRANS_TABLE)) return counts[1] + counts[2] + counts[3] + counts[4] #improved by me: no need to translate type char def popcount32_timpeters3(v): dumped = marshal.dumps(v) return POPCOUNT_TABLE8[ord(dumped[1])] + POPCOUNT_TABLE8[ord(dumped[2])] + POPCOUNT_TABLE8[ord(dumped[3])] + POPCOUNT_TABLE8[ord(dumped[4])] #http://mail.python.org/pipermail/python-list/1999-July/007696.html _run2mask = {1: 0x5555555555555555L, 2: 0x3333333333333333L, 4: 0x0F0F0F0F0F0F0F0FL, 8: 0x00FF00FF00FF00FFL} def buildmask2(run, n): run2 = run + run k = (n + run2 - 1) / run2 n = k * run2 try: answer = _run2mask[run] k2 = 64 / run2 except KeyError: answer = (1L << run) - 1 k2 = 1 while k > k2: k2 = k2 + k2 if k >= k2: answer = answer | (answer << (run * k2)) else: answer = answer | (answer << (run2 * (k - k2/2))) k2 = k if k2 > k: answer = answer >> (run2 * (k2 - k)) return answer, n def nbits(v): return 32 #??? def popcount_seistrup(v): lomask, n = buildmask2(1, nbits(v)) v = v - ((v >> 1) & lomask) target = 2 while n > target: lomask, n = buildmask2(target, n) v = (v & lomask) + ((v >> target) & lomask) target = target + target for i in range(1, target/2): if n <= target: break n = n >> 1 n = (n + target - 1) / target * target v = (v & ((1L <<>> n) return int(v) if __name__ == "__main__": import time, random def time_func(func): before = time.time() result = func() after = time.time() span = after - before return result, span popcounts = [popcount_naive, popcount32_table8, popcount32_table16, popcount_table16_kernighan, popcount_kernighan, popcount32_ops64, popcount32_parallel, popcount32_bagwell, popcount_elklund, popcount32_freebsd, popcount32_bsdmacro, popcount_seistrup, popcount_timpeters, popcount32_timpeters2, popcount32_timpeters3, ] test_count = 30000 max_int = 2**31 - 2 ints = range(0, 257) + [random.randint(0, max_int) for _ in xrange(test_count)] + range(max_int - 100, max_int+1) expected_counts = map(popcount_naive, ints) for popcount in popcounts: counts, timespan = time_func(lambda : map(popcount, ints)) print "%5.5f %s" % ((timespan if (counts == expected_counts) else -1), popcount.__name__) #-1 means failure