Purpose
A slightly uncommon bit-twiddling operation is "popcount", which counts the number high bits. While uncommon, it's crucial for implementing Array Mapped Tries, which is a way to implement immutable maps and vectors. I'm trying to implement immutable maps and vectors, so I needed to implement popcount.I found A TON of ways, but had no idea which was fastest. So, I implemented them all and tested them. The short answer is to do this:
#http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetTable
POPCOUNT_TABLE16 = [0] * 2**16
for index in xrange(len(POPCOUNT_TABLE16)):
POPCOUNT_TABLE16[index] = (index & 1) + POPCOUNT_TABLE16[index >> 1]
def popcount32_table16(v):
return (POPCOUNT_TABLE16[ v & 0xffff] +
POPCOUNT_TABLE16[(v >> 16) & 0xffff])
And here's the long answer:
Results
I ran popcount on 30,000 random ints between 0 and 2
| Name | Time | Max Bits | URL |
| c_bagwell | 0.0030 | 32 | http://lamp.epfl.ch/papers/idealhashtrees.pdf |
| c_bsdmacro | 0.0030 | 32 | http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html |
| c_parallel | 0.0031 | 32 | http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel |
| c_ops64 | 0.0036 | 32 | http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSet64 |
| table16 | 0.0098 | 32 | http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetTable |
| c_table8 | 0.0110 | 32 | http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetNaive |
| table+kern | 0.0132 | - | http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetTable |
| table8 | 0.0163 | 32 | http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetTable |
| bsdmacro | 0.0190 | 32 | http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html |
| parallel | 0.0199 | 32 | http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html |
| ops64 | 0.0207 | 32 | http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSet64 |
| bagwell | 0.0242 | 32 | http://lamp.epfl.ch/papers/idealhashtrees.pdf |
| freebsd | 0.0257 | 32 | http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html |
| timpeters3 | 0.0277 | 32 | http://mail.python.org/pipermail/python-list/1999-July/007696.html |
| timpeters2 | 0.0580 | 32 | http://mail.python.org/pipermail/python-list/1999-July/007696.html |
| timpeters | 0.0724 | ? | http://mail.python.org/pipermail/python-list/1999-July/007696.html |
| kernighan | 0.0745 | - | http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetKernighan |
| elklund | 0.0889 | - | http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html |
| naive | 0.1519 | - | http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetNaive |
| seistrup | 0.2447 | ? | http://mail.python.org/pipermail/python-list/1999-July/007696.html |
And here are the results for my MacBook with a 2.0Ghz Core 2 Duo:
| Name | Time | Max Bits | URL |
| table16 | 0.0279 | 32 | http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetTable |
| table+kern | 0.0372 | - | http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetTable |
| table8 | 0.0417 | 32 | http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetTable |
| bsdmacro | 0.0481 | 32 | http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html |
| bagwell | 0.0596 | 32 | http://lamp.epfl.ch/papers/idealhashtrees.pdf |
| timpeters3 | 0.0744 | 32 | http://mail.python.org/pipermail/python-list/1999-July/007696.html |
| parallel | 0.0807 | 32 | http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html |
| ops64 | 0.1290 | 32 | http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSet64 |
| timpeters2 | 0.1439 | 32 | http://mail.python.org/pipermail/python-list/1999-July/007696.html |
| kernighan | 0.1527 | - | http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetKernighan |
| timpeters | 0.1668 | ? | http://mail.python.org/pipermail/python-list/1999-July/007696.html |
| freebsd | 0.1772 | - | http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html |
| elklund | 0.1913 | - | http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html |
| naive | 0.2889 | - | http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetNaive |
| seistrup | 0.6106 | ? | http://mail.python.org/pipermail/python-list/1999-July/007696.html |
Observations
- Implementations written in C (using Pyrex) are 5-6 times faster than Python.
- My 3.2 Ghz Linux Desktop is 200% faster than my 2.0 Ghz MacBook even though it only has a 60% clockspeed advantage. I'm not sure what to make of that. Python doesn't use multiple cores well, so I'm sure it's not that.
- If you want to run popcount on 32-bit values in pure Python, table16 is the fastest by far, and only 3 times slower than implementations in C.
- If you need to run popcount on arbitrarily large integers, kernighan is the best, but doing a hybrid of table16 and kernighan is probably better.
Conclusion
If you don't mind using about 64KB of memory, here's is the fastest popcount in pure python:
#http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetTable
POPCOUNT_TABLE16 = [0] * 2**16
for index in xrange(len(POPCOUNT_TABLE16)):
POPCOUNT_TABLE16[index] = (index & 1) + POPCOUNT_TABLE16[index >> 1]
def popcount32_table16(v):
return (POPCOUNT_TABLE16[ v & 0xffff] +
POPCOUNT_TABLE16[(v >> 16) & 0xffff])
If you do mind the memory usage, here's a slighly slower version:
#http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetTable
POPCOUNT_TABLE8 = [0] * 2**8
for index in xrange(len(POPCOUNT_TABLE8)):
POPCOUNT_TABLE8[index] = (index & 1) + POPCOUNT_TABLE8[index >> 1]
def popcount32_table8(v):
return (POPCOUNT_TABLE8[ v & 0xff] +
POPCOUNT_TABLE8[(v >> 8) & 0xff] +
POPCOUNT_TABLE8[(v >> 16) & 0xff] +
POPCOUNT_TABLE8[ v >> 24 ])
And if you need to handle values of more than 32 bits, one of these two are the best:
#http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetKernighan
def popcount_kernighan(v):
c = 0
while v:
v &= v - 1
c += 1
return c
POPCOUNT32_LIMIT = 2**32-1
POPCOUNT_TABLE8 = [0] * 2**8
for index in xrange(len(POPCOUNT_TABLE8)):
POPCOUNT_TABLE8[index] = (index & 1) + POPCOUNT_TABLE8[index >> 1]
def popcount_hybrid(v):
if v <= POPCOUNT32_LIMIT:
return (POPCOUNT_TABLE16[ v & 0xffff] +
POPCOUNT_TABLE16[(v >> 16) & 0xffff])
else:
c = 0
while v:
v &= v - 1
c += 1
return c
If it needs to be faster than that, write in C!
Test For Yourself
If you want to test these yourself, here's some code you can run.
#http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetNaive
def popcount_naive(v):
c = 0
while v:
c += (v & 1)
v >>= 1
return c
#http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetKernighan
def popcount_kernighan(v):
c = 0
while v:
v &= v - 1
c += 1
return c
#http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetTable
POPCOUNT_TABLE8 = [0] * 2**8
for index in xrange(len(POPCOUNT_TABLE8)):
POPCOUNT_TABLE8[index] = (index & 1) + POPCOUNT_TABLE8[index >> 1]
def popcount32_table8(v):
return (POPCOUNT_TABLE8[ v & 0xff] +
POPCOUNT_TABLE8[(v >> 8) & 0xff] +
POPCOUNT_TABLE8[(v >> 16) & 0xff] +
POPCOUNT_TABLE8[ v >> 24 ])
POPCOUNT_TABLE16 = [0] * 2**16
for index in xrange(len(POPCOUNT_TABLE16)):
POPCOUNT_TABLE16[index] = (index & 1) + POPCOUNT_TABLE16[index >> 1]
def popcount32_table16(v):
return (POPCOUNT_TABLE16[ v & 0xffff] +
POPCOUNT_TABLE16[(v >> 16) & 0xffff])
POPCOUNT32_LIMIT = 2**32-1
def popcount_table16_kernighan(v):
if v <= POPCOUNT32_LIMIT:
return (POPCOUNT_TABLE16[ v & 0xffff] +
POPCOUNT_TABLE16[(v >> 16) & 0xffff])
else:
c = 0
while v:
v &= v - 1
c += 1
return c
#http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSet64
def popcount32_ops64(v):
return ((((v & 0xfff) * 0x1001001001001 & 0x84210842108421) % 0x1f) +
((((v & 0xfff000) >> 12) * 0x1001001001001 & 0x84210842108421) % 0x1f) +
(((v >> 24) * 0x1001001001001 & 0x84210842108421) % 0x1f))
#http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel
#also http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html
def popcount32_parallel(v):
v = v - ((v >> 1) & 0x55555555)
v = (v & 0x33333333) + ((v >> 2) & 0x33333333)
#v = (v & 0x0F0F0F0F) + ((v >> 4) & 0x0F0F0F0F)
#v = v + ((v >> 4) & 0x0F0F0F0F)
v = (v + (v >> 4)) & 0x0F0F0F0F
v = (v * 0x1010101) >> 24
return v % 256 #I added %256. I'm not sure why it's needed. It's probably because of signed ints in Python
def popcount32_bagwell(v):
v = v - ((v >> 1) & 0x55555555)
v = (v & 0x33333333) + ((v >> 2) & 0x33333333)
v = (v & 0x0F0F0F0F) + ((v >> 4) & 0x0F0F0F0F)
v = v + (v >> 8)
v = (v + (v >> 16)) & 0x3F
return v
#http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html
def popcount_elklund(v):
c = 0
while v:
v ^= v & -v
c += 1
return c
#http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html
def popcount32_freebsd(v):
v = (v & 0x55555555) + ((v & 0xaaaaaaaa) >> 1);
v = (v & 0x33333333) + ((v & 0xcccccccc) >> 2);
v = (v & 0x0f0f0f0f) + ((v & 0xf0f0f0f0) >> 4);
v = (v & 0x00ff00ff) + ((v & 0xff00ff00) >> 8);
v = (v & 0x0000ffff) + ((v & 0xffff0000) >> 16);
return v
#http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html
def popcount32_bsdmacro(v):
#define BITCOUNT(x) (((BX_(x)+(BX_(x)>>4)) & 0x0F0F0F0F) % 255)
#define BX_(x) ((x) - (((x)>>1)&) - (((x)>>2)&0x33333333) - (((x)>>3)&0x11111111))
#
#def bx(v): (v - ((v >> 1) & 0x77777777) - ((v >> 2) & 0x33333333) - ((v >> 3) & 0x11111111))
#def bc(v): ((bx(v) + (bx(v) >> 4)) & 0x0F0F0F0F) % 255
v = (v - ((v >> 1) & 0x77777777) - ((v >> 2) & 0x33333333) - ((v >> 3) & 0x11111111))
v = ((v + (v >> 4)) & 0x0F0F0F0F)
return v % 255
#http://mail.python.org/pipermail/python-list/1999-July/007696.html
import marshal, array, struct, string
POPCOUNT8_TRANS_TABLE = "".join(map(chr, POPCOUNT_TABLE8))
#changed by me to match new dumps() and use sum()
def popcount_timpeters(v):
counts = array.array("B", string.translate(marshal.dumps(v), POPCOUNT8_TRANS_TABLE))
# overwrite type code
counts[0] = 0
return sum(counts)
#changed by me to add loop unrolling and not setting digitcounts[0]
def popcount32_timpeters2(v):
counts = array.array("B", string.translate(marshal.dumps(v), POPCOUNT8_TRANS_TABLE))
return counts[1] + counts[2] + counts[3] + counts[4]
#improved by me: no need to translate type char
def popcount32_timpeters3(v):
dumped = marshal.dumps(v)
return POPCOUNT_TABLE8[ord(dumped[1])] + POPCOUNT_TABLE8[ord(dumped[2])] + POPCOUNT_TABLE8[ord(dumped[3])] + POPCOUNT_TABLE8[ord(dumped[4])]
#http://mail.python.org/pipermail/python-list/1999-July/007696.html
_run2mask = {1: 0x5555555555555555L,
2: 0x3333333333333333L,
4: 0x0F0F0F0F0F0F0F0FL,
8: 0x00FF00FF00FF00FFL}
def buildmask2(run, n):
run2 = run + run
k = (n + run2 - 1) / run2
n = k * run2
try:
answer = _run2mask[run]
k2 = 64 / run2
except KeyError:
answer = (1L << run) - 1
k2 = 1
while k > k2:
k2 = k2 + k2
if k >= k2:
answer = answer | (answer << (run * k2))
else:
answer = answer | (answer << (run2 * (k - k2/2)))
k2 = k
if k2 > k:
answer = answer >> (run2 * (k2 - k))
return answer, n
def nbits(v):
return 32 #???
def popcount_seistrup(v):
lomask, n = buildmask2(1, nbits(v))
v = v - ((v >> 1) & lomask)
target = 2
while n > target:
lomask, n = buildmask2(target, n)
v = (v & lomask) + ((v >> target) & lomask)
target = target + target
for i in range(1, target/2):
if n <= target:
break
n = n >> 1
n = (n + target - 1) / target * target
v = (v & ((1L <<>> n)
return int(v)
if __name__ == "__main__":
import time, random
def time_func(func):
before = time.time()
result = func()
after = time.time()
span = after - before
return result, span
popcounts = [popcount_naive,
popcount32_table8,
popcount32_table16,
popcount_table16_kernighan,
popcount_kernighan,
popcount32_ops64,
popcount32_parallel,
popcount32_bagwell,
popcount_elklund,
popcount32_freebsd,
popcount32_bsdmacro,
popcount_seistrup,
popcount_timpeters,
popcount32_timpeters2,
popcount32_timpeters3,
]
test_count = 30000
max_int = 2**31 - 2
ints = range(0, 257) + [random.randint(0, max_int) for _ in xrange(test_count)] + range(max_int - 100, max_int+1)
expected_counts = map(popcount_naive, ints)
for popcount in popcounts:
counts, timespan = time_func(lambda : map(popcount, ints))
print "%5.5f %s" % ((timespan if (counts == expected_counts) else -1), popcount.__name__) #-1 means failure