I've learned. I'll share.

January 14, 2009

Popcount in Python (with benchmarks)

Purpose

A slightly uncommon bit-twiddling operation is "popcount", which counts the number high bits. While uncommon, it's crucial for implementing Array Mapped Tries, which is a way to implement immutable maps and vectors. I'm trying to implement immutable maps and vectors, so I needed to implement popcount.

I found A TON of ways, but had no idea which was fastest. So, I implemented them all and tested them. The short answer is to do this:

#http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetTable
POPCOUNT_TABLE16 = [0] * 2**16
for index in xrange(len(POPCOUNT_TABLE16)):
    POPCOUNT_TABLE16[index] = (index & 1) + POPCOUNT_TABLE16[index >> 1]

def popcount32_table16(v):
    return (POPCOUNT_TABLE16[ v        & 0xffff] +
            POPCOUNT_TABLE16[(v >> 16) & 0xffff])
And here's the long answer:

Results

I ran popcount on 30,000 random ints between 0 and 231-1. Here are the results from my Linux Desktop with a 3.2Ghz Core 2 Quad:

Name TimeMax BitsURL
c_bagwell 0.003032http://lamp.epfl.ch/papers/idealhashtrees.pdf
c_bsdmacro 0.003032http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html
c_parallel 0.003132http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel
c_ops64 0.003632http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSet64
table16 0.009832http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetTable
c_table8 0.011032http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetNaive
table+kern 0.0132 -http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetTable
table8 0.016332http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetTable
bsdmacro 0.019032http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html
parallel 0.019932http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html
ops64 0.020732http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSet64
bagwell 0.024232http://lamp.epfl.ch/papers/idealhashtrees.pdf
freebsd 0.025732http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html
timpeters3 0.027732http://mail.python.org/pipermail/python-list/1999-July/007696.html
timpeters2 0.058032http://mail.python.org/pipermail/python-list/1999-July/007696.html
timpeters 0.0724 ?http://mail.python.org/pipermail/python-list/1999-July/007696.html
kernighan 0.0745 -http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetKernighan
elklund 0.0889 -http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html
naive 0.1519 -http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetNaive
seistrup 0.2447 ?http://mail.python.org/pipermail/python-list/1999-July/007696.html

And here are the results for my MacBook with a 2.0Ghz Core 2 Duo:

Name TimeMax BitsURL
table16 0.027932http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetTable
table+kern 0.0372 -http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetTable
table8 0.041732http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetTable
bsdmacro 0.048132http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html
bagwell 0.059632http://lamp.epfl.ch/papers/idealhashtrees.pdf
timpeters3 0.074432http://mail.python.org/pipermail/python-list/1999-July/007696.html
parallel 0.080732http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html
ops64 0.129032http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSet64
timpeters2 0.143932http://mail.python.org/pipermail/python-list/1999-July/007696.html
kernighan 0.1527 -http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetKernighan
timpeters 0.1668 ?http://mail.python.org/pipermail/python-list/1999-July/007696.html
freebsd 0.1772 -http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html
elklund 0.1913 -http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html
naive 0.2889 -http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetNaive
seistrup 0.6106 ?http://mail.python.org/pipermail/python-list/1999-July/007696.html

Observations

  1. Implementations written in C (using Pyrex) are 5-6 times faster than Python.
  2. My 3.2 Ghz Linux Desktop is 200% faster than my 2.0 Ghz MacBook even though it only has a 60% clockspeed advantage. I'm not sure what to make of that. Python doesn't use multiple cores well, so I'm sure it's not that.
  3. If you want to run popcount on 32-bit values in pure Python, table16 is the fastest by far, and only 3 times slower than implementations in C.
  4. If you need to run popcount on arbitrarily large integers, kernighan is the best, but doing a hybrid of table16 and kernighan is probably better.

Conclusion

If you don't mind using about 64KB of memory, here's is the fastest popcount in pure python:
#http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetTable
POPCOUNT_TABLE16 = [0] * 2**16
for index in xrange(len(POPCOUNT_TABLE16)):
    POPCOUNT_TABLE16[index] = (index & 1) + POPCOUNT_TABLE16[index >> 1]

def popcount32_table16(v):
    return (POPCOUNT_TABLE16[ v        & 0xffff] +
            POPCOUNT_TABLE16[(v >> 16) & 0xffff])
If you do mind the memory usage, here's a slighly slower version:
#http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetTable
POPCOUNT_TABLE8 = [0] * 2**8
for index in xrange(len(POPCOUNT_TABLE8)):
    POPCOUNT_TABLE8[index] = (index & 1) + POPCOUNT_TABLE8[index >> 1]

def popcount32_table8(v):
    return (POPCOUNT_TABLE8[ v        & 0xff] +
            POPCOUNT_TABLE8[(v >> 8)  & 0xff] +
            POPCOUNT_TABLE8[(v >> 16) & 0xff] +
            POPCOUNT_TABLE8[ v >> 24        ])
And if you need to handle values of more than 32 bits, one of these two are the best:
#http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetKernighan
def popcount_kernighan(v):
    c = 0
    while v:
        v &= v - 1
        c += 1
    return c

POPCOUNT32_LIMIT = 2**32-1
POPCOUNT_TABLE8 = [0] * 2**8
for index in xrange(len(POPCOUNT_TABLE8)):
    POPCOUNT_TABLE8[index] = (index & 1) + POPCOUNT_TABLE8[index >> 1]

def popcount_hybrid(v):
    if v <= POPCOUNT32_LIMIT:
        return (POPCOUNT_TABLE16[ v        & 0xffff] +
                POPCOUNT_TABLE16[(v >> 16) & 0xffff])
    else:
        c = 0
        while v:
            v &= v - 1
            c += 1
        return c

If it needs to be faster than that, write in C!

Test For Yourself

If you want to test these yourself, here's some code you can run.
#http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetNaive
def popcount_naive(v):
    c = 0
    while v:
        c += (v & 1)
        v >>= 1
    return c

#http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetKernighan
def popcount_kernighan(v):
    c = 0
    while v:
        v &= v - 1
        c += 1
    return c

#http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetTable
POPCOUNT_TABLE8 = [0] * 2**8
for index in xrange(len(POPCOUNT_TABLE8)):
    POPCOUNT_TABLE8[index] = (index & 1) + POPCOUNT_TABLE8[index >> 1]

def popcount32_table8(v):
    return (POPCOUNT_TABLE8[ v        & 0xff] +
            POPCOUNT_TABLE8[(v >> 8)  & 0xff] +
            POPCOUNT_TABLE8[(v >> 16) & 0xff] +
            POPCOUNT_TABLE8[ v >> 24        ])

POPCOUNT_TABLE16 = [0] * 2**16
for index in xrange(len(POPCOUNT_TABLE16)):
    POPCOUNT_TABLE16[index] = (index & 1) + POPCOUNT_TABLE16[index >> 1]

def popcount32_table16(v):
    return (POPCOUNT_TABLE16[ v        & 0xffff] +
            POPCOUNT_TABLE16[(v >> 16) & 0xffff])

POPCOUNT32_LIMIT = 2**32-1
def popcount_table16_kernighan(v):
    if v <= POPCOUNT32_LIMIT:
        return (POPCOUNT_TABLE16[ v        & 0xffff] +
                POPCOUNT_TABLE16[(v >> 16) & 0xffff])
    else:
        c = 0
        while v:
            v &= v - 1
            c += 1
        return c


#http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSet64
def popcount32_ops64(v):
    return ((((v & 0xfff) * 0x1001001001001 & 0x84210842108421)            % 0x1f) +
            ((((v & 0xfff000) >> 12) * 0x1001001001001 & 0x84210842108421) % 0x1f) +
            (((v >> 24) * 0x1001001001001 & 0x84210842108421)              % 0x1f))
   

#http://graphics.stanford.edu/~seander/bithacks.html#CountBitsSetParallel
#also http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html
def popcount32_parallel(v):
    v = v                - ((v >> 1) & 0x55555555)
    v = (v & 0x33333333) + ((v >> 2) & 0x33333333)
    #v = (v & 0x0F0F0F0F) + ((v >> 4) & 0x0F0F0F0F)
    #v = v                + ((v >> 4) & 0x0F0F0F0F)
    v = (v + (v >> 4)) & 0x0F0F0F0F
    v = (v * 0x1010101) >> 24
    return v % 256 #I added %256. I'm not sure why it's needed. It's probably because of signed ints in Python

def popcount32_bagwell(v):
    v = v                - ((v >> 1) & 0x55555555)
    v = (v & 0x33333333) + ((v >> 2) & 0x33333333)
    v = (v & 0x0F0F0F0F) + ((v >> 4) & 0x0F0F0F0F)
    v = v                +  (v >> 8)
    v = (v + (v >> 16)) & 0x3F
    return v

#http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html
def popcount_elklund(v):
    c = 0
    while v:
        v ^= v & -v
        c += 1
    return c

#http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html
def popcount32_freebsd(v):
    v = (v & 0x55555555) + ((v & 0xaaaaaaaa) >> 1);
    v = (v & 0x33333333) + ((v & 0xcccccccc) >> 2);
    v = (v & 0x0f0f0f0f) + ((v & 0xf0f0f0f0) >> 4);
    v = (v & 0x00ff00ff) + ((v & 0xff00ff00) >> 8);
    v = (v & 0x0000ffff) + ((v & 0xffff0000) >> 16);
    return v

#http://resnet.uoregon.edu/~gurney_j/jmpc/bitwise.html
def popcount32_bsdmacro(v):
    #define BITCOUNT(x) (((BX_(x)+(BX_(x)>>4)) & 0x0F0F0F0F) % 255)
    #define BX_(x) ((x) - (((x)>>1)&) - (((x)>>2)&0x33333333) - (((x)>>3)&0x11111111))
    #
    #def bx(v): (v - ((v >> 1) & 0x77777777) - ((v >> 2) & 0x33333333) - ((v >> 3) & 0x11111111))
    #def bc(v): ((bx(v) + (bx(v) >> 4)) & 0x0F0F0F0F) % 255

    v = (v - ((v >> 1) & 0x77777777) - ((v >> 2) & 0x33333333) - ((v >> 3) & 0x11111111))
    v = ((v + (v >> 4)) & 0x0F0F0F0F)
    return v % 255



#http://mail.python.org/pipermail/python-list/1999-July/007696.html
import marshal, array, struct, string
POPCOUNT8_TRANS_TABLE = "".join(map(chr, POPCOUNT_TABLE8))

#changed by me to match new dumps() and use sum()
def popcount_timpeters(v):
    counts = array.array("B", string.translate(marshal.dumps(v), POPCOUNT8_TRANS_TABLE))
    # overwrite type code
    counts[0] = 0
    return sum(counts)

#changed by me to add loop unrolling and not setting digitcounts[0]
def popcount32_timpeters2(v):
    counts = array.array("B", string.translate(marshal.dumps(v), POPCOUNT8_TRANS_TABLE))
    return counts[1] + counts[2] + counts[3] + counts[4]

#improved by me: no need to translate type char
def popcount32_timpeters3(v):
    dumped = marshal.dumps(v)
    return POPCOUNT_TABLE8[ord(dumped[1])] + POPCOUNT_TABLE8[ord(dumped[2])] + POPCOUNT_TABLE8[ord(dumped[3])] + POPCOUNT_TABLE8[ord(dumped[4])]

#http://mail.python.org/pipermail/python-list/1999-July/007696.html
_run2mask = {1: 0x5555555555555555L,
             2: 0x3333333333333333L,
             4: 0x0F0F0F0F0F0F0F0FL,
             8: 0x00FF00FF00FF00FFL}

def buildmask2(run, n):
    run2 = run + run
    k = (n + run2 - 1) / run2
    n = k * run2
    try:
        answer = _run2mask[run]
        k2 = 64 / run2
    except KeyError:
        answer = (1L << run) - 1
        k2 = 1
    while k > k2:
        k2 = k2 + k2
        if k >= k2:
            answer = answer | (answer << (run * k2))
        else:
            answer = answer | (answer << (run2 * (k - k2/2)))
            k2 = k
    if k2 > k:
        answer = answer >> (run2 * (k2 - k))
    return answer, n

def nbits(v):
    return 32 #???

def popcount_seistrup(v):
    lomask, n = buildmask2(1, nbits(v))
    v = v - ((v >> 1) & lomask)
    target = 2
    while n > target:
        lomask, n = buildmask2(target, n)
        v = (v & lomask) + ((v >> target) & lomask)
        target = target + target
        for i in range(1, target/2):
            if n <= target:
                break
            n = n >> 1
            n = (n + target - 1) / target * target
            v = (v & ((1L <<>> n)
    return int(v)
   

if __name__ == "__main__":
    import time, random

    def time_func(func):
        before = time.time()
        result = func()
        after  = time.time()
        span   = after - before
        return result, span

    popcounts = [popcount_naive,       
                 popcount32_table8,    
                 popcount32_table16,   
                 popcount_table16_kernighan,
                 popcount_kernighan,   
                 popcount32_ops64,     
                 popcount32_parallel,  
                 popcount32_bagwell,   
                 popcount_elklund,     
                 popcount32_freebsd,   
                 popcount32_bsdmacro,  
                 popcount_seistrup,    
                 popcount_timpeters,   
                 popcount32_timpeters2,
                 popcount32_timpeters3,
                 ]

    test_count      = 30000
    max_int         = 2**31 - 2
    ints            = range(0, 257) + [random.randint(0, max_int) for _ in xrange(test_count)] + range(max_int - 100, max_int+1)
    expected_counts = map(popcount_naive, ints)
    for popcount in popcounts:
        counts, timespan = time_func(lambda : map(popcount, ints))
        print "%5.5f %s" % ((timespan if (counts == expected_counts) else -1), popcount.__name__) #-1 means failure

39 comments:

  1. In python 3 I've found (to my disappointment) that a simple bin(num).count('1') is faster than the table16 algorithm you've posted. As I was trying to replace the gmpy version of popcount I was using, this was not good enough. Thankfully, I was interested in a special case of the problem, namely I was doing a test for

    popcount(a) > 1

    so, inspired by the posted algorithm, I was able to replace that with:

    a not in onebit

    where onebit is a tuple constructed as follows:

    onebit = [0]
    for i in range(32): onebit.append(2**i)
    onebit = tuple(onebit)

    another benefit is that you get to set the size you are interested in just by changing the range of the loop. (in my case, 17 was sufficient.

    ReplyDelete
  2. Interesting results, thanks for compiling this.

    One option that has been over-looked, perhaps, is an 11-bit look-up table:

    POPCOUNT_TABLE11 = [0] * (1 << 11)
    for index in xrange(len(POPCOUNT_TABLE11)):
        POPCOUNT_TABLE11[index] = (index & 1) + POPCOUNT_TABLE11[index >> 1]

    def popcount32_table11(v):
        return (POPCOUNT_TABLE11[ v        & 0x7ff] +
                POPCOUNT_TABLE11[(v >> 11) & 0x7ff] +
                POPCOUNT_TABLE11[ v >> 22         ])


    This actually runs the fastest on my system, I imagine due to caching better than the 16-bit table, being 32x smaller.

    ReplyDelete
  3. The problem is that your benchmarks assume your aren't doing anything OTHER than popcount. On newer x86 processors, the 64K table will fit in (and fill) the L1 data cache, making it very fast, but only as long as it can hog the entire L1 cache! In real code, you'll likely be doing other things that require L1 cache as well, so this will run effectively 3x slower (L2 cache speed)...

    ReplyDelete
  4. I am read your article and it’s helped me a lot, thank you so much for share this wonderful article, I really like to read this type of blog, keep sharing this type of useful information, I am suggest to my all contact to visit this blog and collect this helpful information, If any one searching website designing & development company in Delhi please visit my website, we are do for you any think like website build PPC service and many more so don’t waste your time and come on my website and grow more your business, Thank you so much for read my comment. Ecommerce Website Designing Company in Delhii

    ReplyDelete
  5. This was be great I will read your blog properly, thank you so much for share this valuable information, I am enjoy to read this, it’s provided me lot of information, if any one grow up looking so please visit me website and I am share for you some tips so you will go and grow up your looks, thank you so much for read my comment.
    Lifestyle Magazine India

    ReplyDelete

  6. Thanks for sharing this informative content.,
    Leanpitch provides online training in Scrum Master Certification during this lockdown period everyone can use it wisely.
    Join Leanpitch 2 Days CSM Certification Workshop in different cities.
    CSM training online

    Scrum master training online

    ReplyDelete
  7. Thanks for sharing this.,
    Leanpitch provides online training in Scrum Master during this lockdown period everyone can use it wisely.
    Join Leanpitch 2 Days CSM Certification Workshop in different cities.

    Scrum master certification

    csm certification

    certified scrum master certification

    certified scrum master

    agile scrum master certification

    scrum master certification cost

    csm certification cost

    Scrum master Training

    Scrum master

    Best Scrum master certification

    ReplyDelete
  8. Very informative post and really appreciable. Keep doing great work like this.

    My web site - 부산오피

    (freaky)

    ReplyDelete
  9. You provide such a wonderful article, the way of expressing the word is excellent. Fix the Common Verizon Email Problem like Change Verizon Password, Verizon email is not Working, Forgot Verizon Password etc.

    ReplyDelete
  10. After I originally left a comment I appear to have clicked on the -Notify me when new comments are added- checkbox and from now on every time a comment is added I receive 4 emails with the exact same comment. Perhaps there is an easy method you are able to remove me from that service? Thank you!

    대딸방

    ReplyDelete
  11. Thanks , I have just been looking for information approximately this subject for a while and yours is the greatest I've found out till now. However, what concerning the conclusion? Are you sure about the source?

    타이마사지

    ReplyDelete
  12. Pretty section of content. I just stumbled upon your web blog and in accession capital to assert that I get actually enjoyed account your blog posts. Any way I will be subscribing to your feeds and even I achievement you access consistently quickly.

    안마

    ReplyDelete
  13. Hi there to every one, the contents existing at this web site are truly amazing for people knowledge, well, keep up the good work fellows.
    성인웹툰

    ReplyDelete
  14. Say no to paying tons of cash for overpriced Facebook advertising! Let me show you a platform that charges only a very small payment and provides an almost endless volume of web traffic to your website 안전놀이터

    ReplyDelete
  15. There is something wrong with the following line in function popcount_seistrup in your "Try For Yourself" code :

    v = (v & ((1L <<>> n)

    Python reports syntax error on this line! Seems there is something missing?

    Care to fix it?

    ReplyDelete
  16. Everything is very open with a clear clarification of the issues. It was truly informative. Your site is useful. Thank you for sharing!data analytics course in jodhpur

    ReplyDelete
  17. I always like finding a site that gives you very good ideas because I like learning new stuff. Happy that I found your site because I greatly liked it and I anticipate your following post. A fantastic blog and i’ll come back again for more useful content. ufabet thai vip

    ReplyDelete
  18. 스포츠중계
    토토사이트
    토토

    You make so many great points here that I read your article a couple of times. Your views are in accordance with my own for the most part. This is great content for your readers.

    ReplyDelete
  19. 토토사이트
    배트맨토토

    Just pure classic stuff from you here. I have never seen such a brilliantly written article in a long time. I am thankful to you that you produced this!

    ReplyDelete
  20. Thanks for sharing this article, what you posted is very useful for us and everyone, for about movies and web series provide more information...

    Upcoming Bollywood Movie

    ReplyDelete
  21. Can I attach this article? I want to tell my friends. My blog is pretty interesting too.
    먹튀온라인

    ReplyDelete


  22. will be praised anywhere. I am a columnist This post is really the best on this valuable topic 검증카지노

    ReplyDelete
  23. Great Post
    MyAssignmenthelp.sg provides Dissertation Help Singapore for university students. Our experienced writers provide quality dissertation help so that you can get good grades. Our team of experts has years of experience in writing dissertations and has the ability to draft a perfect dissertation for you. We provide personalized service for dissertation help in Singapore to make sure that you get the best out of it. We also provide editing and proofreading services for dissertation help in Singapore. We are available 24X7 for you to help you with your dissertation queries.

    ReplyDelete
  24. Nice Post
    Need top-quality assignment help in Malta? Look no further than maltaassignmenthelp.com! Our expert team of writers can assist with any assignment, big or small.

    ReplyDelete
  25. I am really thankful to the valuable information for us. Keep continue to share such an amazing blog. Visit this site to know more about this file uncontested divorce virginia

    ReplyDelete
  26. Great blog
    Greeceassignmenthelp.com understands that students in Greece have busy schedules and often struggle to balance their coursework with other responsibilities. That's why we offer a convenient and affordable way to buy Greece assignment services online.

    ReplyDelete
  27. convert youtube to mp3 online high quality
    https www.youtube.coyoutube to mp3
    download youtube videos to mp3 on your phone
    compare youtube to mp3 online converter
    how to convert a youtube to mp3
    https://yttomp3.pro/
    audio file converter to mp3 youtube
    mp3 skull youtube to mp3 converter
    use youtube to mp3 on ipad
    youtube to mp3 converter program online

    ReplyDelete
  28. Great Post
    "Elevate your home decor with the alluring charm of resin clocks from ClassyArtz.com. Discover a symphony of colors and textures in our online collection, where resin takes center stage in the art of timekeeping. Buy resin clocks online and create a striking focal point that reflects your sophisticated taste."

    ReplyDelete
  29. Capture the beauty and love of your wedding day by preserving your garland in resin. Our dedicated artisans use high-quality resin to encapsulate the delicate flowers, securing their charm and colors. The resin forms a protective layer, preserving the garland's essence for years. Choose our Wedding Garland Preservation In Resin to relive the magic of your special day for a lifetime.

    ReplyDelete
  30. I'm continually amazed by the excellence of your blog. Your dedication to delivering top-notch content truly shines.Ley de Divorcio de Nueva York Propiedad Matrimonial

    ReplyDelete
  31. You guys are following impressive trends. I like the look of your website too. Winter Sale Jackets

    ReplyDelete
  32. "Welcome to Gupta Homeo Clinic, your trusted destination for Vitiligo treatment Clinic in Jaipur. Our clinic stands as a beacon of hope, employing advanced homeopathic methodologies to treat Vitiligo. With a dedicated team and personalized protocols, we strive to re-pigment affected areas naturally. Gupta Homeo Clinic in Jaipur is committed to delivering comprehensive and reliable care, ensuring patients receive the best possible treatment for Vitiligo."

    ReplyDelete

Blog Archive

Google Analytics