I've learned. I'll share.

April 1, 2008

You Could Have Invented Monadic Parsing

Even though Pysec (and functional programming in python in general) turned out to be 2-3x slower, monadic parsing is still great for certain tasks. After I wrote about Pysec the first time, I received comments regarding pyparsing, which is a great parsing library that looks very similar. Today, I'd like to show you a very simple example where monadic parsing really shines above all other parsing techniques, including pyparsing. This is a real-world need that I had, not just an academic thought-experiment.

The situation is that I'm writing a peer-to-peer file synchronizer and I need to serialize binary data in an efficient way. I've choosen to use what I call "sized binary", which is basically netstrings without the comma or bencode's byte string. The byte-string "ABC", for example, would be encoded as "3:ABC", or "Hello World!" would be "12:Hello World!". This is very efficient because "ABC" or "Hello World!" can be any bytes \x00-\xFF, without any sort of encoding or decoding like uuencode.

Stop and think for a minute how you would define a grammer to parse "3:ABC". It's so simple, right? Well, try and write a grammar for it. Try using BNF, EBNF, yacc/bison, pyparsing, SimpleParse, or any parsing libarary you know of, in any programming language. If you manage, please write me an email and let me know how you did it, because I'm not so sure it's even possible. Maybe it is in the Perl6 grammars; they're throwing the kitchen sink into that thing :). The best I've seen is that the creator of pyparsing hacked a grammar object to change how it parses the "ABC" part after a different object parsed the "3" part. This basically uses the grammar object as a global variable. It works in a single-core world, but it won't work in the many-core world we're headed for.

But now you're getting bored with me because it's silly ringing my hands over such a simple format. I bet you could parse this format in just 4 lines of code:

def parse_sized_binary(stream):
    size_bytes, stream = stream.readUntil(":")
    size               = int(size_bytes)
    bytes, stream      = stream.read(size)
    return bytes, stream
You're right, it's really simple, at least until you have to embed this little "sized binary language" inside of a bigger language. Perhaps, like me, you need to embed binary data inside of a bigger data structure, and so you need to define a serialization for things like ints, floats, lists, and hash tables. For that stuff, you can define a grammer for a language like JSON or YAML and hand it over to a grammar-based parser library. But now you need a way to tell the library, "OK, when you see my binary data, hand control over to my four lines of code, and then I'll hand control back". You might even say "You know, why don't you just let me hand you any function of the form stream -> (value, stream) and let me control the parsing for a while".

Congratulations, you just invented monadic parsing. The function you wrote, parse_sized_binary, of the form stream -> (bytes, stream), is a Parser Monad. Monadic parsing is just combining lots of these little parsers together. And so, it's incredibly easy to insert arbitrary parsing code, like parse_sized_binary, into any parser.

As a complete example, using Pysec, here's a full parser for a language just like bencode but with support for floats and unicode. Notice that our parse_size_binary is renamed to bencode_sized and is given a more functional definition.

from Pysec import Stub, until, lift, read, many_until, branch

bencode_any     = Stub()
bencode_sized   = until(":") >> lift(int) >> read
bencode_unsized = until("e")
bencode_list    = many_until(bencode_any, "e")
bencode_any.set(branch({"s" : bencode_sized,
                        "u" : bencode_sized >> lift(lambda bytes : bytes.decode("utf8")),
                        "i" : bencode_unsized >> lift(int),
                        "f" : bencode_unsized >> lift(float),
                        "l" : bencode_list,
                        "d" : bencode_list >> lift(dict)}))
I think this is sort of like Greenspun's Tenth Rule. Any implementation of netstrings/sized-binary-data probably has an ad-hoc implementation of monadic parsing in it. Even my own non-monadic version of this parser (which I use for performance reasons) has an ad-hoc implementation of monadic parsing. I'm convinced that if you write a parser for a similar format, you'll implement/invent monadic parsing as well, even if you don't know it.

8 comments:

  1. Using this bencode example and the javascript example, combined with the rest of the source you posted earlier (the Pysec/Parsec in Python post), I tried to create something that can parse something that could otherwise be decoded as follows:

    def tlv(data):
    # data is a string for this example
      import struct
    # two shorts in network byte order
      type, len = struct.unpack('!HH', data[:4])
      value, data = data[4:len+4], data[len+4:]
      return (type,value), data

    I succeeded in using your monad classes and parser decorator to create something to read a short, but passing the length into the next parser (I tried to use read, like in the bencode example) didn't work. It seems likely that the source provided for the javascript example isn't exactly what it is inside the Pysec module you import from in the bencode example.

    Any tips?

    ReplyDelete
  2. Pysec has changed a bit since I posted about it the first time, mostly for performance reasons. If you'd like, I can post somewhere a more complete example that you can actually run. I'm never sure how popular a post will be, so I don't always go to the trouble to make a full, runnable example.

    ReplyDelete
  3. That's understandable. Posting all the code over and over again when the changes are minor could clutter up the posts pretty quickly. I don't know what your plans are for releasing Pysec are but I was wondering, is the code publically accessible anywhere? Those of us keeping up with your blog can keep up with the code too =)

    ReplyDelete
  4. Sorry about not getting to you sooner.

    My code isn't anywhere publicly, other than this blog. Do you know of a good place to host code informally? I guess Google Code looks pretty good, but what do I call the project? "Peter's Stuff"? I might want to release thing other than Pysec.

    Also, I find it hard enough making regular blog entries. I'm not sure how much time I'd have to dedicate to any kind of "project" right now. But perhaps if there's sufficient interest for it, I could find a little time.

    ReplyDelete
  5. Well, a formal "project" isn't really necessary, I was thinking just a pub directory somewhere ;-)

    Or, if you have a revision control system and a webhost, you could have a source browser (such as trac, or viewcvs).

    Just a suggestion, though. Thanks for the full source on the most recent post, by the way =)

    ReplyDelete
  6. Now that you have discovered github, please post post your work on this parser there. I love the simplicity and ease for defining a grammar.

    ReplyDelete
  7. Bencode isn’t context-free, so of course this isn’t possible. One of the reasons I think it’s a terrible format. If you restrict the length of Bytestrings you could probably describe it with a gigantic grammar, but I didn’t feel like figuring it out.

    ReplyDelete
  8. There's also an issue with the parser you posted; viz. it allows you to have things like this 00000000000000000005:hello which would be bad. :-) I found a lot of buffer overflows this way... most people aren't subtle enough in writing these parsers. I ended up writing one by hand for Bencode, but it took a long time. — What a total crap format.

    ReplyDelete

Blog Archive

Google Analytics