Even though Pysec (and functional programming in python in general) turned out to be 2-3x slower, monadic parsing is still great for certain tasks. After I wrote about Pysec the first time, I received comments regarding pyparsing, which is a great parsing library that looks very similar. Today, I'd like to show you a very simple example where monadic parsing really shines above all other parsing techniques, including pyparsing. This is a real-world need that I had, not just an academic thought-experiment.
The situation is that I'm writing a peer-to-peer file synchronizer and I need to serialize binary data in an efficient way. I've choosen to use what I call "sized binary", which is basically netstrings without the comma or bencode's byte string. The byte-string "ABC", for example, would be encoded as "3:ABC", or "Hello World!" would be "12:Hello World!". This is very efficient because "ABC" or "Hello World!" can be any bytes \x00-\xFF, without any sort of encoding or decoding like uuencode.
Stop and think for a minute how you would define a grammer to parse "3:ABC". It's so simple, right? Well, try and write a grammar for it. Try using BNF, EBNF, yacc/bison, pyparsing, SimpleParse, or any parsing libarary you know of, in any programming language. If you manage, please write me an email and let me know how you did it, because I'm not so sure it's even possible. Maybe it is in the Perl6 grammars; they're throwing the kitchen sink into that thing :). The best I've seen is that the creator of pyparsing hacked a grammar object to change how it parses the "ABC" part after a different object parsed the "3" part. This basically uses the grammar object as a global variable. It works in a single-core world, but it won't work in the many-core world we're headed for.
But now you're getting bored with me because it's silly ringing my hands over such a simple format. I bet you could parse this format in just 4 lines of code:
def parse_sized_binary(stream):
size_bytes, stream = stream.readUntil(":")
size = int(size_bytes)
bytes, stream = stream.read(size)
return bytes, stream
You're right, it's really simple, at least until you have to embed this little "sized binary language" inside of a bigger language. Perhaps, like me, you need to embed binary data inside of a bigger data structure, and so you need to define a serialization for things like ints, floats, lists, and hash tables. For that stuff, you can define a grammer for a language like JSON or YAML and hand it over to a grammar-based parser library. But now you need a way to tell the library, "OK, when you see my binary data, hand control over to my four lines of code, and then I'll hand control back". You might even say "You know, why don't you just let me hand you any function of the form stream -> (value, stream) and let me control the parsing for a while".
Congratulations, you just invented monadic parsing. The function you wrote, parse_sized_binary, of the form stream -> (bytes, stream), is a Parser Monad. Monadic parsing is just combining lots of these little parsers together. And so, it's incredibly easy to insert arbitrary parsing code, like parse_sized_binary, into any parser.
As a complete example, using Pysec, here's a full parser for a language just like bencode but with support for floats and unicode. Notice that our parse_size_binary is renamed to bencode_sized and is given a more functional definition.
from Pysec import Stub, until, lift, read, many_until, branch
bencode_any = Stub()
bencode_sized = until(":") >> lift(int) >> read
bencode_unsized = until("e")
bencode_list = many_until(bencode_any, "e")
bencode_any.set(branch({"s" : bencode_sized,
"u" : bencode_sized >> lift(lambda bytes : bytes.decode("utf8")),
"i" : bencode_unsized >> lift(int),
"f" : bencode_unsized >> lift(float),
"l" : bencode_list,
"d" : bencode_list >> lift(dict)}))
I think this is sort of like Greenspun's Tenth Rule. Any implementation of netstrings/sized-binary-data probably has an ad-hoc implementation of monadic parsing in it. Even my own non-monadic version of this parser (which I use for performance reasons) has an ad-hoc implementation of monadic parsing. I'm convinced that if you write a parser for a similar format, you'll implement/invent monadic parsing as well, even if you don't know it.
6 comments: