Valued Lessons: Pysec: Monadic Combinatoric Parsing in Python (aka Parsec in Python)

February 12, 2008

Pysec: Monadic Combinatoric Parsing in Python (aka Parsec in Python)

Update: It turns out that there's a similar technique being used by pyparsing. I hadn't seen it before and when I first saw it I thought had reinvted the wheel and wasted my time. But upon further inspection, Pysec does a few things a much better than pyparsing, which happen to be the exact things that I need. There's no coincedence, of course, that Pysec matches my needs well. I'll be covering this in more detail in a future article.

Update 2: I got @do syntax to work! Again, stay tuned for another article on how.

I've talked about monads in the past, but I never really covered what purpose they serve. I covered the "how" in Python and Ruby, but I doubt I'll ever full cover the "why" because it's simply too big of a subject. But today, I'd like to share with you one example of how monads are useful. In fact, it's the example that motivated me to do all of the monad stuff in the first place: parsing.

Parsing is a topic that's been around pretty much forever in computer science and most people thing it's pretty much "solved". My experience is that we've still got a long way to go. Specifically, I'm writing an application with lots of distributed concurrency, which requires lots of data serialization and deserialization (aka parsing). There are very few good serialization libraries out there, and I've been through three or four versions of various techniques. Finally, I think I have found a parsing technique that works well: monadic combinatoric parsing. And it's in Python.

What the heck does that mean? "monadic" means we're using monads. "combinatoric" means we can take monad parsers and combine them to make new monad parsers, which is extremly powerful. I call it Pysec. The design is a copy of Parsec brought to Python. Notice how I said "design"; I didn't bother looking at any of their code; The design described on their web page was good enough guidance for me. But, I'm sure that their implementation is WAY better than mine. If you want to see real monadic parsing, look at Parsec. If you're interested in monadic parsing for Python, keep reading.

Here's an example of Pysec for parsing a subset of JSON:

from Pysec import Parser, choice, quoted_chars, group_chars, option_chars, digits, between, pair, spaces, match, quoted_collection

# json_choices is a hack to get around mutual recursion 
# a json is value is one of text, number, mapping, and collection
# text is any characters between quotes
# a number is like the regular expression -?[0-9]+(\.[0-9]+)?
# "parser >> Parser.lift(func)" means "pass the parsed value into func and return a new Parser"
# quoted_collection(start, space, inner, joiner, end)
#   means "a list of inner separated by joiner surrounded by start and end"
# we have to put a lot of "spaces" in since JSON allows lot of optional whitespace

json_choices = []
json         = choice(json_choices)
text         = quoted_chars("'", "'")
number       = group_chars([option_chars(["-"]), digits, option_chars([".", digits])]) >> Parser.lift(float)
joiner       = between(spaces, match(","), spaces)
mapping_pair = pair(text, spaces & match(":") & spaces & json)
collection   = quoted_collection("[", spaces, json,         joiner, "]") >> Parser.lift(list)
mapping      = quoted_collection("{", spaces, mapping_pair, joiner, "}") >> Parser.lift(dict)
json_choices.extend([text, number, mapping, collection])

print json.parseString("{'a' : -1.0, 'b' : 2.0, 'z' : {'c' : [1.0, [2.0, [3.0]]]}}")

Like most monadic or functional code, it's pretty dense, so don't feel bad if you go cross-eyed looking at it the first time. Realize that most of the code is building a Parser monad called "json", which parses the test string at the end. I tried to comment each individual part to explain what's going on.

You may be thinking "why would I want to write code like this?". One response is to look at the code: it's the bulk of JSON parsing in 15 lines that look like a grammar definition! Another response I can give is a challange: go write a parser to parse that string and then compare your code to this code. Which is shorter? Which is easier to read? Which is more elegant? While in college, I had a team project to write a compiler for a subset of Pascal. We were smart enough to use Python, but dumb enough to use Yacc and Flex. I'm sure the parser portion was pretty fast, but it was incredibly painful to get it right. Once we did, we dared not touch it for fear of breaking it. I really wish I had Parse/Pysec back then (ok, Parsec was around back then, but I hadn't even heard of Haskell or monads).

But monadic combinatoric parsing isn't just about making your code look like a grammar definition. It makes it possible to combine parsers in incredibly flexible ways. For example, let's say that on a differnt project, you wrote a simplified CSV parser for numbers like this one:

def line(cell):
    return sep_end_by(cell, match(","))

def csv(cell):
    return sep_end_by(line(cell), match("\n"))

print csv(number).parseString("1,2,3\n4,5,6")

And now you realize you'd really like to put whole JSON values in your simplified CSV. In other words, you want to combine the CVS and JSON parsers. I think that you'll find that doing so really isn't as easy as it sounds. Imagine trying to combine two Yacc grammars. It hurts just thinking about that. Luckily, monadic combinatoric parsers make this incredibly easy:

print csv(json).parseString("{'a' : 'A'},[1, 2, 3],'zzz'\n-1.0,2.0,-3.0")

While this is a slighly contrived example, you must understand that with this technique you can combine any two parsers in a similar fasion. I don't know about you, but that really feels right to me. I haven't seen any parsing techniques as elegant as that.

Everything it's perfect of course, especially since making this work in Python is a bit of a hack. Here are some downsides to this approach:

It's hard to debug when you do something wrong.
It's annoying to import so many things from Pysec.
You have to hack around mutual recursion of values in a strict language like python.
There's probably a performance hit.

Most of these can be either fixed or worked around, though, so I think long-term monadic parsing is good bet. I'd like to see what you can do with it or to make it better.

I know how much you all like a working example, so here's some code that you can just cut, paste, and run (in Python 2.5; older versions of Python may require some tweaking). Please ignore the top half. It's mostly stuff I've covered in other articles, like Immutable Records and monad base classes. The meat starts where it says "Parser Monad".

I'd like to talk about it in more detail, but I'll save that for another article. For now, please play around with it and see what you think.

##### Base Libraries included here for convenience ###########

def Record(*props):
    class cls(RecordBase):
        pass

    cls.setProps(props)

    return cls

class RecordBase(tuple):
    PROPS = ()

    def __new__(cls, *values):
        if cls.prepare != RecordBase.prepare:
            values = cls.prepare(*values)
        return cls.fromValues(values)

    @classmethod
    def fromValues(cls, values):
        return tuple.__new__(cls, values)

    def __repr__(self):
        return self.__class__.__name__ + tuple.__repr__(self)

    ## overridable
    @classmethod
    def prepare(cls, *args):
        return args

    ## setting up getters and setters
    @classmethod
    def setProps(cls, props):
        for index, prop in enumerate(props):
            cls.setProp(index, prop)
        cls.PROPS = props

    @classmethod
    def setProp(cls, index, prop):
        getter_name = prop
        setter_name = "set" + prop[0].upper() + prop[1:]

        setattr(cls, getter_name, cls.makeGetter(index, prop))
        setattr(cls, setter_name, cls.makeSetter(index, prop))

    @classmethod
    def makeGetter(cls, index, prop):
        return property(fget = lambda self : self[index])

    @classmethod
    def makeSetter(cls, index, prop):
        def setter(self, value):
            values = (value if current_index == index
                            else current_value
                      for current_index, current_value
                      in enumerate(self))
            return self.fromValues(values)
        return setter

class ByteStream(Record("bytes", "index")):
    @classmethod
    def prepare(cls, bytes, index = 0):
        return (bytes, index)

    def get(self, count):
        start = self.index
        end   = start + count
        bytes = self.bytes[start : end]
        return bytes, (self.setIndex(end) if bytes else self)

def make_decorator(func, *dec_args):
    def decorator(undecorated):
        def decorated(*args, **kargs):
            return func(undecorated, args, kargs, *dec_args) 
        
        decorated.__name__ = undecorated.__name__
        return decorated
    
    decorator.__name__ = func.__name__
    return decorator

decorator = make_decorator

class Monad:
    ## Must be overridden
    def bind(self, func):
        raise NotImplementedError

    @classmethod
    def unit(cls, val):
        raise NotImplementedError

    @classmethod
    def lift(cls, func):
        return (lambda val : cls.unit(func(val)))

    ## useful defaults that should probably NOT be overridden
    def __rshift__(self, bindee):
        return self.bind(bindee)

    def __and__(self, monad):
        return self.shove(monad)
        
    ## could be overridden if useful or if more efficient
    def shove(self, monad):
        return self.bind(lambda _ : monad)

class StateChanger(Record("changer", "bindees"), Monad):
    @classmethod
    def prepare(cls, changer, bindees = ()):
        return (changer, bindees)

    # binding can be slow since it happens at bind time rather than at run time
    def bind(self, bindee):
        return self.setBindees(self.bindees + (bindee,))

    def __call__(self, state):
        return self.run(state)

    def run(self, state0):
        value, state = self.changer(state0) if callable(self.changer) else self.changer
        state        = state0 if state is None else state

        for bindee in self.bindees:
            value, state = bindee(value).run(state)
        return (value, state)

    @classmethod
    def unit(cls, value):
        return cls((value, None))

######## Parser Monad ###########

class ParserState(Record("stream", "position")):
    @classmethod
    def prepare(cls, stream, position = 0):
        return (stream, position)

    def read(self, count):
        collection, stream = self.stream.get(count)
        return collection, self.fromValues((stream, self.position + count))

class Parser(StateChanger):
    def parseString(self, bytes):
        return self.parseStream(ByteStream(bytes))
        
    def parseStream(self, stream):
        state = ParserState(stream)
        value, state = self.run(state)
        return value

class ParseFailed(Exception):
    def __init__(self, message, state):
        self.message = message
        self.state   = state
        Exception.__init__(self, message)

@decorator
def parser(func, func_args, func_kargs):
    def changer(state):
        return func(state, *func_args, **func_kargs)
    changer.__name__ = func.__name__
    return Parser(changer)

##### combinatoric functions #########

@parser
def tokens(state0, count, process):
    tokens, state1 = state0.read(count)

    passed, value = process(tokens)
    if passed:
        return (value, state1)
    else:
        raise ParseFailed(value, state0)
    
def read(count):
    return tokens(count, lambda values : (True, values))

@parser
def skip(state0, parser):
    value, state1 = parser(state0)
    return (None, state1)

@parser
def option(state, default_value, parser):
    try:
        return parser(state)
    except ParseFailed, failure:
        if failure.state == state:
            return (default_value, state)
        else:
            raise
        
@parser
def choice(state, parsers):
    for parser in parsers:
        try:
            return parser(state)
        except ParseFailed, failure:
            if failure.state != state:
                raise failure
    raise ParseFailed("no choices were found", state)

@parser
def match(state0, expected):
    actual, state1 = read(len(expected))(state0)
    if actual == expected:
        return actual, state1
    else:
        raise ParseFailed("expected %r" % (expected,), state0)

def between(before, inner, after):
    return before & inner >> (lambda value : after & Parser.unit(value))

def quoted(before, inner, after):
    return between(match(before), inner, match(after))

def quoted_collection(start, space, inner, joiner, end):
    return quoted(start, space & sep_end_by(inner, joiner), end)

@parser
def many(state, parser, min_count = 0):
    values = []

    try:
        while True:
            value, state = parser(state)
            values.append(value)
    except ParseFailed:
        if len(values) < min_count:
            raise

    return values, state
    
@parser
def group(state, parsers):
    values = []

    for parser in parsers:
        value, state = parser(state)
        values.append(value)

    return values, state

def pair(parser1, parser2):
    # return group((parser1, parser2))
    return parser1 >> (lambda value1 : parser2 >> (lambda value2 : Parser.unit((value1, value2))))

@parser
def skip_many(state, parser):
    try:
        while True:
            value, state = parser(state)
    except ParseFailed:
        return (None, state)

def skip_before(before, parser):
    return skip(before) & parser

@parser
def skip_after(state0, parser, after):
    value, state1 = parser(state0)
    _,     state2 = after(state1)
    return value, state2

@parser
def option_many(state0, first, repeated, min_count = 0):
    try:
        first_value, state1 = first(state0)
    except ParseFailed:
        if min_count > 0:
            raise
        else:
            return [], state0
    else:
        values, state2 = many(repeated, min_count-1)(state1)
        values.insert(0, first_value)
        return values, state2

# parser separated and ended by sep
def end_by(parser, sep_parser, min_count = 0):
    return many(skip_after(parser, sep_parser), min_count)

# parser separated by sep
def sep_by(parser, sep_parser, min_count = 0):
    return option_many(parser, skip_before(sep_parser, parser), min_count)
    
# parser separated and optionally ended by sep
def sep_end_by(parser, sep_parser, min_count = 0):
    return skip_after(sep_by(parser, sep_parser, min_count), option(None, sep_parser))

##### char-specific parsing ###########

def satisfy(name, passes):
    return tokens(1, lambda char : (True, char) if passes(char) else (False, "not " + name))

def one_of(chars):
    char_set = frozenset(chars)
    return satisfy("one of %r" % chars, lambda char : char in char_set)

def none_of(chars):
    char_set = frozenset(chars)
    return satisfy("not one of %r" % chars, lambda char : char and char not in char_set)

def maybe_match_parser(parser):
    return match(parser) if isinstance(parser, str) else parser

def maybe_match_parsers(parsers):
    return tuple(maybe_match_parser(parser) for parser in parsers)

def many_chars(parser, min_count = 0):
    return join_chars(many(parser, min_count))

def option_chars(parsers):
    return option("", group_chars(parsers))

def group_chars(parsers):
    return join_chars(group(maybe_match_parsers(parsers)))
    #return join_chars(group(parsers))

def join_chars(parser):
    return parser >> Parser.lift("".join)

def while_one_of(chars, min_count = 0):
    return many_chars(one_of(chars), min_count)

def until_one_of(chars, min_count = 0):
    return many_chars(none_of(chars), min_count)

def char_range(begin, end):
    return "".join(chr(num) for num in xrange(ord(begin), ord(end)))

def quoted_chars(start, end):
    assert len(end) == 1, "end string must be exactly 1 character"
    return quoted(start, many_chars(none_of(end)), end)

digit  = one_of(char_range("0", "9"))
digits = many_chars(digit, min_count = 1)
space  = one_of(" \v\f\t\r\n")
spaces = skip_many(space)


############# simplified JSON ########################

#from Pysec import Parser, choice, quoted_chars, group_chars, option_chars, digits, between, pair, spaces, match, quoted_collection, sep_end_by

#HACK: json_choices is used to get around mutual recursion 
#a json is value is one of text, number, mapping, and collection, which we define later 
json_choices = []
json         = choice(json_choices)

#text is any characters between quotes
text         = quoted_chars("'", "'")

#sort of like the regular expression -?[0-9]+(\.[0-9]+)?
#in case you're unfamiliar with monads, "parser >> Parser.lift(func)" means "pass the parsed value into func but give me a new Parser back"
number       = group_chars([option_chars(["-"]), digits, option_chars([".", digits])]) >> Parser.lift(float)

#quoted_collection(start, space, inner, joiner, end) means "a list of inner separated by joiner surrounded by start and end"
#also, we have to put a lot of spaces in there since JSON allows lot of optional whitespace
joiner       = between(spaces, match(","), spaces)
mapping_pair = pair(text, spaces & match(":") & spaces & json)
collection   = quoted_collection("[", spaces, json,         joiner, "]") >> Parser.lift(list)
mapping      = quoted_collection("{", spaces, mapping_pair, joiner, "}") >> Parser.lift(dict)

#HACK: finish the work around mutual recursion
json_choices.extend([text, number, mapping, collection])


############# simplified CSV ########################

def line(cell):
    return sep_end_by(cell, match(","))

def csv(cell):
    return sep_end_by(line(cell), match("\n"))

############# testing ####################

print json.parseString("{'a' : -1.0, 'b' : 2.0, 'z' : {'c' : [1.0, [2.0, [3.0]]]}}")
print csv(number).parseString("1,2,3\n4,5,6")
print csv(json).parseString("{'a' : 'A'},[1, 2, 3],'zzz'\n-1.0,2.0,-3.0")