I've learned. I'll share.

February 12, 2008

Pysec: Monadic Combinatoric Parsing in Python (aka Parsec in Python)

Update: It turns out that there's a similar technique being used by pyparsing. I hadn't seen it before and when I first saw it I thought had reinvted the wheel and wasted my time. But upon further inspection, Pysec does a few things a much better than pyparsing, which happen to be the exact things that I need. There's no coincedence, of course, that Pysec matches my needs well. I'll be covering this in more detail in a future article.

Update 2: I got @do syntax to work! Again, stay tuned for another article on how.

I've talked about monads in the past, but I never really covered what purpose they serve. I covered the "how" in Python and Ruby, but I doubt I'll ever full cover the "why" because it's simply too big of a subject. But today, I'd like to share with you one example of how monads are useful. In fact, it's the example that motivated me to do all of the monad stuff in the first place: parsing.

Parsing is a topic that's been around pretty much forever in computer science and most people thing it's pretty much "solved". My experience is that we've still got a long way to go. Specifically, I'm writing an application with lots of distributed concurrency, which requires lots of data serialization and deserialization (aka parsing). There are very few good serialization libraries out there, and I've been through three or four versions of various techniques. Finally, I think I have found a parsing technique that works well: monadic combinatoric parsing. And it's in Python.

What the heck does that mean? "monadic" means we're using monads. "combinatoric" means we can take monad parsers and combine them to make new monad parsers, which is extremly powerful. I call it Pysec. The design is a copy of Parsec brought to Python. Notice how I said "design"; I didn't bother looking at any of their code; The design described on their web page was good enough guidance for me. But, I'm sure that their implementation is WAY better than mine. If you want to see real monadic parsing, look at Parsec. If you're interested in monadic parsing for Python, keep reading.

Here's an example of Pysec for parsing a subset of JSON:

from Pysec import Parser, choice, quoted_chars, group_chars, option_chars, digits, between, pair, spaces, match, quoted_collection

# json_choices is a hack to get around mutual recursion 
# a json is value is one of text, number, mapping, and collection
# text is any characters between quotes
# a number is like the regular expression -?[0-9]+(\.[0-9]+)?
# "parser >> Parser.lift(func)" means "pass the parsed value into func and return a new Parser"
# quoted_collection(start, space, inner, joiner, end)
#   means "a list of inner separated by joiner surrounded by start and end"
# we have to put a lot of "spaces" in since JSON allows lot of optional whitespace

json_choices = []
json         = choice(json_choices)
text         = quoted_chars("'", "'")
number       = group_chars([option_chars(["-"]), digits, option_chars([".", digits])]) >> Parser.lift(float)
joiner       = between(spaces, match(","), spaces)
mapping_pair = pair(text, spaces & match(":") & spaces & json)
collection   = quoted_collection("[", spaces, json,         joiner, "]") >> Parser.lift(list)
mapping      = quoted_collection("{", spaces, mapping_pair, joiner, "}") >> Parser.lift(dict)
json_choices.extend([text, number, mapping, collection])

print json.parseString("{'a' : -1.0, 'b' : 2.0, 'z' : {'c' : [1.0, [2.0, [3.0]]]}}")

Like most monadic or functional code, it's pretty dense, so don't feel bad if you go cross-eyed looking at it the first time. Realize that most of the code is building a Parser monad called "json", which parses the test string at the end. I tried to comment each individual part to explain what's going on.

You may be thinking "why would I want to write code like this?". One response is to look at the code: it's the bulk of JSON parsing in 15 lines that look like a grammar definition! Another response I can give is a challange: go write a parser to parse that string and then compare your code to this code. Which is shorter? Which is easier to read? Which is more elegant? While in college, I had a team project to write a compiler for a subset of Pascal. We were smart enough to use Python, but dumb enough to use Yacc and Flex. I'm sure the parser portion was pretty fast, but it was incredibly painful to get it right. Once we did, we dared not touch it for fear of breaking it. I really wish I had Parse/Pysec back then (ok, Parsec was around back then, but I hadn't even heard of Haskell or monads).

But monadic combinatoric parsing isn't just about making your code look like a grammar definition. It makes it possible to combine parsers in incredibly flexible ways. For example, let's say that on a differnt project, you wrote a simplified CSV parser for numbers like this one:

def line(cell):
    return sep_end_by(cell, match(","))

def csv(cell):
    return sep_end_by(line(cell), match("\n"))

print csv(number).parseString("1,2,3\n4,5,6")

And now you realize you'd really like to put whole JSON values in your simplified CSV. In other words, you want to combine the CVS and JSON parsers. I think that you'll find that doing so really isn't as easy as it sounds. Imagine trying to combine two Yacc grammars. It hurts just thinking about that. Luckily, monadic combinatoric parsers make this incredibly easy:

print csv(json).parseString("{'a' : 'A'},[1, 2, 3],'zzz'\n-1.0,2.0,-3.0")

While this is a slighly contrived example, you must understand that with this technique you can combine any two parsers in a similar fasion. I don't know about you, but that really feels right to me. I haven't seen any parsing techniques as elegant as that.

Everything it's perfect of course, especially since making this work in Python is a bit of a hack. Here are some downsides to this approach:

  • It's hard to debug when you do something wrong.
  • It's annoying to import so many things from Pysec.
  • You have to hack around mutual recursion of values in a strict language like python.
  • There's probably a performance hit.

Most of these can be either fixed or worked around, though, so I think long-term monadic parsing is good bet. I'd like to see what you can do with it or to make it better.

I know how much you all like a working example, so here's some code that you can just cut, paste, and run (in Python 2.5; older versions of Python may require some tweaking). Please ignore the top half. It's mostly stuff I've covered in other articles, like Immutable Records and monad base classes. The meat starts where it says "Parser Monad".

I'd like to talk about it in more detail, but I'll save that for another article. For now, please play around with it and see what you think.

##### Base Libraries included here for convenience ###########

def Record(*props):
    class cls(RecordBase):
        pass

    cls.setProps(props)

    return cls

class RecordBase(tuple):
    PROPS = ()

    def __new__(cls, *values):
        if cls.prepare != RecordBase.prepare:
            values = cls.prepare(*values)
        return cls.fromValues(values)

    @classmethod
    def fromValues(cls, values):
        return tuple.__new__(cls, values)

    def __repr__(self):
        return self.__class__.__name__ + tuple.__repr__(self)

    ## overridable
    @classmethod
    def prepare(cls, *args):
        return args

    ## setting up getters and setters
    @classmethod
    def setProps(cls, props):
        for index, prop in enumerate(props):
            cls.setProp(index, prop)
        cls.PROPS = props

    @classmethod
    def setProp(cls, index, prop):
        getter_name = prop
        setter_name = "set" + prop[0].upper() + prop[1:]

        setattr(cls, getter_name, cls.makeGetter(index, prop))
        setattr(cls, setter_name, cls.makeSetter(index, prop))

    @classmethod
    def makeGetter(cls, index, prop):
        return property(fget = lambda self : self[index])

    @classmethod
    def makeSetter(cls, index, prop):
        def setter(self, value):
            values = (value if current_index == index
                            else current_value
                      for current_index, current_value
                      in enumerate(self))
            return self.fromValues(values)
        return setter

class ByteStream(Record("bytes", "index")):
    @classmethod
    def prepare(cls, bytes, index = 0):
        return (bytes, index)

    def get(self, count):
        start = self.index
        end   = start + count
        bytes = self.bytes[start : end]
        return bytes, (self.setIndex(end) if bytes else self)

def make_decorator(func, *dec_args):
    def decorator(undecorated):
        def decorated(*args, **kargs):
            return func(undecorated, args, kargs, *dec_args) 
        
        decorated.__name__ = undecorated.__name__
        return decorated
    
    decorator.__name__ = func.__name__
    return decorator

decorator = make_decorator

class Monad:
    ## Must be overridden
    def bind(self, func):
        raise NotImplementedError

    @classmethod
    def unit(cls, val):
        raise NotImplementedError

    @classmethod
    def lift(cls, func):
        return (lambda val : cls.unit(func(val)))

    ## useful defaults that should probably NOT be overridden
    def __rshift__(self, bindee):
        return self.bind(bindee)

    def __and__(self, monad):
        return self.shove(monad)
        
    ## could be overridden if useful or if more efficient
    def shove(self, monad):
        return self.bind(lambda _ : monad)

class StateChanger(Record("changer", "bindees"), Monad):
    @classmethod
    def prepare(cls, changer, bindees = ()):
        return (changer, bindees)

    # binding can be slow since it happens at bind time rather than at run time
    def bind(self, bindee):
        return self.setBindees(self.bindees + (bindee,))

    def __call__(self, state):
        return self.run(state)

    def run(self, state0):
        value, state = self.changer(state0) if callable(self.changer) else self.changer
        state        = state0 if state is None else state

        for bindee in self.bindees:
            value, state = bindee(value).run(state)
        return (value, state)

    @classmethod
    def unit(cls, value):
        return cls((value, None))

######## Parser Monad ###########

class ParserState(Record("stream", "position")):
    @classmethod
    def prepare(cls, stream, position = 0):
        return (stream, position)

    def read(self, count):
        collection, stream = self.stream.get(count)
        return collection, self.fromValues((stream, self.position + count))

class Parser(StateChanger):
    def parseString(self, bytes):
        return self.parseStream(ByteStream(bytes))
        
    def parseStream(self, stream):
        state = ParserState(stream)
        value, state = self.run(state)
        return value

class ParseFailed(Exception):
    def __init__(self, message, state):
        self.message = message
        self.state   = state
        Exception.__init__(self, message)

@decorator
def parser(func, func_args, func_kargs):
    def changer(state):
        return func(state, *func_args, **func_kargs)
    changer.__name__ = func.__name__
    return Parser(changer)

##### combinatoric functions #########

@parser
def tokens(state0, count, process):
    tokens, state1 = state0.read(count)

    passed, value = process(tokens)
    if passed:
        return (value, state1)
    else:
        raise ParseFailed(value, state0)
    
def read(count):
    return tokens(count, lambda values : (True, values))

@parser
def skip(state0, parser):
    value, state1 = parser(state0)
    return (None, state1)

@parser
def option(state, default_value, parser):
    try:
        return parser(state)
    except ParseFailed, failure:
        if failure.state == state:
            return (default_value, state)
        else:
            raise
        
@parser
def choice(state, parsers):
    for parser in parsers:
        try:
            return parser(state)
        except ParseFailed, failure:
            if failure.state != state:
                raise failure
    raise ParseFailed("no choices were found", state)

@parser
def match(state0, expected):
    actual, state1 = read(len(expected))(state0)
    if actual == expected:
        return actual, state1
    else:
        raise ParseFailed("expected %r" % (expected,), state0)

def between(before, inner, after):
    return before & inner >> (lambda value : after & Parser.unit(value))

def quoted(before, inner, after):
    return between(match(before), inner, match(after))

def quoted_collection(start, space, inner, joiner, end):
    return quoted(start, space & sep_end_by(inner, joiner), end)

@parser
def many(state, parser, min_count = 0):
    values = []

    try:
        while True:
            value, state = parser(state)
            values.append(value)
    except ParseFailed:
        if len(values) < min_count:
            raise

    return values, state
    
@parser
def group(state, parsers):
    values = []

    for parser in parsers:
        value, state = parser(state)
        values.append(value)

    return values, state

def pair(parser1, parser2):
    # return group((parser1, parser2))
    return parser1 >> (lambda value1 : parser2 >> (lambda value2 : Parser.unit((value1, value2))))

@parser
def skip_many(state, parser):
    try:
        while True:
            value, state = parser(state)
    except ParseFailed:
        return (None, state)

def skip_before(before, parser):
    return skip(before) & parser

@parser
def skip_after(state0, parser, after):
    value, state1 = parser(state0)
    _,     state2 = after(state1)
    return value, state2

@parser
def option_many(state0, first, repeated, min_count = 0):
    try:
        first_value, state1 = first(state0)
    except ParseFailed:
        if min_count > 0:
            raise
        else:
            return [], state0
    else:
        values, state2 = many(repeated, min_count-1)(state1)
        values.insert(0, first_value)
        return values, state2

# parser separated and ended by sep
def end_by(parser, sep_parser, min_count = 0):
    return many(skip_after(parser, sep_parser), min_count)

# parser separated by sep
def sep_by(parser, sep_parser, min_count = 0):
    return option_many(parser, skip_before(sep_parser, parser), min_count)
    
# parser separated and optionally ended by sep
def sep_end_by(parser, sep_parser, min_count = 0):
    return skip_after(sep_by(parser, sep_parser, min_count), option(None, sep_parser))

##### char-specific parsing ###########

def satisfy(name, passes):
    return tokens(1, lambda char : (True, char) if passes(char) else (False, "not " + name))

def one_of(chars):
    char_set = frozenset(chars)
    return satisfy("one of %r" % chars, lambda char : char in char_set)

def none_of(chars):
    char_set = frozenset(chars)
    return satisfy("not one of %r" % chars, lambda char : char and char not in char_set)

def maybe_match_parser(parser):
    return match(parser) if isinstance(parser, str) else parser

def maybe_match_parsers(parsers):
    return tuple(maybe_match_parser(parser) for parser in parsers)

def many_chars(parser, min_count = 0):
    return join_chars(many(parser, min_count))

def option_chars(parsers):
    return option("", group_chars(parsers))

def group_chars(parsers):
    return join_chars(group(maybe_match_parsers(parsers)))
    #return join_chars(group(parsers))

def join_chars(parser):
    return parser >> Parser.lift("".join)

def while_one_of(chars, min_count = 0):
    return many_chars(one_of(chars), min_count)

def until_one_of(chars, min_count = 0):
    return many_chars(none_of(chars), min_count)

def char_range(begin, end):
    return "".join(chr(num) for num in xrange(ord(begin), ord(end)))

def quoted_chars(start, end):
    assert len(end) == 1, "end string must be exactly 1 character"
    return quoted(start, many_chars(none_of(end)), end)

digit  = one_of(char_range("0", "9"))
digits = many_chars(digit, min_count = 1)
space  = one_of(" \v\f\t\r\n")
spaces = skip_many(space)


############# simplified JSON ########################

#from Pysec import Parser, choice, quoted_chars, group_chars, option_chars, digits, between, pair, spaces, match, quoted_collection, sep_end_by

#HACK: json_choices is used to get around mutual recursion 
#a json is value is one of text, number, mapping, and collection, which we define later 
json_choices = []
json         = choice(json_choices)

#text is any characters between quotes
text         = quoted_chars("'", "'")

#sort of like the regular expression -?[0-9]+(\.[0-9]+)?
#in case you're unfamiliar with monads, "parser >> Parser.lift(func)" means "pass the parsed value into func but give me a new Parser back"
number       = group_chars([option_chars(["-"]), digits, option_chars([".", digits])]) >> Parser.lift(float)

#quoted_collection(start, space, inner, joiner, end) means "a list of inner separated by joiner surrounded by start and end"
#also, we have to put a lot of spaces in there since JSON allows lot of optional whitespace
joiner       = between(spaces, match(","), spaces)
mapping_pair = pair(text, spaces & match(":") & spaces & json)
collection   = quoted_collection("[", spaces, json,         joiner, "]") >> Parser.lift(list)
mapping      = quoted_collection("{", spaces, mapping_pair, joiner, "}") >> Parser.lift(dict)

#HACK: finish the work around mutual recursion
json_choices.extend([text, number, mapping, collection])


############# simplified CSV ########################

def line(cell):
    return sep_end_by(cell, match(","))

def csv(cell):
    return sep_end_by(line(cell), match("\n"))

############# testing ####################

print json.parseString("{'a' : -1.0, 'b' : 2.0, 'z' : {'c' : [1.0, [2.0, [3.0]]]}}")
print csv(number).parseString("1,2,3\n4,5,6")
print csv(json).parseString("{'a' : 'A'},[1, 2, 3],'zzz'\n-1.0,2.0,-3.0")

13 comments:

  1. that's really good!! lets kill regular expression :)

    ReplyDelete
  2. How's this different from pyparsing?

    ReplyDelete
  3. Congratulations on a nice little parsing library. For comparison, here is a pyparsing JSON parser (http://pyparsing.wikispaces.com/space/showimage/jsonParser.py). It should look very familiar to you! :)

    Come visit the pyparsing wiki, the concepts are very similar. One of the big differences is that your monads are implemented in functions, and pyparsing uses class instances. By using objects, it is easier to attach additional behavior (such as parse actions and field names) to the parser expression monads. Also, pyparsing creates a very rich return type, the ParseResults class, instead of just returning nested lists of tokens.

    ReplyDelete
  4. Paul, I had no idea pyparsing even existed. I hate reinventing the wheel and searched far and wide for parsing tools, but pyparsing never came up. Now when I Google for "python parsing", I see that pyparsing is the last result on the first page, but I guess my eyes missed it when I looked before. Shame on me.

    I congratulate you on your work. It looks great. It's almost what I was trying to accomplish, but certainly more mature. Had I known about it, I may have used it instead, although I'm glad that I could explore monadic parsing as well.

    But, I have some peculiar needs in a parser that I wrote using Pysec. I'd like to see if pyparsing can do it also, but there's not much documentation that I can find on the wiki. Would you mind answering a few questions? You can respond here or direct me to a better place to have a discussion.

    Here's what I'm wondering if pyparsing is capable of:

    1. I need to read size-prefixed binary data, much like what's found in the bittorrent format called bencode. It looks like this:

    b7:XXXXXXX

    Where each X is a byte. The parser has to read the 7, parse it, and then read 7 bytes, no matter what they are, and then continue parsing from there. Can pyparsing do this? In Pysec, it looks a bit like this:

    (match("b") & integer) >> (lambda size : (match(":") & read(size)))

    2. I need something a lot like macro expansion, except that rather then expand the reference into the text of the anchor, I need it to expand into the parsed value of the anchor. So, for example, I need to parse something that looks like this:

    [&a [1, 2, 3], @a, @a]

    Into what you would get if you did this in python:

    a = [1, 2, 3]
    [a, a, a]

    It requires keeping a table of parsed values, but that's easy because a Pysec Parser is really a StateTransformer monad, so we just carry an inner state in around with the ParserState.

    Can pyparsing do this?

    Thanks in advance Paul.

    ReplyDelete
  5. Peter -

    For an implementation of the size-prefixed binary data, see:
    http://pyparsing.pastebin.com/m72db9ab6

    As for the second question, I don't quite understand what you are describing, but there is a macroProcessor example on the pyparsing wiki Examples page.

    Pyparsing has epydoc-generated class docs, UML diagrams, and two presentations I made at Pycon06, that are included with the source or docs release on SourceForge. If you go to the Documentation page of the wiki, there are some links to articles, blog entries, and if you want to spring for $10, a ShortCut on O'Reilly.

    Post a comment on the wiki home page Discussion tab, if you want to pursue item #2 further.

    -- Paul

    ReplyDelete
  6. Paul, thanks for your response. It helped me compare pyparsing and Pysec. I'll have to write another article comparing the two, but here's what I found initially:

    1. It can parse b7:XXXXXXX, but it's a hack. I'm impressed you figured out such a hack so quickly, but it essentially requires modifying a global variable to work.

    2. It can parse [&a [1, 2, 3] @a @a]. A tweaking of the macro expansion on the wiki is sufficient. But I'm calling this one a hack too because it also requires modifying a global value (the macro table).

    So, it works for my purposes, but I don't really like how. Admittedly, the format I'm parsing isn't definable as a grammar, and so a parsing library built around the idea of a grammar doesn't match too well.

    ReplyDelete
  7. Don't be too impressed. The only reason I was able to crank this out so quickly was because it was mostly a replication of a helper function already defined in pyparsing, called countedArray. Now that you've see the other implementation, look at the implementation of countedArray (here: http://pyparsing.pastebin.com/m278ec49a). It also uses a "global" value (the variable arrayExpr), but at least it hides it inside the countedArray method.

    My experience with macro expansion is that it is not a parser-friendly process in general. For instance, unless your parser engine allows you to back up to reparse generated text, you have to perform multiple passes on the input source - one to do macro expansion (potentially recursively, which should still be doable in a single pass), and then a second pass to parse the macro-expanded code. I don't really feel too bad about this, C compilers have to do much the same thing.

    -- Paul

    ReplyDelete
  8. Perhaps I can characterize Pysec/Parsec as a way to build parsers which are macro-expansion-friendly.
    I think the general case of this is that they allow parsers to change behavior based on what as already been parsed, which, in a way, is viewing the parsed text less as a language of a grammar and more as processing instructions.

    ReplyDelete
  9. Hi Peter!

    I've just come across Pysec - a great parser library, I love it!

    I've known about Haskell's Parsec for some time (and that's very impressive too), but it is good to see a similar library in a somewhat-more-familiar language. Haskell has a bit of a learning-curve to it.... ;)

    I'm very keen to use this library in the near-future. I was wondering about the license - is it "public domain"? ( I always give credit anyway... )
    Many thanks for this post (and for a great library!
    - Andy

    ReplyDelete
  10. Andy,

    I'm glad you like it. Unfortunately, for my main use, Pysec too slow (I made a blog entry about this called "why are my monads so slow?"), and so I haven't worked on it at all.

    Hopefully it will be fast enough and work well enough for your needs. You're free to do whatever you want with it. Consider it "public domain", I guess.

    ReplyDelete
  11. Hi Peter -
    Great! Thanks very much for your reply!
    Pysec is a great little app when you're getting into Python and are also keen on FP (as I am).
    As a token of thanks, here's a link to another *very cool* Python parsing app. It's called "yeanpypa" (YEt ANother PYthon PArser lib). Very nicely done - it is inspired by the C++ Boost::Spirit parser library (which I have a fair bit of experience with, and which rocks!).
    Yeanpypa doesn't use an FP approach, but it's quite a nice "BNF"-y style parser.
    http://freshmeat.net/projects/yeanpypa/

    ReplyDelete
  12. how do you tell whether the whole input stream was parsed?

    ReplyDelete
  13. ah this is how:

    value, state = expr.run(ParserState(ByteStream(inputline)))
    if state.position != len(inputline):
    raise ParseError("couldn't parse entire line %s, left over: %s" % (
    inputline, inputline[state.position:]))

    ReplyDelete

Blog Archive

Google Analytics