I've learned. I'll share.

February 12, 2008

Pysec: Monadic Combinatoric Parsing in Python (aka Parsec in Python)

Update: It turns out that there's a similar technique being used by pyparsing. I hadn't seen it before and when I first saw it I thought had reinvted the wheel and wasted my time. But upon further inspection, Pysec does a few things a much better than pyparsing, which happen to be the exact things that I need. There's no coincedence, of course, that Pysec matches my needs well. I'll be covering this in more detail in a future article.

Update 2: I got @do syntax to work! Again, stay tuned for another article on how.

I've talked about monads in the past, but I never really covered what purpose they serve. I covered the "how" in Python and Ruby, but I doubt I'll ever full cover the "why" because it's simply too big of a subject. But today, I'd like to share with you one example of how monads are useful. In fact, it's the example that motivated me to do all of the monad stuff in the first place: parsing.

Parsing is a topic that's been around pretty much forever in computer science and most people thing it's pretty much "solved". My experience is that we've still got a long way to go. Specifically, I'm writing an application with lots of distributed concurrency, which requires lots of data serialization and deserialization (aka parsing). There are very few good serialization libraries out there, and I've been through three or four versions of various techniques. Finally, I think I have found a parsing technique that works well: monadic combinatoric parsing. And it's in Python.

What the heck does that mean? "monadic" means we're using monads. "combinatoric" means we can take monad parsers and combine them to make new monad parsers, which is extremly powerful. I call it Pysec. The design is a copy of Parsec brought to Python. Notice how I said "design"; I didn't bother looking at any of their code; The design described on their web page was good enough guidance for me. But, I'm sure that their implementation is WAY better than mine. If you want to see real monadic parsing, look at Parsec. If you're interested in monadic parsing for Python, keep reading.

Here's an example of Pysec for parsing a subset of JSON:

from Pysec import Parser, choice, quoted_chars, group_chars, option_chars, digits, between, pair, spaces, match, quoted_collection

# json_choices is a hack to get around mutual recursion 
# a json is value is one of text, number, mapping, and collection
# text is any characters between quotes
# a number is like the regular expression -?[0-9]+(\.[0-9]+)?
# "parser >> Parser.lift(func)" means "pass the parsed value into func and return a new Parser"
# quoted_collection(start, space, inner, joiner, end)
#   means "a list of inner separated by joiner surrounded by start and end"
# we have to put a lot of "spaces" in since JSON allows lot of optional whitespace

json_choices = []
json         = choice(json_choices)
text         = quoted_chars("'", "'")
number       = group_chars([option_chars(["-"]), digits, option_chars([".", digits])]) >> Parser.lift(float)
joiner       = between(spaces, match(","), spaces)
mapping_pair = pair(text, spaces & match(":") & spaces & json)
collection   = quoted_collection("[", spaces, json,         joiner, "]") >> Parser.lift(list)
mapping      = quoted_collection("{", spaces, mapping_pair, joiner, "}") >> Parser.lift(dict)
json_choices.extend([text, number, mapping, collection])

print json.parseString("{'a' : -1.0, 'b' : 2.0, 'z' : {'c' : [1.0, [2.0, [3.0]]]}}")

Like most monadic or functional code, it's pretty dense, so don't feel bad if you go cross-eyed looking at it the first time. Realize that most of the code is building a Parser monad called "json", which parses the test string at the end. I tried to comment each individual part to explain what's going on.

You may be thinking "why would I want to write code like this?". One response is to look at the code: it's the bulk of JSON parsing in 15 lines that look like a grammar definition! Another response I can give is a challange: go write a parser to parse that string and then compare your code to this code. Which is shorter? Which is easier to read? Which is more elegant? While in college, I had a team project to write a compiler for a subset of Pascal. We were smart enough to use Python, but dumb enough to use Yacc and Flex. I'm sure the parser portion was pretty fast, but it was incredibly painful to get it right. Once we did, we dared not touch it for fear of breaking it. I really wish I had Parse/Pysec back then (ok, Parsec was around back then, but I hadn't even heard of Haskell or monads).

But monadic combinatoric parsing isn't just about making your code look like a grammar definition. It makes it possible to combine parsers in incredibly flexible ways. For example, let's say that on a differnt project, you wrote a simplified CSV parser for numbers like this one:

def line(cell):
    return sep_end_by(cell, match(","))

def csv(cell):
    return sep_end_by(line(cell), match("\n"))

print csv(number).parseString("1,2,3\n4,5,6")

And now you realize you'd really like to put whole JSON values in your simplified CSV. In other words, you want to combine the CVS and JSON parsers. I think that you'll find that doing so really isn't as easy as it sounds. Imagine trying to combine two Yacc grammars. It hurts just thinking about that. Luckily, monadic combinatoric parsers make this incredibly easy:

print csv(json).parseString("{'a' : 'A'},[1, 2, 3],'zzz'\n-1.0,2.0,-3.0")

While this is a slighly contrived example, you must understand that with this technique you can combine any two parsers in a similar fasion. I don't know about you, but that really feels right to me. I haven't seen any parsing techniques as elegant as that.

Everything it's perfect of course, especially since making this work in Python is a bit of a hack. Here are some downsides to this approach:

  • It's hard to debug when you do something wrong.
  • It's annoying to import so many things from Pysec.
  • You have to hack around mutual recursion of values in a strict language like python.
  • There's probably a performance hit.

Most of these can be either fixed or worked around, though, so I think long-term monadic parsing is good bet. I'd like to see what you can do with it or to make it better.

I know how much you all like a working example, so here's some code that you can just cut, paste, and run (in Python 2.5; older versions of Python may require some tweaking). Please ignore the top half. It's mostly stuff I've covered in other articles, like Immutable Records and monad base classes. The meat starts where it says "Parser Monad".

I'd like to talk about it in more detail, but I'll save that for another article. For now, please play around with it and see what you think.

##### Base Libraries included here for convenience ###########

def Record(*props):
    class cls(RecordBase):


    return cls

class RecordBase(tuple):
    PROPS = ()

    def __new__(cls, *values):
        if cls.prepare != RecordBase.prepare:
            values = cls.prepare(*values)
        return cls.fromValues(values)

    def fromValues(cls, values):
        return tuple.__new__(cls, values)

    def __repr__(self):
        return self.__class__.__name__ + tuple.__repr__(self)

    ## overridable
    def prepare(cls, *args):
        return args

    ## setting up getters and setters
    def setProps(cls, props):
        for index, prop in enumerate(props):
            cls.setProp(index, prop)
        cls.PROPS = props

    def setProp(cls, index, prop):
        getter_name = prop
        setter_name = "set" + prop[0].upper() + prop[1:]

        setattr(cls, getter_name, cls.makeGetter(index, prop))
        setattr(cls, setter_name, cls.makeSetter(index, prop))

    def makeGetter(cls, index, prop):
        return property(fget = lambda self : self[index])

    def makeSetter(cls, index, prop):
        def setter(self, value):
            values = (value if current_index == index
                            else current_value
                      for current_index, current_value
                      in enumerate(self))
            return self.fromValues(values)
        return setter

class ByteStream(Record("bytes", "index")):
    def prepare(cls, bytes, index = 0):
        return (bytes, index)

    def get(self, count):
        start = self.index
        end   = start + count
        bytes = self.bytes[start : end]
        return bytes, (self.setIndex(end) if bytes else self)

def make_decorator(func, *dec_args):
    def decorator(undecorated):
        def decorated(*args, **kargs):
            return func(undecorated, args, kargs, *dec_args) 
        decorated.__name__ = undecorated.__name__
        return decorated
    decorator.__name__ = func.__name__
    return decorator

decorator = make_decorator

class Monad:
    ## Must be overridden
    def bind(self, func):
        raise NotImplementedError

    def unit(cls, val):
        raise NotImplementedError

    def lift(cls, func):
        return (lambda val : cls.unit(func(val)))

    ## useful defaults that should probably NOT be overridden
    def __rshift__(self, bindee):
        return self.bind(bindee)

    def __and__(self, monad):
        return self.shove(monad)
    ## could be overridden if useful or if more efficient
    def shove(self, monad):
        return self.bind(lambda _ : monad)

class StateChanger(Record("changer", "bindees"), Monad):
    def prepare(cls, changer, bindees = ()):
        return (changer, bindees)

    # binding can be slow since it happens at bind time rather than at run time
    def bind(self, bindee):
        return self.setBindees(self.bindees + (bindee,))

    def __call__(self, state):
        return self.run(state)

    def run(self, state0):
        value, state = self.changer(state0) if callable(self.changer) else self.changer
        state        = state0 if state is None else state

        for bindee in self.bindees:
            value, state = bindee(value).run(state)
        return (value, state)

    def unit(cls, value):
        return cls((value, None))

######## Parser Monad ###########

class ParserState(Record("stream", "position")):
    def prepare(cls, stream, position = 0):
        return (stream, position)

    def read(self, count):
        collection, stream = self.stream.get(count)
        return collection, self.fromValues((stream, self.position + count))

class Parser(StateChanger):
    def parseString(self, bytes):
        return self.parseStream(ByteStream(bytes))
    def parseStream(self, stream):
        state = ParserState(stream)
        value, state = self.run(state)
        return value

class ParseFailed(Exception):
    def __init__(self, message, state):
        self.message = message
        self.state   = state
        Exception.__init__(self, message)

def parser(func, func_args, func_kargs):
    def changer(state):
        return func(state, *func_args, **func_kargs)
    changer.__name__ = func.__name__
    return Parser(changer)

##### combinatoric functions #########

def tokens(state0, count, process):
    tokens, state1 = state0.read(count)

    passed, value = process(tokens)
    if passed:
        return (value, state1)
        raise ParseFailed(value, state0)
def read(count):
    return tokens(count, lambda values : (True, values))

def skip(state0, parser):
    value, state1 = parser(state0)
    return (None, state1)

def option(state, default_value, parser):
        return parser(state)
    except ParseFailed, failure:
        if failure.state == state:
            return (default_value, state)
def choice(state, parsers):
    for parser in parsers:
            return parser(state)
        except ParseFailed, failure:
            if failure.state != state:
                raise failure
    raise ParseFailed("no choices were found", state)

def match(state0, expected):
    actual, state1 = read(len(expected))(state0)
    if actual == expected:
        return actual, state1
        raise ParseFailed("expected %r" % (expected,), state0)

def between(before, inner, after):
    return before & inner >> (lambda value : after & Parser.unit(value))

def quoted(before, inner, after):
    return between(match(before), inner, match(after))

def quoted_collection(start, space, inner, joiner, end):
    return quoted(start, space & sep_end_by(inner, joiner), end)

def many(state, parser, min_count = 0):
    values = []

        while True:
            value, state = parser(state)
    except ParseFailed:
        if len(values) < min_count:

    return values, state
def group(state, parsers):
    values = []

    for parser in parsers:
        value, state = parser(state)

    return values, state

def pair(parser1, parser2):
    # return group((parser1, parser2))
    return parser1 >> (lambda value1 : parser2 >> (lambda value2 : Parser.unit((value1, value2))))

def skip_many(state, parser):
        while True:
            value, state = parser(state)
    except ParseFailed:
        return (None, state)

def skip_before(before, parser):
    return skip(before) & parser

def skip_after(state0, parser, after):
    value, state1 = parser(state0)
    _,     state2 = after(state1)
    return value, state2

def option_many(state0, first, repeated, min_count = 0):
        first_value, state1 = first(state0)
    except ParseFailed:
        if min_count > 0:
            return [], state0
        values, state2 = many(repeated, min_count-1)(state1)
        values.insert(0, first_value)
        return values, state2

# parser separated and ended by sep
def end_by(parser, sep_parser, min_count = 0):
    return many(skip_after(parser, sep_parser), min_count)

# parser separated by sep
def sep_by(parser, sep_parser, min_count = 0):
    return option_many(parser, skip_before(sep_parser, parser), min_count)
# parser separated and optionally ended by sep
def sep_end_by(parser, sep_parser, min_count = 0):
    return skip_after(sep_by(parser, sep_parser, min_count), option(None, sep_parser))

##### char-specific parsing ###########

def satisfy(name, passes):
    return tokens(1, lambda char : (True, char) if passes(char) else (False, "not " + name))

def one_of(chars):
    char_set = frozenset(chars)
    return satisfy("one of %r" % chars, lambda char : char in char_set)

def none_of(chars):
    char_set = frozenset(chars)
    return satisfy("not one of %r" % chars, lambda char : char and char not in char_set)

def maybe_match_parser(parser):
    return match(parser) if isinstance(parser, str) else parser

def maybe_match_parsers(parsers):
    return tuple(maybe_match_parser(parser) for parser in parsers)

def many_chars(parser, min_count = 0):
    return join_chars(many(parser, min_count))

def option_chars(parsers):
    return option("", group_chars(parsers))

def group_chars(parsers):
    return join_chars(group(maybe_match_parsers(parsers)))
    #return join_chars(group(parsers))

def join_chars(parser):
    return parser >> Parser.lift("".join)

def while_one_of(chars, min_count = 0):
    return many_chars(one_of(chars), min_count)

def until_one_of(chars, min_count = 0):
    return many_chars(none_of(chars), min_count)

def char_range(begin, end):
    return "".join(chr(num) for num in xrange(ord(begin), ord(end)))

def quoted_chars(start, end):
    assert len(end) == 1, "end string must be exactly 1 character"
    return quoted(start, many_chars(none_of(end)), end)

digit  = one_of(char_range("0", "9"))
digits = many_chars(digit, min_count = 1)
space  = one_of(" \v\f\t\r\n")
spaces = skip_many(space)

############# simplified JSON ########################

#from Pysec import Parser, choice, quoted_chars, group_chars, option_chars, digits, between, pair, spaces, match, quoted_collection, sep_end_by

#HACK: json_choices is used to get around mutual recursion 
#a json is value is one of text, number, mapping, and collection, which we define later 
json_choices = []
json         = choice(json_choices)

#text is any characters between quotes
text         = quoted_chars("'", "'")

#sort of like the regular expression -?[0-9]+(\.[0-9]+)?
#in case you're unfamiliar with monads, "parser >> Parser.lift(func)" means "pass the parsed value into func but give me a new Parser back"
number       = group_chars([option_chars(["-"]), digits, option_chars([".", digits])]) >> Parser.lift(float)

#quoted_collection(start, space, inner, joiner, end) means "a list of inner separated by joiner surrounded by start and end"
#also, we have to put a lot of spaces in there since JSON allows lot of optional whitespace
joiner       = between(spaces, match(","), spaces)
mapping_pair = pair(text, spaces & match(":") & spaces & json)
collection   = quoted_collection("[", spaces, json,         joiner, "]") >> Parser.lift(list)
mapping      = quoted_collection("{", spaces, mapping_pair, joiner, "}") >> Parser.lift(dict)

#HACK: finish the work around mutual recursion
json_choices.extend([text, number, mapping, collection])

############# simplified CSV ########################

def line(cell):
    return sep_end_by(cell, match(","))

def csv(cell):
    return sep_end_by(line(cell), match("\n"))

############# testing ####################

print json.parseString("{'a' : -1.0, 'b' : 2.0, 'z' : {'c' : [1.0, [2.0, [3.0]]]}}")
print csv(number).parseString("1,2,3\n4,5,6")
print csv(json).parseString("{'a' : 'A'},[1, 2, 3],'zzz'\n-1.0,2.0,-3.0")

February 1, 2008

The Easy Way to Print Flash Cards (with Python and ImageMagick)

Here's a valued lessons that I learned the long and hard way: how to easily print flash cards. Don't waste any time doing it the hard way like I did. Follow my easy route and save yourself some time.

The other day, my wife had a very simple computer need: print and cut pictures into little uniform cards, like flash cards with pictures. It sounds like a simple problem, but there's a simple way and an easy way, and I'd like to share with you the easy way.

The simple way is what she did: use some software like Scribus (in her case) or Microsoft Word (if she were almost anyone else), copy the pictures in, manually align them, and print. That works great ... until you try and do 700 pictures (about 80 pages of 9 pictures per page). Scribus ate all of the computer's memory, the hard drive thrashed like mad, everything slowed down, and my poor wife spent hours putting it all together. I doubt Word would have fared much better.

But it still wasn't over. When she tried to print, the printer printed out one page with "error" on it. So, we exported to a PDF, and tried printing that: "error" again. So, we tried to print one page at a time: after about 5 minutes, it printed. All I could figure was that something bad was going on between the computer and the printer and involving sending these images uncompressed.

At this point, I decided that it was time to call it quits on manual work and use automated work: write a program to do this for me. I'm not about to manually tell 78 pages to individually print. So, with some quick bash command line work, I told it to print each page by itself. That worked for a few pages, but some were still too big and we got the dreaded "error".

Finally, I decdied to really role up my sleeves and write a program like we should have in the first place: in go pictures, out comes a PDF, and your print the PDF. Additionally, I wanted like to control the DPI of the images used to control the data flow to the printer. I did so, made the PDF, printed it and (after a while, still), it printed. Hurray! Even better, now if we wanted to change pictures or change sizes, we'll be able to do it with very little work.

So here's the program. Put some pictures in a directory, run the program, and print the pdf of "flash cards" or "image thumbnails" or "reading lessons" or whatever you want to call them. You could even make this into an Automator task in Mac OS X. It uses ImageMagick, though, so make sure you have that installed first.

from subprocess import call

MONTAGE = ["montage", "-bordercolor", "white"]
CONVERT = ["convert", "-bordercolor", "white"]

def montage_files_of_paths(paths, name,
                           columns =  3, rows    = 3,
                           width   = 10, height  = 8, xborder = .33, yborder = .33, density = 200,
                           frame   = 5,  spacing = 10,                                            
                           intermediate_ext = "png", final_ext = "pdf"):

    width, height, xborder, yborder = (int(val * density) for val in (width, height, xborder, yborder))
    thumbnail_width  = int((width  - (xborder * 2)) / columns)
    thumbnail_height = int((height - (yborder * 2)) / rows)  

    call(MONTAGE + ["-tile",     "%rx%r" % (rows, columns),
                    "-geometry", "%rx%r+%r+%r" % (thumbnail_width, thumbnail_height, spacing, spacing),
                    "-frame",    str(frame),
                    "-density",  str(density)]
                 + [path + "[0]" for path in paths] #[0] to ignore animated GIFS
                 + [name + "." + intermediate_ext])
    call(CONVERT + ["-border",   "%rx%r" % (xborder, yborder),
                    "-density",  str(density),
                    "-annotate", "0x0+%r+%r" % (xborder, yborder), name,
                    name + "*." + intermediate_ext,
                    name + "."  + final_ext])

import sys

if __name__ == "__main__":
    if len(sys.argv) < 2:
        print "usage: %s image_directory, image_directory, image_directory ..." % (sys.argv[0],)
        for dirname in sys.argv[1:]:
            montage_files_of_paths([dirname+"/*"], dirname)

Blog Archive

Google Analytics