A simple interpreter from scratch in Python (part 2)

Published on 2011-03-29
Edited on 2014-04-12
Tagged: compilers imp parsers python

In the previous article in the series, we covered the IMP language and the general structure of the interpreter we are building for it. We also covered the lexer in depth. In this article, we will write a small parser combinator library. This library will be used to create the IMP parser, which extracts an abstract syntax tree (AST) from the list of tokens generated by the lexer. The combinator library is language agnostic, so you could use it to write a parser for any language.

What are parser combinators?

There are many, many ways to build a parser. Combinators are probably the easiest and fastest way to get a parser up and running.

You can think of a parser as a function. It accepts a stream of tokens as input. If successful, the parser will consume some tokens from the stream. It will return part of the final AST, along with the remaining tokens. A combinator is a function which produces a parser as its output, usually after taking one or more parsers as input, hence the name "combinator". You can use combinators to create a complete parser for a language like IMP by creating lots of smaller parsers for parts of the language, then using combinators to build the final parser.

A minimal combinator library

Parser combinators are fairly generic, and can be used with any language. As we did with the lexer, we'll start by writing a language agnostic library of combinators, then use that to write our IMP parser.

First, let's define a Result class. Every parser will return a Result object on success or None on failure. A Result comprises a value (part of the AST), and a position (the index of the next token in the stream).

class Result:
    def __init__(self, value, pos):
        self.value = value
        self.pos = pos

    def __repr__(self):
        return 'Result(%s, %d)' % (self.value, self.pos)

Next, we'll define a base class, Parser. Earlier, I said parsers are functions which take a stream of tokens as input. We will actually be defining parsers as objects with a __call__ method. This means that a parser object will behave as if it were a function, but we can also provide additional functionality by defining some operators.

class Parser:
    def __call__(self, tokens, pos):
        return None  # subclasses will override this

    def __add__(self, other):
        return Concat(self, other)

    def __mul__(self, other):
        return Exp(self, other)

    def __or__(self, other):
        return Alternate(self, other)

    def __xor__(self, function):
        return Process(self, function)

The method that actually does the parsing is __call__. The input is the full list of tokens (returned by the lexer) and an index into the list indicating the next token. The default implementation always returns None (failure). Subclasses of Parser will provide their own __call__ implementation.

The other methods, __add__, __mul__, __or__, and __xor__ define the +, *, |, and ^ operators, respectively. Each operator provides a shortcut for calling a different combinator. We'll cover each of these combinators shortly.

Next, we'll look at the simplest combinator, Reserved. Reserved will be used to parse reserved words and operators; it will accept tokens with a specific value and tag. Remember, tokens are just value-tag pairs. token[0] is the value, token[1] is the tag.

class Reserved(Parser):
    def __init__(self, value, tag):
        self.value = value
        self.tag = tag

    def __call__(self, tokens, pos):
        if pos < len(tokens) and \
           tokens[pos][0] == self.value and \
           tokens[pos][1] is self.tag:
            return Result(tokens[pos][0], pos + 1)
        else:
            return None

At this point, you might stop and say, "I thought combinators were going to be functions returning parsers. This subclass doesn't look like a function." It is like a function though, if you think of a constructor as a function which returns an object (which in this case also happens to be callable). Subclassing is an easy way to define a new combinator since all we need to do is provide a constructor and a __call__ method, and we still retain other functionality (like the overloaded operators).

Moving on, the Tag combinator is very similar to Reserved. It matches any token which has a particular tag. The value can be anything.

class Tag(Parser):
    def __init__(self, tag):
        self.tag = tag

    def __call__(self, tokens, pos):
        if pos < len(tokens) and tokens[pos][1] is self.tag:
            return Result(tokens[pos][0], pos + 1)
        else:
            return None

The Tag and Reserved combinators are our primitives. All combinators will be built out of them at the most basic level.

The Concat combinator will take two parsers as input (left and right). When applied, a Concat parser will apply the left parser, followed by the right parser. If both are successful, the result value will be a pair containing the left and right results. If either parser is unsuccessful, None is returned.

class Concat(Parser):
    def __init__(self, left, right):
        self.left = left
        self.right = right

    def __call__(self, tokens, pos):
        left_result = self.left(tokens, pos)
        if left_result:
            right_result = self.right(tokens, left_result.pos)
            if right_result:
                combined_value = (left_result.value, right_result.value)
                return Result(combined_value, right_result.pos)
        return None

Concat is useful for parsing specific sequences of tokens. For instance, to parse 1 + 2, you could write:

parser = Concat(Concat(Tag(INT), Reserved('+', RESERVED)), Tag(INT))

or more concisely using the + operator shorthand:

parser = Tag(INT) + Reserved('+', RESERVED) + Tag(INT)

The Alternate combinator is similar. It also takes left and right parsers. It starts by applying the left parser. If successful, that result is returned. If unsuccessful, it applies the right parser and returns its result.

class Alternate(Parser):
    def __init__(self, left, right):
        self.left = left
        self.right = right

    def __call__(self, tokens, pos):
        left_result = self.left(tokens, pos)
        if left_result:
            return left_result
        else:
            right_result = self.right(tokens, pos)
            return right_result

Alternate is useful for choosing among several possible parsers. For instance, if we wanted to parse any binary operator:

parser = Reserved('+', RESERVED) |
         Reserved('-', RESERVED) |
         Reserved('*', RESERVED) |
         Reserved('/', RESERVED)

Opt is useful for optional text, such as the else-clause of an if-statement. It takes one parser as input. If that parser is successful when applied, the result is returned normally. If it fails, a successful result is still returned, but the value of that result is None. No tokens are consumed in the failure case; the result position is the same as the input position.

class Opt(Parser):
    def __init__(self, parser):
        self.parser = parser

    def __call__(self, tokens, pos):
        result = self.parser(tokens, pos)
        if result:
            return result
        else:
            return Result(None, pos)

Rep applies its input parser repeatedly until it fails. This is useful for generating lists of things. Note that Rep will successfully match an empty list and consume no tokens if its parser fails the first time it's applied.

class Rep(Parser):
    def __init__(self, parser):
        self.parser = parser

    def __call__(self, tokens, pos):
        results = []
        result = self.parser(tokens, pos)
        while result:
            results.append(result.value)
            pos = result.pos
            result = self.parser(tokens, pos)
        return Result(results, pos)

Process is a useful combinator which allows us to manipulate result values. Its input is a parser and a function. When the parser is applied successfully, the result value is passed to the function, and the return value from the function is returned instead of the original value. We will use Process to actually build the AST nodes out of the pairs and lists that Concat and Rep return.

class Process(Parser):
    def __init__(self, parser, function):
        self.parser = parser
        self.function = function

    def __call__(self, tokens, pos):
        result = self.parser(tokens, pos)
        if result:
            result.value = self.function(result.value)
            return result

As an example, consider the parser we built with Concat. When it parses 1 + 2, the result value we actually get back is (('1', '+'), '2'), which is not very useful. With Process we can change that result. For example, the following parser would sum the parsed expression.

def process_func(parsed):
    ((l, _), r) = parsed
    return int(l) + int(r)

better_parser = parser ^ process_func

Lazy is a less obviously useful combinator. Instead of taking an input parser, it takes a zero-argument function which returns a parser. Lazy will not call the function to get the parser until it's applied. This is needed to build recursive parsers (like how arithmetic expressions can include arithmetic expressions). Since such a parser refers to itself, we can't just define it by referencing it directly; at the time the parser's defining expression is evaluated, the parser is not defined yet. We would not need this in a language with lazy evaluation like Haskell or Scala, but Python is anything but lazy.

class Lazy(Parser):
    def __init__(self, parser_func):
        self.parser = None
        self.parser_func = parser_func

    def __call__(self, tokens, pos):
        if not self.parser:
            self.parser = self.parser_func()
        return self.parser(tokens, pos)

The next combinator, Phrase, takes a single input parser, applies it, and returns its result normally. The only catch is that it will fail if its input parser did not consume all of the remaining tokens. The top level parser for IMP will be a Phrase parser. This prevents us from partially matching a program which has garbage at the end.

class Phrase(Parser):
    def __init__(self, parser):
        self.parser = parser

    def __call__(self, tokens, pos):
        result = self.parser(tokens, pos)
        if result and result.pos == len(tokens):
            return result
        else:
            return None

The last combinator is unfortunately the most complicated. Exp is fairly specialized; it's used to match an expression which consists of a list of elements separated by something. Here's an example with compound statements:

a := 10;
b := 20;
c := 30

In this case, we have a list of statements separated by semicolons. You might think that we don't need Exp since we can do the same thing with the other combinators. We could just write a parser for the compound statement like this:

def compound_stmt():
    return stmt() + Reserved(';', RESERVED) + stmt()

Think about how stmt might be defined though.

def stmt():
    return Lazy(compound_stmt) | assign_stmt()

So stmt starts by calling compound_stmt, which starts by calling stmt. These parsers will call each other over and over without getting anything done until we get a stack overflow. This problem is not limited to compound statements; arithmetic and Boolean expressions have the same problem (think about operators like + or and as the separators). This problem is called left recursion, and parser combinators are bad at it.

Fortunately, Exp provides a way to work around left recursion, just by matching a list, similar to the way Rep does. Exp takes two parsers as input. The first parser matches the actual elements of the list. The second matches the separators. On success, the separator parser must return a function which combines elements parsed on the left and right into a single value. This value is accumulated for the whole list, left to right, and is ultimately returned.

Let's look at the code:

class Exp(Parser):
    def __init__(self, parser, separator):
        self.parser = parser
        self.separator = separator

    def __call__(self, tokens, pos):
        result = self.parser(tokens, pos)

        def process_next(parsed):
            (sepfunc, right) = parsed
            return sepfunc(result.value, right)
        next_parser = self.separator + self.parser ^ process_next

        next_result = result
        while next_result:
            next_result = next_parser(tokens, result.pos)
            if next_result:
                result = next_result
        return result            

result will always contain everything that's been parsed so far. process_next is a function that can be used with Process. next_parser will apply separator followed by parser to get the next element of the list. process_next will create a new result by applying the separator function to the current result and the newly parsed element. next_parser is applied in a loop until it can't match any more elements.

Let's see how Exp can be used to solve our compound_stmt problem.

def assign_stmt():
    ...

def compound_sep():
    def process_sep(parsed):
        return lambda l, r: CompoundStmt(l, r)
    return Reserved(';', RESERVED) ^ process_sep

def compound_stmt():
    return Exp(assign_stmt(), compound_sep())

We could also write this as:

def compound_stmt():
    return assign_stmt() * compound_sep()

We'll go over this in much more detail when we cover parsing of arithmetic expressions in the next article.

To be continued...

In This article, we built a minimal parser combinator library. This library can be used to write a parser for pretty much any computer language.

In the next article, we'll define the data structures that make up the abstract syntax tree for IMP, and we'll define a parser using this library.

If you're curious to try the IMP interpreter right now, or if you want to see the full source code, you can download it now.