A simple interpreter from scratch in Python (part 3)
This article is part of the series "A simple interpreter from scratch in Python".
So far, we've written a lexer and a parser combinator library for our interpreter. In this part, we'll create the abstract syntax tree (AST) data structures, and we'll write a parser using our combinator library which will convert the list of tokens returned by the lexer into an AST. Once we have parsed the AST, executing the program will be very easy.
Defining the AST
Before we can actually start writing our parsers, we need to define the data structures that will be the output of our parser. We will do this with classes. Every syntactic element of IMP will have a corresponding class. Objects of these classes will represent nodes in the AST.
There are three kinds of structures in IMP: arithmetic expressions (used to compute numbers), Boolean expressions (used to compute conditions for if- and while-statements), and statements. We will start with arithmetic expressions, since the other two elements depend on them.
An arithmetic expression can take one of three forms:
- Literal integer constants, such as
- Variables, such as
- Binary operations, such as
x + 42. These are made out of other arithmetic expressions.
We can also group expressions together with parenthesis (like
(x + 2) * 3. This isn't a different kind of expression so much as a different way to parse the expression.
We will define three classes for these forms, plus a base class for arithmetic expressions in general. For now, the classes won't do much except contain data. We will include a
__repr__ method, so we can print out the AST for debugging. All AST classes will subclass
Equality so we can check if two AST objects are the same. This helps with testing.
from equality import * class Aexp(Equality): pass class IntAexp(Aexp): def __init__(self, i): self.i = i def __repr__(self): return 'IntAexp(%d)' % self.i class VarAexp(Aexp): def __init__(self, name): self.name = name def __repr__(self): return 'VarAexp(%s)' % self.name class BinopAexp(Aexp): def __init__(self, op, left, right): self.op = op self.left = left self.right = right def __repr__(self): return 'BinopAexp(%s, %s, %s)' % (self.op, self.left, self.right)
Boolean expressions are a little more complicated. There are four kinds of Boolean expressions.
- Relational expressions (such as
x < 10)
- And expressions (such as
x < 10 and y > 20)
- Or expressions
- Not expressions
The left and right sides of a relational expression are arithmetic expressions. The left and right sides of an "and", "or", or "not" expression are Boolean expressions. Restricting the type like this helps us avoid nonsensical expressions like "x < 10 and 30".
class Bexp(Equality): pass class RelopBexp(Bexp): def __init__(self, op, left, right): ... class AndBexp(Bexp): def __init__(self, left, right): ... class OrBexp(Bexp): def __init__(self, left, right): ... class NotBexp(Bexp): def __init__(self, exp): ...
Statements can contain both arithmetic and Boolean expressions. There are four kinds of statements: assignment, compound, conditional, and loop.
class Statement(Equality): pass class AssignStatement(Statement): def __init__(self, name, aexp): ... class CompoundStatement(Statement): def __init__(self, first, second): ... class IfStatement(Statement): def __init__(self, condition, true_stmt, false_stmt): ... class WhileStatement(Statement): def __init__(self, condition, body): ...
Now that we have our AST classes and a convenient set of parser combinators, it's time to write our parser. When writing a parser, it's usually easiest to start with the most basic structures of the language and work our way up.
The first parser we'll look at is the
keyword parser. This is just a specialized version of the
Reserved combinator using the
RESERVED tag that all keyword tokens are tagged with. Remember,
Reserved matches a single token where both the text and tag are the same as the ones given.
def keyword(kw): return Reserved(kw, RESERVED)
keyword is actually a combinator because it is a function that returns a parser. We will use it directly in other parsers though.
id parser is used to match variable names. It uses the
Tag combinator, which matches a token with the specified tag.
id = Tag(ID)
num parser is used to match integers. It works just like
id, except we use the
Process combinator (actually the
^ operator, which calls
Process) to convert the token into an actual integer value.
num = Tag(INT) ^ (lambda i: int(i))
Parsing arithmetic expressions
Parsing arithmetic expressions is not particularly simple, but since we need to parse arithmetic expressions in order to parse Boolean expressions or statements, we will start there.
We'll first define the
aexp_value parser, which will convert the values returned by
id into actual expressions.
def aexp_value(): return (num ^ (lambda i: IntAexp(i))) | \ (id ^ (lambda v: VarAexp(v)))
We're using the
| operator here, which is a shorthand for the
Alternate combinator. So this will attempt to parse an integer expression first. If that fails, it will try to parse a variable expression.
You'll notice that we defined
aexp_value as a zero-argument function instead of a global value, like we did with
num. We'll do the same thing with all the other parsers. The reason is that we don't want the code for each parser to be evaluated right away. If we defined every parser as a global, each parser would not be able to reference parsers that come after it in the same source file, since they would not be defined yet. This would make recursive parsers impossible to define, and the source code would be awkward to read.
Next, we'd like to support grouping with parenthesis in our arithmetic expressions. Although grouped expressions don't require their own AST class, they do require another parser to handle them.
def process_group(parsed): ((_, p), _) = parsed return p def aexp_group(): return keyword('(') + Lazy(aexp) + keyword(')') ^ process_group
process_group is a function we use with the
Process combinator (
^ operator). It just discards the parenthesis tokens and returns the expression in between.
aexp_group is the actual parser. Remember, the
+ operator is shorthand for the
Concat combinator. So this will parse
'(', followed by an arithmetic expression (parsed by
aexp, which we will define soon), followed by
')'. We need to avoid calling
aexp directly since
aexp will call
aexp_group, which would result in infinite recursion. To do this, we use the
Lazy combinator, which defers the call to
aexp until the parser is actually applied to some input.
Next, we combine
aexp_term expression is any basic, self-contained expression where we don't have to worry about operator precedence with respect to other expressions.
def aexp_term(): return aexp_value() | aexp_group()
Now we come to the tricky part: operators and precedence. It would be easy for us just to define another kind of
aexp parser and throw it together with
aexp_term. This would result in a simple expression like:
1 + 2 * 3
being parsed incorrectly as:
(1 + 2) * 3
The parser needs to be aware of operator precedence, and it needs to group together operations with higher precedence.
We will define a few helper functions in order to make this work.
def process_binop(op): return lambda l, r: BinopAexp(op, l, r)
process_binop is what actually constructs
BinopAexp objects. It takes any arithmetic operator and returns a function that combines a pair of expressions using that operator.
process_binop will be used with the
Exp combinator (
Exp parses a list of expressions with a separator between each pair of expressions. The left operand of
Exp is a parser that will match individual elements of the list (arithmetic expressions in our case). The right operand is a parser that will match the separators (operators). No matter what separator is matched, the right parser will return a function which, given the matched separator, returns a combining function. The combining function takes the parsed expressions to the left and right of the separator and returns a single, combined expression. Confused yet? We'll go through the usage of
process_binop is actually what will be returned by the right parser.
Next, we'll define our precedence levels and a combinator to help us deal with them.
def any_operator_in_list(ops): op_parsers = [keyword(op) for op in ops] parser = reduce(lambda l, r: l | r, op_parsers) return parser aexp_precedence_levels = [ ['*', '/'], ['+', '-'], ]
any_operator_in_list takes a list of keyword strings and returns a parser that will match any of them. We will call this on
aexp_precedence_levels, which contains a list of operators for each precedence level (highest precedence first).
def precedence(value_parser, precedence_levels, combine): def op_parser(precedence_level): return any_operator_in_list(precedence_level) ^ combine parser = value_parser * op_parser(precedence_levels) for precedence_level in precedence_levels[1:]: parser = parser * op_parser(precedence_level) return parser
precedence is the real meat of the operation. Its first argument,
value_parser is a parser than can read the basic parts of an expression: numbers, variables, and groups. That will be
precedence_levels is a list of lists of operators, one list for each level. We'll use
aexp_precedence_levels for this.
combine will take a function which, given an operator, returns a function to build a larger expression out of two smaller expressions. That's
precedence, we first define
op_parser which, for a given precedence level, reads any operator in that level and returns a function which combines two expressions.
op_parser can be used as the right-hand argument of
Exp. We start by calling
op_parser for the highest precedence level, since these operations need to be grouped together first. We then use the resulting parser as the element parser (
Exp's left argument) at the next level. After the loop finishes, the resulting parser will be able to correctly parse any arithmetic expression.
How does this work in practice? Let's work it through.
E0 = value_parser E1 = E0 * op_parser(precedence_levels) E2 = E1 * op_parser(precedence_levels)
E0 is the same as
value_parser. It can parse numbers, variables, and groups, but no operators. E1 can parse expressions containing anything E0 can match, separated by operators in the first precedence level. So
E1 could match
a * b / c, but it would raise an error as soon as it encountered a
E2 can match expressions
E2 can match, separated by operators in the next precedence level. Since we only have two precedence levels,
E2 can match any arithmetic expression we support.
Let's do an example. We'll take a complicated arithmetic expression, and gradually replace each part by what matches it.
4 * a + b / 2 - (6 + c) E0(4) * E0(a) + E0(b) / E0(2) - E0(6+c) E1(4*a) + E1(b/2) - E1(6+c) E2((4*a)+(b/2)-(6+c))
precedence to directly define
def aexp(): return precedence(aexp_term(), aexp_precedence_levels, process_binop)
We probably could have defined precedence in a less abstract way, but its strength is that it applies to any situation where operator precedence is a problem. We will use it again to parse Boolean expressions.
Parsing Boolean expressions
With arithmetic expressions out of the way, we can move on to Boolean expressions. Boolean expressions are actually simpler than arithmetic expressions, and we won't need any new tools to parse them. We'll start with the most basic Boolean expression, the relation:
def process_relop(parsed): ((left, op), right) = parsed return RelopBexp(op, left, right) def bexp_relop(): relops = ['<', '<=', '>', '>=', '=', '!='] return aexp() + any_operator_in_list(relops) + aexp() ^ process_relop
process_relop is just a function that we use with the
Process combinator. It just takes three concatenated values and creates a
RelopBexp out of them. In
bexp_relop, we parse two arithmetic expressions (
aexp), separated by a relational operator. We use our old friend
any_operator_in_list so we don't have to write a case for every operator. There is no need to use combinators like
precedence since relational expressions can't be chained together in IMP like they can in other languages.
Next, we define the
not is a unary operation with high precedence. This makes it much easier to parse than
def bexp_not(): return keyword('not') + Lazy(bexp_term) ^ (lambda parsed: NotBexp(parsed))
Here, we just concatenate the keyword
not with a Boolean expression term (which we will define next). Since
bexp_not will be used to define
bexp_term, we need to use the
Lazy combinator to avoid infinite recursion.
def bexp_group(): return keyword('(') + Lazy(bexp) + keyword(')') ^ process_group def bexp_term(): return bexp_not() | \ bexp_relop() | \ bexp_group()
bexp_term pretty much the same way we did for the arithmetic equivalents. Nothing new here.
Next, we need to define expressions which include the
or operators. These operators actually work the same way arithmetic operators do; both are evaluated left to right, and
and has higher precedence.
bexp_precedence_levels = [ ['and'], ['or'], ] def process_logic(op): if op == 'and': return lambda l, r: AndBexp(l, r) elif op == 'or': return lambda l, r: OrBexp(l, r) else: raise RuntimeError('unknown logic operator: ' + op) def bexp(): return precedence(bexp_term(), bexp_precedence_levels, process_logic)
process_logic is intended to be used with the
Exp combinator. It takes an operator and returns a function which combines two sub-expressions into an expression using that operator. We pass this along to
precedence, just like we did with
aexp. Writing generic code pays off here, since we don't have to repeat the complicated expression processing code.
bexp finished, we can start parsing IMP statements. We'll start with the humble assignment statement.
def assign_stmt(): def process(parsed): ((name, _), exp) = parsed return AssignStatement(name, exp) return id + keyword(':=') + aexp() ^ process
Nothing particularly interesting for this one. Next, we'll look at
stmt_list. This will parse a series of statements separated by semicolons.
def stmt_list(): separator = keyword(';') ^ (lambda x: lambda l, r: CompoundStatement(l, r)) return Exp(stmt(), separator)
Remember, we need to use the
Exp combinator here instead of doing something simple like
stmt() + keyword(';') + stmt() to avoid left recursion.
Next up is the
def if_stmt(): def process(parsed): (((((_, condition), _), true_stmt), false_parsed), _) = parsed if false_parsed: (_, false_stmt) = false_parsed else: false_stmt = None return IfStatement(condition, true_stmt, false_stmt) return keyword('if') + bexp() + \ keyword('then') + Lazy(stmt_list) + \ Opt(keyword('else') + Lazy(stmt_list)) + \ keyword('end') ^ process
The only complexity here is the fact that the
else-clause is optional. This makes the
process function a little more complicated.
Finally, we have the
def while_stmt(): def process(parsed): ((((_, condition), _), body), _) = parsed return WhileStatement(condition, body) return keyword('while') + bexp() + \ keyword('do') + Lazy(stmt_list) + \ keyword('end') ^ process
We'll wrap it all up with
def stmt(): return assign_stmt() | \ if_stmt() | \ while_stmt()
You'll notice that both the
while statements use
stmt_list for their bodies rather than
stmt_list is actually our top level definition. We can't have
stmt depend directly on
stmt_list, since this would result in a left recursive parser. And since we want
while statements to be able to have compound statements for their bodies, we use
Putting it all together
We now have a parser for every part of the language. We just need to make a couple top level definitions:
def parser(): return Phrase(stmt_list())
parser will parse an entire program. A program is just a list of statements, but the
Phrase combinator ensures that we use every token in the file, rather than ending prematurely if there is are garbage tokens at the end.
def imp_parse(tokens): ast = parser()(tokens, 0) return ast
imp_parse is the function that clients will call to parse a program. It takes the full list of tokens, calls
parser, starting with the first token, and returns the resulting AST.
To test out our parsers (in addition to unit tests), here's a simple driver program:
import sys from imp_parser import * if __name__ == '__main__': if len(sys.argv) != 3: sys.stderr.write('usage: %s filename parsername' % sys.argv) sys.exit(1) filename = sys.argv file = open(filename) characters = file.read() file.close() tokens = imp_lex(characters) parser = globals()[sys.argv]() result = parser(tokens, 0) print result
This program will read a file (first argument) and parse it with one of the parsers in imp_parse.py (second argument). For example:
$ echo '1 + 2 * 3' >foo.imp $ python imp_parser_driver.py foo.imp aexp Result(BinopAexp(+, IntAexp(1), BinopAexp(*, IntAexp(2), IntAexp(3))), 5)
This should provide a good sandbox for experimentation.
In this article, we built a parser combinator library from scratch and used it to write a parser for IMP. In the next and final article in the series, we'll write an evaluator for our parsed AST.
Once again, all the source code for the interpreter is available for download.