How to build a parser by hand

Published on 2014-02-13
Tagged: parsers python web

When I first started learning about compiler development, I heard about hand-written parsers for complicated languages like C++. It sounded scary. Maybe they were ignoring parser generators like Yacc or Bison or ANTLR so they could squeeze out every last drop of performance. It reminded me of the hacker folklore Story of Mel.

It turns out that building a hand-written parser is actually not much harder than using a tool. You can easily build something simple, efficient, and flexible, but perhaps not that elegant. You'll probably end up with more code than you would with a parser generator (why else would those tools exist?), but a hand-written parser is easy to modify and debug, and you will be able to do things you can't do when using those tools. This is the biggest advantage of hand-writing a parser: it allows you to parse non-context-free grammars which are difficult to parse using other methods. It's also the biggest danger: non-context-free languages can get very complicated as they grow. Incidentally, this is why C++ compilers use hand-written parsers: the syntax of C++ is inextricably tangled with symbols and semantics, so you can't really separate the front-end into stages; you have to parse and compile all at once.

In this article, I'll show how I built a parser for a very simple template language I use to generate HTML for this blog.

The Grammar

Let's talk about the language we'll be parsing. I basically wanted to write a normal HTML document, but be able to inject dynamic content into it with directives like if, for, include, and call. Here's a snippet from the template that lists all posts.

{!for post in posts}
  <p>
    <a href="/posts/{post.id}/{!call post.uriName}">{post.title}</a><br>
    {!if post.description}
      <span class="description">{post.description}</span><br>
    {!endif}
    {!if post.tags}Tagged:
      {!for tag in post.tags}
        <a class="tag" href="/tags/{tag}">{tag}</a>
      {!endfor}
      <br>
    {!endif}
  </p>
{!endfor}

Each template directive beings with a tag which starts with {! and ends with }. Some template directives have a corresponding closing tag like {!endif}.

for-directives loop over a block of text, emitting it for each item in a collection. They have an optional else-block, which is emitted instead if the collection is empty. if-directives emit a block of text if a condition is true. They also have an optional else-block emitted when the condition is false. call-directives call a method on a value and emit the resulting string. include-directives (not shown in the example above) read and emit a template from a named file. There are also simple text substitutions which begin with { and end with }.

Here's the context-free grammar for our language. Terminal symbols (generated by the lexer) are upper-case. Non-terminal symbols (built by the parser out of other symbols) are lower-case.

template ::= TEXT
          |  INCLUDE
          |  CALL
          |  if-template
          |  for-template
          |  template template

if-template ::= IF template ENDIF
             |  IF template ELSE template ENDIF

for-template ::= FOR template ENDFOR
              |  FOR template ELSE template ENDFOR

Pretty simple, right? We don't make the parser handle text substitutions or the internal structure of each tag. These can easily be handled with regular expressions later. To avoid making this post too long, I won't get into the details of that.

Can't we do all of this with regex?

No! The one whose name cannot be expressed in the basic multilingual plane would not approve.

Our template language is not a regular language. Think about the if directive and how you would find the corresponding closing endif using regular expressions. It's not necessarily the next one, since there may be nested if directives in the body. It's not necessarily the last one either. You need to count opening and closing tags, but since regular expressions are finite state machines, there is no way to count an arbitrary number of nestings.

We need something more sophisticated. We will use regular expressions to build our lexer, and our parser will be built on top of that.

The Lexer

The lexer for this language is pretty straightforward. I've covered lexers in detail before, so I'm not going to go into a great amount of detail here.

Basically, our input will be a string containing the whole template to be parsed. Our output will be a list of tokens. Each token contains a string and a tag, which says which terminal symbol it is.

import re
from collections import namedtuple

Token = namedtuple("Token", ["tag", "text"])

def lexTemplate(text):
  tagRex = re.compile(r"\{!([a-z]+)[^}]*\}")
  tokens = []
  pos = 0
  while pos < len(text):
    m = tagRex.search(text, pos)
    if m is None:
      # No more directives found.
      tokens.append(Token("text", text[pos:]))
      break
    else:
      # There might be text between here and the next directive.
      if m.start() > pos:
        tokens.append(Token("text", text[pos:m.start()]))
      tokens.append(Token(m.group(1), m.group(0)))
      pos = m.end()
  return tokens

Infrastructure

To start out, we'll create a simple class, ParseState, which keeps track of where we are in the token list.

class ParseState(object):
  def __init__(self, tokens, pos=0):
    self.tokens = tokens
    self.pos = pos

  ...

We'll also define some utility methods for this class. Objects of this class will be immutable, so when a token is successfully consumed, we'll return a ParseResult containing a new ParseState and a value. The value may simply be a token, or it may be an AST fragment built by the parser. On failure, we'll return None.

from collections import namedtuple

ParseResult = namedtuple("ParseResult", ["state", "value"])

class ParseState(object):
  ...
  def atEnd(self):
    return self.pos == len(self.tokens)

  def take(self):
    if self.atEnd():
      return None
    next = ParseState(self.tokens, pos + 1)
    token = self.tokens[self.pos]
    return ParseResult(next, token)

  def expect(self, expectedTag):
    result = self.take()
    if result is None or result.value.tag != expectedTag:
      return None
    return result

  def peek(self):
    return self.tokens[self.pos].tag if not self.atEnd() else None

How to build the parser

Our next step is to translate the rules of the grammar into code which parses it. We will define a function which parses each rule. These functions can be written somewhat mechanically. For each symbol in a production of the rule, call expect for a terminal symbol or take for a non-terminal. If there are multiple productions for parsing the same rule, try the first one. If the result is a failure, try the next one using the original parse state, and so on.

To start out, here is the function which parses the are the call directive. The function for include is the same, so I won't show it here. Both of these consist of a single token, so they are very easy to parse: we just call expect and build an AST node from the token. As before, we return a ParseResult on success, and None on failure.

def parseCall(state):
  res = state.expect("call")
  if res is None:
    return None
  value = buildCall(res.value.text)
  return ParseResult(res.state, value)

Now for something more complicated. Here is the function which parses the if directive. If we encounter an unrecoverable failure in these cases, we'll raise an exception. The function for the for directive is almost exactly the same, so again, I won't show it here.

class TemplateParseError(Exception):
  pass


def parseIf(state):
  res = state.expect("if")
  if res is None:
    return None
  ifText = res.value.text
  res = parseTemplate(res.state)
  if res is None:
    raise TemplateParseError("expected body for if-directive")
  trueBody = res.value
  if res.state.peek() == "else":
    res = res.state.take()
    res = parseTemplate(res.state)
    if res is None:
      raise TemplateParseError("expected else-body for if-directive")
    falseBody = res.value
  else:
    falseBody = None
  res = res.expect("endif")
  if res is None:
    raise TemplateParseError("expected endif")
  value = buildIf(ifText, trueBody, falseBody)
  return ParseResult(res.state, value)

Before we tie all this together with the top-level template parsing function, we have to watch out for one particular production:

template ::= ...
          |  template template

This one says we can have a sequence of templates of any length. The problem with this rule is that it's left recursive. If we handled in naively, we would end up with infinite recursion. We can handle this formally by transforming the grammar into a non-left-recursive form. Or we can just handle it informally by moving the "simple" non-left-recursive productions into their own function and just calling that in a loop in the top-level function.

def parseTemplate(state):
  value = None
  done = False
  while not done:
    res = parseSimpleTemplate(state)
    if res is None:
      done = True
    else:
      value = buildSequence(value, res.value)
      state = res.state
  if value is None:
    return None
  else:
    return ParseResult(state, value)
        

def parseSimpleTemplate(state):
  res = parseText(state)
  if res:
    return res
  res = parseInclude(state)
  if res:
    return res
  res = parseCall(state)
  if res:
    return res
  res = parseIf(state)
  if res:
    return res
  res = parseFor(state)
  if res:
    return res
  return None

One last detail: when we parse a template file, we don't want any junk at the end of the file that can't be parsed. So when we invoke parseTemplate, we need to check that all tokens were consumed.

def parseTemplateFile(fileName):
  with open(fileName) as templateFile:
    templateText = templateFile.read()
  tokens = lexTemplate(templateText)
  state = ParseState(tokens)
  res = parseTemplate(state)
  if res is None:
    raise TemplateParseError("could not parse template")
  if not res.state.atEnd():
    raise TemplateParseError("garbage at end of template")
  return res.value

Conclusion

There are some drawbacks to writing a parser by hand. First and foremost, it is very easy to deviate from a context-free grammar. Because there isn't really any structure to follow, you are free to invent all kinds of crazy rules, and your language could end up as an unparseable mess as it grows *cough*C++*cough*. Second, hand-written parsers tend to be fairly large in terms of code size. This leaves a lot of space for bugs to hide. It can also be a performance problem. In V8, I have seen high instruction cache miss rates on some benchmarks because the parser is too large. I suspect automatically generated table-driven parsers would do better in this regard.

All that said, it's surprisingly easy to hand-write a parser for a small language in a short amount of time. Writing one this way may be a good idea when you're language is simple, and you don't want to add a parser generator tool or library as a dependency. It may be necessary when your target language is already not context-free. In any case, it's good to know how to do it!