Preserving comments when parsing and formatting code

How Gazelle keeps comments when modifying BUILD files

Published on 2023-11-02
Tagged: bazel compilers go parsers

View All Posts

I've been tinkering with a new toy language this week, and I wanted to replicate a feature I like from some tools in the Bazel ecosystem. Gazelle, buildozer, and buildifier have the ability to preserve comments when formatting a BUILD file, even after significant programmatic modifications. Gazelle can also easily read comments when walking the syntax tree, which is important for processing # keep comments and directives.

In this article, I'll compare Go's parser and formatter with the library used by the Bazel tools. Both libraries can preserve comments, but the Bazel library does it better.

How go/ast preserves comments

Of course any library that can format parsed code can preserve comments. No one would use a formatter that didn't. There are different ways to implement it though. Let's start by looking at the Go standard library package representing the Go syntax tree.

Here's one of the abstract syntax tree (AST) node types, AssignStmt. I chose this type arbitrarily; most of them look similar.

type AssignStmt struct {
	Lhs    []Expr
	TokPos token.Pos   // position of Tok
	Tok    token.Token // assignment token, DEFINE
	Rhs    []Expr
}

This type has no place to record comments. Where are they? Here's the top-level type, File:

type File struct {
	Doc     *CommentGroup // associated documentation; or nil
	Package token.Pos     // position of "package" keyword
	Name    *Ident        // package name
	Decls   []Decl        // top-level declarations; or nil

	FileStart, FileEnd token.Pos       // start and end of entire file
	Scope              *Scope          // package scope (this file only)
	Imports            []*ImportSpec   // imports in this file
	Unresolved         []*Ident        // unresolved identifiers in this file
	Comments           []*CommentGroup // list of all comments in the source file
	GoVersion          string          // minimum Go version required by //go:build or // +build directives
}

This has a Comments field, which is a []*CommentGroup. What is that?

type CommentGroup struct {
	List []*Comment // len(List) > 0
}

type Comment struct {
	Slash token.Pos // position of "/" starting the comment
	Text  string    // comment text (excluding '\n' for //-style comments)
}

All the comments are stored in a list attached to the root node of the AST. Each Comment records its original position in the file. This approach causes two significant problems for refactoring tools.

First, when walking the AST, it's difficult to find the comments related to any particular node in the tree. Some node types like FuncDecl do have comments, which is helpful for documentation generation, but most node types lack comments.

Second, when the formatter converts the AST into text, it preserves comments by iterating over the AST and the comment list together. Both AST nodes and comments have positions, so before the formatter writes an AST node, it writes comments with earlier positions. This makes difficult to build a refactoring tool that modifies the AST, adding something in the middle of a file. It throws off all the written offsets. It's also hard to programmatically add comments for the same reason.

How the Bazel tools preserve comments

The Bazel tools I mentioned all use the same library to parse and format Starlark files: github.com/bazelbuild/buildtools/build. This package takes a very different approach. Let's look again at one of the AST node types, AssignExpr:

type AssignExpr struct {
	Comments
	LHS       Expr
	OpPos     Position
	Op        string
	LineBreak bool // insert line break between Op and RHS
	RHS       Expr
}

type Comments struct {
	Before []Comment // whole-line comments before this expression
	Suffix []Comment // end-of-line comments after this expression

	// For top-level expressions only, After lists whole-line
	// comments following the expression.
	After []Comment
}

type Comment struct {
	Start Position
	Token string // without trailing newline
}

Every AST node type embeds the Comments struct type to store comments. This allows programs like Gazelle to walk the AST and read comments in the context they were written instead of searching through a separate list. This is particularly important for Gazelle because it needs to recognize # keep comments that tell it not to modify specific expressions. Each comment has a position, but that's not particularly important when formatting a file: the formatter can print comments around associated AST nodes easily. This is wonderful if you're adding comments or editing the AST, which is Gazelle's main purpose.

So how does the parser actually read the comments and attach them to the right AST nodes? The important details are in input.assignComments.

Like most other parsers, the parser in build is written in two phases: lexical analysis (splitting input text into tokens), and syntax analysis (building an AST from those tokens). The lexer identifies comments and saves them in two lists: line comments that appear by themselves on a line, and suffix comments that appear after other tokens. Comments are usually not included in the token stream, but the lexer does emit _COMMENT tokens for top-level comments at the beginning of the file. These are handled by the parser as if they were part of the grammar (even though they are technically not).

In most cases though, comments are not part of token stream, so the parser can build the AST without considering them. Comments are attached to the correct AST nodes only after the whole tree is built. This is the clever part. The parser builds two lists of all AST nodes by walking the tree: one from a pre-order traversal (earlier, outer-most nodes come first), the other from a post-order traversal (later, outer-most nodes come last). The parser iterates forward over the pre-order list and attaches line comments to the outer-most nodes that they appear before. Afterward, it iterates backward over the post-order list and attaches suffix comments to the outer-most nodes that they appear after.

To understand this, let's work out a somewhat contrived example:

foo(
    # line
    a + b) # suffix

The AST looks like this:

   call
   / \
foo   +
     / \
    a   b

When we traverse the tree, we get these two lists:

pre-order:  call, foo, +, a, b
post-order: foo, a, b, +, call

Each node in these lists has a known start and end position, as do the comments, so we can easily figure out where to attach the comments. For the # line comment, the first node in the pre-order list with a later start position is the +, so that's where we attach it. That's what we'd intuitively expect. For the # suffix comment, the last node with a later end position is the call node, which is also what we'd expect.

Let's modify the example a little bit:

foo(
    # line
    a + b, # suffix
)

The # suffix comment now starts before the call AST node ends, which means it gets attached to the + AST node, again, just like we'd expect.

I hope this article is useful to someone. I thought this was clever. It makes refactoring tools a bit easier to write, which is a win in my book.