ANTLR 4: A Case Study

19 Jan 2014

Over the last few months, I’ve worked extensively with ANTLR 4, the fourth version of the ANTLR (ANother Tool for Language Recognition) parser generator. ANTLR is maintained by Terence Parr, a professor at the University of San Francisco.

(Several sentences extolling the virtues of ANTLR 4 used to exist here before I deleted them for the sake of brevity. Suffice it to say that Terence Parr is the man. ANTLR is some powerful, educational, well-written, free, and open source software. If you find an excuse to work with it at home or in the workplace, you’re guaranteed to have a good time.)

In this post, I provide a case study of ANTLR 4, with details on how I’ve used it and a handful of things I’ve found useful and/or interesting. It is not meant to be a step-by-step guide to getting started with ANTLR; for that you will need to pick up the excellent ANTLR 4 documentation. Although ANTLR 4 is (financially and RMS-approved) free, the documentation is not - it is available in the form of a book called The Definitive ANTLR 4 Reference, written by none other than Terence Parr himself.

Although I wrote this post mainly to convey my experiences with ANTLR, substantial details about PeopleCode, the language I am lexing/parsing with ANTLR, are included. Most readers will undoubtedly give zero cares about PeopleCode given its obscurity and irrelevance outside the PeopleTools runtime, but several language nuances (and one glaring language ambiguity) have been included to show just what ANTLR is capable of.

Overview

When you combine the powerful simplicity of JDBC and Java’s automatic memory management with the capabilities of ANTLR, the barrier to entry for creating database-intensive language applications is drastically lowered. My use of ANTLR centers on the development of a PeopleCode interpreter, which in turn is incorporated into the alternative PeopleTools runtime engine I’ve been working on. The codebase for the system is written in Java, which allows me to incorporate the ANTLR runtime directly into the engine via a JAR file in the classpath. Why did I choose Java/ANTLR rather than C/C++/lex/yacc? Because I knew that the engine was going to be executing thousands of SQL statements, a task better left to JDBC rather than Oracle’s raw OCI library (unless one is feeling especially 1337, but reverse engineering PeopleTools has proven to be wild enough without sadistically adding OCI nuances to the mix).

The Grammar File

Everything you do with ANTLR flows from a grammar file (file suffix is typically “.g4”). If you want to parse an existing language, check this GitHub repo; it may have the language you’re looking for. If you can’t find a grammar file for an existing language, or you want to create your own language, you’ll need to develop a grammar file from scratch. This is what I did for the PeopleCode language, and I’ll be discussing parts of the resultant grammar file in this section. You can find the file in its entirety here.

The first thing to note about my grammar file is that I did not embed any actions in it. Embedded actions are pieces of Java code attached to various rules of a grammar; when these rules are matched by the parser, the associated Java code will run. Embedded actions are a great way to get started with ANTLR, but I don’t recommend using them for anything but simple applications and experiments, as you’ll likely want to keep your language definition separate from application logic. More on this later.

The next thing to note are the two main parts of the file: the first part contains parser rules, the second contains lexer rules. A rule is written in the (very simplistic) form:

<rule-id> : <alternatives>

Rules starting with a lowercase letter are used by the parser. Rules starting with an uppercase letter are used by the lexer. As you can see, the PeopleCode language isn’t all that complex (at least compared to something like C++): you have the usual statement constructs (while, if, for, break, etc.), along with the usual expressions (expr * expr, expr >= expr, etc.).

Now take a step back and look at the big picture. A program is made up of a stmtList, which in turn is made up of one or more stmts. When ANTLR is given this grammar and a PeopleCode program, it will first lex the program, turning it into a stream of tokens, as defined by the lexer rules in the grammar file. Then it will parse the program: it will attempt to organize that stream of tokens into a tree, based on the format of the alternatives for each parser rule and the order in which the alternatives appear. If ANTLR completes successfully, it was able to successfully build a parse tree for the input; otherwise, you’ll get an error.

Grammar and language ambiguity are the bane of all language applications. If your grammar is ambiguous, the parser will successfully build a parse tree for a syntactically valid input, but it may not be the parse tree you consider to be correct/valid. If you look at the stmt and expr rules in the PeopleCode grammar file, you’ll find two identical alternatives: expr = expr. This is an ambiguity - there are two possible ways to parse a single input. Unfortunately, this ambiguity cannot be fixed with a modification to the grammer file because it’s rooted in the PeopleCode language itself. When writing PeopleCode programs, programmers must use the = operator for both comparison and assignment; in standalone statements, the = will result in an assignment, but in if expressions, function arguments, and any other expression context, the = will result in a Boolean equality comparison.

To compensate, I listed the expr '=' expr alternative for the stmt rule before the expr alternative. When ANTLR detects an ambiguity, it will choose the first valid alternative. So, when ANTLR sees SSR_STDNT_TERM0.EMPLID = "AA0001";, it will consider that to be an assignment stmt rather than an expr involving a Boolean comparison. All other uses of = will be (correctly) considered by ANTLR to be comparison expressions. If you’re working with a language that contains syntactic ambiguities like this, you’ll need to carefully order your alternatives in this manner, with higher precedence alternatives preceding identical alternatives of lower precedence.

If you’re designing a new language, you will vastly simplify your grammar by taking care to avoid ambiguities in your language. For the love of all that is holy, don’t use = to represent both comparison and assignment.

Giving Your Grammar to ANTLR

ANTLR 4 consists of a single JAR file; in addition to containing classes that can be incorporated into your application at runtime, this JAR can be called at the command line to transform your grammar file into a set of Java files that can then parse input according to the rules in the grammar. Note that at this time, ANTLR 4 can only generate parsers in Java; there are plans to support additional languages down the road.

For the PeopleCode.g4 grammar file, generating a lexer and parser is done as follows (all on one line):

java -jar antlr-4.1-complete.jar -package com.mattjquinn.antlr4.frontend grammars/PeopleCode.g4

After running this, PeopleCodeLexer.java and PeopleCodeParser.java can be found in my current directory. Note the -package option: if this is set, ANTLR will include a package <pkg-name>; line at the top of both the lexer and parser Java files. In the event that you incorporate these files into a broader system, you’ll likely want to set this value accordingly in your build script.

At this point, you now have a lexer and parser, but no main class with which to call them. ANTLR comes with a test rig, which means you don’t have to write a boilerplate class with a main entry point. The test rig can be called up using java org.antlr.v4.runtime.misc.TestRig (the ANTLR JAR file must be in your classpath). Terence Parr suggests aliasing this to grun at the command line for quicker execution; see The Definitive ANTLR 4 Reference for more.

Although the test rig is useful, it isn’t necessary if you’ve already got a broader system within which you’re incorporating the lexer and parser; the next section has more info if that’s the case.

Language Applications: Visitors vs Listeners

In all likelihood, you’re considering using ANTLR to execute custom logic based on the contents of well-formed input adhering to the syntax in a grammar file. If your logic is simple enough, and your grammar will only be used for a single language application, you might consider embedding the logic within the grammar file itself.

If your logic is complex, and/or your grammar will be used for multiple language applications, you should avoid embedded actions and stick with visitors and listeners. Within my alternative PeopleTools runtime, I have two language applications: one interprets PeopleCode programs, while the other pre-processes programs before they are interpreted. I implemented the former as a visitor and the latter as a listener; here are the details and trade-offs for each:

Visitors

Visitors are used to selectively visit nodes in an ANTLR parse tree. If you have a language application that doesn’t need to visit every node in a parse tree, or if the visits are dependent on context, you should implement your logic within a visitor. When you give your grammar to ANTLR for lexer and parser generation, include the -visitor argument to tell ANTLR that you plan on writing one or more visitors. ANTLR will generate <NameOfGrammar>Visitor.java and <NameOfGrammar>BaseVisitor.java alongside the lexer and parser files. To create a visitor, create a Java class that extends the <NameOfGrammar>BaseVisitor class. Then, to specify code that should be run upon visiting a rule in the grammar, override the appropriate method found in the <NameOfGrammar>BaseVisitor class.

As an example, I’ve created a class in my system called InterpreterVisitor (available here) that extends PeopleCodeBaseVisitor<Void> - this visitor implements the logic required to interpret PeopleCode programs. Note that the base visitor class is genericized; if you want to be able to return values from the visitor methods that you override, change Void to the appropriate type. However, my experience has been that using return values is too restrictive to be of use, so I just use Void and return null; at the end of each method I override. Regardless of the return scheme you use, you’ll most likely want to define instance variables to annotate parse tree nodes with various data. Aside from the usual standard library data structures, there’s a useful ANTLR-specific data structure called ParseTreeProperty. The class is genericized and acts like a hash map; if, for instance, you want to annotate parse tree nodes with String values, instantiate a new ParseTreeProperty<String>() and call .put(parseTreeNode, "<string-val>").

In my InterpreterVisitor class, I’ve overriden many of the base methods. For instance, to implement variable assignment for PeopleCode statements, I overrode the visitStmtAssign method found in my base visitor:

@Override
public Void visitStmtAssign(PeopleCodeParser.StmtAssignContext ctx) {
	...
}

Note that the default implementation for each method in <NameOfGrammar>BaseVisitor.java simply visits the children of the parse tree node provided as the argument. For each of these methods that you override, you must explicitly call visit(<child-node>) for any and all children of the provided parse tree node that you wish to visit. This is why visitors are well-suited for interpreters; in a conditional construct like an if/else statement, you don’t need to visit the body of the else if the conditional is true, and vice versa.

Listeners

Listeners are good for executing code every time your parser encounters a particular rule. Although the same effect can be achieved with a visitor, using a listener does not require you to manually visit child parse tree nodes, unlike use of a visitor. By default, all nodes in a parse tree are visited by a listener; in other words, the entire parse tree is walked. When you give your grammar to ANTLR for lexer and parser generation, include the -listener argument to tell ANTLR that you plan on writing one or more listeners. ANTLR will generate <NameOfGrammar>Listener.java and <NameOfGrammar>BaseListener.java alongside the lexer and parser files. To create a listener, create a Java class that extends the <NameOfGrammar>BaseListener class. Then, to specify code that should be run upon visiting a rule in the grammar, override the appropriate method found in the <NameOfGrammar>BaseListner class.

As an example, I’ve created a class in my system called ProgLoadListener (available here) that extends PeopleCodeBaseListener. I wrote this listener in order to pre-process programs before they are interpreted by my InterpreterVisitor visitor. I chose to use a listener here because I found that my pre-processing logic exhibited a pattern: “Every time an import statement appears, attach some metadata to the program”, “Every time an object is created, save a reference to the underlying program definition”, and so on.

In my ProgLoadListener class, I’ve overriden several of the base methods. For instance, to capture the first parse tree node in a method declaration within a PeopleCode program (and thus allowing me to visit the node again if that method is called during interpretation), I overrode the enterMethodImpl method found in my base listener:

@Override
public void enterMethodImpl(PeopleCodeParser.MethodImplContext ctx) {
	...
}

Listeners allow you to override up to two methods per rule in your grammar: enter<RuleId> and exit<RuleId> (you can also override enterEveryRule and exitEveryRule). Their purposes should be self-explanatory: when ANTLR reaches the first token matched to a rule, it will call the “enter” method for that rule, and when it reaches the last token matched to the rule, it will call the corresponding “exit” method. You should take care to place your logic accordingly; if you plan on annotating parse tree nodes with data using a listener, any logic that depends on data attached to child nodes should be placed in an “exit” rule. In an “enter” rule, you can access the text and other basic properties of child nodes, but ANTLR will not have visited them yet, and thus your annotations will not yet be available.

Advanced ANTLR : Accessing Hidden Tokens on a Separate Lexer Channel

One of the coolest ANTLR 4 features I’ve made use of is the ability to emit tokens on separate lexer channels. By default, the parser generated by ANTLR tunes to only one channel. All tokens are emitted by default to this primary channel. But what if you don’t care about whitespace, or comments in a program? One option is to have the lexer throw them out using the skip token attribute, as this rule from the PeopleCode.g4 grammar does with block comments in PeopleCode programs:

COMMENT_2 : '<*' .*? '*>' -> skip;

However, there may be tokens you don’t want to throw out, yet emitting them on the default parser channel would be unnecessary or make little sense. In that case, you can have ANTLR emit these tokens on a different lexer channel. Then, when you write your language applications, you can check for and access different channels to get the hidden tokens when you need them.

In my alternative PeopleTools runtime engine, I’m doing this for both whitespace and PeopleCode program references, which I’ll explain below. In my grammar file, I’ve written the following:

Excerpt from PeopleCode.g4

...

@lexer::members {
    public static final int REFERENCES = 1;
    public static final int WHITESPACE = 2;
}

...

WS :   [ \t\r\n]+ -> channel(WHITESPACE);
ENT_REF_IDX :   '#ENTREF{' IntegerLiteral '}' -> channel(REFERENCES);

My reason for deciding not to simply throw out whitespace is a common one: in my language applications, I wanted the ability to get the text of various parts of PeopleCode programs as it appeared in the original program, whitespace and all. The syntactical validity of PeopleCode programs does not depend on whitespace (some languages, i.e. Python, do in fact have this dependency), so emitting whitespace on the primary token channel would have been unnecessary.

In order to understand why I created a REFERENCES channel, some details about PeopleCode are in order. In a PeopleSoft database, a table called PSPCMPROG holds the metadata and bytecode for every PeopleCode program in the system. When my alternative PeopleTools runtime engine runs, it assembles the bytecode into its equivalent text form using logic found in the Decode PeopleCode open source project, developed by Erik H (erikh3@users.sourceforge.net). It’s in this text form that PeopleCode programs are provided to the ANTLR-generated parser.

Using ANTLR to parse the text of PeopleCode programs, rather than interpreting the bytecode directly, allows for greater clarity and ease-of-debugging, albeit with a performance hit. The downside is that there are certain bytecode instructions that lack a text equivalent. Two of these instructions (0x21 and 0x4A) signal a reference to a PeopleTools object definition, such as a record or field. Each program has a list of referenced objects in the PSPCMNAME table; the number following the 0x21 and 0x4A instructions keys into this reference list. The interpreter (implemented as an ANTLR visitor) doesn’t care about these references, but the pre-processor (implemented as an ANTLR listener) does - it needs to pre-fetch metadata for these objects to ensure that the interpreter has access to them at runtime.

This is a perfect use case for hidden tokens: a situation where one or more, but not all, language applications for a grammar require access to a token that has no business being emitted on the primary lexer channel. I conjured up a textual representation for the 0x21 and 0x4A reference instructions: “#ENTREF{id}”, where “id” is the integer identifying the reference. I modified the logic that assembles bytecode into text to ensure that whenever a reference instruction is encountered, the appropriate “#ENTREF{id}” is inserted. I then added the ENT_REF_IDX rule to PeopleCode.g4 (see above), with the channel attribute set to the REFERENCES channel.

With those changes in place, the PeopleCode parser generated by ANTLR could successfully take a PeopleCode program annotated with object references, like this…

Excerpt from the Record:FUNCLIB_CS.Field:SRVC_IND_NEG.Event:FieldFormula program.

Function SI_SecurityNeg(&EMPLID)

   &bolNegDisplay = False;
   &intCountNeg = 0;

   &rsPerson = CreateRowset(#ENTREF{2}Record.SRVC_IND_DATA);
   &rsPerson.Fill(" WHERE EMPLID = :1 AND POS_SRVC_INDICATOR= 'N' ", &EMPLID);
   &obj_si_scrty = create SCC_MANAGE_SI_SECURITY:SCC_DISPLAY:SrvcIndScrty();
   For &i = 1 To &rsPerson.ActiveRowCount
      &str_mode = &obj_si_scrty.getScrty(
			&rsPerson.GetRow(&i).#ENTREF{3}SRVC_IND_DATA.#ENTREF{4}INSTITUTION.Value,
			&rsPerson.GetRow(&i).#ENTREF{3}SRVC_IND_DATA.#ENTREF{5}SRVC_IND_CD.Value,
			&rsPerson.GetRow(&i).#ENTREF{3}SRVC_IND_DATA.#ENTREF{6}SRVC_IND_REASON.Value);
      If &str_mode = "NONE" Then
         &intCountNeg = &intCountNeg + 1;
      End-If;
   End-For;
   If &intCountNeg = &rsPerson.ActiveRowCount Then
      &bolNegDisplay = True;
   End-If;
End-Function;

… and build a parse tree for the program. The parse tree doesn’t actually contain the “#ENTREF{id}” tokens - you have to pull them from the token stream generated by the lexer. In my ProgLoadListener, I do just that:

Excerpt from ProgLoadListener.java

public class ProgLoadListener extends PeopleCodeBaseListener {
	...
	private BufferedTokenStream tokens;

	public ProgLoadListener(..., BufferedTokenStream tokens) {
		...
        this.tokens = tokens;
    }

    ...

    @Override
    public void enterEveryRule(ParserRuleContext ctx) {
    	...
    	int tokPos = ctx.getStart().getTokenIndex();
        List<Token> refChannel = tokens.getHiddenTokensToLeft(tokPos,
            PeopleCodeLexer.REFERENCES);
        Token refTok;

        if(refChannel != null && (refTok = refChannel.get(0)) != null) {
            String text = refTok.getText();
            int refIdx = Integer.parseInt(text.substring(8, text.length() - 1));
            ...
        }
    }
    ...
}

In English: for each rule that is preceded by an “#ENTREF{id}”, ProgLoadListener can easily get the integer reference id and use that to key into the list of references for the PeopleCode program represented by the parse tree being walked. The code is pretty simple, a testament to the kinds of things you can do with ANTLR without writing much code at all.

Conclusion

This post covers a mere fraction of the capabilities offered by ANTLR 4. Hopefully it gives you a taste of what’s possible. If you’re considering writing your own language applications, ANTLR 4 is a great choice for a parser generator, and the included runtime facilities are excellent. Even if you are planning to use C/C++/lex/yacc, you might consider prototyping your application in Java/ANTLR 4 first; you’ll likely enjoy being able to rapidly prototype first in order to iron any surface bugs out of your application.