Parsing Expression Grammars (PEGs) have been an interest of mine for some time now. Last summer (2009) I made a parser generator using the C programming language that parsed using a grammar language with the same matching power as PEGs but with some more expressive (but basic) tree-building constructs. Unfortunately, parsers generated with my parser generator sometimes ran into trouble when grammars used left recursion; specifically: indirect left recursion.
Recently a post came up on the PEG mailing list that made me remember the work I had done last summer and it re-ignited my interest in using PEGs with left-recursive grammars. While reading that post, a fairly simple formulation of how to handle both direct and indirect left recursion occurred to me. The formulation itself is similar to that of Warth et al. in that it detects left recursion and re-executes any rule that is the root of left recursion. The difference is that the cache / memo table does not need to be treated differently for left recursive invocations of the grammar's productions.
For the sake of simplicity, I will deal with a subset of PEG specification that is not as powerful as PEGs as it lacks the followed-by and not-followed-by operators (& and !, respectively). With this in mind, define a grammar for a language as follows:
- A set of variables. A variable is a name given to a set of structures within a language. For example: imagine that we want to describe the language M of mathematical expressions over the integers and that we want to assign the variable Summation to all expressions formed by either adding or subtracting two expressions in M. If A and B be two valid expressions in M then an expression that is structured as A + B or A - B is a identified by the Summation variable.
- A set of terminals. Words in a language are sequences of terminals. If our language is the set of English words then the terminals of the word hello are the letters h, e, l, l, and o. If our language is C then the terminals include return, go to label, int, etc.
- A set of symbols. The set of symbols is the union of the set of variables and terminals. We will assume that the set of terminals and the set of variables are disjoint.
- A set of productions. Each production is a sequence of one or more symbols. A production represents a pattern or structure to match. The variables within the production represent sub-structures to match. For convenience, we define a function symbols[G] mapping a production Pi to a sequence of zero or more symbols.
- A function productions[G] (for some grammar G) mapping variables to a sequence of zero or more productions. A variable that is mapped to an empty sequence of productions will always match. We will use |productions[G][V]| to represent the number of productions for some grammar G and some variable V.
- A parser is a procedure that determines whether or not a string of terminals can be generated / accepted by a grammar. Parsers usually do more than this: they often build parse trees or interpret the terminals directly. For the purpose of this article, the function of a parser will be to return the number of terminals of a string can be generated by a grammar.
Define a string as a sequence of zero or more terminals.
The First Parser
We can begin by defining a simple parser for a language described by an (ordered) grammar by the following relatively straightforward pseudocode procedure:
parse(grammar: G, variable: var, string: str, int: offset=0):
int: start_offset = offset bool: last_production_failed = false for production: P in productions[G][var]: last_production_failed = false for symbol: S in symbols[G][P]: if S is a terminal: if terminal(S) is terminal(str[offset]): offset = offset + 1 else: go to label try_next_production else: variable: prod_var = variable(S) try: offset = parse(G, prod_var, str, offset) catch Error: go to label try_next_production go to label done label try_next_production: last_production_failed = true offset = start_offset if last_production_failed: raise Error label done: return offset
The behaviour of this parser is fairly straightforward. Given a grammar, a starting variable, and a string, it will work its way through the string and either return the some integer length representing how much of the string was parsed or raise an Error. If the sequence of symbols for a given production is empty then the inner loop will not execute and the procedure will return the offset passed into it, effectively matching nothing. If the sequence is non-empty then matching terminal symbols will cause the parser to move forward and matching non-terminal symbols will result in a recursive call to the parse procedure that updates the string offset to the terminal immediately following the last terminal matched by the call to parse. If matching a non-/terminal fails then the offset is reset to be the offset that was passed into the current invocation of parse and stored in start_offset.
Unfortunately, this parser still suffers one obvious flaw: it risks repeating a lot of work in the event that the execution of some productions starting at any particular point in the string share common prefixes. Consider the following BNF-like grammar, but with ordered choice:
START0 → A x START1 → A b A0 → a A a A1 →When this grammar is executed using the above parse procedure on the string aaaab, START0 successfully matches A at offset 0 but then fails to match x. START1 is then tried, matches A (again) at offset 0, and then matches START0.
The Second Parser
We can solve the problem of repeating work by remembering work that we have already done. This is how PEGs parse strings in time that is linear to the number of terminals in the string and the number of variables in the grammar. We can use a two-dimensional table to represent our memory. The first dimension will be accessed by a variable name. The second dimension will be accessed by the offset of a terminal in a string. This table will be referenced by the parameter cache in the following code:
(variable × int) → (int + Error): cache parse(grammar: G, variable: var, string: str, int: offset=0):
int: start_offset = offset bool: last_production_failed = false for production: P in productions[G][var]: last_production_failed = False for symbol: S in symbols[G][P]: if S is a terminal: if terminal(S) is terminal(str[offset]): offset = offset + 1 else: go to label try_next_production else: variable: prod_var = variable(S) if (prod_var, offset) in cache: if cache[prod_var][offset] is Error: go to label try_next_production else: offset = cache[prod_var][offset] else: try: offset = parse(G, prod_var, str, offset) catch Error: go to label try_next_production go to label done label try_next_production: last_production_failed = True offset = start_offset if last_production_failed: cache[var][start_offset] = Error raise Error label done: cache[var][start_offset] = offset return offset
The modifications to the original parsing procedure are few but powerful. Before executing a recursive call to the parse procedure, a check is done to see if the parse procedure has already been called at the desired offset for the specific variable. If the result of the previous call is an Error then the next production is tried. If the result is an integer then we assign that result to offset and move on.
Still, there remains a problem: the parse procedure will not terminate if the grammar contains left recursion. For example, the following grammar should be able to parse the string aaa; however, because of the property of ordered choice, the parser never stops calling parse with START0 at offset zero.
START0 → START a START1 →
The Final Parser
How can we prevent this behaviour when the grammar contains left recursion? One surprisingly simple way to handle left recursion is to fail! Suppose that we could detect a (direct or indirect) left recursive production application on some variable V. If the previous application of V was the production Vi then for the current application of V, we can pretend to fail Vi and apply Vi+1 if it exists. If Vi+1 does not exist then we simply fail to apply V and record the failure in cache. The intuition is that we expect that a left recursive application should eventually match something, i.e. it should have at least one base case. If a base case matches then we can imagine collapsing the parser stack (easier with an explicit stack) until we hit the root of the current left recursive invocation, and then we can substitute the base case in as the result of applying the variable left-recursively, and continue on past it. The trick is to continue growing this left recursive root by substituting in previous invocations in until no forward progress is made. et al. describe this process in their paper as growing the left recursive seed.
How can we detect left recursion? It is surprisingly simple, in fact. If the last application of a particular variable is at the same terminal offset in the string as the current application of the variable then we are applying the variable left recursively. If we maintain a stack of production applications and their terminal offsets for each variable then we figure out if we are applying a production using left recursive by peeking at the top of the stack.
What is more important, however, is that the stack allows us to identify the root of a left recursive invocation. Suppose that for each variable we maintain a stack of type production × bool × bool × int representing the production being applied, whether or not the production is left recursive, whether or not the production is the root of left recursion, and the terminal offset to which the production was applied, respectively. We can detect left recursion and the left recursive root as follows:
is_left_recursive(Stack[production × bool × bool × int]: stack, int: offset): match top(stack): case _, True, _, offset: return True case _, False, _, offset: top[stack].is_left_recursive = True top[stack].is_left_recursive_root = True return True return False
The above procedure can be used to determine whether or not a production that is about to be applied is left recursive given the stack of previous applications of this production's variable and the offset at which the production is being applied. The procedure might also update the element on the top of the stack to be left-recursive and be the root of left recursion.
Making a parse procedure work correctly according to the idea of recognizing and growing left recursion can be subtle, and I think is easiest to do by managing the parsing stack explicitly. Instead of including pseudo code, I defer to my toy C++ implementation of the above ideas.
I think that using explicit stacks of production application information for each variable in the grammar is the easiest way to recognize left recursive invocations and the roots of those invocations, as evidenced by is_left_recursive.