RSS Feed

Peter Goodman's blog about PHP, Parsing Theory, C++, Functional Programming, Applications,

The Small Core of Parsing Expression Grammars (PEGs)

Informal Definition of PEGs

Per the Wikipedia definition, a PEG is similar to a Context-free Grammar (CFG) insofar as both describe the structure of a language recursively by means of rewrite rules. PEGs, however, are also quite different from CFGs.

PEGs introduce the idea of ordered choice when rewriting a variable that appears in a rule. CFGs on the other hand have no such concept of ordered choice when rewriting a variable and so all rules must be evaluated at once, or at least tested. For this reason, non-deterministic pushdown automata (PDA) are ideal machines for recognizing the languages accepted by a given CFG. More formally, let GP be some PEG and GC be some CFG and have both share the same alphabet, Σ, and the same set of variables, V.

Let RP be the relation of variables to rewrite rules for GP and RC be the relation of variables to rewrite rules for GC. Let P be the set of all possible rewrite rules, (V ∪ Σ)*. Then RC: V → ℘(P) and RP: V → P*. Clearly, RC relates each variable to a set of sequences (unordered), whereas RP relates each variable to a sequence of sequences (ordered).

Ordered choice also hints at one of the key differences between the interpretation of PEGs and CFGs: PEGs describe how to parse a string and thus check if said string belongs to the language accepted by the PEG, whereas CFGs describe how to generate every string in the language accepted by the PDA equivalent to the CFG.

Finally, PEGs are very expressive and have two particularly interesting unary predicates that CFGs do not: followed-by (&) and not-followed-by (!). You can find a description of all of the operators defined for PEGs in the above linked Wikipedia article.

Top-Down Parsing Language (TDPL)

The TDPL is a strict subset of a PEG and was invented before PEGs. Everything that can be expressed by a TDPL grammar can thus be expressed by a PEG; however, the converse is not true: PEGs feature the aforementioned predicate operators and those cannot be expressed in the TDPL.

Prolog's Cut Operator

Suppose we introduce Prolog's cut operator into the TDPL. We can formally describe the Prolog's cut as a parsing operator as follows: Let ν ∈ V be some variable of P and RP(v) → (S1, S2, …, Sj, …, Sn) be the rewrite rules for ν. Suppose we are currently parsing according to the rule Sj. The continuation of the parser for ν should Sj fail at the current position in the string is (Sj+1, …, Sn). If we reach a cut operator in Sj then we replace the current parser continuation from Sj with the empty sequence, (), and move to the next symbol in Sj. Another interpretation of the cut operator is that it disables the parsers ability to backtrack and retry parsing with the rule Sj+1 if Sj fails at any point after a cut has been reached.

We will extend the TDPL with the cut operator and call this language !TDPL (for the sake of distinguishing it from the TDPL in this article). We want to prove the equivalence in matching power of the !TDPL and PEGs. Let T be some !TDPL grammar, P be some PEG, and s be some string. We will prove by construction the equivalence in matching power by defining each of the cut operator and the (not-)followed-by operators in terms of the other. The intention is to show that the !TDPL is the "small core" of a PEG.

If: s ∈ L(T) ⇒ s ∈ L(P)

For this half of the proof, we need only define the followed-by and not-followed-by predicates in terms of the cut operator. Supposed the following rules exist in P:

  • B → !b

  • A → &a

We can define equivalent rules in T as follows. Notice that the followed-by predicate is equivalent to the negation of the followed-by, or rather: not-followed-by ◦ not-followed-by. Thus, it is necessary only to prove that the cut is just as expressive as the not-followed-by operator in order to prove !TDPL and PEG equivalence. Interestingly, this implies that the TDPL with the addition of a not-followed-by predicate is as small of a core as the !TDPL.

  • B → b ! ƒ / ε

  • A → C ! ƒ / ε
  • C → a ! ƒ / ε

Only-If: s ∈ L(T) ⇐ s ∈ L(P)

Suppose the following rule exists in T:

  • A → B ! C / D

We can define an equivalent rule in P as follows:

  • A → !B D / B C

Generalized TDPL (GTDPL)

Interestingly, all of this has been shown before but not with the cut operator. PEGs have an alternate simple core as proven by Brian Ford. It is the GTDPL (information can be found on the GTDPL in the TDPL Wikipedia entry), another simple extension of the TDPL.

The GTDPL defines an if-then-else mechanism using the following syntax:

  • A → B[C, D]

Which is equivalent to the !TDPL rule in the previous section.

Conclusion

All of this has, in essence, been an exercise in tedium. It was not necessary to re-express the GTDPL in another way; however, it was fun. While making my PEG parser generator earlier this summer (it was in interest project, the program is not meant for serious use) I ended up "discovering" that I could have all of the expressive power of PEGs using the simple-to-implement cut operator as opposed to implementing all of the PEG operators. This, coupled with some tree-building operators allowed me to generate the same parse trees that the equivalent PEG would. Had I done more research at the time, I would have realized that I had not really discovered a more succinct core to the PEG language but instead re-expressed a well-established one using a construct from a language that was familiar to me: Prolog. Nevertheless, the process was fun and rewarding!


Comments

There are no comments, but you look like you have something on your mind.


Comment