<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" 
    xmlns:dc="http://purl.org/dc/elements/1.1/" 
    xmlns:media="http://search.yahoo.com/mrss/" 
    xmlns:atom="http://www.w3.org/2005/Atom"
    >
<channel>
    <title>I/O Reader</title>
    <atom:link href="http://www.ioreader.com/feed" rel="self" type="application/rss+xml" />
    <link>http://www.ioreader.com/</link>
    <description>Peter Goodman's blog about computer programming.</description>
    <pubDate>Wed, 09 May 2012 16:13:04 GMT</pubDate>
    <language>en</language>
        <item>
                <title><![CDATA[Traditional Parsing Methods]]></title>
        <link>http://www.ioreader.com/2012/05/09/traditional-parsing-methods</link>
        <comments>http://www.ioreader.com/2012/05/09/traditional-parsing-methods#comments</comments>
        <pubDate>Wed, 09 May 2012 16:13:04 GMT</pubDate>
        <dc:creator>Peter Goodman</dc:creator>
                        <category><![CDATA[Parsing Theory]]></category>
                                <category><![CDATA[TDOP]]></category>
                        <guid isPermaLink="false">2r</guid>
        <description><![CDATA[<p>
    One parsing technique that I sometimes use is <a href="http://dl.acm.org/citation.cfm?id=512931" title="Top Down Operator Precedence Parsing">Top Down Operator Precedence Parsing</a> (<abbr title="Top Down Operator Precedence Parsing">TDOP</abbr>). TDOP parsers have been discussed <a href="http://journal.stuffwithstuff.com/2011/03/19/pratt-parsers-expression-parsing-made-easy/">in</a> <a href="http://effbot.org/zone/tdop-index.htm">many</a> <a href="http://eli.thegreenplace.net/2010/01/02/top-down-operator-precedence-parsing/">other</a> <a href="http://javascript.crockford.com/tdop/tdop.html">places</a> as well. Unfortunately, I have not seen TDOP described in terms of left-corner parsing (except for a passing comment in <a href="http://publications.csail.mit.edu/lcs/specpub.php?id=715">this thesis</a>). 
</p>
<p>
    The purpose of this post is to set the stage for a later discussion about TDOP parsing. This post will introduce top-down and bottom-up parsing, then combine the two methods to introduce left-corner parsing. Also, the top-down parsing language (TDPL) will be briefly mentioned as its semantics relate to TDOP.
</p>

<h3>Traditional Parsing Methods</h3>
<p>
    Before getting into TDOP, it&apos;s important to have at least some background in non-TDOP parsing methods. This is because TDOP can be understood as a combination of several different parsing methods.
</p>
<p>
    Parsing is a language acceptance problem. That is, a parser is a function that accepts or rejects a string. If a parser accepts a string then we say that string is in some language. The opposite is said of rejection. A string in this case means a sequence of zero or more symbols. In the English language, symbols are Latin/alphabetic characters. In the <a href="http://en.wikipedia.org/wiki/C_(programming_language)">C programming language</a>, symbols are reserved words, variables, literals, and punctuation (e.g. <tt>void</tt>, <tt>&quot;foo&quot;</tt>, <tt>&gt;</tt>, etc.).
</p>
<p>
    Typically, a parser accepts the language generated by a <a href="http://en.wikipedia.org/wiki/Context-free_grammar" title="Context-free grammar">context-free grammar</a> (<abbr title="Context-free grammar">CFG</abbr>). CFGs are a formalism for describing <a href="http://en.wikipedia.org/wiki/Context-free_language">some</a> languages. The following is an example CFG that generates simple arithmetic expressions:
</p>
<pre class="code">
E &rarr; &quot;(&quot; E &quot;)&quot;
E &rarr; A

A &rarr; M &quot;+&quot; A
A &rarr; M &quot;-&quot; A
A &rarr; &quot;-&quot; A
A &rarr; M

M &rarr; N &quot;&times;&quot; M
M &rarr; N &quot;&divide;&quot; M
M &rarr; N

N &rarr; &quot;0&quot;
N &rarr; &quot;1&quot;
  &#8942;
N &rarr; &quot;10&quot;
</pre>
<p>
    <small style="font-style:italic;">
        Note: ignore the unusual placement of the parentheses and the <a href="http://en.wikipedia.org/wiki/Associative_property">right-associativity</a> of the operators described by the grammar.
    </small>
</p>
<p>
    The name to the left of the <tt>&rarr;</tt> is called a variable or a non-terminal. Something in quotes is called a token, or terminal. Both terminals and non-terminals are considered symbols. Terminals can be thought of as the letters of one&apos;s language.
</p>
<p>
    The <tt>&rarr;</tt> itself is a relation which says that non-terminal on the left-hand side can generate the language on the right-hand side. This combination is called a production.
</p>
<p>
    <em>Note</em>: the rest of this article will focus on parsing strings from left-to-right. The following examples detailing various parsing methods assume that our parsers alway guess correctly. Finally, we assume that our grammars are &epsilon;-free. That is, the right-hand side of a production is never empty (with one exception).
</p>
<h3>Top-Down Parsing</h3>
<p>
    As its name implies, <a href="http://en.wikipedia.org/wiki/Top-down_parsing" title="Top-down parsing">top-down parsing</a> proceeds top-down. In the case of the above expression grammar, the &quot;top&quot; starts off as <tt>E</tt>. The action of going &quot;down&quot; involves one of two things:
</p>
<ol>
    <li>Replacing a non-terminal with something that it is related to (the right-hand side of <tt>&rarr;</tt>).</li>
    <li>Consuming a terminal.</li>
</ol>
<p>
    Right-hand sides of productions contain both terminals and non-terminals. In replacing a non-terminal with ones of its right-hand sides, we set up expectations about the structure of later parts of the string. For example, suppose we want to parse &quot;<tt>(2 &times; 3)</tt>&quot;. Parsing will proceed as follows:
</p>
<div align="center">
    <script type="text/javascript" src="http://www.ioreader.com/js/image-slide.js"></script>
    <table cellpadding="0" cellspacing="0" border="1">
        <thead>
            <th>Step</th>
            <th colspan="2">Action</th>
            <th>Expectations</th>
            <th>Remainder of string</th>
        </thead>
        <tbody>
        <tr>
            <td>1</td>
            <td colspan="2">start</td>
            <td><tt>E</tt></td>
            <td><tt>(2 &times; 3)</tt></td>
        </tr>
        <tr>
            <td>2</td>
            <td>replace</td>
            <td><tt>E &rarr; &quot;(&quot; E &quot;)&quot;</tt></td>
            <td><tt>&quot;(&quot; E &quot;)&quot;</tt></td>
            <td><tt>(2 &times; 3)</tt></td>
        </tr>
        <tr>
            <td>3</td>
            <td>consume</td>
            <td><tt>&quot;(&quot;</tt></td>
            <td><tt>E &quot;)&quot;</tt></td>
            <td><tt>2 &times; 3)</tt></td>
        </tr>
        <tr>
            <td>4</td>
            <td>replace</td>
            <td><tt>E &rarr; A</tt></td>
            <td><tt><font color="blue">A</font> &quot;)&quot;</tt></td>
            <td><tt>2 &times; 3)</tt></td>
        </tr>
        <tr>
            <td>5</td>
            <td>replace</td>
            <td><tt>A &rarr; M</tt></td>
            <td><tt><font color="blue">M</font> &quot;)&quot;</tt></td>
            <td><tt>2 &times; 3)</tt></td>
        </tr>
        <tr>
            <td>6</td>
            <td>replace</td>
            <td><tt>M &rarr; N &quot;&times;&quot; M</tt></td>
            <td><tt><font color="blue">N &quot;&times;&quot; M</font> &quot;)&quot;</tt></td>
            <td><tt>2 &times; 3)</tt></td>
        </tr>
        <tr>
            <td>7</td>
            <td>replace</td>
            <td><tt>N &rarr; &quot;2&quot;</tt></td>
            <td><tt><font color="blue">&quot;2&quot;</font> &quot;&times;&quot; M &quot;)&quot;</tt></td>
            <td><tt>2 &times; 3)</tt></td>
        </tr>
        <tr>
            <td>8</td>
            <td>consume</td>
            <td><tt>&quot;2&quot;</tt></td>
            <td><tt>&quot;&times;&quot; M &quot;)&quot;</tt></td>
            <td><tt>&times; 3)</tt></td>
        </tr>
        <tr>
            <td>9</td>
            <td>consume</td>
            <td><tt>&quot;&times;&quot;</tt></td>
            <td><tt>M &quot;)&quot;</tt></td>
            <td><tt>3)</tt></td>
        </tr>
        <tr>
            <td>10</td>
            <td>replace</td>
            <td><tt>M &rarr; N</tt></td>
            <td><tt><font color="blue">N</font> &quot;)&quot;</tt></td>
            <td><tt>3)</tt></td>
        </tr>
        <tr>
            <td>11</td>
            <td>replace</td>
            <td><tt>N &rarr; &quot;3&quot;</tt></td>
            <td><tt><font color="blue">&quot;3&quot;</font> &quot;)&quot;</tt></td>
            <td><tt>3)</tt></td>
        </tr>
        <tr>
            <td>12</td>
            <td>consume</td>
            <td><tt>&quot;3&quot;</tt></td>
            <td><tt>&quot;)&quot;</tt></td>
            <td><tt>)</tt></td>
        </tr>
        <tr>
            <td rowspan="2">13</td>
            <td>consume</td>
            <td><tt>&quot;)&quot;</tt></td>
            <td></td>
            <td></td>
        </tr>
        <tr>
            <td colspan="2">accept</td>
            <td colspan="2"></td>
        </tr>
        </tbody>
    </table>
</div>
<p>
    If&mdash;as a side-effect of parsing a string&mdash;one wanted to build a parse tree, then the order of constructing nodes in the parse tree would be as follows:
</p>
<div align="center">
    <table border="1" cellpadding="0" cellspacing="0">
        <tr>
            <td colspan="3">
                <img src="http://www.ioreader.com/images/tdop/top-down-step0.png" id="tdop_top_down">
            </td>
        </tr>
        <tr>
            <td align="center"><button id="tdop_prev_top_down">prev</button></td>
            <td align="center"><button id="tdop_reset_top_down">reset</button></td>
            <td align="center"><button id="tdop_next_top_down">next</button></td>
        </tr>
    </table>
    <script>
    $("#tdop_top_down").imageSlider(11, "tdop_prev_top_down", "tdop_reset_top_down", "tdop_next_top_down");
    </script>
</div>
<h4>Top-Down Parsing Language</h4>
<p>
    Brief mention needs to be given to the <a href="http://en.wikipedia.org/wiki/Top-down_parsing_language" title="Top-Down Parsing Language">top-down parsing language</a> (TDPL). The TDPL formalizes the behavior of many top-down parsers. A key difference between a TDPL grammar and a CFG is that productions are totally ordered in a TDPL grammar.
</p>
<p>
    For example, if the productions of the above CFG were totally ordered according to their text order, then a parser cannot try the second production (<tt>E &rarr; A</tt>) without first failing to parse according to the first production (<tt>E &rarr; &quot;(&quot; E &quot;)&quot;</tt>).
</p>
<h3>Bottom-Up Parsing</h3>
<p>
    We can characterize top-down parsers as making &quot;global&quot; decisions. Their expectations about the future structure of the as-of-yet unseen parts of the string are evidence of this. On the other hand, <a href="http://en.wikipedia.org/wiki/Bottom-up_parsing" title="Bottom-up parsing">bottom-up parsers</a> operate &quot;locally&quot;. That is, they make decisions based only on the structure of the part of the string that they have already seen. 
</p>
<p>
    The consequence of local decision making is that bottom-up parsers discover sub-structures of the parsed string before they discover super/structures. In theory, a bottom-up parser has no expectations about the remainder of the string to be parsed. In practice, <a href="http://en.wikipedia.org/wiki/LALR_parser" title="LALR parser">common</a> bottom-up parsers implicitly make use of top-down information.
</p>
<p>
    Bottom-up parsers typically perform two main actions: shift and reduce. 
</p>
<ol>
    <li>
        Shifting is similar to consuming to the extent that our cursor into the string being parsed moves forward by one symbol. This is equivalent to removing the first symbol of the input string.
        <br><br>
        Unlike top-down parsers, bottom-up parsers do not maintain a sequence of expectations. Instead, they operate on a partially parsed substring of the input string.
        <br><br>
        Shifting involves taking the first symbol from remainder of the input string and appending it to the end of the partially parsed string.
    </li>
    <li>
        Reducing operates on a suffix of the partially parsed string. A reduction involves taking a suffix of the partially parsed string, matching it against the right-hand side of a production, and then replacing it with the left-hand side of a production (non-terminal).
    </li>
</ol>
<p>
    For example, suppose we want to parse &quot;<tt>(2 &times; 3)</tt>&quot;. Parsing will proceed as follows:
</p>
<div align="center">
    <table border="1" cellpadding="0" cellspacing="0">
        <thead>
            <th>Step</th>
            <th colspan="2">Action</th>
            <th>Partial parse</th>
            <th>Remainder of string</th>
        </thead>
        <tbody>
        <tr>
            <td>1</td>
            <td colspan="2">start</td>
            <td><tt></tt></td>
            <td><tt>(2 &times; 3)</tt></td>
        </tr>
        <tr>
            <td>2</td>
            <td>shift</td>
            <td><tt>&quot;(&quot;</tt></td>
            <td><tt>&quot;(&quot;</tt></td>
            <td><tt>2 &times; 3)</tt></td>
        </tr>
        <tr>
            <td>3</td>
            <td>shift</td>
            <td><tt>&quot;2&quot;</tt></td>
            <td><tt>&quot;(&quot; &quot;2&quot;</tt></td>
            <td><tt>&times; 3)</tt></td>
        </tr>
        <tr>
            <td>4</td>
            <td>reduce</td>
            <td><tt>N &rarr; &quot;2&quot;</tt></td>
            <td><tt>&quot;(&quot; <font color="blue">N</font></tt></td>
            <td><tt>&times; 3)</tt></td>
        </tr>
        <tr>
            <td>5</td>
            <td>shift</td>
            <td><tt>&quot;&times;&quot;</tt></td>
            <td><tt>&quot;(&quot; N &quot;&times;&quot;</tt></td>
            <td><tt>3)</tt></td>
        </tr>
        <tr>
            <td>6</td>
            <td>shift</td>
            <td><tt>&quot;3&quot;</tt></td>
            <td><tt>&quot;(&quot; N &quot;&times;&quot; &quot;3&quot;</tt></td>
            <td><tt>)</tt></td>
        </tr>
        <tr>
            <td>7</td>
            <td>reduce</td>
            <td><tt>N &rarr; &quot;3&quot;</tt></td>
            <td><tt>&quot;(&quot; N &quot;&times;&quot; <font color="blue">N</font></tt></td>
            <td><tt>)</tt></td>
        </tr>
        <tr>
            <td>8</td>
            <td>reduce</td>
            <td><tt>M &rarr; N</tt></td>
            <td><tt>&quot;(&quot; N &quot;&times;&quot; <font color="blue">M</font></tt></td>
            <td><tt>)</tt></td>
        </tr>
        <tr>
            <td>9</td>
            <td>reduce</td>
            <td><tt>M &rarr; N &quot;&times;&quot; M</tt></td>
            <td><tt>&quot;(&quot; <font color="blue">M</font></tt></td>
            <td><tt>)</tt></td>
        </tr>
        <tr>
            <td>10</td>
            <td>reduce</td>
            <td><tt>A &rarr; M</tt></td>
            <td><tt>&quot;(&quot; <font color="blue">A</font></tt></td>
            <td><tt>)</tt></td>
        </tr>
        <tr>
            <td>11</td>
            <td>reduce</td>
            <td><tt>E &rarr; A</tt></td>
            <td><tt>&quot;(&quot; <font color="blue">E</font></tt></td>
            <td><tt>)</tt></td>
        </tr>
        <tr>
            <td>12</td>
            <td>shift</td>
            <td><tt>&quot;)&quot;</tt></td>
            <td><tt>&quot;(&quot; E &quot;)&quot;</tt></td>
            <td></td>
        </tr>
        <tr>
            <td rowspan="2">13</td>
            <td>reduce</td>
            <td><tt>E &rarr; &quot;(&quot; E &quot;)&quot;</tt></td>
            <td><tt><font color="blue">E</font></tt></td>
            <td></td>
        </tr>
        <tr>
            <td colspan="2">accept</td>
            <td colspan="2"></td>
        </tr>
        </tbody>
    </table>
</div>
<p>
    If&mdash;as a side-effect of parsing a string&mdash;one wanted to build a parse tree, then the order of constructing nodes in the parse tree would be as follows:
</p>
<div align="center">
    <table border="1" cellpadding="0" cellspacing="0">
        <tr>
            <td colspan="3">
                <img src="http://www.ioreader.com/images/tdop/bottom-up-step0.png" id="tdop_bottom_up">
            </td>
        </tr>
        <tr>
            <td align="center"><button id="tdop_prev_bottom_up">prev</button></td>
            <td align="center"><button id="tdop_reset_bottom_up">reset</button></td>
            <td align="center"><button id="tdop_next_bottom_up">next</button></td>
        </tr>
    </table>
    <script>
    $("#tdop_bottom_up").imageSlider(11, "tdop_prev_bottom_up", "tdop_reset_bottom_up", "tdop_next_bottom_up");
    </script>
</div>
<h3>Left-Corner Parsing</h3>
<p>
    <a href="http://cs.union.edu/~striegnk/courses/nlp-with-prolog/html/node53.html" title="Left-corner parsing">Left-corner parsing</a> (LC) is a parsing technique that makes decisions based on top-down and bottom-up information.
</p>
<p>
    In the case of the bottom-up parser above, it appears that we were lucky that the sequence of shifts and reductions ended up reducing the entire string to an <tt>E</tt>. Strictly speaking, the goal of the above bottom-up parser was exactly that: reduce a string to <tt>E</tt>. If our expression were very long, then it wouldn&apos;t be clear until near the end of a bottom-up parse that our parser might have a chance of reaching its goal of <tt>E</tt>. 
</p>
<p>
    An LC parser attempts to satisfy multiple goals, including the end goal of reducing the string to <tt>E</tt>. An LC parser predicts substructures present in the remainder of the string, and attempts to parse those sub-structures bottom-up. But the prediction step sets up expectations about the structure of unseen parts of the string, which is a top-down approach.
</p>
<p>
    In fact, LC parsers alternate between bottom-up and top-down parsing. Alternation is possible because an LC parser maintains a list of goals (analogous to our top-down expectations), a list of predictions, and a partial parse of the input string (as in a bottom-up parser). An LC parser operates on its input string and these three lists in the following way:
</p>
<ol>
    <li>
        Repeat:
        <ol>
            <li>
                If the head of the goal list is a terminal, then <strong>consume</strong> the terminal and shift the first symbol of the remainder of the input string onto the end of the partial parse. If the goal terminal does not match the first symbol of the string then reject.
                <br><br>
                If the head of the goal list is a non-terminal, then attempt to <strong>reduce</strong> a suffix of the partial parse to the to the goal non-terminal. If such a reduction is possible, then remove the non-terminal from the head of goal list and update the partial parse accordingly.
                <br><br>
                This step is repeated until the goal list remains unchanged.
            </li>
            <li>
                If <em>&beta;</em> is the last symbol of the partial parse, then find a production of the form &quot;<em>&alpha; &rarr; &beta; &gamma;</em>&quot; where <em>&gamma;</em> is a string of zero-or-more symbols. <em>&beta;</em> is said to be a <strong>left corner</strong> of <em>&alpha;</em>. Left corners can be both terminals and non-terminals. If we weren&apos;t restricting ourselves to &epsilon;-free CFGs, then left corners do not necessarily appear immediately following the &quot;&rarr;&quot;!
                <br><br>
                Place <em>&gamma;</em> and <em>&alpha;</em> on the head of the goal list, so that the first symbol (if any) of <em>&gamma;</em> is our next goal.
                <br><br>
                If the goals list is changed then return to the step 1.1.
            </li>
            <li>
                If neither of the previous two steps changed the goals list, then <strong>shift</strong> a symbol from the remainder of the input string onto the end of the partial parse.
                <br><br>
                If no such symbol can be shifted, then reject the string. Otherwise, return to step 1.2.
            </li>
        </ol>
    </li>
    <li>
        Stop when the goal list is empty.
    </li>
</ol>
<p>
    For example, suppose we want to parse &quot;<tt>(2 &times; 3)</tt>&quot;. Parsing will proceed as follows:
</p>
<div align="center">
    <table border="1" cellpadding="0" cellspacing="0">
        <thead>
            <th>Step</th>
            <th colspan="2">Action</th>
            <th>Goals</th>
            <th>Partial parse</th>
            <th>Remainder of string</th>
        </thead>
        <tbody>
        <tr>
            <td>1</td>
            <td colspan="2">start</td>
            <td><tt></tt></td>
            <td></td>
            <td><tt>(2 &times; 3)</tt></td>
        </tr>
        
        <tr>
            <td rowspan="2">2</td>
            <td colspan="5">
                <small>
                    (1.1) no change to goals list<br>
                    (2.2) no change to goals list
                </small>
            </td>
        </tr>
        <tr>
            
            <td colspan="2">shift</td>
            <td><tt></tt></td>
            <td><tt>&quot;(&quot;</tt></td>
            <td><tt>2 &times; 3)</tt></td>
        </tr>
        
        <tr>
            <td rowspan="2">3</td>
            <td colspan="5">
                <small>
                    (1.1) no change to goals list
                </small>
            </td>
        </tr>
        <tr>
            
            <td>corner</td>
            <td><tt><font color="red">E</font> &rarr; <font color="blue">&quot;(&quot;</font> <font color="green">E &quot;)&quot;</font></tt></td>
            <td><tt><font color="green">E &quot;)&quot;</font> <font color="red">E</font> </tt></td>
            <td><tt><font color="blue">&quot;(&quot;</font></tt></td>
            <td><tt>2 &times; 3)</tt></td>
        </tr>
        
        <tr>
            <td rowspan="2">4</td>
            <td colspan="5">
                <small>
                    (1.1) no change to goals list<br>
                    (1.2) no change to goals list
                </small>
            </td>
        </tr>
        <tr>
            
            <td>shift</td>
            <td><tt><font color="blue">&quot;2&quot;</font></tt></td>
            <td><tt>E &quot;)&quot; E</tt></td>
            <td><tt>&quot;(&quot; <font color="blue">&quot;2&quot;</font></tt></td>
            <td><tt>&times; 3)</tt></td>
        </tr>
        
        <tr>
            <td rowspan="2">5</td>
            <td colspan="5">
                <small>
                    (1.1) no change to goals list
                </small>
            </td>
        </tr>
        <tr>
            
            <td>corner</td>
            <td><tt><font color="red">N</font> &rarr; <font color="blue">&quot;2&quot;</font></tt></td>
            <td><tt><font color="red">N</font> E &quot;)&quot; E</tt></td>
            <td><tt>&quot;(&quot; <font color="blue">&quot;2&quot;</font></tt></td>
            <td><tt>&times; 3)</tt></td>
        </tr>
        
        <tr>
            <td>6</td>
            <td>reduce</td>
            <td><tt><font color="blue">N</font> &rarr; &quot;2&quot;</tt></td>
            <td><tt>E &quot;)&quot; E</tt></td>
            <td><tt>&quot;(&quot; <font color="blue">N</font></tt></td>
            <td><tt>&times; 3)</tt></td>
        </tr>
        
        <tr>
            <td rowspan="2">7</td>
            <td colspan="5">
                <small>
                    (1.1) no change to goals list
                </small>
            </td>
        </tr>
        <tr>
            <td>corner</td>
            <td><tt><font color="red">M</font> &rarr; <font color="blue">N</font> <font color="green">&quot;&times;&quot; M</font></tt></td>
            <td><nobr><tt><font color="green">&quot;&times;&quot; M</font> <font color="red">M</font> E &quot;)&quot; E</tt></nobr></td>
            <td><tt>&quot;(&quot; N <font color="blue"></font></tt></td>
            <td><tt>&times; 3)</tt></td>
        </tr>
        
        <tr>
            <td>8</td>
            <td>consume</td>
            <td><tt><font color="blue">&quot;&times;&quot;</font></tt></td>
            <td><tt>M M E &quot;)&quot; E</tt></td>
            <td><tt>&quot;(&quot; N <font color="blue">&quot;&times;&quot;</font></tt></td>
            <td><tt>3)</tt></td>
        </tr>

        <tr>
            <td rowspan="2">9</td>
            <td colspan="5">
                <small>
                    (1.1) no change to goals list<br>
                    (1.2) no change to goals list
                </small>
            </td>
        </tr>
        <tr>
            <td>shift</td>
            <td><tt><font color="blue">&quot;3&quot;</font></tt></td>
            <td><tt>M M E &quot;)&quot; E</tt></td>
            <td><tt>&quot;(&quot; N &quot;&times;&quot; <font color="blue">&quot;3&quot;</font></tt></td>
            <td><tt>)</tt></td>
        </tr>
        </tbody>
        <thead>
            <th>Step</th>
            <th colspan="2">Action</th>
            <th>Goals</th>
            <th>Partial parse</th>
            <th>Remainder of string</th>
        </thead>
        <tbody>
        <tr>
            <td rowspan="2">10</td>
            <td colspan="5">
                <small>
                    (1.1) no change to goals list
                </small>
            </td>
        </tr>
        <tr>
            <td>corner</td>
            <td><tt><font color="red">N</font> &rarr; <font color="blue">&quot;3&quot;</font></tt></td>
            <td><tt><font color="red">N</font> M M E &quot;)&quot; E</tt></td>
            <td><tt>&quot;(&quot; N &quot;&times;&quot; <font color="blue">&quot;3&quot;</font></tt></td>
            <td><tt>)</tt></td>
        </tr>
        
        <tr>
            <td>11</td>
            <td>reduce</td>
            <td><tt><font color="blue">N</font> &rarr; &quot;3&quot;</tt></td>
            <td><tt>M M E &quot;)&quot; E</tt></td>
            <td><tt>&quot;(&quot; N &quot;&times;&quot; <font color="blue">N</font></tt></td>
            <td><tt>)</tt></td>
        </tr>
        
        <tr>
            <td>12</td>
            <td>reduce</td>
            <td><tt><font color="blue">M</font> &rarr; N</tt></td>
            <td><tt>M E &quot;)&quot; E</tt></td>
            <td><tt>&quot;(&quot; N &quot;&times;&quot; <font color="blue">M</font></tt></td>
            <td><tt>)</tt></td>
        </tr>
        
        <tr>
            <td>13</td>
            <td>reduce</td>
            <td><tt><font color="blue">M</font> &rarr; N &quot;&times;&quot; M</tt></td>
            <td><tt> E &quot;)&quot; E</tt></td>
            <td><tt>&quot;(&quot; <font color="blue">M</font></tt></td>
            <td><tt>)</tt></td>
        </tr>
        
        <tr>
            <td rowspan="2">14</td>
            <td colspan="5">
                <small>
                    (1.1) no change to goals list
                </small>
            </td>
        </tr>
        <tr>
            <td>corner</td>
            <td><tt><font color="red">A</font> &rarr; <font color="blue">M</font></tt></td>
            <td><tt><font color="red">A</font> E &quot;)&quot; E</tt></td>
            <td><tt>&quot;(&quot; <font color="blue">M</font></tt></td>
            <td><tt>)</tt></td>
        </tr>
        
        <tr>
            <td>15</td>
            <td>reduce</td>
            <td><tt><font color="blue">A</font> &rarr; M</tt></td>
            <td><tt>E &quot;)&quot; E</tt></td>
            <td><tt>&quot;(&quot; <font color="blue">A</font></tt></td>
            <td><tt>)</tt></td>
        </tr>
        
        <tr>
            <td>16</td>
            <td>reduce</td>
            <td><tt><font color="blue">E</font> &rarr; A</tt></td>
            <td><tt>&quot;)&quot; E</tt></td>
            <td><tt>&quot;(&quot; <font color="blue">E</font></tt></td>
            <td><tt>)</tt></td>
        </tr>

        <tr>
            <td>17</td>
            <td>consume</td>
            <td><tt><font color="blue">&quot;)&quot;</font></tt></td>
            <td><tt>E</tt></td>
            <td><tt>&quot;(&quot; E <font color="blue">&quot;)&quot;</font></tt></td>
            <td></td>
        </tr>

        <tr>
            <td rowspan="2">18</td>
            <td>reduce</td>
            <td><nobr><tt><font color="blue">E</font> &rarr; &quot;(&quot; E &quot;)&quot;</tt></nobr></td>
            <td></td>
            <td><tt><font color="blue">E</font></tt></td>
            <td></td>
        </tr>
        <tr>
            <td colspan="5">accept</td>
        </tr>
        </tbody>
    </table>
</div>
<p>
    If&mdash;as a side-effect of parsing a string&mdash;one wanted to build a parse tree, then the order of constructing nodes in the parse tree would be as follows:
</p>
<div align="center">
    <table border="1" cellpadding="0" cellspacing="0">
        <tr>
            <td colspan="3">
                <img src="http://www.ioreader.com/images/tdop/left-corner-step0.png" id="tdop_lc">
            </td>
        </tr>
        <tr>
            <td align="center"><button id="tdop_prev_lc">prev</button></td>
            <td align="center"><button id="tdop_reset_lc">reset</button></td>
            <td align="center"><button id="tdop_next_lc">next</button></td>
        </tr>
    </table>
    <script>
    $("#tdop_lc").imageSlider(17, "tdop_prev_lc", "tdop_reset_lc", "tdop_next_lc");
    </script>
</div>
<p>
    Compared to the other two methods, this seems like a lot of work for nothing! Also, there is some amount of magic happening: recall that we are operating under the assumption that every action taken will be the correct one. In practice, one constructs a table and &quot;cheats&quot; when deciding which actions to take.
</p>
<h3>Summary</h3>
<p>
    Top-down and bottom-up parsing were covered to set the stage for left-corner parsing and the TDPL, which provide context for the behavior of TDOP parsers. My next post will go into TDOP and how it relates to left-corner parsing and the TDPL.
</p>]]></description>
    </item>
        <item>
                <title><![CDATA[Symbolic Interpretation]]></title>
        <link>http://www.ioreader.com/2012/04/07/symbolic-interpretation</link>
        <comments>http://www.ioreader.com/2012/04/07/symbolic-interpretation#comments</comments>
        <pubDate>Sat, 07 Apr 2012 23:09:05 GMT</pubDate>
        <dc:creator>Peter Goodman</dc:creator>
                        <category><![CDATA[Compilers]]></category>
                                <category><![CDATA[Interpreters]]></category>
                                <category><![CDATA[Optimization]]></category>
                        <guid isPermaLink="false">2q</guid>
        <description><![CDATA[<p>
Recently I worked on a project for my <a href="http://www.eecg.toronto.edu/~tsa/homepage/TeachingPage.htm">Optimizing Compilers course</a>. The purpose of this project was to implement <a href="http://en.wikipedia.org/wiki/Loop-invariant_code_motion">Loop-invariant Code Motion</a> and any other compiler optimizations that we choose. The project is competitive because one's mark is based on how one's compiler improves the mean execution time on a small set of static, pre-determined test cases. Given that the test cases do not change, it is natural to specialize one's optimizations to the code being tested. Realistically, this might not be the best approach as code tends to change and compiler optimizations are not always transparent.
</p>

<h3>Optimizations</h3>
<p>
So far I have implemented the following optimizations. This post will focus on the last optimization, symbolic interpretation (labeled EVAL).
</p>
<dl>
	<dt>CP</dt>
	<dd><a href="http://en.wikipedia.org/wiki/Copy_propagation" title="Copy propagation compiler optimization">Copy propagation</a></dd>

	<dt>CF</dt>
	<dd><a href="http://en.wikipedia.org/wiki/Constant_folding" title="Constant folding compiler optimization">Constant folding</a> (with local constant propagation)</dd>

	<dt>LICM</dt>
	<dd><a href="http://en.wikipedia.org/wiki/Loop-invariant_code_motion" title="Loop-invariant code motion compiler optimization">Loop-invariant code motion</a></dd>

	<dt>DCE</dt>
	<dd><a href="http://en.wikipedia.org/wiki/Dead_code_elimination" title="Dead code elimination compiler optimization">Dead code elimination</a> (with unreachable code elimination, block merging, and local constant de-duplication)</dd>

	<dt>CSE</dt>
	<dd><a href="http://en.wikipedia.org/wiki/Common_subexpression_elimination" title="Common subexpression elimination compiler optimization">Common subexpression elimination</a></dd>

	<dt>EVAL</dt>
	<dd>Symbolic interpretation (based on <a href="http://en.wikipedia.org/wiki/Abstract_interpretation" title="Abstract interpretation">abstract interpretation</a>)</dd>
</dl>
<p>
These optimizations were arranged into the following pipeline, where dashed edges are followed when a pass changes something and solid edges are followed when no changes are made:
</p>
<p align="center">
<img src="http://www.ioreader.com/images/optimizer-pipeline.png" alt="Pipeline of optimization passes" />
</p>
<h3>SimpleSUIF</h3>
<p>
This project uses Stanford's <a href="http://suif.stanford.edu/suif/suif1/" title="The SUIF 1.x Compiler System">SimpleSUIF</a> compiler infrastructure. SimpleSUIF's intermediate representation (<abbr title="Intermediate Representation">IR</abbr>) is a linked list of instructions, including such things as basic arithmetic, bitwise operators, memory/constant load/store, and calling/branching operations. The IR is register based, with three register classes: machine, pseudo, and temporary. For our purposes, machine registers are never used. Temporary registers represent single-definition and single-use registers, where both the definition and use (if any) must reside in the same <a href="http://en.wikipedia.org/wiki/Basic_block" title="Basic block">basic block</a>. Temporary registers often hold loaded constants. Pseudo registers behave like general purpose registers. Finally, all registers are typed.
</p>
<p>
One quirk of how we use SimpleSUIF is that there is no apparent way to access the IR for an arbitrary function within the same compilation unit. As such, <a href="http://en.wikipedia.org/wiki/Interprocedural_optimization" title="Interprocedural compiler optimization">interprocedural optimizations</a> such as function inlining and compile-time execution are not possible. This was unfortunate as there was one particular test case that would have benefitted from interprocedural optimization.
</p>
<h3>Test case</h3>
<p>
Below is one of the functions in the test case of interest. Two lines are striked out because the dead code elimination optimization pass regards them as useless.
</p>
<pre class="code">
float f1(float b, float c){
   int i;
   float j, k;

   <strike>j = c;</strike>
   for(i = 0; i &lt; 2; i++) {
      k = b * i;
      <strike>j += k;</strike>
   }

   return k;
}
</pre>
<p>
Looking closely at this example, it is clear that only the initialization of <tt>i</tt> to <tt>0</tt>, the last iteration of the loop, and the value of <tt>b</tt> are important to the output of <tt>f1</tt>. However, this is difficult to tell from the perspective of the IR without running through the program. With more information (e.g. about loop induction variables or loop dependencies), we might be able to make smarter decision, but only in some really restricted cases. Unfortunately, it's not clear <em>how</em> one should go about &quot;executing&quot; this program in the absence of a particular value for <tt>b</tt>. This is where symbolic interpretation comes in.
</p>
<h3>Symbolic interpretation</h3>
<p>
Symbolic interpretation is similar to <a href="http://en.wikipedia.org/wiki/Global_value_numbering" title="Local value numbering">local value numbering</a> in that we operate on concrete and symbolic values. For simplicity, I restricted this optimization pass to a subset of the provably <a href="http://en.wikipedia.org/wiki/Pure_function" title="Pure functions">pure functions</a>. Because information about other functions was absent, I considered a pure function to be any function that does not:
</p>
<ul>
<li>Load from or store to a memory location.</li>
<li>Call any functions. Note: this constraint can be relaxed in the case of a recursive function call. The test cases I focused on did not include recursive function calls; however, this method can easily be extended to apply to that case.</li>
<li>Copy from one memory location to another memory location.</li>
</ul>
<p>
Thus, a function is considered pure if it depends only on constants, local variables, and function arguments, and performs no operation that could generate a side-effect.
</p>
<p>
The following control-flow graph (does not include some edges because I am lazy with SVG) is an interactive symbolic executor of the SimpleSUIF-like IR representing the above function. Below I describe how each step of the evaluator is performed.
</p>
<div align="center" style="display:none;" id="si_table">
  <table border="1">
    <tr>
      <td colspan="2" align="center">
        <button id="si_next">Next</button>
        <button id="si_reset">Reset</button>
      </td>
    </tr>
    <tr>
      <td id="si_state" width="110"></td>
      <td height="400" width="240"><div id="si_blocks"></div></td>
    </tr>
  </table>
<script type="text/javascript" src="http://www.ioreader.com/js/d3.v2.min.js"></script>
<script type="text/javascript" src="http://www.ioreader.com/js/si.js"></script>
</div>
<noscript>
<p>There is supposed to be a cool symbolic interpreter simulator here but javascript is required to see it.</p>
</noscript>
<p>
The symbolic interpreter behaves similarly to something that performs a combination of constant folding and constant propagation, with the exception that when an operation is performed on an expression containing a symbol, a new symbol is generated.
</p>
<p>For example, if one performs an <tt>ldc</tt> operation to load the constant <tt>0</tt> into register <tt>t6</tt>, then we can assign to <tt>t6</tt> the value <tt>0</tt>. If a copy (<tt>cpy</tt>) operation is performed, then the value of the right-hand register is assigned to be the new value of the left-hand register. For example, <tt>cpy r3 = t6</tt> assigns to <tt>r3</tt> the value <tt>0</tt>.
</p>
<p>
Sometimes a register is used before it is defined. For example, <tt>r1</tt> in <tt>mul t8 = r1, r3</tt> is never defined in the above code. This is because <tt>r1</tt> represents one of the arguments to the function. In this case, <tt>r1</tt> is given a new symbolic value that is distinct from every other symbolic value. In the above simulator, the symbolic value assigned to <tt>r1</tt> is named <tt>r1</tt>. The purpose of being able to identify the &quot;origin&quot; of a symbol value will be useful for code generation.
</p>
<p>
When a symbolic value participates in an expression, as in <tt>mul t8 = r1, r3</tt>, a new and unique symbolic value is generated that represents the expression. If any of the components of the expression are constants (known at compile time) then we want to store those constants as part of the symbolic expression. For example, in the first iteration of the loop, <tt>t8</tt> is assigned the symbolic expression <tt>r1 * 0</tt>. In the second iteration of the loop, <tt>t8</tt> is assigned the symbolic expression <tt>r1 * 1</tt>.
</p>
<p>
Something not touched on in this example is a branch that depends on a symbolic value. In this case, we cannot follow the branch as we don't know in which direction it will go at runtime. We are concerned with cases in which we can <em>statically</em> determine the direction of the branch.
</p>
<h3>Code Generation</h3>
<p>
The focus of symbolic evaluation has been to end up with some symbolic or constant expression for each register. In fact, for this optimization, only the returned register (<tt>r5</tt>) ends up being useful. If the returned register contained a constant value then the function is necessarily constant, and so the function's code can be replaced with a <tt>ldc</tt> followed by a <tt>ret</tt>.
</p>
<p>
In the case that the returned register is a symbolic expression, we can walk the expression tree and output for each subexpression the instructions needed to compute that subexpression. The leaves of the expression tree will be symbolic register values (named according to their register) or constants.
</p>
<p>
Using the above expression tree walking strategy, the symbolic expression of <tt>r5</tt> can be converted to the following sequence of instructions:
</p>
<pre class="code">
ldc t1 = 1
mul t2 = r1, t1
ret t2
</pre>
<p>
Here we have generated new registers to hold temporaries, but left symbolic registers alone. This new sequence of instructions takes the place of the old, larger sequence of instruction.
</p>]]></description>
    </item>
        <item>
                <title><![CDATA[Dr. Sheng Yu]]></title>
        <link>http://www.ioreader.com/2012/01/26/dr-sheng-yu</link>
        <comments>http://www.ioreader.com/2012/01/26/dr-sheng-yu#comments</comments>
        <pubDate>Fri, 27 Jan 2012 05:52:46 GMT</pubDate>
        <dc:creator>Peter Goodman</dc:creator>
                                <guid isPermaLink="false">2p</guid>
        <description><![CDATA[<p>
It is with great sadness that I report the passing of my friend, colleague, and mentor: <a href="http://www.csd.uwo.ca/People/sheng_yu.html">Dr. Sheng Yu</a>. I knew Sheng in the past three and a half years of his life. Sheng was twice my professor, twice my employer, and my undergraduate thesis supervisor. 
</p>
<p>
Often I would pop in to Sheng's office on the third floor of Middlesex College at The University of Western Ontario. On his desks were towers of books and papers; it baffled me that they never fell. In his office, we would talk--sometimes for hours--about his past students and what they were up to, about parsing techniques, finite automata, regular languages and their operations, and object-oriented programming.
</p>
<p>
I prefaced each of our e-mail correspondences with the far too formal "Prof. Sheng Yu". Goodbye Prof. Sheng Yu; you will be missed.
</p>]]></description>
    </item>
        <item>
                <title><![CDATA[Comment System Now Working Again]]></title>
        <link>http://www.ioreader.com/2011/03/30/comment-system-now-working-again</link>
        <comments>http://www.ioreader.com/2011/03/30/comment-system-now-working-again#comments</comments>
        <pubDate>Thu, 31 Mar 2011 01:10:29 GMT</pubDate>
        <dc:creator>Peter Goodman</dc:creator>
                                <guid isPermaLink="false">2o</guid>
        <description><![CDATA[<p>
    It turns out that the comment system hasn't been working since I allowed for HTML in comments. I have now fixed the commenting system.
</p>]]></description>
    </item>
        <item>
                <title><![CDATA[Undergraduate Thesis Report Finished!]]></title>
        <link>http://www.ioreader.com/2011/03/30/undergraduate-thesis-report-finished</link>
        <comments>http://www.ioreader.com/2011/03/30/undergraduate-thesis-report-finished#comments</comments>
        <pubDate>Wed, 30 Mar 2011 17:27:01 GMT</pubDate>
        <dc:creator>Peter Goodman</dc:creator>
                        <category><![CDATA[Parsing Theory]]></category>
                                <category><![CDATA[Finite Automata]]></category>
                                <category><![CDATA[C++]]></category>
                                <category><![CDATA[Grail+]]></category>
                                <category><![CDATA[FLTL]]></category>
                        <guid isPermaLink="false">2n</guid>
        <description><![CDATA[<p>
    <strong>Update:</strong> Grail+ is now on <a href="http://www.github.org">GitHub</a> at <a href="https://github.com/pgoodman/Grail-Plus" title="Grail+ on GitHub">https://github.com/pgoodman/Grail-Plus</a>.
</p>
<p>
    Well, I've finally submitted my undergraduate thesis project's final report. My project was to develop the
    newest version of <a href="http://www.grailplus.org" title="Grail+">Grail+</a>. Grail+</em> is a set of command line tools for manipulating non-deterministic finite automata (&epsilon;-NFAs), non-deterministic pushdown automata (&epsilon;-NPDAs), and context-free grammars (CFGs). <em>Grail+</em> is built on top of the Formal Language Template Library (FLTL), a library that I developed for representing and symbolically manipulating CFGs, &epsilon;-NFAs, and &epsilon;-NPDAs. Over the past several months I've worked hard and built Grail+ and the FLTL from the ground up. Together, they represent around <a href="https://www.ohloh.net/p/grailplus/analyses/latest">12,000</a> lines of C++.
</p>
<p>
    My report can be found <a href="http://www.petergoodman.me/docs/goodman-undergrad-thesis.pdf" title="Peter Goodman's Undergraduate Thesis - Grail+">here</a>. The report is 49 pages long. For anyone reading this blog, the most interesting part of the report is the implementation discussion. Unfortunately, I had to leave a lot out of the report as it is already quite long. As such, the API described in the report is incomplete and some of the interesting discussions were cut short.
</p>
<p>
    I have licensed Grail+ under the <a href="http://www.opensource.org/licenses/mit-license.php" title="MIT Open Source License">MIT License</a>. I am interested in collaborating with others to continue the development of the project. The source code of Grail+ can be found <a href="https://github.com/pgoodman/Grail-Plus" title="Grail+ Git Repository">here</a>.
</p>]]></description>
    </item>
    </channel>
</rss>

