RSS Feed

Peter Goodman's blog about PHP, Parsing Theory, C++, Functional Programming, Applications,

Using a Stack to Parse HTML with PHP5

Use Cases

Before we get into any of the code we should think about how a HTML / Template parser can possibly be useful. Well, with it, we can do several things: make a custom template engine, make a HTML code cleaner for such things as posts/comments on blogs/forums/etc, make a HTML Tidy script that will properly indent our html, and more.

Strategy

Off the bat it should be obvious that we're going to need to use one of PHP's perl compatible regular expression (PCRE) functions--these are denoted by the prefix "preg_". Thinking this through, we want to be able to capture every single tag in the document/text (hereafter: buffer). Lets assume we use a preg_match_all or a preg_replace. These will successfully find all of the tags but it won't get us the text in-between. You might be wondering: why is the text in between the tags important? Well, HTML and XML follow a tree structure, such as:

<root>
    <sub>
        <sub>
            text
        </sub>
    </sub>
</root>

If we only find the tags through a preg_match_all, then we will have no way of building a new fixed/compiled version of the buffer because we won't have any of the available text. If we use a preg_replace/preg_replace_callback then we can replace all of the tags properly but then making sure that all of the tags open and close in their proper order would be ugly and very involved.

There is another great PCRE function available to use that many people often don't use. This function is preg_split. This function will allow us to split up the buffer into text and markup nodes. Perfect! preg_split has a flags parameter that allows us to change the results that we recieve from it. The option that we will choose is "PREG_SPLIT_DELIM_CAPTURE", which the PHP Manual defines as

If this flag is set, parenthesized expression in the delimiter pattern will be captured and returned as well.
If that seems like jibberish, it simply means that the array returned from preg split will include both the things in-between the split regex pattern and what the pattern matched as well.

Alright, so now we know how we're going to find the tags and the text. Lets think about how we're actually going to parse them. We know that the markup will be in a tree-like format and most tags will open and close (there can also be non-closing tags). Lets look at this as a challenge. We have an array of opening and closing tags (as XML disperesed among array elements that have normal text. We could make a regular expressions to match opening and closing <sub> nodes but chances are that in a more complex tree the regular expression will not match the proper closing tags. Given that we are on the first <sub> tag and we want to find its </sub> and not the <sub> that's within it, we can parse it top-down as a stack. let me put this as a series of events:

  • found <sub>, push the first <sub> tag onto the stack
  • found <sub>, push the second <sub> tag onto the stack
  • found text
  • found </sub>, pop the second <sub> off of the stack
  • found </sub>, pop the first <sub> off of the stack

This clearly shows us that the last array_pop of the stack will represent the closing tag to our first <sub> node! Think this through a bit if it seems awkward. Now that we know how we want to parse the markup, lets get to it--with code!

Splitting up the Buffer

The regular expression we will use is quite simple: ~<(/?)([^>]*)>~ What is says is look for <, then see if there's a "/" after it (a closing tag), then keep going until it finds the first > but doesn't include it, then finish off the pattern with the >. If we run this preg split on the above XML, we get an array like:

$matches = preg_split("~<(/?)([^>]*)>~",$buffer,-1,PREG_SPLIT_DELIM_CAPTURE);
print_r($matches);
//output:
Array
(
    [0] => 
    [1] => 
    [2] => root

    [3] => 
    [4] => 
    [5] => sub

    [6] =>     
    [7] => 
    [8] => sub

    [9] => text
    [10] => /
    [11] => sub

    [12] => 
    [13] => /
    [14] => sub

    [15] => 
    [16] => /
    [17] => root

    [18] => 
)

Now I've split up the results of the print_r into groups of three so that we can notice a trend in the split (all preg_splits will return this type of thing). Every first index in these groups will be text. Every second will be either nothing or a / meaning that we are on a closing tag. Finally, every third element will be the inners of the tag itself. This means that in a for loop that increments $i from 0 to less than count($matches), $i modulo 3 will give us the following results..

$i % 3 == 0 // text node
$i % 3 == 1 // / or nothing tells us if this is a closing node
$i % 3 == 2 // everything in between < and >, exluding the / if present

Applying What we've Figured Out to the Split Up Buffer

This is extremely useful because now we have a way to set up our stack! Here's how we will do it:

$stack = array();
for($i = 0; $i &lt; count($matches); $i++)
{
    $key = $i % 3;

    // the $key is equal to 0, therefore this is text
    if($key == 0)
    {
        // *** do something with text here
    }

    // the $key must be greater than 0, so we're now dealing
    // dealing with parts of the tag itself.
    else
    {
        // lets make a boolean variable to keep track of whether or
        // the tag we are on in the array is a closing one or not.
        $closing = FALSE;

        // we've found a /, therefore this is a closing node        
        if($matches[$i] == &quot;/&quot;)
        {
            $closing = TRUE;

            // pop the node off of the stack.
            $node = array_pop($stack);

            // *** we've found the closing node, do something with it.
        }

        // this is not a closing node, therefore it must be an opening one
        if(!$closing)
        {
            // get the tag info from $matches[$i+1], we use the +1 because
            // the info for the tag is in the next array key and knowing this,
            // we can easily reduce the number of loops that this for loop has
            // to do.
            $tag = $matches[$i+1];

            // *** figure out if it's a normal opening node or a non-closing
            // *** node then push it onto the stack if we expect it to close.
        }

        // skip to the next text array key because in here
        // we will have dealt with $key == 1 and 2
        $i++;
    }
}

We've got the basics of the of how the parser will go over the split up buffer and interact with a stack. We know where nodes are going to be pushed onto the stack and where they will be popped off, which means that it's time to actually make a stack that can handle this.

Making a Stack

Here we will make a simple class to deal with adding nodes to a stack (push), taking nodes off a stack (pop), and figuring out what the last node added (i.e. the parent node) is. This functionality could have be easily done without the use of a class, but I felt like using it anyway.

Note: PHP already has a lot of functions built in to work with arrays as stacks so in reality we don't actually need the following class.

// a stack to keep track of non-text nodes
class NodeStack
{
    // the array that is the stack
    protected $stack = array();

    // keep track of what the current key of
    // the last element on the node stack is.
    // this starts at -1 because when we push
    // the first element on, it will become 0.
    protected $key = -1;

    // push an element on to the node stack
    final public function push(ClosingNode $node)
    {
        $this->stack[] = $node;
        $this->key++;
    }

    // pop the last element off of the node stack
    final public function pop()
    {       
        $ret = FALSE;
        $this->key--;
        return array_pop($this-&gt;stack);
    }

    // return the last element of the node stack
    final public function last()
    {
        return $this->stack[$this->key];
    }
}

When I say stack, what I mean is first-in-last-out. This means that the first node pushed on to the stack will be the last node to be popped off of the stack. As I went into earlier, this follows the tree structure of XML/HTML. In the push() function, you can see the that parameter $node trails ClosingNode. This is a way to ensure that the value passed to $node is an instance of ClosingNode. The last() function will simply return the last node on the stack, without popping it off the stack.

Dealing with the Nodes

We know that we will we running into three types of nodes: text nodes, closing nodes, and non-closing nodes. We can streamline this by simply storing the text nodes as text within their parent nodes. We now have closing and non-closing nodes left. We need to think about how we can represent a container for these. We will need to store the text nodes within them, i.e. a buffer, we will need to be able to parse the opening tags and also the closing tags. For non-closing nodes, we don't require a buffer and we also don't need to parse any closing tags. We can thus represent these two cases with abstract classes that our own custom tag handling classes will later extend.

// a class to handle all nodes.
abstract class ClosingNode
{
    protected $buffer = '';
    protected $name = '';

    // constructor
    public function __construct($name = '')
    {
        $this->name = $name;
    }

    // add text to our buffer
    final public function addTextToBuffer($buffer = '')
    {
        $this->buffer .= $buffer;
    }

    // parse this node
    final public function parseBuffer()
    {
        return $this->parseOpen($this->arguments) . $this->buffer . $this->parseClose();
    }

    // get this nodes tag name
    final public function getName()
    {
        return $this->name;
    }

    // abstract methods parseOpen and parseClose that
    // will be overwritten by extending classes
    abstract public function parseOpen();
    abstract public function parseClose();
}

// non-closing tags only require parseOpen functions,
// so we will define the parseClose method and make it
// final so that anything extending NonClosingNode doesn't
// need to define a parseClose method.
//
// another thing to note is that this class doesn't directly
// have any abstract methods, but needs to be defined as an
// abstract class because it extends ClosingNode and inherits
// the abstract parseOpen function.
abstract class NonClosingNode extends ClosingNode
{
    final public function parseClose()
    {
        return '';
    }
}

Although there is a lot going on here, all we really need to concern ourselves with are how these classes will be used. Each node that parse will be able to be parsed by its own class that extends either ClosingNode or NonClosingNode. When we run into text, we will use the addTextToBuffer function to add it into the parent nodes buffer. We will use this same method to add the results of parseBuffer to any parent nodes. This requires one simple thing: we always need a parent node! Fortunately, this is easy to accomplish and also has some benefits that I will get into later. Here are two node classes, one for unknown nodes and one for out root node.

// the root node
class RootNode extends ClosingNode
{
    public function parseOpen()
    {
        return '';
    }
    public function parseClose()
    {   
        return '';
    }
}

// an unknown node, execute as-if these are normal
// functions
class UnknownNode extends ClosingNode
{
    public function parseOpen()
    {
        return '';
    }
    public function parseClose()
    {   
        return '';
    }
}

Dealing with nodes was enough. Even if you don't immediately understand all the abstract, final, protected, and other keywords in there, it doesn't really matter. As long as you can recognize the basic class structure and that the only functions that any classes that extend Non/ClosingNode will have are parseOpen and possible parseClose.

Putting it Together

It's time to put the above classes to work together and fill in the blanks in the loop that we made to go over the parsed buffer. We will look at the code section-by-section to make sure that nothing is left out. Here we're going to start things off. The following code has to deal with splitting up the buffer and setting up our stack.

// split up the buffer into tags and text
$parts = preg_split("~<(/?)([^>]*)>~", $buffer, -1, PREG_SPLIT_DELIM_CAPTURE);

// create a node stack, this will keep track of
// non-text nodes, and push our root node onto
// it as the first node.
$stack = new NodeStack;
$stack->push(new RootNode(''));

// keep track of whether we are looping over a closing
// tag or not
$closing = FALSE;

The reason why we immediately push the RootNode onto our stack is because each time we loop over the parts of the buffer that were split up using preg_split, we will need to know what the last node added--which will be our parent node--is. You might also note the boolean $closing variable. It is there so that we can keep track of whether or not the current iteration is a closing tag (TRUE) or not (FALSE).

// loop through the split up parts of the buffer
for($i = 0; $i < count($parts); $i++)
{
	// get the last node added to the stack, the first
	// iteration will always have this as the root node
	$parent = $stack->last();
			
	// the results of this will tell us what type of 
	// node we are on
	$key = $i % 3;

We've now started to loop over the split up parts of the buffer. On the every iteration until we find a node, $parent will be an instance of RootNode.

// text nodes
if($key == 0)
{
    // add this text to the parent nodes buffer
    $parent->addTextToBuffer($parts[$i]);
    
    // reset closing to false so that future tags
    // don't get popped out for no reason.
    $closing = FALSE;
}

$key is equal to zero therefore this is just some text. We know what the parent node is, so we will add it to its buffer. This is essentially saying that when we call the parent node's parseBuffer() function, whatever has been added to its buffer will be put in between the return values of its parseOpen() and parseClose() function. We also set $closing to FALSE by default because given the case that the previous iteration was a closing node, this new iteration isn't necessarily a closing node yet so we need to change it.

else
{
    // split the tag up into its arguments
    $tag_parts = preg_split("~([^W]*)(=("|')?([a-z0-9s.-_]*)("|')?)~i", $parts[$i+1], -1, PREG_SPLIT_DELIM_CAPTURE);
    
    // try to get the tag name
    $tag_name = trim(strtolower($tag_parts[0]));

    // a closing tag has been found, pop it out of 
    // the stack
    if($parts[$i] == '/')
    {
        // tell the next iteration that this tag was
        // a closing tag
        $closing = TRUE;
                
        // need to pop the current node off the stack
        // otherwise we will end up adding text to itself
        $node = $stack->pop();
        
        // the new parent is the new last node in the stack
        $parent = $stack->last();
        
        // whoops, we have a finishing tag without a starting
        // one or a malformed one!
        if($node->getName() != $tag_name)
        {
            // reset the stack to where it was
            $stack->push($node);
            $parent = $stack->last();
            
            // add in a HTML error
            $parent->addTextToBuffer('<!-- BAD CLOSING TAG FOR ['. $this->tag_prefix . $tag_name .'] -->');
        }
        else {
            // parse the tag and add it to the parent node's buffer
            $parent->addTextToBuffer($node->parseBuffer());
        }
    }

There's a lot going on in here. First we split up the parts when $key == 2 (see splitting up the buffer), and that's why we use $parts[$i+1] because we will just handle $parts[$i] and $parts[$i+1] in the same loop when $key == 1 and 2. The tag parts will find the tag name (as $tag_parts[0]) and any tag arguments with their values. In the final version, there is code to deal with these arguments; however, they will not be included in this post. Getting the tag name is important because every time that we pop a node onto the stack (as you will see in the next code block), we pass the tag name, so that we can make sure that when we pop a node off the stack, it corresponds to the closing node that we've found. That's what the $node->getName() != $tag_name is for. $node is technically the parent node, because we've just popped it off of the stack, and the new parent node becomes $stack->last().

If we've found the wrong closing node, we will push $node back onto the stack so it again becomes the parent node, and with it, we will add text to the buffer saying that we've found the wrong closing node. In the case that $node-<getName() is equal to $tag_name, we will add the parsed version of this node as text to the parent node's buffer. This is the most important part. Please try to visualize this if it's not clear. We've found the closing tag for a node and we've already added text into its buffer, so if we take all of this in a ball and add it to the parent nodes buffer as text, we will be keeping the original order of the markup. This is the nature of parsing using a stack.

    // an opening tag, push it onto the stack
    if(!$closing)
    {
        // get the tag handler class name given the tag name
        $class_name = $this->getTagHandler($tag_name);
        
        // instanciate the appropriate node class
        $node = new $class_name($tag_name);
        
        // is this a non-closing tag? Parse it without ever
        // putting it on the stack                  
        if(trim($tag_parts[count($tag_parts)-1]) == '/')
            $parent->addTextToBuffer($node->parseBuffer());
        
        // normal opening tag
        else
            $stack->push($node);
    }
    
    // increment again to skip the next iteration
    $i++;
}

In the last code block, we looked for a / to find out if we are dealing with a closing node. If that was the case, $closing was set to TRUE and this code block won't be executed. However, if we've found an opening node, this is what will happen. First, we're going to get the class name of the tag handler for this node. I didn't show you the getTagHandler() function nor the mechanism to add tag handler, but rest assured they are quite simple. When we register a tag handler, we just pass it the tag name and it looks to see if a node class exists for that. If that's not the case, it will default to UnknownNode.

Remember how we split up the parts of the tag? Well, we want to see if the last element in that array, meaning the last part split off is a /. If it is, then we've found a non-closing node. Unlike closing nodes, we won't add this node to the stack because it doesn't encompass anything. We will simply parse it and add it to the parent nodes buffer. If this isn't a non-closing node, then we will add it to the stack. $i is incremented because in this one loop, we've dealt with $key as 1 and 2. This block of code concludes the for loop.

Earlier I alluded to an alternate benefit to having a RootNode. This benefit is the result of finding incorrect closing nodes. We will see that if the markup was originally malformed, then we might finish the loop with nodes other than the root node still on the stack! Obviously we need to deal with these extra nodes, and this next block of code shows how.

$temp_buffer = '';
while($node = $stack->pop())

    // is this the root node?
    if($node instanceof RootNode)
    {
        // add any text back into it from malformed
        // tags then parse it.
        $node->addTextToBuffer($temp_buffer);
        $parsed_buffer .= $node->parseBuffer();
        
        break;
    }
    
    // extract the buffers from the malformed tags
    // and put them back into the root node later.
    else
        $temp_buffer .= $node->extractBuffer();

// we're done! return the final parsed and put back together buffer     
return $parsed_buffer;

We will loop over the remaining nodes on the stack. If we pass the RootNode, then $stack->pop() will return FALSE and the loop will stop executing. First we check if the node popped off the stack is an instance of RootNode. If it is, then we've found what we're looking for and we're done. Otherwise, we will extract their buffers (a function that returns their buffers, this doesn't include the return values of parseOpen or parseClose) and add them to a temporary buffer. This temporary buffer is passed to the addTextToBuffer() function for the RootNode.

What's Next?

So we've gone through all of the important code. At the start I discussed some use cases and now that the code has been thoroughly examined, I'll explain how to accomplish some of those cases.

Code Cleaner

There are several ways we can accomplish this. We can use UnknownNode to clean all tags, or make specific node handler classes to deal with the ones we accept and then use UnknownNode to remove the rest. In the final code, there are functions to handle tag arguments and typecast their values.

Template Engine

This is the original use case for the parser. We can take a different approach to this by making a slight modification to the parsers code (this is in the final code) and allow us to recognize a prefix on all tags. To be simple, we can make all template tags namespaces XML tags, such as <fa:foo> and then have individual node classes to handle each of those. These node classes would return PHP that would either be eval'd or saved to a file for later inclusion.

Get the Code

You can get the final code to the template/HTML parser at http://ioreader.com/code/Template/parser.phps. There is also some other related code in the directory for making a template engine.


Comments

There are no comments, but you look like you have something on your mind.


Comment