Balancing Tags in HTML and XHTML Excerpts

It is fairly common to want to take an HTML source of variable length and display an excerpt. Although some formats, such as Atom and RSS, anticipate this and create a separate summary element, we don’t always have the luxury of using such a data source.

Creating an excerpt introduces a problem, though: if we create an excerpt based on a number of words or characters, we may end up with unbalanced HTML or even broken tags.

One solution is to discard all tags and display plain text, but this is often unsatisfactory.

Here is my method of balancing tags. It assumes that the input is an excerpt of a valid XHTML snippet. The reason for this requirement has to do with self-closing tags, which I hope will be apparent from the description:

  1. Fix Broken Tags
    First we need to address any tags that are incomplete, e.g. <stro. Such a tag would appear only at the end of the excerpt, so we will go backwards through the excerpt, one character at a time. If we find a > before a <, then do nothing: there are no broken tags. If we find a < before a >, then we remove all characters from the < to the end of the excerpt.
  2. Fix Unbalanced Tags
    To do this, we make use of a stack. At a high level, we go through the excerpt from left to right. Every time we find an element start tag, push it onto the stack. Every time we find an element end tag, pop it from the stack. (Since we are assuming the input is well-balanced XHTML, the end tag should always match the popped start tag.)

    At the end of the input, the remaining items on the stack, if any, are the unbalanced tags. We can add the end tags for those at the end of our excerpt.

What about self-closing tags?
There are quite a few tags that don’t require end tags: br, img, video, and more. We could handle those too by creating a list of such tags–but for my purposes I am going to require that they are closed within the tag, e.g. <br />.

Here’s the function as written in Javascript:

// balance:
// - takes an excerpted or truncated XHTML string
// - returns a well-balanced XHTML string
function balance(string) {
  // Check for broken tags, e.g. <stro
  // Check for a < after the last >, indicating a broken tag
  if (string.lastIndexOf("<") > string.lastIndexOf(">")) {
    // Truncate broken tag
    string = string.substring(0,string.lastIndexOf("<"));
  }

  // Check for broken elements, e.g. &lt;strong&gt;Hello, w
  // Get an array of all tags (start, end, and self-closing)
  var tags = string.match(/<[^>]+>/g);
  var stack = new Array();
  for (tag in tags) {
    if (tags[tag].search("/") <= 0) {
      // start tag -- push onto the stack
      stack.push(tags[tag]);
    } else if (tags[tag].search("/") == 1) {
      // end tag -- pop off of the stack
      stack.pop();
    } else {
      // self-closing tag -- do nothing
    }
  }

  // stack should now contain only the start tags of the broken elements,
  // the most deeply-nested start tag at the top
  while (stack.length > 0) {
    // pop the unmatched tag off the stack
    var endTag = stack.pop();
    // get just the tag name
    endTag = endTag.substring(1,endTag.search(/[ >]/));
    // append the end tag
    string += "</" + endTag + ">";
  }

  // Return the well-balanced XHTML string
  return(string);
}

Demo the balanceTags function.

I recently saw a call for submissions to CFLib.org, a ColdFusion code library, so I submitted a version of balanceTags for ColdFusion as well.

One thought on “Balancing Tags in HTML and XHTML Excerpts”

Leave a Reply

Your email address will not be published. Required fields are marked *