본문 바로가기

IT-Consultant

Simple XML Parsing with SAX and DOM


XML has arrived. Configuration files, application file formats, even database access layers make use of XML-based documents. Fortunately, several high-quality implementations of the standard APIs for handling XML are available. Unfortunately, these APIs are large and therefore provide a formidable hurdle for the beginner.

In this article, I would like to offer an accessible introduction to the two most widely used APIs: SAX and DOM. For each API, I will show a sample application that reads an XML document and turns it into a set of Java objects representing the data in the document, a process known as XML "unmarshalling."

First, a word on style. For instructional purposes, I have kept the code as simple as possible. In order to focus on the basic usage of SAX and DOM, I completely omitted error handling and handling of XML namespaces, among other things. Furthermore, the code has not been tuned for flexibility or elegance; it may be dull, but hopefully it is also obvious.

The 60-Second XML Skinny

For those completely new to XML, I would like to review the most important terms and concepts used with XML data.

Each XML document starts with a prologue, followed by the actual document content. The prologue begins with an XML declaration, such as:

<?xml version="1.0" standalone="yes" ?>

The declaration must be at the very beginning of the document -- not even whitespace may precede it! It is followed by the document type declaration, which in the present case only names the root element (catalog), but in a real-world application would also provide a link to a constraint as provided by a Document Type Definition (DTD) or XML Schema document:

<!DOCTYPE catalog>

This concludes the prologue. The following body of the XML document is made up of elements, which take the role of (and look like) familiar HTML tags. Every element has a name, and may have an arbitrary number of attributes:

<catalog version="1.0">...</catalog>

Here catalog is the name of the element, having one attribute named version, with value 1.0. In contrast to HTML, XML element names are case sensitive and must be closed with the appropriate closing tag. Note that there must be no space between the opening angle bracket and the element name. If the element contains neither text nor other elements, the closing tag may be merged with the start tag (a so-called empty tag):

<catalog version="1.0" />

An element may include either text, or other elements, or a combination of both. Text may include entity references, similar to those in HTML. In short, an entity reference is a placeholder for another piece of data. They are often used to include special characters, such as angle brackets: < or >. Entity references consist of a ampersand, followed by the entity name and a semicolon:

&entityname;

XML elements have to be properly nested; in particular, the opening and closing tags of different elements must not overlap. In other words, an element's opening and end tags must reside in the same parent. This establishes a clear parent/child relationship among all elements of an XML document. Finally, the outermost element (the one following the prologue) is called the root element.

An element name may be qualified by an XML namespace prefix, yielding a qualified, or qNam. The namespace prefix is in the form of a Universal Resource Identifier (URI) and is followed by the local name after a colon:

namespace:localname

A document following these rules is syntactically well-formed. This is to be distinguished from its validity, which refers to adherence to the constraint laid out in the DTD or XML Schema document. Note that for a document that does not specify a constraint (such as the example document below), the concept of validity makes no sense.

The XML Document and Data Objects

The document to read describes the catalog of a library. The catalog may contain an arbitrary number of books and magazines. Each book has a title and exactly one author. Each magazine has a name and may contain an arbitrary number of articles. Finally, each article has a headline and a starting page.


<?xml version="1.0"?>

<catalog library="somewhere">

  <book>
    <author>Author 1</author>
    <title>Title 1</title>
  </book>

  <book>
    <author>Author 2</author>
    <title>His One Book</title>
  </book>

  <magazine>
    <name>Mag Title 1</name>

    <article page="5">
      <headline>Some Headline</headline>
    </article>

    <article page="9">
      <headline>Another Headline</headline>
    </article>
  </magazine>

  <book>
    <author>Author 2</author>
    <title>His Other Book</title>
  </book>

  <magazine>
    <name>Mag Title 2</name>

    <article page="17">
      <headline>Second Headline</headline>
    </article>
  </magazine>

</catalog>

Note that the starting page is encoded as an attribute of the article element. This is done primarily to demonstrate the use of attributes, although it can be argued that this design decision is actually semantically justified, since the starting page of an article is information about the article, but not part of the article itself.

In the example text, the following elements (called "complex elements" for the purpose of this article) may contain other elements:

  • <catalog>
  • <book>
  • <magazine>
  • <article>

The "simple" elements are those that contain only text:

  • <author>
  • <title>
  • <name>
  • <headline>

There are no elements that contain both text and child elements simultaneously.

The complex elements are represented in the application code by classes, whereas the simple elements are java.lang.String member variables of these classes. Since the sole purpose of these classes is to bundle the data read from the document, their interface has been kept minimal: they can be instantiated, their data members can be set, and finally, they override the toString() method, so as to allow access to the data inside.


class Catalog {
    private Vector books;
    private Vector magazines;

    public Catalog() {
	books = new Vector();
	magazines = new Vector();
    }

    public void addBook( Book rhs ) {
	books.addElement( rhs );
    }
    public void addMagazine( Magazine rhs ) {
	magazines.addElement( rhs );
    }

    public String toString() {
	String newline = System.getProperty( "line.separator" );
	StringBuffer buf = new StringBuffer();
	
	buf.append( "--- Books ---" ).append( newline );
	for( int i=0; i<books.size(); i++ ){
	    buf.append( books.elementAt(i) ).append( newline );
	}
	
	buf.append( "--- Magazines ---" ).append( newline );
	for( int i=0; i<magazines.size(); i++ ){
	    buf.append( magazines.elementAt(i) ).append( newline );
	}

	return buf.toString();
    }
}

// --------------------------------------------------------------

class Book {
    private String author;
    private String title;

    public Book() {}
    
    public void setAuthor( String rhs ) { author = rhs; }
    public void setTitle(  String rhs ) { title  = rhs; }

    public String toString() {
	return "Book: Author='" + author + "' Title='" + title + "'";
    }
}

// --------------------------------------------------------------

class Magazine {
    private String name;
    private Vector articles;

    public Magazine() {
	articles = new Vector();
    }

    public void setName( String rhs ) { name = rhs; }

    public void addArticle( Article a ) {
	articles.addElement( a );
    }

    public String toString() {
	StringBuffer buf = new StringBuffer( "Magazine: Name='" + name + "' ");
	for( int i=0; i<articles.size(); i++ ){
	    buf.append( articles.elementAt(i).toString() );
	}
	return buf.toString();
    }
}

// --------------------------------------------------------------

class Article {
    private String headline;
    private String page;

    public Article() {}

    public void setHeadline( String rhs ) { headline = rhs; }
    public void setPage(     String rhs ) { page     = rhs; }

    public String toString() {
	return "Article: Headline='" + headline + "' on page='" + page + "' ";
    }
}

The classes have not been declared public, therefore they have package visibility. The primary consequence of this is that all of them can be defined in the same source file. (To remove possible confusion: the variable name rhs used in the setter methods stands for right-hand-side -- a very convenient naming convention for assignments!)

SAX, the Simple API for XML, is a traditional, event-driven parser. It reads the XML document incrementally, calling certain callback functions in the application code whenever it recognizes a token. Callbacks events are generated for the beginning and the end of a document, the beginning and end of an element, etc. They are defined in the interface org.xml.sax.ContentHandler, which every SAX-based document handler class must implement. It is the responsibility of the application programmer to implement these callback functions. Often, the application may not care about certain events reported by the SAX parser. For these cases, there exists a convenience class, org.xml.sax.helpers.DefaultHandler, which provides empty implementations for all functions defined in ContentHandler; custom classes simply extend DefaultHandler and need only override those callbacks in which they are specifically interested. This is done in the code below.



 

At the heart of a program (or class) utilizing the SAX parser typically lies a stack. Whenever an element is started, a new data object of the appropriate type is pushed onto the stack. Later, when the element is closed, the topmost object on the stack has been finished and can be popped. Unless it has been the root element (in which case the stack will be empty after it has been popped), the most recently popped element will have been a child element of the object that now occupies the top position of the stack, and can be inserted into its parent object. This process corresponds to the shift-reduce cycle of bottom-up parsers. Note how the requirement that XML elements must not overlap is crucial for the proper functioning of this idiom.

Example 1. Unmarshalling with SAX.


class SaxCatalogUnmarshaller extends DefaultHandler {
    private Catalog catalog;

    private Stack stack;
    private boolean isStackReadyForText;

    private Locator locator;

    // ----- 

    public SaxCatalogUnmarshaller() {
	stack = new Stack();
	isStackReadyForText = false;
    }

    public Catalog getCatalog() { return catalog; }

    // ----- callbacks: -----

    public void setDocumentLocator( Locator rhs ) { locator = rhs; }

    // ----- 

    public void startElement( String uri, String localName, String qName,
			      Attributes attribs ) {

	isStackReadyForText = false;

	// if next element is complex, push a new instance on the stack
	// if element has attributes, set them in the new instance
	if( localName.equals( "catalog" ) ) {
	    stack.push( new Catalog() );

	}else if( localName.equals( "book" ) ) {
	    stack.push( new Book() );

	}else if( localName.equals( "magazine" ) ) {
	    stack.push( new Magazine() );

	}else if( localName.equals( "article" ) ) {
	    stack.push( new Article() );
	    String tmp = resolveAttrib( uri, "page", attribs, "unknown" );
	    ((Article)stack.peek()).setPage( tmp );
	}
	// if next element is simple, push StringBuffer 
	// this makes the stack ready to accept character text
	else if( localName.equals( "title" ) || localName.equals( "author" ) ||
		 localName.equals( "name"  ) || localName.equals( "headline" ) ) {
	    stack.push( new StringBuffer() );
	    isStackReadyForText = true;
	}
	// if none of the above, it is an unexpected element		 
	else{
	    // do nothing
	}		 
    }

    // ----- 

    public void endElement( String uri, String localName, String qName ) {

	// recognized text is always content of an element
	// when the element closes, no more text should be expected
	isStackReadyForText = false;

	// pop stack and add to 'parent' element, which is next on the stack
	// important to pop stack first, then peek at top element!
	Object tmp = stack.pop();
	
	if( localName.equals( "catalog" ) ) {
	    catalog = (Catalog)tmp;
	
	}else if( localName.equals( "book" ) ) {
	    ((Catalog)stack.peek()).addBook( (Book)tmp );

	}else if( localName.equals( "magazine" ) ) {
	    ((Catalog)stack.peek()).addMagazine( (Magazine)tmp );
	    
	}else if( localName.equals( "article" ) ) {
	    ((Magazine)stack.peek()).addArticle( (Article)tmp );
	}
	// for simple elements, pop StringBuffer and convert to String
	else if( localName.equals( "title" ) ) {
	    ((Book)stack.peek()).setTitle( tmp.toString() );

	}else if( localName.equals( "author" ) ) {
	    ((Book)stack.peek()).setAuthor( tmp.toString() );

	}else if( localName.equals( "name" ) ) {
	    ((Magazine)stack.peek()).setName( tmp.toString() );

	}else if( localName.equals( "headline" ) ) {
	    ((Article)stack.peek()).setHeadline( tmp.toString() );
	}
	// if none of the above, it is an unexpected element:
	// necessary to push popped element back!
	else{
	    stack.push( tmp );
	}
    }

    // -----
    
    public void characters( char[] data, int start, int length ) {

	// if stack is not ready, data is not content of recognized element
	if( isStackReadyForText == true ) {
	    ((StringBuffer)stack.peek()).append( data, start, length );
	}else{
	    // read data which is not part of recognized element
	}
    }
    
    // -----
    
    private String resolveAttrib( String uri, String localName, 
			          Attributes attribs, String defaultValue ) {
	
	String tmp = attribs.getValue( uri, localName );
	return (tmp!=null)?(tmp):(defaultValue);
    }
}

Of the various callback methods declared in the ContentHandler interface, only four are implemented here. In unmarshalling a document, we are primarily interested in the contents that are encoded in it. Therefore, the relevant events are the beginning and end of an element, and the occurrence of raw character data inside an element. We also implement the setDocumentLocator() method. Although not used in the application code, it can be very helpful in debugging. The org.xml.sax.Locator interface acts like a cursor, pointing to the position in the XML document where the last event occurred. It provides useful methods such as getLineNumber() and getColumnNumber().

Start of Element

When the startElement() function is called, the SAX parser passes it a number of arguments. The first three are (in order): the namespace URI, the local name, and the fully qualified name of the element. By default, only the URI and the local name need to be supplied, while the qualified name is optional. Since the catalog document does not introduce any XML namespaces, we only use the local name in the present application.

The last argument holds the attributes of the present element (if any) in a specific container, which allows retrieval of the attributes by their names, as well as iteration over all attributes using an integer index.

Related Reading

Java and XML
Solutions to Real-World Problems
By Brett�McLaughlin

Elements are recognized by their local names. If the current element is a complex element, an object of the appropriate type is instantiated and pushed onto the stack. If the current element is simple, a new StringBuffer is pushed onto the stack instead, ready to accept character data.

Finally, the <article> element has an attribute, which is read from the attribs argument and inserted into the newly created article object on top of the stack. The attribute is extracted using the convenience function resolveAttrib(), which returns the attribute value or a default text, if the attribute is missing.

End of Element

The endElement() function is called with essentially the same arguments as the startElement() function; only the list of attributes is missing. In any case, the topmost element on the stack is popped, converted to the proper type, and inserted into its parent, which now occupies the top of the stack. Only the root element, which has no parent, is treated differently.

Raw Text

Finally, the callback function named characters() is called when the parser encounters raw text. It is passed a char array, containing the actual data, as well as a position at which to start reading and the length of data to be read from the array. Of course, it is illegal to access the data array outside of those boundaries. The implementation of the callback method inserts the data into the StringBuffer on the stack.

The way the characters() function is called by the underlying SAX parser often leads to some initial confusion, for two reasons. Firstly, there is no guarantee that a stretch of contiguous data results in only a single call to characters() -- it would be perfectly legal for the parser to invoke the callback function for each individual character of text! Although this is certainly an extreme scenario, it is quite common for text with embedded entity references to result in several calls to characters(): one for the text before the reference, a separate call for the entity itself, and finally, one for the remaining text. This is the reason that a StringBuffer is pushed on the stack if a simple element is encountered when reading the example document. (In fact, using a StringBuffer with the characters() callback function is a common idiom when using the SAX API.)

The second reason that characters() can lead to confusion results from the fact that it is called for all text characters encountered by the parser, including whitespace, even the whitespace between element tags (such as newlines and tabs). This is surprising, since ContentHandler defines a special callback method ignorableWhitespace(), taking the same arguments as characters(). However, without a DTD or XML Schema, this method is never called, since there is no way for the parser to distinguish whether some whitespace is ignorable or not. In the present example program, the boolean flag isStackReady serves to distinguish between the two. The stack only becomes ready to accept text when a simple element has started and before it has ended.



The Document Object Model (DOM) describes an XML document as a tree-like structure, with every XML element being a node in the tree. A DOM-based parser reads the entire document, and (at least in principle) forms the corresponding document tree in memory. The DOM tree is formed from classes that all implement the org.w3c.dom.Node interface. This interface provides functions to walk or modify the tree (such as getChildNodes(), or appendChild() and removeChild()), and, of course, methods to query each node for its name and value.



The present unmarshalling code does not need to modify the DOM tree. The tree traversal itself is essentially recursive: the root node is unmarshalled, then each of its child nodes (which are either of type book or magazine), and, in the case of the magazine, its children (article). Whenever a child node has been unmarshalled, the resulting object representation of that node is inserted into the parent object.

Example 2. Unmarshalling with DOM.


class DomCatalogUnmarshaller {

    public DomCatalogUnmarshaller() { }

    // -----

    public Catalog unmarshallCatalog( Node rootNode ) {
	Catalog c = new Catalog();

	Node n;
	NodeList nodes = rootNode.getChildNodes();
	
	for( int i=0 ; i<nodes.getLength(); i++ ){
	    n = nodes.item( i );

	    if( n.getNodeType() == Node.ELEMENT_NODE ){

		if( n.getNodeName().equals( "book" ) ) {
		    c.addBook( unmarshallBook( n ) );
		    
		}else if( n.getNodeName().equals( "magazine" ) ){
		    c.addMagazine( unmarshallMagazine( n ) );
		    
		}else{
		    // unexpected element in Catalog
		}
	    }else{
		// unexpected node-type in Catalog
	    }
	}
	return c;
    }

    // -----

    private Book unmarshallBook( Node bookNode ) {
	Book b = new Book();

	Node n;
	NodeList nodes = bookNode.getChildNodes();
	
	for( int i=0 ; i<nodes.getLength(); i++ ){
	    n = nodes.item( i );

	    if( n.getNodeType() == Node.ELEMENT_NODE ){

		if( n.getNodeName().equals( "author" ) ){
		    b.setAuthor( unmarshallText( n ) );

		}else if( n.getNodeName().equals( "title" ) ){
		    b.setTitle( unmarshallText( n ) );

		}else{
		    // unexpected element in Book
		}
	    }else{
		// unexpected node-type in Book
	    }
	}
	return b;
    }

    // -----

    private Magazine unmarshallMagazine( Node magazineNode ) {
	Magazine m = new Magazine();

	Node n;
	NodeList nodes = magazineNode.getChildNodes();
	
	for( int i=0 ; i<nodes.getLength(); i++ ){
	    n = nodes.item( i );

	    if( n.getNodeType() == Node.ELEMENT_NODE ){

		if( n.getNodeName().equals( "name" ) ) {
		    m.setName( unmarshallText( n ) );

		}else if( n.getNodeName().equals( "article" ) ) {
		    m.addArticle( unmarshallArticle( n ) );

		}else{
		    // unexpected element in Magazine
		}
	    }else{
		// unexpected node-type in Magazine
	    }
	}
	return m;
    }

    // -----

    private Article unmarshallArticle( Node articleNode ) {
	Article a = new Article();

	if( articleNode.hasAttributes() == true ) {
	    a.setPage( unmarshallAttribute( articleNode, "page", "unknown" ) );
	}
	
	Node n;
	NodeList nodes = articleNode.getChildNodes();
	
	for( int i=0 ; i<nodes.getLength(); i++ ){
	    n = nodes.item( i );

	    if( n.getNodeType() == Node.ELEMENT_NODE ){

		if( n.getNodeName().equals( "headline" ) ) {
		    a.setHeadline( unmarshallText( n ) );

		}else{
		    // unexpected element in Article
		}
	    }else{
		// unexpected node-type in Article
	    }
	}
	return a;
    }
    
    // -----

    private String unmarshallText( Node textNode ) {
	StringBuffer buf = new StringBuffer();

	Node n;
	NodeList nodes = textNode.getChildNodes();

	for( int i=0; i<nodes.getLength(); i++ ){
	    n = nodes.item( i );

	    if( n.getNodeType() == Node.TEXT_NODE ) {
		buf.append( n.getNodeValue() );
	    }else{
		// expected a text-only node!
	    }
	}
	return buf.toString();
    }

    // -----

    private String unmarshallAttribute( Node node, 
    	String name, String defaultValue ){
	Node n = node.getAttributes().getNamedItem( name );
	return (n!=null)?(n.getNodeValue()):(defaultValue);
    }
}

There are subtypes of the Node interface representing elements, text, comments, entities, and many others. The tree model, by which each part of the document is represented as a Node, is followed very consistently. Character data, for instance, is considered a child of its enclosing Element and is represented by its own Text instance, which has to be queried using getNodeValue() to find the actual string.

Related Reading

SAX2
By David�Brownell

The Node supertype offers getNodeName(), getNodeValue(), and getAttributes() to provide access to information about a Node instance without having to downcast it.

Not all three of these methods make sense for every node type, however. For instance, only an Element can have attributes; for all other Node subtypes the corresponding function returns null. For Element nodes, getNodeName() returns the tag name, but getNodeValue() returns null. In contrast, for a Text node, getNodeValue() returns the character data, while getNodeName() returns the fixed string "#TEXT". The www.w3.org DOM specification contains a table detailing the behavior of all three functions for every possibly node type.

In the present program, we are only interested in three kinds of nodes: those representing elements, text, and attributes. All of the unmarshalling functions are very similar to each other. They accept the topmost node of the subtree they are to unmarshall as an argument. Then they create an object representing the current node and iterate over its child nodes, unmarshalling each in turn. If a child node describes a complex element, the node is passed on to the appropriate unmarshalling function, depending on the element name. A child node of type TEXT_NODE describes a simple element, and the node value is simply the character data.

Nodes describing attributes are a bit different, since attributes are not really part of the document's tree structure: attributes are not proper children of the elements in which they are contained. They can therefore not be reached by tree-walking operations; instead, the Node class provides a getAttributes() function, which returns a collection of key/value-pairs, containing the attributes. Again, we provide a convenience function that returns a default value in case no attribute can be found for the given name.

The Driver

Finally, we need a driver class, containing static void main(). The main() function reads the API to use (SAX or DOM) and the name of the XML file from the command line. It creates a org.xml.sax.InputSource from the filename. This class is acceptable to both SAX and DOM as an encapsulation of an XML document. Then it creates instances of the the appropriate parser and unmarshaller classes and passes the input file to them. Finally, it prints the contents of the created objects to standard output.

Example 3. Driver class.


public class Driver {
    
    public static void main( String[] args ) {
	Catalog catalog = null;

	try {
	    File file = new File( args[1] );
	    InputSource src = new InputSource( new FileInputStream( file ) );
	
	    if( args[0].equals( "SAX" ) ) {
		System.out.println( "--- SAX ---" );

		SaxCatalogUnmarshaller saxUms = new SaxCatalogUnmarshaller();

		XMLReader rdr = XMLReaderFactory.
		    createXMLReader( "org.apache.xerces.parsers.SAXParser" );
		rdr.setContentHandler( saxUms );
		rdr.parse( src );

		catalog = saxUms.getCatalog();

	    }else if( args[0].equals( "DOM" ) ) {
		System.out.println( "--- DOM ---" );

		DomCatalogUnmarshaller domUms = new DomCatalogUnmarshaller();

		org.apache.xerces.parsers.DOMParser prsr = 
		    new org.apache.xerces.parsers.DOMParser();
		prsr.parse( src );
		Document doc = prsr.getDocument();
		
		catalog = domUms.unmarshallCatalog( doc.getDocumentElement() );

	    }else{
                System.out.println( "Usage: SAX|DOM filename" );
                System.exit(0);
            }

	    System.out.println( catalog.toString() );

	}catch( Exception exc ) {
	    System.out.println( "Usage: SAX|DOM filename" );
	    System.err.println( "Exception: " + exc );
	}
    }
}

SAX and DOM are interface specifications. Implementations of these interfaces are available from various sources (both commercial and free), and it is part of the driver's responsibility to load the specific parser class. The code above uses the Apache Xerces implementations of the SAX and DOM specifications; these are freely available, open source, high-quality implementations. Be sure that the corresponding classes are included in your CLASSPATH.

The SAX specification contains a factory class that can be used to select which SAX parser implementation will be used. After instantiating the XMLReader class, we need to register with it our SAX unmarshaller as application-specific content handler. Finally, we can retrieve the unmarshalled objects from the unmarshaller instance.

As opposed to SAX, the DOM specification covers only the tree representation of the XML document. Instantiating and using the parser is not actually covered by DOM itself, and the specific implementation must be named directly in the application code. After the input document has been parsed, the resulting DOM tree can be retrieved from the parser using the getDocument function, which returns a Document instance. The Document interface extends the Node interface and represents the root node of the document. It is then used with the appropriate unmarshaller class, similar to the SAX case.

Conclusion

It bears repeating that the code above is for instructional purposes only. It ignores many XML structures (such as namespaces, entities, and, of course, constraints), as well as more advanced features of the parser classes (such as additional SAX callback handlers, or more powerful ways to walk and modify a DOM tree). But the most immediate omission concerns the handling of unexpected elements and similar errors. The locations in the code where these conditions should be handled are clearly marked. It can be enlightening to insert some logging code and then observe the behavior of the program after some "errors" (such as unexpected elements) have been introduced into the XML document. Finally, the document structure has been hard-coded into program. A real-world application would need greater flexibility, or at least better diagnostics.

I hope to have demonstrated how to use either API to parse a simple XML document and turn its data into a set of Java objects. The example application is simple, but it should be enough to get you started. The references contain additional resources.

References

Books

  • Brett McLaughlin: Java & XML, 2nd edition, O'Reilly (2001)
  • Erik T. Ray: Learning XML, 1st edition, O'Reilly (2001)
  • David Brownell: SAX2, 1st edition, O'Reilly (2002)

Online

Philipp K. Janert is a software project consultant, server programmer, and architect.

'IT-Consultant' 카테고리의 다른 글

MySQL이 GROUP BY가 느린 이유  (0) 2010.01.27
Everest  (0) 2010.01.27
돌고 도는 이클립스  (0) 2010.01.07
SAS Text miner  (0) 2010.01.04
2010년 1월 4일 첫 출근  (0) 2010.01.04