The eXtensible Markup Language (XML)

SAX 2

SAX offers an alternative approach to DOM for parsing XML documents. The approach of DOM is tree-based; the approach of SAX is event-based. There are tradeoffs between the approaches.

SAX applications require little memory overhead and they are easy to program, but they do not offer much flexibility for random access or document traversal. SAX works well for many simple operations such as retrieving specific parts of a document or creating a subset document.

DOM applications have a large memory overhead of holding the entire document in memory, but this allows great flexibility for random access and document traversal. DOM works well for tasks involving more complicated analysis or document modification. DOM may be unusable for very large documents.

Parser Invocation

There are a variety of SAX parsers including Apache's Xerces, MSXML, Saxon, Sun's JAXP, and Oracle's XML Parser. The SAX parser can be specified directly, e.g.:
    import org.apache.xerces.parsers.SAXParser;
    import org.xml.sax.Parser;
    import org.xml.sax.XMLReader;
allows:
    SAXParser parser = new SAXParser();
for a SAX 1.0 parser and
    XMLReader reader = new SAXParser();
for a SAX 2.0 parser. The SAX parser can be specified indirectly so the actual parser can be changed in the "factory" instead in the application. For example:
    import org.xml.sax.XMLReader;
    import org.xml.sax.helpers.XMLReaderFactory;
    SAXParserFactory factory = SAXParserFactory.newInstance();
    SAXParser parser = factory.newSAXParser();

Then, the parser or reader is ready to parse the file and generate events. The application must be written to handle the events as specified in the API.

SAX Handlers

In addition to the XMLReader interface, SAX 2 has a number of handler interfaces:
    ContentHandler
    ErrorHandler
    DTDHandler
    EntityResolver
and helpers
    XMLReaderAdapter - for adapting SAX2 events for SAX1
    ParserAdapter - for adapting SAX1 events for SAX2
    DefaultHandler - base implementation of the four handlers
        by default, all handlers do nothing
There is also an XMLFilter interface described later.

Content Handlers

There are content handler interface methods are:
    startDocument()
    endDocument()
    startElement()
    endElement()
    characters()
    ignorableWhitespace()
    skippedEntity()
    processingInstruction()
    startPrefixMapping()
    endPrefixMapping()
    setDocumentLocator()

The startDocument() and endDocument() methods are called only once each, at the start and end of parsing the document, respectively. The endDocument() method will be invoked even if there is an unrecoverable parsing error.

The startElement() and endElement() methods are called respectively when the start tag and end tag of elements are encountered during sequential processing of the document. The startElement() method has parameters for the namespaceURI (if present), localName (without prefix), fullName (with prefix), and attributes. The endElement() method has the same parameters except without attributes.

The characters() method is called when there is text data within an element. Its parameters are a character array and indices for the start and end of the text. There is some flexibility in how the method is invoked, for example, there can be one or more calls to the event handler for a block of text.

Validating parsers make this call to report each chunk of ignorable whitespace. The parameters are the same as for the character() method. Non-validating parsers can return whitespace in either the characters() method or the ignorableWhitespace() method.

Non-validating parsers can skip entity references (except the 5 builtin entities). When the parser skips entity references, the skippedEntity() method is called with a string parameter of the skipped entity reference.

The processingInstruction() method is called when the parser encounters a processing instruction and the parameters are the target and the instruction.

The startPrefixMapping() and endPrefixMapping() are used to handle events related to namespace contexts. The startPrefixMapping() method is invoked when an element with an xmlns attribute is encountered. The parameters are the prefix and the URI. The endPrefixMapping() method is invoked when the end of the corresponding element end-tag is reached. The parameter is the prefix.

The setDocumentLocator() method is invoked before any events to define a Locator object that contains methods to retrieve the location of the parser event. Two Locator methods are getLineNumber() and getColumnNumber().

Error Handlers

There are three categoris of SAX errors and a method to handle each type: warning(), error(), and fatalError(). The parameter for all three handlers is a SAXParseException. Various methods are available for the SAXParseException class, including getLineNumber(), getColumnNumber(), and getMessage().

DTD Handlers

Two SAX 2 handlers help support parsing of DTDs. The notationDecl() method reports a notation declaration. The unparsedEntityDecl() reports entity declarations that should not be parsed.

Entity Resolver

The entity reolver interface has a single method resolveEntity() that allows the application to resolve external entities. The two parameters are publicId, the public identifier of the external entity, and systemId, the system identifier of the external entity.

Programming Notes

There are several issues that may cause problems for programmers unless they are considered. The issue that multiple events for text are possible has been described. Another issue is that the order of the element attributes may vary with the implementation. Also, it is important to always remember that SAX processes data in sequential order, without lookahead.

Professional XML has two example programs: one for retrieving data and one for word counting. The first example uses an XML file of Shakespeare's Hamlet, which conveniently is in the public domain, from Jon Bosak. The example Java program uses the SAX interface creates a a list of PERSONAE (the roles) from the file. The second example Java program just counts the words within elements. It also allows specifying element names in which words are counted.

Filters

Filters allow for such operations as removing unwanted elements, modifying elements or attributes, normalizing data values, etc. SAX2 formalizes the design technique common to filters in anXMLFilter interface and XMLFilterImpl default implementation.