The eXtensible Markup Language (XML)

XML Syntax Preliminaries

XML syntax is similar to HTML.

XML (unlike HTML) is case sensitive.

XML is supposed to use 16-bit Unicode characters (ISO/IEC 10656) or UTF-16, which facilitate languages requiring many symbols. However, most all current systems store characters in ASCII (or UTF-8), which works efficiently for English. Conversion of 8-bit ASCII to 16-bit Unicode requires only an added, zero, high-order byte. Legal characters include the printable ASCII characters x0020 (space) through x007E (tilde, '~'), x0009 (horizontal tab), x000A (line feed), and x000D (carriage return).

Characters can be referenced by a "&#" followed by a decimal number or by a "&#x" followed by a hexadecimal number. For example, the copyright symbol '©' can be written "&#169" or "&#xA9".

There are five special characters that must be replaced in XML content with an entity reference:

Symbol	Character	Replacement
<	left angle bracket or less than	<
>	right anlge bracket or greater than	>
&	ampersand	&
'	apostrophe or single quote	&apos
"	double qute	"

An entity reference consists of an ampersand ('&') followed by a legal XML name string, followed by a semi-colon (';'). Entity references other than for the five special characters (e.g., "&disclaimer;") must be defined prior to their use.

Simple names begin with a letter (in English, [a-zA-Z]), underscore ('_'), or colon (:). Subsequent characters may be any of these characters, a number ([0-9]), a hypen ('-'), or a period ('.').

String literals are used for the values of attributes, internal entities, and external identifies. In XML, string literals are delimited by either a pair of single quotes, e.g., 'a string literal', or a pair of double quotes, e.g., "a string literal". Remember to escape single or double quotes using the ampersand notation, e.g., "Tom's thumb" and "Response: "No"".

There are four whitespace characters in XML: the horizontal tab, line feed, carriage return, and space. All whitespace characters (except the carriage return) in the content of the document are preserved by the parser and passed unmodified to the application, whereas whitespace characters within element tags (including attribute values) may be removed. All three common ASCII end-of-line strings (CR-LF, LF, and CR) are converted to a single line feed.

Elements

The element is the basic building block in XML. Elements may contain other elements, character data, character references, entity references, comments, processing instructios, and/or CDATA.

Elements are delimited with tags. If an element has no content, it can have a single empty-element tag. Otherwise, it must have a start-tag and end-tag pair (unlike HTML, which has elements with only a start-tag). An element contains (i.e., its contents are) everything between the start-tag and end-tag.

A tag begins with a '<' (left angle bracket) and ends with a '>' (right angle bracket).

Inside the tag delimiters, start-tags consist of the element type name and perhaps element attributes; end-tags consist of a '/' (forward slash) followed by the element type name; and empty-tags begin with the element type name and end with a '/' (forward slash).

The example in the textbook illustrates all three types of tags, e.g.,
start-tag "<HEAD>", end-tag "</HEAD>", and empty-tag "<BR/>".

XML requires proper nesting. Thus, an element cannot contain only part of another element unless it contains all of the other element. For example, the following code is interpretable in HTML, but is not allowed in XML: "<B>bold text<I>bold italic text</B>italic text</I>".

Attributes describe elements much as adjectives describe nouns. An attribute is placed in the start-tag after the element type name and consist of the attribute name, an equal sign ('='), and the attribute value. Only one instance of an attribute is allowed in an element (unlike HTML). The attribute values must be delimited strings. For example:

<Cost currency="USD">12.95</Cost>

Attributes can be used with empty-tags. For example:

<Cost currency='USD' amount='12.95' />

The decision whether to use contained elements or attributes (as with the amount in these examples) is a point of consideration and disagreement.

Special Markup Elements

Comments, processing instructions (PIs), and CDATA sections warrant special mention.

Comments allow the inclusion of text in the XML file that is not intended to be read as part of the document. Comments begin with "", e.g., "". The comment between these delimiters should not contain the character sequence of two hypens "--" and should not end with a hyphen '-' (producing "--->"). Comments may not be used within element tags and cannot be nested.

Processing instructions allow the passing of instructions or hints to the application processing the document. Processing instructions begin with "<?", have a target and optional instructions, and end with "?>", e.g., "<?xml-sylesheet href='unltheses.css' type='text/css'?>". The target must be a valid XML name indentifying the application for the processing instruction. The instructions may not include the character sequence "?>".

CDATA sections allow the inclusion of text that contain characters that would normally be interpreted as markup. CDATA sections have the basic syntax: "<![CDATA[...]]>" where "..." is the included text. So, for example, to present the actual text:

<Legalese>©&copyinfo;</Legalese>

instead of

<Legalese>&copy;&copyinfo;</Legalese>

one could write:

<![CDATA[<Legalese>©&copyinfo;</Legalese>]]>

XML Document Structure

An XML document consists of three parts:

An optional prolog.
The body, consisting of one or more elements in a hierarchical tree.
An optional epilog.

The example from the textbook has a prolog with an XML declaration and one processing instruction, and a body with single root element, but no epilog.

The structure of the XML document can be drawn as a tree with a document root, an optional child for the prolog, a child for the body, and an optional child for the epilog.

All XML documents should begin with an XML Declaration (as in the example) beginnig with "<?xml " (six characters with the ending space). The parameters are version (required, currently must be "1.0"); encoding (recommended, default is "UTF-8" or "UTF-16" as indicated by the initial "<?xml " string); and standalone ("yes" or "no", indicating whether all required entity declarations are contained within the document). The rest of the prolog also may contain comments and processing instructions. The document type declaration (as in the example) is a common processing instruction.

The body contains a tree, with single root node, which may contain child elements, including comments and processing instructions.

The role of the epilog is rather vague at this point. Many parsers may stop after reaching the end of the body.

Parsing

A well-formed XML document adheres to the syntactic rules of XML. XML parsers are expected to insure the data is well-formed.

Additional structural rules may be imposed on the document (e.g., via a Document Type Definition (DTD) or XML Schema). A valid XML document meets both the rules of XML syntax and additional structural rules. If a parser also checks adherence to additional structural rules, it is said to be a validating parser.

There are two approaches to parsing: event-driven and tree-based. The event-driven approach processes XML data sequentially, handling components one at a time. The tree-based approach constructs a tree representation of the entire document. The main advantage of the event-based approach is simplicity. The standard API for event-driven parsing is SAX. The main advantage of the tree-based approach is its support for complex operations such as searching and editing. The DOM is a tree-based structure.

Here is an example of using the API for SAX. It consists of a main program that sets up the parser and input file and a handler class that defines the SAX handler interface. These simple examples are from Professional Java XML by K. Ahmed et al. (Wrox Press, 2001), but nearly identical introductory programs can be found in any book on Java XML programming.

Here is the main program in XML/SAX/SAXMain.java:

import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import wrox.sax.SAXParserHandler;

public class SAXMain {
public static void main(String[] args) throws Exception {
    SAXParserFactory factory = SAXParserFactory.newInstance();
    factory.setValidating(true);
    SAXParser parser = factory.newSAXParser();
    parser.parse("SAX2.0.xml", new SAXParserHandler());
}
}

The ParserFactory helper class uses reflection to instatiate the class referred to in the system property in javax.xml.parsers.SAXParserFactory. This is an example of the use of a design pattern for related objects.

The input here is hard-coded as "SAX2.0.xml".

<?xml version="1.0" ?>
<!DOCTYPE article SYSTEM "article.dtd">
<article date="02-Dec-2000">
<headline>SAX 2.0 Released</headline>
<author>F. Bar</author>
<body>SAX 2.0 has been released into the public domain</body>
</article>

The SAXParserHandler in XML/SAX/wrox/sax/SAXParserHandler.java is also very simple.

package wrox.sax;

import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import org.xml.sax.Attributes;
import org.xml.sax.Locator;
import org.xml.sax.SAXException;
import org.xml.sax.SAXParseException;
import org.xml.sax.SAXNotRecognizedException;
import org.xml.sax.SAXNotSupportedException;
import org.xml.sax.helpers.DefaultHandler;

// Create a SAX handler to parse through a document
public class SAXParserHandler extends DefaultHandler
{

private Locator locator = null;

public void startDocument() throws SAXException {
System.out.println("startDocument");
}

public void endDocument() throws SAXException {
System.out.println("endDocument");
}

public void setDocumentLocator(Locator locator) {
this.locator = locator;
}

public void characters(char[] ch, int start, int length)
     throws SAXException
{
    String charString = new String(ch, start, length);
    System.out.println("characters: " + charString);
}

public void startElement(String namespaceURI, String localName,
                           String qName, Attributes atts)
        throws SAXException
{
    System.out.println("startElement: " + qName);

    // list out the attributes and their values
    for (int i = 0 ; i < atts.getLength() ; i++) {
      System.out.println("Attribute: " + atts.getLocalName(i));
      System.out.println("\tValue: " + atts.getValue(i));
    }
}

public void endElement(String namespaceURI, String localName,
                         String qName)
        throws SAXException
{
    System.out.println("endElement: " + qName);
}

public void ignorableWhitespace (char[] ch, int start, int length)
throws SAXException
{
System.out.println(length + " characters of ignorable whitespace");
}

public void startPrefixMapping(String prefix, String uri)
throws SAXException
{
System.out.println("Begin namespace prefix: " + prefix);
}

public void endPrefixMapping(String prefix) throws SAXException {
System.out.println("End namespace prefix: " + prefix);
}

public void processingInstruction(String instruction, String data)
throws SAXException
{
System.out.println("Instruction: " + instruction + ", data: " + data);
}

public void skippedEntity(String name) throws SAXException {
System.out.println("Skipped entity: " + name);
}

}

The output is from executing "java SAXMain" is:

startDocument
startElement: article
Attribute: date
Value: 02-Dec-2000
0 characters of ignorable whitespace
1 characters of ignorable whitespace
2 characters of ignorable whitespace
startElement: headline
characters: SAX 2.0 Released
endElement: headline
0 characters of ignorable whitespace
1 characters of ignorable whitespace
2 characters of ignorable whitespace
startElement: author
characters: F. Bar
endElement: author
0 characters of ignorable whitespace
1 characters of ignorable whitespace
2 characters of ignorable whitespace
startElement: body
characters: SAX 2.0 has been released into the public domain
endElement: body
0 characters of ignorable whitespace
1 characters of ignorable whitespace
0 characters of ignorable whitespace
endElement: article
endDocument

The rules of attibute value normalization are described in the textbooks and are not presented here.

Special Attributes

The xml:lang attribute allows the association of a human language (such as English) with XML data. For example,

<song xml:lang="de">
<title>Sagt mir wo die Blument sind</title>
</song>

The attribute has a scope of the element and can be selectively overridden.

The xml:space attribute allows the XML data to inform applications processing the data that all whitespace should be preserved. However, the application is governs its own behavior in this regard.