Internet Systems and Programming - XML

Like HTML, the eXtensible Markup Language (XML) is used to store information in a structured format for online applications. Like HTML, it is derived from the Standard Generalized Markup Language (SGML). XML is more extensible than HTML, allowing authors to define structures for their own purposes.

For general information about XML, see:

    http://www.w3.org/XML/
    http://java.sun.com/xml/
    http://msdn.microsoft.com/xml/
    http://xml.coverpages.org/

Most of these examples are from the textbooks, Internet and the World Wide Web: How to Program, Deitel, Deitel, and Nieto [Prentice Hall, 2000], and Programming the World Wide Web, Sebesta [Addison-Wesley, 2003].

Simple XML Examples

In Deitel et al., article.xml, the root element, article, contains all other elements as sub-elements. Container elements, e.g., author, contain children sub-elements specifying other content items, such as title, author, etc.

XML documents must be well-formed (with proper nesting) and empty elements, e.g., flag in letter.xml, close with '/>', just as XHTML.

In Sebesta, planes.xml the root element, planes_for_sale, contains all other elements as sub-elements. Container elements, e.g., ad and location, contain children sub-elements specifying the content.

The document type is specified with the !DOCTYPE element. In this tag, the SYSTEM attribute specifies the external Document Type Definition (DTD) file which defines and describes the XML elements.

Deitel et al., letter.xml, the root element is letter. The other elements specify the content of the letter.

Document Type Definitions

The DTD allows checking for validity of structure (as well as well-formedness). Specifications are written in Extended Backus-Naur Form (BNF).

This letter.dtd DTD describes the elements used in the simple business letter example, including attributes of two of the elements.

The !ELEMENT tag defines an element and lists its potential sub-elements. Sub-elements without an enumerator specified must appear exactly once within the element. The '*' enumerator indicates 0 or more; the '+' enumerator indicates 1 or more; and the '?' enumerator indicates 0 or 1. The #PCDATA specification indicates the element contains parsed character data (i.e., text). The EMPTY specification indicates the element has no content.

The !ATTLIST tag defines attributes of an element, including the name and data type. The CDATA specification indicates the attribute data type is character data (i.e., a string). The #IMPLIED specification indicates the value is set by the system.

This planes.dtd DTD describes the planes_for_sale document.

The planes_for_sale element consists of one or more ads.
Each ad, consists of one year element, one make element, one model element, one color element, one description element, zero or one price element, one seller element, and one location element.
All of the elements of an ad are character data except the location, which consists of one city element and one state element, each of which are character data.
The seller element may have two attributes. The seller attribute with character data is required and the email attribute is optional and the character data value may or many not be given.
Three entities, or shorthand notations, are defined. Firefox 1.0 will not load external entity definitions in DTD files, so planes.xml, which uses these entities, will not be displayed properly.

The DTD can be placed inside an XML document or in an external file.

There are many decisions in the design of an information structure such as is expressed by a DTD. We will look at this important problem after covering more of the basics of XML.

XML Namespaces

The development of data models and vocabularies is of fundamental importance in XML applications. XML supports reuse and sharing of data models and vocabularies.

In order to reuse data models and vocabularies, it is important that XML support segmenting problems into logical and manageable domains. Then, applications must be able to reference and use models and vocabularies as required. Two issues immediately present themselves:

What naming convention should be used for referencing these models and vocabularies?

How are collisions between multiple models and vocabularies prevented or handled?

Both the problem of naming and the problem of collisions are addressed by the Namespaces Recommendation (January 1999).

An XML Namespace is a collection or group of names, usually sharing a context, that is identified by a globally unique name. The naming convention uses Universal Resource Identifiers (URI), but indirectly through an alias, so as to adhere to the XML name syntax.

For example,

In the first case, the Chapter namespace attribute establishes a default namespace. In the second case, the attributes establish two namespace prefixes that can be used in the following way anywhere within the scope of the Toysco element:

<Per:Name>...</Per:Name>
<Inv:Name>...</Inv:Name>

The default namespace or a previously established namespace can be overridden by another assignment in a sub-element.

Deitel's examples namespace.xml and defaultnamespace.xml illustrate namespaces.

XML Schema

A schema uses an XML-like language to define document types as an alternative to DTD. The W3C is developing standards for XML Schema. The most recent recomeendations are for XML Schema 1.0, 2nd edition (October 2004), including:

Prominent XML Examples

There are several prominent markup languages defined using XML. For example,
the Math Markup Language (MathML), Chemical Markup Language (CML), Speech Markup Language (SpeechML), and others.

The Math Markup Language, MathML is a low-level markup specification for mathematics. MS IE, Mozilla Firefox, and W3C Amaya browsers render MathML. Click here for an example. (This example doesn't seem to work properly with Firefox 1.0.)

The Geography Markup Language, GML. For examples and other links, visit the Open Geospatial Consortium (OGC) and GML4J (a Java API for GML).

An example of the Chemical Markup Language is ammonia.xml. (This example requires a helper application.)

XML and HTML

An XML document can be embedded in HTML as an xml element between an <xml> open tag and an </xml> close tag. Such a document is called a data island. Here is an example using the MS ActiveX Tabular Data Control (TDC). In this example, the local data island is declared with the ID set to "xmlDoc". (This is a MS technology and does not work with Firefox 1.0.) That ID is referenced with the DATASRC attribute of the table element. The SPAN element binds the table data item elements to the information in the xml document.

XML parsers provide more extensive access and utilization of XML data and XSLT provides for transformation of XML data.

XML Parsers and XML Processing

An XML parser processes the XML document and DTD or Schema files (if available) to determine the content and structure. Parser APIs allow a program access to the document. Parsers also determine that a document is well-formed and/or valid. An XML document with no syntactic errors is well-formed. An XML document that is well-formed and consistent with the DTD or Schema is valid.

There are two approaches to parsing: event-driven and tree-based. The event-driven approach processes XML data sequentially, handling components one at a time. The tree-based approach constructs a tree representation of the entire document. The main advantage of the event-driven approach is simplicity; the main disadvantage is that there is no structure for general access to the document. The standard API for event-driven parsing is SAX. The main advantage of the tree-based approach is its support for general access to support complex operations such as searching and editing; the main disadvantage is memory overhead, which may be several times the size of the document. The DOM is a tree-based structure.

An external XML file can be processed for inclusion in an HTML document, as in this example (from the previous edition of Deitel's book) that uses HTML and JavaScript to animage a chess game described in an XML file. Example chess.html contains an HTML file with JavaScript to access the XML file in ScholarMate.xml which describes a sequence of moves of a chess game. (This example does not work properly in Firefox 1.0.)

Microsoft parser support is included in its XML Core Services. Services include XSD validation with either SAX or DOM, support for the Schema Object Model (SOM) for access to the Schema, and support for XSLT.

The Java API for XML Parser (JAXP), developed through the Java Community Process, supports DOM, SAX, and XSLT. JAXP 1.3 is incorporated into J2SE 5.0.

The Apache XML Project has released the Xerces XML Processor for both Java and C++ (with wrappers for Perl and COM). The COM wrapper provides compatibility with MSXML. We will look more closely at parsers when we study server-side programming.

An online XML validator parser is at http://www.stg.brown.edu/pub/xmlvalid/. An online XML schema validator is at http://www.w3.org/2001/03/webdata/xsv.

Extensible Stylesheet Language (XSL)

You may be wondering how we render XML data. The Sebesta textbook has an example using CSS to display XML, but the eXtensible Stylesheet Language (XSL) is a more powerful language for defining the layout or presentation of XML documents. (Note the color differences between pages rendered by MS IE 6.0 and Firefox 1.0.) An XSL stylesheet is a specification for an XML document type. See http://www.w3.org/TR/xsl/.

XSL Transformations (XSLT) allow specification of the translation rules from one document type to another document type, for example XML to HTML. See http://www.w3.org/TR/xslt.

The Sebesta textbook has two examples using XSLT: xslplane.xml with xslplane.xsl and xslplanes.xml with xslplanes.xsl.

The Deitel text has an example that formats an XML document games.xml as HTML using XSLT in games.xsl. The example doesn't work right on the browser on the NT Server, but does work correctly on more recent browsers. Note that these are new and still changing technologies.

There are several XSL elements to sort and filter the data. The xsl:for-each and xsl:value-of elements with the order-by and select attributes allow manipulation of the order and the creation of a new XML document.

Working Examples

Here is an extensive example of converting XML to HTML with both an XML file and XSLT file. The output (produced via XMLSpy) is an HTML file.

Here is an example that is only 135 lines long with about 50 lines of HTML, 5 lines of XML, about 50 lines of JavaScript using DOM, and about 30 lines of XSLT.