The eXtensible Markup Language (XML)

Markup Languages

The purpose of markup is to communicate metadata for a document, i.e., data about the data in the document.

Markup languages typically use tags to delimit and describe pieces of a document.

The Generalized Markup Language (GML) was developed at IBM in 1969. GML was a self-referential meta-language for arbitrary data, i.e., it could describe languages, grammars, and vocabularies for markup.

The Standardized Generalized Markup Language (SGML) developed from GML and was adopted as a standard (ISO 8879) in 1986.  Here is a simple example (from the textbook) of an SGML document first describing the structure of the document and then specifying the content.

<!DOCTYPE email [
<!ELEMENT email 0 0 ((to & from & date & subject?), text) >
<!ELEMENT text - 0 (para+) >
<!ELEMENT para 0 0 (#PCDATA) >
<!ELEMENT (to, from, date, subject) - 0 (#PCDATA) >
]>
<date>10/12/99
>to>you@yours.com
<from>me@mine.com
<text>I just mailed to say...
The Hypertext Markup Language (HTML) is a simple SGML-based language widely used for creating documents for the World Wide Web (WWW).  HTML is not a meta-language in that it cannot be used to describe languages for markup.  Here is a simple example (from the textbook) of an HTML document, first providing a meta-data title and then structured information.

The eXtensible Markup Language (XML) has been developed, beginning in 1996, to combine the power and flexibility of SGML with the simplicity and popularity of HTML.  XML is license-free, platform-independent, and the basis for products from many sources.  Here are a few useful links.

International Organization for Standardization (ISO) http://www.iso.ch
World Wide Web Consortium (W3C)  http://www.w3.org
Organization for the Advancement of Structured Information Standards (OASIS) http://www.oasis-open.org
OASIS XML Cover Pages http://xml.coverpages.org
XML Industry Portal http://www.xml.org


XML Design

XML is designed with the goals of being simple, extensible, international, compatible with SGML and the Internet, and to separate structure (semantics) and presentation.

The W3C standardization process, overseen by a working group, has five steps: Working Draft, Last Call Working Draft, Candidate Recommendation, Proposed Recommendation, and Recommendation.  The XML 1.0 Recommendation was issued in 1998.  The second edition of the XML 1.0 Recommendation was released in 2000.

Related core components include the Namespaces Recommendation (January 1999) and Associating Stylesheets Recommendation (June 1999).  The XML Core Working Group is also responsible for the InfoSet Candidate Recommendation (May 2001), the XInclude Working Draft (May 2001), and the XFragment Candidate Recommendation (February 2001).  Other working groups include the XML Schema Working Group  which released the XML Schema Working Draft (March 2001); the XML Query Working Group which released several Working Drafts in February and June 2001; and the XML Linking Working Group which released the XPointer Last Call Working Draft (January 2001), the XLink Recommendation (June 2001), and XML Base Recommendation (June 2001).

XML 1.0 Overview

XML is based on existing standards.  It separates structure from presentation and allows self-describing data.  The following example from the textbook illustrates this.  Compare the following XML and HTML.

<Person>
  <Name>
    <First>Thomas</First>
    <Last>Atkins</Last>
  </Name>
  <Age>30</Age>
</Person>

<TABLE>
  <TR>
    <TD>Thomas</TD><TD>Atkins</TD>
  </TR>
  <TR>
    <TD>age:</TD><TD>30</TD>
  </TR>
</TABLE>

When an XML document is parsed, it can be checked to ensure that it conforms to the XML 1.0 syntax specification.  If a document conforms, it is said to be well-formed.  If a document is well-formed and also conforms to specifications on its content (e.g., as contained in a Document Type Definition (DTD)), is is said to be valid.   A parser that checks validity is said to be a validating parser, while a parser that checks only for well-formed documents is said to be a non-validating parser.

One of the most important keys to an XML application is the development of an appropriate data model and an effective vocabulary.

Vocabularies of data elements can be shared.  Many standardized vocabularies have been and are being developed, including MathML and the Chemical Markup Language (CML). XML Namespaces facilitate sharing by providing a means of naming and distinguishing shared vocabularies.

The XML Information Set (Infoset) facilitates sharing by providing consistent definitions of basic information items, including:

Document
Element
Attribute
Processing Instruction
Unexpanded Entity Reference
Character
Comment
Unparsed Entity
Notation
Namespace
These information items will be discussed in more detail beginning with the next lecture on XML syntax.

An important aspect that XML shares with HTML is navigation on the WWW.  Various XML technologies are being developed to support navigation including:

XPath - a language for addressing parts of  XML data
XPointer - a language (based on XPath) used for fragment identification
XLink  - a language to create and describe links between resources
XInclude - syntax for including (merging) XML data
XFragment - a standard for communicating framentary XML data
XQuery - related standards for queries of XML data.
XML data wouldn't be very useful if it couldn't be presented visually.  Various technologies support styling and transforming XML data including:
Cascading Style Sheets (CSS) - A language for style sheets for HTML or XML.
eXtensible Stylesheet Language (XSL) - A language for expressing stylesheets.
XSL Transformations (XSLT) - A language for transforming XML.
XML data can be presented in different forms depending on the context, including the user-agent.

XML is the basis for many of the major initiatives for Internet applications.  Several progamming models with tools are available for processing XML, including the Document Object Model (DOM) and Simple API for XML (SAX).  Communications protocols support distributed applications.  The Simple Object Access Protocol (SOAP) uses XML for messages.  Efforts are beginning to use semantic information more dynamically for data warehouses and other applications.  Some of the work in this area includes Topic Maps and Universal, Description, Discovery, and Integration (UDDI).