The eXtensible Markup Language (XML)

Data Modeling

The textbook chapter on data modeling is organized in three sections:
    Information modeling
    Document design
    Creating schema

Information Modeling

A model is a description of the operational objects in a system and their relationships with each other in the operation of the system.  A vocabulary, that is a list of well-defined terms, is an important part of a model.  The
vocabulary should include terms used as nouns, verbs, adjectives, and adverbs.

The development of an information model should focus on the problem domain and the applications required and should be independent of the technology to be used in building the system.

The sorts of questions that one asks about the problem domain and the applications are:
    How do those working in the domain think and talk about it?
    What do the words in the domain mean?
    Who "owns" data and information?  Who generates it?  Who uses it?
    What is the "life-cycle" of information?

Two objectives are:

Developing a precise and unambiguous vocabulary and model.

Developing a vocabulary and model that facilitates effective communications.

Often, the process of modeling uncovers conflicting views of the domain and different definitions of important terms.

Static and Dynamic Information Models

There are two main types of information models:  static and dynamic.

Static information models focus on states - that is how things are - and focus on subject nouns and adjectives.  For example, "a customer can have one or more accounts", "a refund is a transaction", etc.

Dynamic information models focus on processes - that is what happens - and focus on subject-noun, verb, object-noun relationships and adverbs.  For example, "an agent opens an account by completing Form 11B and forwarding it to the Accounts Department".  In terms of systems approaches, static models are immediately relevant to database design and dynamic models are immediately relevent to message-passing.

In the end, both states and processes must be accomodated and the line between the approaches is not always sharp, so it is ultimately beneficial to consider both types of models.  Beginning with a static model allows an initial focus on "things" and the terminology for them and it is often true that the basic objects are more permanent even as processes change, but the modeler may choose how to begin based on preference and initial knowledge.

Another important early step is to work to define the goals and context of the project.  What is the scope and purpose of the system being designed?  Who are the decision-makers and stakeholders in the project?  What are the pre-existing processes, systems, documents, etc.?  What new processes, systems, documents, etc. are expected?  Are there resources (e.g., guidelines) outside of the organization (such as from a non-profit group) that can be used?

The field of information modeling is large and there is increasing interest in it.  The field provides tools that can be used in this process.  The Unified Modeling Language is a graphical language for expressing program design.  It is becoming an industry standard (see the Object Modeling Group, www.omg.org). Rational Software is home to initial UML developers and its Rational Rose is one of the most popular UML products.

Static Information Modeling

A four-step process to static information modeling is presented in the textbook:

  1. Names - identifying, naming, and defining objects
  2. Taxonomy - organizing objects in a class hierarchy
  3. Relationships - identifying and specifying relationships among objects and classes
  4. Properties - detail objects, classes, and relationships
Identifying, Naming, and Defining Objects

It is important to know and understand the things in the domain.  Identifying and naming these things is typically easier than defining them precisely.  Questions such as:
    Is X an instance of this name?
    Are X and Y the same thing?
    If X changes in this way, is it still an instance of this name?
are useful in developing separate and well-defined types of objects.
For example, an "order" might be an important thing in a sales application.  In that domain, if a caller to a phone order system orders two items, pays with one charge to a charge-card, but has the items shipped to two different destinations, is that one order or two?

Sometimes, unique identifiers already exist, such a ISBN numbers for books and social-security numbers for employees, that are helpful in "naming" things.

This is the first of four steps, so it is likely that the resulting vocabulary of objects and definitions will be revised with subsequent steps.  This is not a simple step-by-step process - preliminary elements of the model will be refined during later work.

In UML, objects are represented with rectangular boxes containing the name of the object.

Organizing Objects in a Class Hierarchy

A taxonomy is an orderly system of classification, typically in a hierarchy related to the objects' characteristics.  The term ontology is also used for organizing principles of classification.

The key phrase is "is a kind of".  So, for example, an invoice is a kind of communication, a tire is a kind of product, etc.  Note that this is different than "is a specific instance of", such as "Tom is a person".

Organizing principles also distinguish between types of objects.  For example, services and products are distinct - that is something is one
or the other (or neither), but not both.

In UML, an arrow points from a sub-type to its super-type.

Identifying and Specifying Relationships

Objects have relationships with other objects.  For example, there might be a relationship between customers and a scheduled flight.  This relationship could be expressed in either direction:  customers can hold a ticket on zero or more flights and a flight would have zero or more customers holding tickets.

Cardinality is an important aspect of many kinds of relationships.  UML
uses a range notation at each end of a link between related objects.  For example:
    1..1    means one and only one object in the relationship
    0..n    means zero to n objects in the relationship
    1..n    means one to n objects in the relationship
These notations can define one-to-one, one-to-many, and many-to-many relationships.

There are general patterns for relationships, including two types of containment: aggregation and composition.  For example, a special offer might be defined as an aggregation of products, each of which has their own separble identity.  A hotel is composed of rooms, each of which exists only as part of the hotel.  UML uses a diamond at the container end of the relationship.

It also is possible to use relationship words to more descriptively describe a relationship.  For example, a line-item "is contained on" an invoice and an
invoice "is composed with" line-items (among other things).  These relationship words can be placed at either end of the link between related objects.

Other notations can be used in diagrams, such as names of unique identifiers associated with relationships, but a graph with too much information can be difficult to read.

Detailing Properties of Objects, Classes, and Relationships

Properties of objects are needed to capture other important information.  For example, an airline flight will would have properties including time and date of departure, a part might have a serial number, etc.

In UML, the properties are listed with an indication as to the type of information (e.g., time-date).

Dynamic Modeling

There are a variety of approaches for developing models based on events and actions, including:
    Process and workflow models
    Data flow models
    Object models, interaction diagrams, and life histories
    Use cases

Some of the important questions in dynamic modeling are:
    How is data created?
    How is data passed from one agent to another?
    How are operations on data completed?

For example, in an inventory system, some of the events are:
    Ordering
    Receipt
    Payment
    Disposition
    Sale
    Conversion

Process and Workflow Models

Flowcharts are a popular example from computer programming and other applications for capturing the sequences, loops, and tests involved in activities.  Microsoft's Visio is a popular diagramming tool.  Rational Rose and other development tools also have dynamic modeling tools.

Data Flow Models

Data flow models are similar to workflow models, but the emphasis is on how the data changes in the process rather than on the actions themselves.
Data flow models require good definitions of the data, whereas workflow models can describe processes with only general references to the data itself.

Object Models, Interaction Diagrams, and Life Histories

Object models can have dynamic as well as static descriptions.  Dynamic elements of object models represent the methods and calls that can be made on the object, inlcuding what the object can be called upon to do and what can be done to it.  One problem with object modeling of dynamic elements is that events and actions may involve multiple objects.

Object interaction diagrams are used to analyze the exchange of messages between objects.  Object interaction diagrams are similar to data flow models, but object-orientation provides a somewhat different, and perhaps finer-grained, perspective.  Message-passing considerations, such as are made in object interaction diagrams, may provide insights for distributed systems.

Object life histories trace objects from their creation, through important "life events", to their termination.  This is similar to data-flow modeling but with data conceptualized in objects.  This process can be useful for examining the completeness of a data model - to test whether the model accurately and completely captures processes.

Choosing a Dynamic Modeling Approach

Dynamic modeling approaches complement static approaches and complement each other.  Different approaches can be combined.
Modeling isn't a clearly defined linear path.  Developing data models involves both progress and revision.  There are clearly elements of both design and engineering, both creativity and careful practice.

If one can work at a large-granularity in the domain to get good abstraction, modularity, and layering, then design and implementation can be iteratively pursued, e.g., with prototypes and increasingly refined versions of the model.

Designing XML Documents

Once there is a data model, documents or datasets can be based on it.  The line between data, documents, and programs is blurring, even to the point that we sometimes speak of XML documents and XML data interchangeably.  XML can be used to give structure and format for both persistent data and messages.  Persistent data would be stored in files for long-term reference.  Messages would be used for intermediate representations or communications between programs.  Static models may be more directly applicable to persistent data and dynamic models may be more directly applicable to messages.  Some of the document design issues are the same for either type of data and some of the design issues are different.

There can be significant advantages to adopting standard documents as defined or in a slightly modified form.  Some of the sites for standardization efforts based on XML include:
    http://www.biztalk.org/
    http://www.oasis-open.org/
    http://www.xml.org/
    http://www.ontology.org/
    http://www.rosettanet.org/
    http://www.commercenet.org/
    http://www.omg.org/
    http://www.opengis.org/
    http://www.ebxml.org

Regardless of the document type, the document design should focus on the information content and not the intended use.  People and organizations will quickly evolve new processes for valuable data.

There often are advantages to segmenting documents are data into relatively separate modular components.  Documents, like the underlying data, may be hierarchical.

The document or data design should anticipate change.  Consider using version numbers and document control tools.

Make meaningfulness a high priority.  For example, avoid codes where full names will work and use pre-existing identifiers rather than creating new ones.

Message Data

It is useful to have a message wrapper that is defined separate from the body of the message and that contains information about the sender, intended recipient(s), date/time, unique ID, and version number.

Of course, in some (but very few) applications, there may be serious constraints on the size of messages, but this is not true of most applications.  A more common question is whether to have one large
message or several smaller messages.

Persistent Data

Persistent data may be stored using the file system, a relational or
object-oriented database system, or directory services.

File systems are familiar and simple and have mechanisms for shared access and protection, but have limited support for searching, merging,
and otherwise processing file contents.

Relational databases such as SQL-based software are widely popular and have better mechanisms for searching, merging, and processing data.  Database vendors are adding support for XML.  Using databases with XML is discussed later in the course.

Linking and indexing of documents is important.  Also, directory services are an emerging area and there are many developing XML approaches to distributed information including Directory Services Markup Language (DSML), RDF, and UDDI.  Some of these will be discussed later in the course.

In retrieving informatin from document stores, there is a two-step problem:  first, finding the document and then finding the desired information in the document.  An issue with any storage system for persistent data is the size of the data.  In some applications, a big document with all data might be preferable (to avoid needing to retrieve many small pieces).  In other applications, small documents might be preferable (to allow retrieval of just what is needed).  In some ways, XML and the Internet in general is better suited to smaller units of data.

Mapping Data and Documents to XML

There are a number of issues in mapping data and documents to the particular technology of XML.

One decision is how to compose names of elements and properties.  An XML implementation can use longer or shorter names.  One can use more descriptive or more general names, such as BillingAddress or Address.  Even the combination of characters present choices, for example billingAddress, BillingAddress, or billing.address.

Although entities and groups are designed more as a tool for physical preparation and not so much for logic puposes, they also pose document design issues, particularly if they are imported from an external source.

The choice of making something an attribute or a sub-element can sometimes be an issue in mapping data to XML.  It is an issue that may seem vexing even when either decision may work fine.  For example, one could make deposit a sub-element of payment or have a payment type attribute and have deposit as a possible value.  In general, making things
attribute types instead of sub-element types is more restrictive and limiting.  Sometimes restrictions and limits are good and sometimes they aren't.

Such decisions are often very domain and application specific, so general guidelines can be difficult to promulgate.  The textbook offers some guiding questions in this.
    Is the data flat or hierarchical?
        In general, if it is flat, then use attibutes and otherwise use elements.
    Is there a need to order child information items?
        If there is, then use elements because attribute order is arbitrary.
    Is the data information about content or content itself.
        For content, use elements; for metadata use attributes.
    Are changes expected?
        Elements more easily accomodate significant changes.
    Is the data for humans or programs?
        Elements may be easier for humans.
        Attributes may be easier for pplications (and require less space).
    Does the type have an enumeration of values?
        Enumerated attribute values can be convenient.
    Does the data have multiple values?
        One can have multiple instances of an element, but not attributes.

Declaring elements and attributes in general scope can improve reusability.  More later in this lecture.

In writing DTDs and Schema, one should consider the value of version numbers (e.g., in the root element and even in other elements that are subject to change) and version control.

Namespaces are a particularly valuable feature for modularity, reusability, and data sharing.

In defining property values, there are some developed standards (e.g., many by the ISO).

In representing relationships, there are various tools.  Some of these have been introduced, such as ID, IDREF, IDREFS, Key, and KeyRef.  We will look later at XPointer and other XML technologies.

Binary data is not supported beyond simple notations.  Base64 encoding can be used to put binary data in the XML data, but binary data is often kept in a separate file or message.

Local or Global