Data Modeling
The textbook chapter on data modeling
is organized in three sections:
Information
modeling
Document design
Creating schema
Information Modeling
A model is a description of the
operational objects in a system and their relationships with each other
in the operation of the system. A vocabulary, that is a list of well-defined
terms, is an important part of a model. The
vocabulary should include terms
used as nouns, verbs, adjectives, and adverbs.
The development of an information model should focus on the problem domain and the applications required and should be independent of the technology to be used in building the system.
The sorts of questions that one
asks about the problem domain and the applications are:
How do those
working in the domain think and talk about it?
What do the
words in the domain mean?
Who "owns" data
and information? Who generates it? Who uses it?
What is the
"life-cycle" of information?
Two objectives are:
Developing a precise and unambiguous vocabulary and model.Often, the process of modeling uncovers conflicting views of the domain and different definitions of important terms.Developing a vocabulary and model that facilitates effective communications.
Static and Dynamic Information Models
There are two main types of information models: static and dynamic.
Static information models focus on states - that is how things are - and focus on subject nouns and adjectives. For example, "a customer can have one or more accounts", "a refund is a transaction", etc.
Dynamic information models focus on processes - that is what happens - and focus on subject-noun, verb, object-noun relationships and adverbs. For example, "an agent opens an account by completing Form 11B and forwarding it to the Accounts Department". In terms of systems approaches, static models are immediately relevant to database design and dynamic models are immediately relevent to message-passing.
In the end, both states and processes must be accomodated and the line between the approaches is not always sharp, so it is ultimately beneficial to consider both types of models. Beginning with a static model allows an initial focus on "things" and the terminology for them and it is often true that the basic objects are more permanent even as processes change, but the modeler may choose how to begin based on preference and initial knowledge.
Another important early step is to work to define the goals and context of the project. What is the scope and purpose of the system being designed? Who are the decision-makers and stakeholders in the project? What are the pre-existing processes, systems, documents, etc.? What new processes, systems, documents, etc. are expected? Are there resources (e.g., guidelines) outside of the organization (such as from a non-profit group) that can be used?
The field of information modeling is large and there is increasing interest in it. The field provides tools that can be used in this process. The Unified Modeling Language is a graphical language for expressing program design. It is becoming an industry standard (see the Object Modeling Group, www.omg.org). Rational Software is home to initial UML developers and its Rational Rose is one of the most popular UML products.
Static Information Modeling
A four-step process to static information modeling is presented in the textbook:
It is important to know and understand
the things in the domain. Identifying and naming these things is
typically easier than defining them precisely. Questions such as:
Is X an instance
of this name?
Are X and Y
the same thing?
If X changes
in this way, is it still an instance of this name?
are useful in developing separate
and well-defined types of objects.
For example, an "order" might be
an important thing in a sales application. In that domain, if a caller
to a phone order system orders two items, pays with one charge to a charge-card,
but has the items shipped to two different destinations, is that one order
or two?
Sometimes, unique identifiers already exist, such a ISBN numbers for books and social-security numbers for employees, that are helpful in "naming" things.
This is the first of four steps, so it is likely that the resulting vocabulary of objects and definitions will be revised with subsequent steps. This is not a simple step-by-step process - preliminary elements of the model will be refined during later work.
In UML, objects are represented with rectangular boxes containing the name of the object.
Organizing Objects in a Class Hierarchy
A taxonomy is an orderly system of classification, typically in a hierarchy related to the objects' characteristics. The term ontology is also used for organizing principles of classification.
The key phrase is "is a kind of". So, for example, an invoice is a kind of communication, a tire is a kind of product, etc. Note that this is different than "is a specific instance of", such as "Tom is a person".
Organizing principles also distinguish
between types of objects. For example, services and products are
distinct - that is something is one
or the other (or neither), but
not both.
In UML, an arrow points from a sub-type to its super-type.
Identifying and Specifying Relationships
Objects have relationships with other objects. For example, there might be a relationship between customers and a scheduled flight. This relationship could be expressed in either direction: customers can hold a ticket on zero or more flights and a flight would have zero or more customers holding tickets.
Cardinality is an important aspect
of many kinds of relationships. UML
uses a range notation at each end
of a link between related objects. For example:
1..1
means one and only one object in the relationship
0..n
means zero to n objects in the relationship
1..n
means one to n objects in the relationship
These notations can define one-to-one,
one-to-many, and many-to-many relationships.
There are general patterns for relationships, including two types of containment: aggregation and composition. For example, a special offer might be defined as an aggregation of products, each of which has their own separble identity. A hotel is composed of rooms, each of which exists only as part of the hotel. UML uses a diamond at the container end of the relationship.
It also is possible to use relationship
words to more descriptively describe a relationship. For example,
a line-item "is contained on" an invoice and an
invoice "is composed with" line-items
(among other things). These relationship words can be placed at either
end of the link between related objects.
Other notations can be used in diagrams, such as names of unique identifiers associated with relationships, but a graph with too much information can be difficult to read.
Detailing Properties of Objects, Classes, and Relationships
Properties of objects are needed to capture other important information. For example, an airline flight will would have properties including time and date of departure, a part might have a serial number, etc.
In UML, the properties are listed with an indication as to the type of information (e.g., time-date).
Dynamic Modeling
There are a variety of approaches
for developing models based on events and actions, including:
Process and
workflow models
Data flow models
Object models,
interaction diagrams, and life histories
Use cases
Some of the important questions
in dynamic modeling are:
How is data
created?
How is data
passed from one agent to another?
How are operations
on data completed?
For example, in an inventory system,
some of the events are:
Ordering
Receipt
Payment
Disposition
Sale
Conversion
Process and Workflow Models
Flowcharts are a popular example from computer programming and other applications for capturing the sequences, loops, and tests involved in activities. Microsoft's Visio is a popular diagramming tool. Rational Rose and other development tools also have dynamic modeling tools.
Data Flow Models
Data flow models are similar to
workflow models, but the emphasis is on how the data changes in the process
rather than on the actions themselves.
Data flow models require good definitions
of the data, whereas workflow models can describe processes with only general
references to the data itself.
Object Models, Interaction Diagrams, and Life Histories
Object models can have dynamic as well as static descriptions. Dynamic elements of object models represent the methods and calls that can be made on the object, inlcuding what the object can be called upon to do and what can be done to it. One problem with object modeling of dynamic elements is that events and actions may involve multiple objects.
Object interaction diagrams are used to analyze the exchange of messages between objects. Object interaction diagrams are similar to data flow models, but object-orientation provides a somewhat different, and perhaps finer-grained, perspective. Message-passing considerations, such as are made in object interaction diagrams, may provide insights for distributed systems.
Object life histories trace objects from their creation, through important "life events", to their termination. This is similar to data-flow modeling but with data conceptualized in objects. This process can be useful for examining the completeness of a data model - to test whether the model accurately and completely captures processes.
Choosing a Dynamic Modeling Approach
Dynamic modeling approaches complement
static approaches and complement each other. Different approaches
can be combined.
Modeling isn't a clearly defined
linear path. Developing data models involves both progress and revision.
There are clearly elements of both design and engineering, both creativity
and careful practice.
If one can work at a large-granularity in the domain to get good abstraction, modularity, and layering, then design and implementation can be iteratively pursued, e.g., with prototypes and increasingly refined versions of the model.
Designing XML Documents
Once there is a data model, documents or datasets can be based on it. The line between data, documents, and programs is blurring, even to the point that we sometimes speak of XML documents and XML data interchangeably. XML can be used to give structure and format for both persistent data and messages. Persistent data would be stored in files for long-term reference. Messages would be used for intermediate representations or communications between programs. Static models may be more directly applicable to persistent data and dynamic models may be more directly applicable to messages. Some of the document design issues are the same for either type of data and some of the design issues are different.
There can be significant advantages
to adopting standard documents as defined or in a slightly modified form.
Some of the sites for standardization efforts based on XML include:
http://www.biztalk.org/
http://www.oasis-open.org/
http://www.xml.org/
http://www.ontology.org/
http://www.rosettanet.org/
http://www.commercenet.org/
http://www.omg.org/
http://www.opengis.org/
http://www.ebxml.org
Regardless of the document type, the document design should focus on the information content and not the intended use. People and organizations will quickly evolve new processes for valuable data.
There often are advantages to segmenting documents are data into relatively separate modular components. Documents, like the underlying data, may be hierarchical.
The document or data design should anticipate change. Consider using version numbers and document control tools.
Make meaningfulness a high priority. For example, avoid codes where full names will work and use pre-existing identifiers rather than creating new ones.
Message Data
It is useful to have a message wrapper that is defined separate from the body of the message and that contains information about the sender, intended recipient(s), date/time, unique ID, and version number.
Of course, in some (but very few)
applications, there may be serious constraints on the size of messages,
but this is not true of most applications. A more common question
is whether to have one large
message or several smaller messages.
Persistent Data
Persistent data may be stored using
the file system, a relational or
object-oriented database system,
or directory services.
File systems are familiar and simple
and have mechanisms for shared access and protection, but have limited
support for searching, merging,
and otherwise processing file contents.
Relational databases such as SQL-based software are widely popular and have better mechanisms for searching, merging, and processing data. Database vendors are adding support for XML. Using databases with XML is discussed later in the course.
Linking and indexing of documents is important. Also, directory services are an emerging area and there are many developing XML approaches to distributed information including Directory Services Markup Language (DSML), RDF, and UDDI. Some of these will be discussed later in the course.
In retrieving informatin from document stores, there is a two-step problem: first, finding the document and then finding the desired information in the document. An issue with any storage system for persistent data is the size of the data. In some applications, a big document with all data might be preferable (to avoid needing to retrieve many small pieces). In other applications, small documents might be preferable (to allow retrieval of just what is needed). In some ways, XML and the Internet in general is better suited to smaller units of data.
Mapping Data and Documents to XML
There are a number of issues in mapping data and documents to the particular technology of XML.
One decision is how to compose names of elements and properties. An XML implementation can use longer or shorter names. One can use more descriptive or more general names, such as BillingAddress or Address. Even the combination of characters present choices, for example billingAddress, BillingAddress, or billing.address.
Although entities and groups are designed more as a tool for physical preparation and not so much for logic puposes, they also pose document design issues, particularly if they are imported from an external source.
The choice of making something an
attribute or a sub-element can sometimes be an issue in mapping data to
XML. It is an issue that may seem vexing even when either decision
may work fine. For example, one could make deposit a sub-element
of payment or have a payment type attribute and have deposit as a possible
value. In general, making things
attribute types instead of sub-element
types is more restrictive and limiting. Sometimes restrictions and
limits are good and sometimes they aren't.
Such decisions are often very domain
and application specific, so general guidelines can be difficult to promulgate.
The textbook offers some guiding questions in this.
Is the data
flat or hierarchical?
In general, if it is flat, then use attibutes and otherwise use elements.
Is there a need
to order child information items?
If there is, then use elements because attribute order is arbitrary.
Is the data
information about content or content itself.
For content, use elements; for metadata use attributes.
Are changes
expected?
Elements more easily accomodate significant changes.
Is the data
for humans or programs?
Elements may be easier for humans.
Attributes may be easier for pplications (and require less space).
Does the type
have an enumeration of values?
Enumerated attribute values can be convenient.
Does the data
have multiple values?
One can have multiple instances of an element, but not attributes.
Declaring elements and attributes in general scope can improve reusability. More later in this lecture.
In writing DTDs and Schema, one should consider the value of version numbers (e.g., in the root element and even in other elements that are subject to change) and version control.
Namespaces are a particularly valuable feature for modularity, reusability, and data sharing.
In defining property values, there are some developed standards (e.g., many by the ISO).
In representing relationships, there are various tools. Some of these have been introduced, such as ID, IDREF, IDREFS, Key, and KeyRef. We will look later at XPointer and other XML technologies.
Binary data is not supported beyond simple notations. Base64 encoding can be used to put binary data in the XML data, but binary data is often kept in a separate file or message.
Local or Global