lect14b-ch06-dtd
DTD
ch06

Originally based on XML Visual Quickstart Guide by Kevin Howard Goldberg, and notes therefrom by Jack Davis (jcdavis@radford.edu).

Defining a tag set (that is, an XML grammar) formally is extremely important for keeping XML documents consistent. Generally, these definition documents are referred to as “schemas”. In fact, you can compare any XML document to its corresponding schema to validate whether it conforms to the rules specified in the schema. If an XML document is deemed valid, then it data is in the proper form as specified by the schema. (Of course, just as a syntax-checker can validate a Java program's syntax, it won't validate its meaning; similarly schema's won't validate an XML document's logical content. This isn't as much a problem for XML documents (especially small ones) as for programs. For databases implemented as large XML files, people might build other ad hoc tools to run sanity checks on the contents of an XML document's meaning — check that a wonder's year-destroyed isn't less than its year-built, that links actually resolve, etc..

There are two principal systems for writing schemas: DTD and XML Schema. A DTD, Document Type Definition, is an older but widely used system with a peculiar and limited syntax. However, they are compact and easily comprehended with a little study. Since they are relatively simple and still widely used, studying them is a good first step in understanding XML tag set definition. A DTD is a text only document itself and therefore does not begin with the standard XML declaration.

Pros of using DTD's
- They are compact and easily comprehended with a little study.
- The can be defined inline (internal DTD's) for quick development.
- The can define entities.
- They are likely the most widely accepted and are supported by most XML parsers.
Cons of using DTD's
- They are not written using XML syntax, and require parsers to support additional language.
- They do not support Namespaces.
- They do not have data typing (requiring data to be an integer, a string, or a date, etc.), decreasing the strength of validation.
- They have limited capacity to define how many child elements can nest within a given parent element.

The three things a DTD specifies

elements (tags)
attributes for tags
entities

A common question, when designing an XML language, is when to use a (nested) tag, vs. an attribute on the tag. The rule-of-thumb is “data as tags; metadata as attributes”. For example, in XHTML, the img tag has the filename as an attribute, because the image is the data; the filename is information about how to find the (real) data. Guidelines:

Does the additional info itself contain sub-information? Then you must use a tag element.
Most common case: markup. For example, if a caption may contain emphasized text, you can't do that with an attribute.
Does the info's order matter, relative to its sibling information? Can there be two such pieces of info? If so, you must use tag elements. (Attributes must be unique and order doesn't matter.)
For example, if a book can have multiple authors (and order matters), you can't do that with an attribute.

(All book examples)

Defining Elements
Elements are the foundational units of an XML document. They can contain values, have attributes, and they can contain other elements. A DTD for a given custom markup language (tag set) will define a list of elements and any child elements that each element can have. It will define any attributes that each element can have, and it will define whether these elements and attributes are optional or required.
- Defining an element that only contains text:
  <!ELEMENT tag (#PCDATA) > -- where tag is the element name
  PCDATA stands for “parsed character data”, and it refers to the text value of an element; “parsed” meaning that it can contain entities.
  An element that is defined to contain PCDATA can't contain any other element.
  Example:
  <!ELEMENT name (#PCDATA) >
  <!ELEMENT history (#PCDATA) >
- Defining an empty element:
  An empty element is an XML element that does not have any body of its own. (It may use its attributes to store data, though.
  <!ELEMENT tag EMPTY > -- where tag is the element name
  
  Example:
  <!ELEMENT br EMPTY >
- Defining an element that contains a child element:
  1. Tags with required child nodes: <!ELEMENT tag (child1) > -- where child1 is an element name
    <!ELEMENT tag (child1, child2) > -- where child1 and child2 are element names
    Example: <!ELEMENT html (head,body) >
  2. Optional and repeatable child nodes: <!ELEMENT tag (child1*, child2+, child3?) >
    You can define the number of occurrences a child element may appear in an XML document, (* : 0 or more), (+ : 1 or more), (? : 0 or 1).
    
    For example:
    <!ELEMENT pet (name+, nickname*, vet?) >
    saying that a tag pet can have one or more proper names, any number of nicknames, and (possibly) one current vet (all in that order).
  3. You can also use quantifiers to define the number of occurrences for a sequence of tags.
    There is no special way to define a specific quantity of an element (for example, 3 occurrences). The one rather awkward way to do it is:
    <!ELEMENT tag (child1, child1, child1) >
- Defining choices for child elements:
  <!ELEMENT tag (child1 | child2 | child3) >
  The above declaration says the tag element contains exactly one child, and it must be a child1 or child2 or a child3 sub-elements.
- You can use nested parentheses to provide optional content among groups of subelements. One common pattern is the "list of approved tags":
  <!ELEMENT p (em,strong,ul,ol,br)* >
- Defining a tag that may contain anything:
  While not ideal for creating a structured set of rules, in a DTD, you can define an element to contain anything, meaning it can contain any combination of elements and text. As with mixed content, this is useful if you are creating a DTD to support XML documents from different sources. It may be the only way to define elements you know and allow for element structures you can't anticipate.
  
  <!ELEMENT tag ANY >

Digression: tags vs. attributes

Which of the following do you like/dislike? Why?

  <wonder height = "37 feet"</wonder>…</wonder>

  <height>
    <measure units="feet">37</measure>
    <measure units="meters">11.8</measure>
  </height>

<height>37 feet>/height>

  <height>
    <measure>37</measure>
    <units>feet</units>
  >/height>

  <height units="feet">
    <measure>37</measure>
  >/height>

Defining Attributes
Attributes are useful to provide additional data about an element. Information contained in attributes tends to be about the content of the XML document, as opposed to being the content itself. General best practices suggest that elements are better used for information you want to display. Attributes are better used for information about information. Some of the reasons are: Attributes cannot describe data relationships like child elements can, their value are not as easily validated by a DTD, and they cannot contain multiple values whereas child elements can. Attributes are often used with empty elements where they describe information about the element. For example, they are often used to store ID's, as attributes are not the data, but information about the data.

Defining a standard element attribute:
Syntax: <!ATTLIST tag attrName CDATA #REQUIRED >
(tag - the name of the element this attribute occurs in; attrName - the attribute's name; CDATA indicates that this is (unparsed) character data; the last word is either #REQUIRED or #IMPLIED.)
You may define multiple attributes for one element in a single definition statement.
<!ATTLIST tag attr1 CDATA #REQUIRED attr2 CDATA #IMPLIED>

Defining Default Attribute Values:

<!ELEMENT height (#PCDATA)>
<!ATTLIST height 
 
          units CDATA  "feet">
          -- default value of feet for attribute units, could still establish a
          -- different value for units as in units="meters"

<!ELEMENT height (#PCDATA)>
      <!ATTLIST height
	     units CDATA #FIXED "feet">
             -- default value of feet for attribute units, cannot establish a 

             -- different value for units as in units="meters"

Defining Attributes with Choices

<!ELEMENT height (#PCDATA)>
<!ATTLIST height units (inches | feet) #REQUIRED> -- value of units must be either inches or feet
Defining Attributes that refers to another tag's ID attribute:
Suppose you want to restrict an attribute's value to have to reference (the ID attribute of) another element in the document (vaguely like a foreign key): <!ELEMENT special_site (title,url)> <!ATTLIST special_site wonder_focus IDREF #REQUIRED >
<special_site wonder_focus="w_143"> Use “IDREF” to define an attribute that can contain a value matching any existing ID attribute's value. Use “IDREFS” to define an attribute that can contain several white-space-separated values which match any existing ID attribute's value. There may be several IDREF attributes that refer to the same ID. But of course, the ID itself must be unique to one element.
Restricting Attributes to Valid XML Names:
DTD's don't allow for much data typing, but there is one restriction that you can apply to attributes. The value of an attribute defined as the NMTOKEN type, must be a valid XML name.

<!ELEMENT w_visit EMPTY>
<!ATTLIST w_visit primary_keyword NMTOKEN #REQUIRED>

To keep the primary_keyword attribute to just one word (with no white space) it can be defined to be the NMTOKEN type.

lect14b-ch06-census.xml

home—lects—exams—hws
breeze (snow day)

©2011, Ian Barland, Radford University
Last modified 2011.Dec.09 (Fri) Please mail any suggestions
(incl. typos, broken links)
to ibarlandradford.edu

lect14b-ch06-dtd DTDch06

The three things a DTD specifies

lect14b-ch06-dtd
DTD
ch06