dtd
DTD
ch06

Originally based on XML Visual Quickstart Guide by Kevin Howard Goldberg, and notes therefrom by Jack Davis (jcdavis@radford.edu).

We've seen quick examples of some standardized file-formats using their own variant of XML: iTunes collections, .svg files, Word documents. We also saw a made-up set of tags about children, as well as the books' made-up set of tags about ancient wonders. Actually, even the “standardized” file formats are just somebody who originaly made up tags, to describe the hierarchical information they wanted to represent. But if the tags are made-up, who's to declare if people are using them correctly?

When defining a new XML language, you must specify its grammar — what tags are allowed, what child-tags they may (or, must) contain. Similarly, for attributes — what attributes are required on certain tags (like “img” tags must have a “src” attribute and otherwise be empty), and what the allowed values are for attributes.

In fact, you can compare any XML document to its corresponding grammar to validate whether it conforms to the rules specified in the schema. If an XML document is deemed valid, then it data is in the proper form as specified by the schema. (Of course, just as a syntax-checker can validate a Java program's syntax, it won't validate its meaning; similarly schema's won't validate an XML document's logical content.) This isn't as much a problem for XML documents (especially small ones) as for programs. For databases implemented as large XML files, people might build other ad hoc tools to run sanity checks on the contents of an XML document's meaning — check that a wonder's year-destroyed isn't less than its year-built, that links actually resolve, etc..

There are two common formats for specifying XML grammars: DTDs, and XML Schema. A DTD, “Document Type Definition”, is an older but widely used system with a peculiar and limited syntax. However, they are lightweight: compact and easily comprehended with a little study. Since they are relatively simple and still widely used, studying them is a good first step in understanding XML tag set definition. A DTD is a text-only document itself and therefore does not begin with the standard XML declaration.

Pros of using DTD's
- lightweight (compact, easily comprehended, but limited expressiveness).
- The can be defined inline (internal DTD's) for quick development.
- The can define entities.
- They are perhaps the most widely accepted and are supported by most XML parsers.
Cons of using DTD's
- They are not written using XML syntax, and require parsers to support additional language.
- They do not support Namespaces.
- They do not have data typing (requiring data to be an integer, a string, or a date, etc.), decreasing the strength of validation.
- They have limited capacity to define how many child elements can nest within a given parent element.

The three things a DTD specifies

elements (tags)
attributes for tags
entities

An example file: the textbook's ch06-wonders.dtd, as referenced in ch06-wonders.xml (do a view-source, line 3)

Where to put the DTD file

You can have the DTD as an external file, or in-document (similar to providing css-files).

In-document, inside the same file as the XML:

<!DOCTYPE ancient_wonders [
  <!ELEMENT ancient_wonders wonders*>
  <!ELEMENT wonder (name+, …)>
  ⋮
]>

This is the lightest-weight solution, and is suitable for the homework assignment.

For an external document file on your own computer, you can use SYSTEM Start your xml file with a DOCTYPE specifying where the dtd is found, e.g. <!DOCTYPE ancient_wonders SYSTEM "wonders.dtd">. This is what the above ch06-wonders.xml does (remember to view-source, line 3).
An external document on the interwebs, using PUBLIC: <!DOCTYPE ancient_wonders PUBLIC "-//ibarland//DTD Archaeological Wonders 1.0//EN" "https://php.radford.edu/~itec325/2016spring-ibarland/Lectures/wonders.dtd">. See here for the syntax of the word after “PUBLIC”. (Please do not use this approach, for your homework; I want to be able to grade it using only what is submitted on D2L.)

Defining your XML: Elements, Attributes, and Entities

Defining Elements (tags)
Elements are the foundational units of an XML document. They can contain values, have attributes, and they can contain other elements. A DTD for a given custom markup language (tag set) will define a list of elements and any child elements that each element can have. It will define any attributes that each element can have, and it will define whether these elements and attributes are optional or required.
- Defining an element that only contains text:
  <!ELEMENT tag (#PCDATA) > — where tag is the element name
  PCDATA stands for “parsed character data”, and it refers to the text value of an element; “parsed” meaning that it can contain entities.
  An element that is defined to contain PCDATA can't contain any other element (no sub-tags).
  Example:
  <!ELEMENT name (#PCDATA) >
  <!ELEMENT history (#PCDATA) >
- Defining an empty element:
  An empty element is an XML element that does not have any body of its own. (It may use its attributes to store data, though.
  <!ELEMENT tag EMPTY > — where tag is the element name
  
  Example:
  <!ELEMENT br EMPTY >
- Defining an element that contains a child element:
  1. Tags with required child nodes: <!ELEMENT tag (child1) > — where child1 is another element-name
    <!ELEMENT tag (child1, child2) > — where child1 and child2 are element names
    Example: <!ELEMENT html (head,body) >
  2. Optional and repeatable child nodes: <!ELEMENT tag (child1*, child2+, child3?) >
    You can define the number of occurrences a child element may appear in an XML document, (* : 0 or more), (+ : 1 or more), (? : 0 or 1).
    
    For example:
    <!ELEMENT pet (name+, nickname*, vet?) >
    saying that a tag pet can have one or more proper names, any number of nicknames, and (possibly) one current vet (all in that order).
  3. You can also use quantifiers to define the number of occurrences for a sequence of tags.
    There is no special way to define a specific quantity of an element (for example, 3 occurrences). The one rather awkward way to do it is:
    <!ELEMENT tag (child1, child1, child1) >
- Defining choices for child elements:
  <!ELEMENT tag (child1 | child2 | child3) >
  The above declaration says the tag element contains exactly one child, and it must be a child1 or child2 or a child3 sub-elements.
- You can use nested parentheses to provide optional content among groups of subelements. One common pattern is the “list of approved tags”:
  <!ELEMENT p (em|strong|ul|ol|br|#PCDATA)* >
- Defining a tag that may contain anything:
  While not ideal for creating a structured set of rules, in a DTD, you can define an element to contain anything, meaning it can contain any combination of elements and text. As with mixed content, this is useful if you are creating a DTD to support XML documents from different sources. It may be the only way to define elements you know and allow for element structures you can't anticipate.
  
  <!ELEMENT tag ANY >

Defining Attributes

Attributes are useful to provide additional data about an element. Information contained in attributes tends to be about the content of the XML document, as opposed to being the content itself. General best practices suggest that elements are better used for information you want to display. Attributes are better used for information about information. Some of the reasons are: Attributes cannot describe data relationships like child elements can, their value are not as easily validated by a DTD, and they cannot contain multiple values whereas child elements can. Attributes are often used with empty elements where they describe information about the element. For example, they are often used to store ID's, as attributes are not the data, but information about the data.

Defining a standard element attribute:

<!ELEMENT height (#PCDATA)>
<!ATTLIST height 
 
          units CDATA  "feet">
          -- default value of feet for attribute units, could still establish a
          -- different value for units as in units="meters"

Syntax: <!ATTLIST tag attrName CDATA default-or-status >
(tag - the name of the element this attribute occurs in; attrName - the attribute's name; CDATA indicates that this is (unparsed) character data; default-or-status is either:

A default value (string) for that attribute;
the token #REQUIRED to indicate that it's required (but there is no default);
the token #IMPLIED, meaning "optional" (and no default).
(uncommon:) #FIXED the-only-allowed-value, when to force the author to only use one possible value. This can also be achieved via the more flexible "enumerated values", next; I'm not sure what the intent of this particular approach is.

Enumerating the attributes choices:
Instead of allowing any “CDATA” as an attribute-value, you can restrict it to an enumerated list of possible values: <!ELEMENT height (#PCDATA)>
<!ATTLIST height units (inches | feet) #REQUIRED> -- value of units must be either inches or feet
Defining Attributes that refers to another tag's ID attribute:
Suppose you want to restrict an attribute's value to have to reference (the ID attribute of) another element in the document (vaguely like a foreign key): <!ELEMENT special_site (title,url)> <!ATTLIST special_site wonder_focus IDREF #REQUIRED >
<special_site wonder_focus="w_143"> Use “IDREF” to define an attribute that can contain a value matching any existing ID attribute's value. Use “IDREFS” to define an attribute that can contain several white-space-separated values which match any existing ID attribute's value. There may be several IDREF attributes that refer to the same ID. But of course, the ID itself must be unique to one element.
Restricting Attributes to Valid XML Names:
DTD's don't allow for much data typing, but there is one restriction that you can apply to attributes. The value of an attribute defined as the NMTOKEN type, must be a valid XML name.

<!ELEMENT w_visit EMPTY>
<!ATTLIST w_visit primary_keyword NMTOKEN #REQUIRED>

To keep the primary_keyword attribute to just one word (with no white space) it can be defined to be the NMTOKEN type.

Defining entities.
- In the DTD:
  <!ENTITY entName "content"> -- general form
  <!ENTITY wow "Wonders o’ <em>the</em> World"> -- example
- Using General Entities
  
  <story> The oldest of the &wow;, the Great Pyramid, is … </story>
  will render as “The oldest of the Wonders o’ the World, the Great Pyramid, is…”.
For some further info, see an older lecture.

Design Issues

Digression: tags vs. attributes

When specifying a height, which of the following do you like/dislike? Why?

  <wonder height = "37 feet">…</wonder>

  <wonder>
    <height>37 feet</height>
    …
  </wonder>

  <height>
    <measure units="meters">11.8</measure>
    <measure units="feet">37</measure>
  </height>

  <height>
    <feet>37</feet>
    <meters>11.8</meters>
  </height>

  <height measure="37">
    <units>feet</units>
  </height>

  <height>
    37
    <units>feet</units>
  </height>

  <height>
    <measure>37</measure>
    <units>feet</units>
  </height>

  <height units="feet">
    <measure>37</measure>
  </height>

  <height units="feet">
    37
  </height>

A common question, when designing an XML language, is when to use a (nested) tag, vs. an attribute on the tag. The rule-of-thumb is “data as tags; metadata as attributes”. For example, in XHTML, the img tag has the filename as an attribute, because the image is the data; the filename is information about how to find the (real) data. A couple of guidelines:

Does the additional info itself contain sub-information? Then you must use a tag element.
Most common case: markup. For example, if a caption may contain emphasized text, you can't do that with an attribute.
Does the info's order matter, relative to its sibling information? Can there be two such pieces of info? If so, you must use tag elements. (Attributes must be unique, and order doesn't matter.)
For example, if a book can have multiple authors (and order matters), you can't do that with an attribute.
Does this represent data, or meta-data (i.e. data about data)? E.g. the fact that the data 37 is being measured in feet is meta-data.
Put meta-information as attributes, as possible.

Design practice

Sample exercise:

(a) create a reasonable DTD for census records so that the following file would be legal: ch06-census.xml
(b) Critique any strengths and weaknesses of how that file represents information — what changes would you make to represent census records?

home—lects—hws
D2L—breeze (snow day)

©2015, Ian Barland, Radford University
Last modified 2016.Apr.21 (Thu) Please mail any suggestions
(incl. typos, broken links)
to ibarlandradford.edu