DTDs
tags and attributes

These notes influenced by XML Visual Quickstart Guide by Kevin Howard Goldberg, and notes therefrom by Jack Davis (jcdavis@radford.edu).

We've seen quick examples of some standardized file-formats using their own variant of XML: iTunes collections, .svg files, Word documents. We also saw a made-up set of tags about children, as well as the books' made-up set of tags about ancient wonders. Actually, even the “standardized” file formats are just somebody who originally made up tags, to describe the hierarchical information they wanted to represent. But if the tags are made-up, who's to declare if people are using them correctly?

video: DTD: Introduction; Elements (22m59s)

When defining a new XML language, you must specify its grammar — what tags are allowed, what child-tags they may (or, must) contain. Similarly, for attributes — what attributes are required on certain tags (like “img” tags must have a “src” attribute and otherwise be empty), and what the allowed values are for attributes.

In fact, you can compare any XML document to its corresponding grammar to validate whether it conforms to the rules specified in the schema. If an XML document is deemed valid, then it data is in the proper form as specified by the schema. (Of course, just as a syntax-checker can validate a Java program's syntax, it won't validate its meaning; similarly schema's won't validate an XML document's logical content.) This isn't as much a problem for XML documents (especially small ones) as for programs. For databases implemented as large XML files, people might build other ad hoc tools to run sanity checks on the contents of an XML document's meaning — check that a wonder's year-destroyed isn't less than its year-built, that links actually resolve, etc..

There are two common formats for specifying XML grammars: DTDs, and XML Schema. A DTD, “Document Type Definition”, is an older but widely used system with a peculiar and limited syntax. However, they are lightweight: compact and easily comprehended with a little study. Since they are relatively simple and still widely used, studying them is a good first step in understanding XML tag set definition. A DTD is a text-only document itself and therefore does not begin with the standard XML declaration.

Pros of using DTD's
- lightweight (compact, easily comprehended, but limited expressiveness).
- The can be defined inline (internal DTD's) for quick development.
- The can define entities.
- They are perhaps the most widely accepted and are supported by most XML parsers.
Cons of using DTD's
- They are not written using XML syntax, and require parsers to support additional language.
- They do not support Namespaces.
- They do not have data typing (requiring data to be an integer, a string, or a date, etc.), decreasing the strength of validation.
- They have limited capacity to define how many child elements can nest within a given parent element.

The three things a DTD specifies

elements (tags)
attributes for tags
entities

An example file: the textbook's ch06-wonders.dtd, as referenced in ch06-wonders.xml (do a view-source, line 3)

Where to put the DTD file

You can have the DTD as an external file, or in-document (similar to providing css-files).

In-document, inside the same file as the XML:

<!DOCTYPE ancient_wonders [
    <!ELEMENT ancient_wonders wonders*>
    <!ELEMENT wonder (name+, …)>
    ⋮
]>

<ancient_wonders>
    <wonder>
        <name>Great Pyramid of Ghiza</name>
        ⋮
    </wonder>
</ancient_wonders>

This is the lightest-weight solution, and is suitable for the homework assignment.

For an external document file on your own computer, you can use SYSTEM Start your xml file with a DOCTYPE specifying where the dtd is found, e.g. <!DOCTYPE ancient_wonders SYSTEM "wonders.dtd">. This is what the above ch06-wonders.xml does (remember to view-source, line 3).
An external document on the interwebs, using PUBLIC: <!DOCTYPE ancient_wonders PUBLIC "-//ibarland//DTD Archaeological Wonders 1.0//EN" "https://itec-php01.radford.edu/~itec325/2016spring-ibarland/Lectures/wonders.dtd">. See here for the syntax of the word after “PUBLIC”. (And: on your homework, prefer the SYSTEM technique to this one.)

Defining your XML: Elements, Attributes, and Entities

Defining Elements (tags)
Elements are the foundational units of an XML document. They can contain values, have attributes, and they can contain other elements. A DTD for a given custom markup language (tag set) will define a list of elements and any child elements that each element can have. It will define any attributes that each element can have, and it will define whether these elements and attributes are optional or required.

see also: w3schools.com
- Defining an element that only contains text:
  <!ELEMENT tag (#PCDATA) > — where tag is the element name
  PCDATA stands for “parsed character data”, and it refers to the text value of an element; “parsed” meaning that it can contain entities.¹
  Example:
  <!ELEMENT name (#PCDATA) >
  <!ELEMENT history (#PCDATA) >
- Defining an empty element:
  An empty element is an XML element that does not have any body of its own. (It may use its attributes to store data, though.
  <!ELEMENT tag EMPTY > — where tag is the element name
  
  Example:
  <!ELEMENT br EMPTY >
- Defining an element that contains a child element:
  1. Tags with required child nodes: <!ELEMENT tag (child1) > — where child1 is another element-name
    <!ELEMENT tag (child1, child2) > — where child1 and child2 are element names
    Example: <!ELEMENT html (head,body) >
  2. Optional and repeatable child nodes: <!ELEMENT tag (child1*, child2+, child3?) >
    You can define the number of occurrences a child element may appear in an XML document, (* : 0 or more), (+ : 1 or more), (? : 0 or 1).
    
    For example:
    <!ELEMENT pet (name+, nickname*, vet?) >
    saying that a tag pet can have one or more proper names, any number of nicknames, and (possibly) one current vet (all in that order).
  3. You can also use quantifiers to define the number of occurrences for a sequence of tags.
    There is no special way to define a specific quantity of an element (for example, 3 occurrences). The one rather awkward way to do it is:
    <!ELEMENT tag (child1, child1, child1) >
- Defining choices for child elements:
  <!ELEMENT tag (child1 | child2 | child3) >
  The above declaration says the tag element contains exactly one child, and it must be a child1 or child2 or a child3 sub-elements.
- You can use nested parentheses to provide optional content among groups of subelements. One common pattern is the “list of approved tags”:
  <!ELEMENT p (em|strong|ul|ol|br|#PCDATA)* >
- Defining a tag that may contain anything:
  While not ideal for creating a structured set of rules, in a DTD, you can define an element to contain anything, meaning it can contain any combination of elements and text. As with mixed content, this is useful if you are creating a DTD to support XML documents from different sources. It may be the only way to define elements you know and allow for element structures you can't anticipate.
  
  <!ELEMENT tag ANY >
  Using ANY is probably punting on the actual "defining what is/isn't allowed", so it should not be used routinely.

video: DTD: attlist, entities; design choices (14m48s)

Defining Attributes

Attributes are useful to provide additional data about an element. Information contained in attributes tends to be about the content of the XML document, as opposed to being the content itself. General best practices suggest that elements are better used for information you want to display. Attributes are better used for information about information. Some of the reasons are: Attributes cannot describe data relationships like child elements can, their value are not as easily validated by a DTD, and they cannot contain multiple values whereas child elements can. Attributes are often used with empty elements where they describe information about the element. For example, they are often used to store ID's, as attributes are not the data, but information about the data.

Defining a standard element attribute:

<!ELEMENT height (#PCDATA)>
<!ATTLIST height 
 
          units CDATA  "feet">
          -- default value of feet for attribute units, could still establish a
          -- different value for units as in units="meters"

Syntax: <!ATTLIST tag attrName CDATA default-or-status >
(tag - the name of the element this attribute occurs in; attrName - the attribute's name; CDATA indicates that this is character data ² (as opposed other allowed values, like ID or NMTOKEN; see below); it is overwhelmingly the most common value for attributes. default-or-status is either:

A default value (string) for that attribute;
the token #REQUIRED to indicate that it's required (but there is no default-value). (And if you don't include any of the default-or-status, then it's considered #REQUIRED.)
the token #IMPLIED, meaning "optional" (and no default -- its value is somehow “implied” from other context).
(uncommon:) #FIXED the-only-allowed-value, when to force the author to only use one possible value. This can also be achieved via the more flexible "enumerated values", next; I'm not sure what the intent of this particular approach is.

Enumerating the attributes choices:
Instead of allowing any “CDATA” as an attribute-value, you can restrict it to an enumerated list of possible values: <!ELEMENT height (#PCDATA)>
<!ATTLIST height units (inches | feet) #REQUIRED> -- value of units must be either inches or feet
Defining Attributes that refers to another tag's ID attribute:
Suppose you want to restrict an attribute's value to have to reference (the ID attribute of) another element in the document (vaguely like a foreign key): <!ELEMENT special_site (title,url)> <!ATTLIST special_site wonder_focus IDREF #REQUIRED >
<special_site wonder_focus="w_143"> Use “IDREF” to define an attribute that can contain a value matching any existing ID attribute's value. Use “IDREFS” to define an attribute that can contain several white-space-separated values which match any existing ID attribute's value. There may be several IDREF attributes that refer to the same ID. But of course, the ID itself must be unique to one element.
Restricting Attributes to Valid XML Names:
DTD's don't allow for much data typing, but there is one restriction that you can apply to attributes. The value of an attribute defined as the NMTOKEN type, must be a valid XML name.

<!ELEMENT w_visit EMPTY>
<!ATTLIST w_visit primary_keyword NMTOKEN #REQUIRED>

To keep the primary_keyword attribute to just one word (with no white space) it can be defined to be the NMTOKEN type.

Defining entities.
- In the DTD:
  <!ENTITY entName "content"> -- the general form
  <!ENTITY wow "Wonders o’ <em>the</em> World"> -- an example
- Using General Entities
  
  <story> The oldest of the &wow;, the Great Pyramid, is … </story>
  will render as “The oldest of the Wonders o’ the World, the Great Pyramid, is…”.
DTDs actually allow fancier entity handling, but the concepts introduced aren't as fundamental as the general idea of formally defining our data, which is the reason we have looked at DTDs.

Security issue: There are many ways to trick computers into serving up resources they shouldn't; here is an one related to a program which purportedly processes a docx document, but instead exploits its XML format and its DTD: click 1 Candy Cane » show solution and just look at the DTD file at the bottom of the solution..

¹ However, an element that is defined to contain PCDATA can't contain any other element (no sub-tags). (Thus an element allowed to be #PCDATA cannot contain a raw < character because that would indicate a sub-tag — despite the fact that the term parsed seems to misleadingly suggest that sub-tags would be processed. ↩

The details are easy to confuse: DTDs allow #PCDATA for elements and CDATA for attributes (note the # or lack of it), but there is no PCDATA nor #CDATA.

Furthermore, XML already allows <![CDATA[…]]>, a processing-instruction meaning (unparsed) character-data that might contain characters like literal < and &s. But that's not part of DTD.

↩

This page licensed CC-BY 4.0 Ian Barland
Page last generated Please mail any suggestions
(incl. typos, broken links)
to ibarlandradford.edu

DTDs tags and attributes

The three things a DTD specifies

Where to put the DTD file

Defining your XML: Elements, Attributes, and Entities

DTDs
tags and attributes