home—lects—exams—hws
breeze (snow day)
lect14b-ch06-dtd
DTD
ch06
Originally based on
XML Visual Quickstart Guide
by Kevin Howard Goldberg,
and notes therefrom by Jack Davis (jcdavis@radford.edu).
Defining a tag set (that is, an XML grammar) formally is extremely important for keeping
XML documents consistent.
Generally, these definition documents are referred to as “schemas”.
In fact, you can compare any XML document to its corresponding schema to validate whether it
conforms to the rules specified in the schema.
If an XML document is deemed valid, then it data is in the proper form as specified by the schema.
(Of course, just as a syntax-checker can validate a Java program's syntax,
it won't validate its meaning;
similarly schema's won't validate an XML document's logical content.
This isn't as much a problem for XML documents (especially small ones) as for programs.
For databases implemented as large XML files,
people might build other ad hoc tools to run sanity checks on the contents
of an XML document's meaning — check that a wonder's year-destroyed
isn't less than its year-built,
that links actually resolve, etc..
There are two principal systems for writing schemas: DTD and XML
Schema. A DTD, Document Type Definition, is an older but
widely used system with a peculiar and limited syntax. However, they
are compact and easily comprehended with a little study. Since they
are relatively simple and still widely used, studying them is a
good first step in understanding XML tag set definition. A DTD is a
text only document itself and therefore does not begin with the
standard XML declaration.
- Pros of using DTD's
- They are compact and easily comprehended with a little study.
- The can be defined inline (internal DTD's) for quick development.
- The can define entities.
- They are likely the most widely accepted and are supported
by most XML parsers.
- Cons of using DTD's
- They are not written using XML syntax, and require parsers
to support additional language.
- They do not support Namespaces.
- They do not have data typing (requiring data to be
an integer, a string, or a date, etc.), decreasing the
strength of validation.
- They have limited capacity to define how many
child elements can nest within a given parent element.
The three things a DTD specifies
- elements (tags)
- attributes for tags
- entities
A common question, when designing an XML language,
is when to use a (nested) tag, vs. an attribute on the tag.
The rule-of-thumb is “data as tags; metadata as attributes”.
For example, in XHTML, the img tag has the filename as an attribute,
because the image is the data; the filename is information about how to find the (real) data.
Guidelines:
-
Does the additional info itself contain sub-information?
Then you must use a tag element.
Most common case: markup.
For example, if a caption may contain emphasized text, you can't do that with an attribute.
-
Does the info's order matter, relative to its sibling information?
Can there be two such pieces of info?
If so, you must use tag elements.
(Attributes must be unique and order doesn't matter.)
For example, if a book can have multiple authors (and order matters),
you can't do that with an attribute.
(All book examples)
Defining Elements
Elements are the foundational units of an XML document. They
can contain values, have attributes, and they can contain other
elements. A DTD for a given custom markup language (tag set)
will define a list of elements and any child elements that each
element can have. It will define any attributes that each
element can have, and it will define whether these elements and
attributes are optional or required.
- Defining an element that only contains text:
<!ELEMENT tag (#PCDATA) > -- where tag is the element name
PCDATA stands for “parsed character data”,
and it refers to the text value of an element;
“parsed” meaning that it can contain entities.
An element that is defined to contain PCDATA can't contain any other element.
Example:
<!ELEMENT name (#PCDATA) >
<!ELEMENT history (#PCDATA) >
- Defining an empty element:
An empty element is an XML element that does not have any
body of its own.
(It may use its attributes to store data, though.
<!ELEMENT tag EMPTY > -- where tag is the element name
Example:
<!ELEMENT br EMPTY >
- Defining an element that contains a child element:
- Tags with required child nodes:
<!ELEMENT tag (child1) > -- where child1 is an element name
<!ELEMENT tag (child1, child2) > -- where child1 and child2 are element names
Example:
<!ELEMENT html (head,body) >
Optional and repeatable child nodes:
<!ELEMENT tag (child1*, child2+, child3?) >
You can define the number of occurrences a child element may appear
in an XML document, (* : 0 or more), (+ : 1 or more), (? : 0 or 1).
For example:
<!ELEMENT pet (name+, nickname*, vet?) >
saying that a tag pet can have one or more proper names,
any number of nicknames,
and (possibly) one current vet (all in that order).
-
You can also use quantifiers to define the number of occurrences for
a sequence of tags.
There is no special way to define a specific quantity of an element (for
example, 3 occurrences). The one rather awkward way to do it is:
<!ELEMENT tag (child1, child1, child1) >
- Defining choices for child elements:
<!ELEMENT tag (child1 | child2 | child3) >
The above declaration says the tag element contains
exactly one child, and it must be a
child1 or child2
or a child3 sub-elements.
-
You can use nested parentheses to provide optional content among groups of
subelements.
One common pattern is the "list of approved tags":
<!ELEMENT p (em,strong,ul,ol,br)* >
- Defining a tag that may contain anything:
While not ideal for creating a structured set of rules, in a DTD,
you can define an element to contain anything,
meaning it can contain any combination of elements and text.
As with mixed content,
this is useful if you are creating a DTD to support XML documents
from different sources. It may be the only way to define elements
you know and allow for element structures you can't anticipate.
<!ELEMENT tag ANY >
- Digression: tags vs. attributes
Which of the following do you like/dislike? Why?
-
<wonder height = "37 feet"</wonder>…</wonder>
|
-
<height>
<measure units="feet">37</measure>
<measure units="meters">11.8</measure>
</height>
|
-
-
<height>
<measure>37</measure>
<units>feet</units>
>/height>
|
-
<height units="feet">
<measure>37</measure>
>/height>
|
- Defining Attributes
Attributes are useful to provide additional data about an element.
Information contained in attributes tends to be about the content of the
XML document, as opposed to being the content itself. General best practices
suggest that elements are better used for information you want to display.
Attributes are better used for information about information. Some of the
reasons are: Attributes cannot describe data relationships like child
elements can, their value are not as easily validated by a DTD, and they
cannot contain multiple values whereas child elements can. Attributes
are often used with empty elements where they describe information about
the element. For example, they are often used to store ID's, as attributes
are not the data, but information about the data.
- Defining a standard element attribute:
Syntax: <!ATTLIST tag attrName CDATA #REQUIRED >
(tag - the name of the element this attribute occurs in;
attrName - the attribute's name;
CDATA indicates that this is (unparsed) character data;
the last word is either #REQUIRED or #IMPLIED.)
You may define multiple attributes for one element in a single
definition statement.
<!ATTLIST tag
attr1 CDATA #REQUIRED
attr2 CDATA #IMPLIED>
|
- Defining Default Attribute Values:
<!ELEMENT height (#PCDATA)>
<!ATTLIST height
units CDATA "feet">
-- default value of feet for attribute units, could still establish a
-- different value for units as in units="meters"
|
<!ELEMENT height (#PCDATA)>
<!ATTLIST height
units CDATA #FIXED "feet">
-- default value of feet for attribute units, cannot establish a
-- different value for units as in units="meters"
|
- Defining Attributes with Choices
<!ELEMENT height (#PCDATA)>
<!ATTLIST height
units (inches | feet) #REQUIRED>
-- value of units must be either inches or feet
- Defining Attributes that refers to another tag's ID attribute:
Suppose you want to restrict an attribute's value to have to
reference (the ID attribute of) another element in the document
(vaguely like a foreign key):
<!ELEMENT special_site (title,url)>
<!ATTLIST special_site wonder_focus IDREF #REQUIRED >
<special_site wonder_focus="w_143">
Use “IDREF” to define an
attribute that can contain a value
matching any existing ID attribute's value.
Use “IDREFS” to
define an attribute that can contain several white-space-separated
values which match any existing ID attribute's value.
There may be several IDREF attributes that refer to the same ID.
But of course, the ID itself must be unique to one element.
- Restricting Attributes to Valid XML Names:
DTD's don't allow for much data typing, but there
is one restriction that you can apply to attributes.
The value of an attribute defined as the NMTOKEN type, must
be a valid XML name.
<!ELEMENT w_visit EMPTY>
<!ATTLIST w_visit primary_keyword NMTOKEN #REQUIRED>
To keep the primary_keyword attribute to just one word
(with no white space) it can be defined to be the NMTOKEN type.
-
lect14b-ch06-census.xml
home—lects—exams—hws
breeze (snow day)