These notes influenced by
XML Visual Quickstart Guide
by Kevin Howard Goldberg,
and notes therefrom by Jack Davis (jcdavis@radford.edu).
We've seen quick examples of some standardized file-formats using their own variant of XML:
iTunes collections, .svg files, Word documents.
We also saw a made-up set of tags about children,
as well as the books' made-up set of tags about ancient wonders.
Actually, even the “standardized” file formats are just somebody
who originally made up tags, to describe the hierarchical information they wanted to represent.
But if the tags are made-up, who's to declare if people are using them correctly?
When defining a new XML language, you must specify its grammar —
what tags are allowed,
what child-tags they may (or, must) contain.
Similarly, for attributes — what attributes are required on certain tags
(like “img” tags must have a “src”
attribute
and otherwise be empty), and what the allowed values are for attributes.
In fact, you can compare any XML document to its corresponding grammar to validate whether it
conforms to the rules specified in the schema.
If an XML document is deemed valid, then it data is in the proper form as specified by the schema.
(Of course, just as a syntax-checker can validate a Java program's syntax,
it won't validate its meaning;
similarly schema's won't validate an XML document's logical content.)
This isn't as much a problem for XML documents (especially small ones) as for programs.
For databases implemented as large XML files,
people might build other ad hoc tools to run sanity checks on the contents
of an XML document's meaning — check that a wonder's year-destroyed
isn't less than its year-built,
that links actually resolve, etc..
There are two common formats for specifying XML grammars:
DTDs, and
XML Schema.
A DTD, “Document Type Definition”, is an older but
widely used system with a peculiar and limited syntax.
However, they are lightweight: compact and easily comprehended with a little study.
Since they are relatively simple and still widely used, studying them is a
good first step in understanding XML tag set definition.
A DTD is a text-only document itself and therefore does not begin with the
standard XML declaration.
Pros of using DTD's
lightweight (compact, easily comprehended, but limited expressiveness).
The can be defined inline (internal DTD's) for quick development.
The can define entities.
They are perhaps the most widely accepted and are supported
by most XML parsers.
Cons of using DTD's
They are not written using XML syntax, and require parsers
to support additional language.
They do not support Namespaces.
They do not have data typing (requiring data to be
an integer, a string, or a date, etc.), decreasing the
strength of validation.
They have limited capacity to define how many
child elements can nest within a given parent element.
This is the lightest-weight solution, and is suitable for the homework assignment.
For an external document file on your own computer,
you can use SYSTEM
Start your xml file with a DOCTYPE specifying where the dtd is found,
e.g.<!DOCTYPE ancient_wonders SYSTEM "wonders.dtd">.
This is what the above
ch06-wonders.xml
does (remember to view-source, line 3).
An external document on the interwebs, using PUBLIC:
<!DOCTYPE ancient_wonders PUBLIC "-//ibarland//DTD Archaeological Wonders 1.0//EN" "https://php.radford.edu/~itec325/2016spring-ibarland/Lectures/wonders.dtd">.
See here
for the syntax of the word after “PUBLIC”.
(And: on your homework, prefer the SYSTEM technique to this one.)
Defining your XML: Elements, Attributes, and Entities
Defining Elements (tags)
Elements are the foundational units of an XML document. They
can contain values, have attributes, and they can contain other
elements. A DTD for a given custom markup language (tag set)
will define a list of elements and any child elements that each
element can have. It will define any attributes that each
element can have, and it will define whether these elements and
attributes are optional or required.
Defining an element that only contains text: <!ELEMENT tag (#PCDATA) > — where tag is the element name
PCDATA stands for “parsed character data”,
and it refers to the text value of an element;
“parsed” meaning that it can contain entities.
An element that is defined to contain PCDATA can't contain any other element (no sub-tags).
Example: <!ELEMENT name (#PCDATA) > <!ELEMENT history (#PCDATA) >
Defining an empty element:
An empty element is an XML element that does not have any
body of its own.
(It may use its attributes to store data, though. <!ELEMENT tag EMPTY > — where tag is the element name
Example: <!ELEMENT br EMPTY >
Defining an element that contains a child element:
Tags with required child nodes:
<!ELEMENT tag (child1) > —
where child1 is another element-name <!ELEMENT tag (child1, child2) > — where child1 and child2 are element names
Example:
<!ELEMENT html (head,body) >
Optional and repeatable child nodes:
<!ELEMENT tag (child1*, child2+, child3?) >
You can define the number of occurrences a child element may appear
in an XML document, (* : 0 or more), (+ : 1 or more), (? : 0 or 1).
For example: <!ELEMENT pet (name+, nickname*, vet?) >
saying that a tag pet can have one or more proper names,
any number of nicknames,
and (possibly) one current vet (all in that order).
You can also use quantifiers to define the number of occurrences for
a sequence of tags.
There is no special way to define a specific quantity of an element (for
example, 3 occurrences). The one rather awkward way to do it is: <!ELEMENT tag (child1, child1, child1) >
Defining choices for child elements: <!ELEMENT tag (child1 | child2 | child3) >
The above declaration says the tag element contains
exactly one child, and it must be a
child1 or child2
or a child3 sub-elements.
You can use nested parentheses to provide optional content among groups of
subelements.
One common pattern is the “list of approved tags”:
<!ELEMENT p (em|strong|ul|ol|br|#PCDATA)* >
Defining a tag that may contain anything:
While not ideal for creating a structured set of rules, in a DTD,
you can define an element to contain anything,
meaning it can contain any combination of elements and text.
As with mixed content,
this is useful if you are creating a DTD to support XML documents
from different sources. It may be the only way to define elements
you know and allow for element structures you can't anticipate.
Attributes are useful to provide additional data about an element.
Information contained in attributes tends to be about the content of the
XML document, as opposed to being the content itself. General best practices
suggest that elements are better used for information you want to display.
Attributes are better used for information about information. Some of the
reasons are: Attributes cannot describe data relationships like child
elements can, their value are not as easily validated by a DTD, and they
cannot contain multiple values whereas child elements can. Attributes
are often used with empty elements where they describe information about
the element. For example, they are often used to store ID's, as attributes
are not the data, but information about the data.
Defining a standard element attribute:
<!ELEMENT height (#PCDATA)>
<!ATTLIST height
units CDATA "feet">
-- default value of feet for attribute units, could still establish a
-- different value for units as in units="meters"
Syntax:
<!ATTLIST tagattrName CDATA default-or-status >
(tag - the name of the element this attribute occurs in;
attrName - the attribute's name;
CDATA indicates that this is (unparsed) character data;
default-or-status is either:
A default value (string) for that attribute;
the token #REQUIRED to indicate that it's required
(but there is no default);
the token #IMPLIED, meaning "optional"
(and no default).
(uncommon:)
#FIXED the-only-allowed-value,
when to force the author to only use one possible value.
This can also be achieved via the more flexible
"enumerated values", next;
I'm not sure what the intent of this particular approach is.
Enumerating the attributes choices:
Instead of allowing any “CDATA”
as an attribute-value,
you can restrict it to an enumerated list of possible values:
<!ELEMENT height (#PCDATA)> <!ATTLIST height
units (inches | feet) #REQUIRED>
-- value of units must be either inches or feet
Defining Attributes that refers to another tag's ID attribute:
Suppose you want to restrict an attribute's value to have to
reference (the ID attribute of) another element in the document
(vaguely like a foreign key):
<!ELEMENT special_site (title,url)> <!ATTLIST special_site wonder_focus IDREF #REQUIRED >
<special_site wonder_focus="w_143">
Use “IDREF” to define an
attribute that can contain a value
matching any existing ID attribute's value.
Use “IDREFS” to
define an attribute that can contain several white-space-separated
values which match any existing ID attribute's value.
There may be several IDREF attributes that refer to the same ID.
But of course, the ID itself must be unique to one element.
Restricting Attributes to Valid XML Names:
DTD's don't allow for much data typing, but there
is one restriction that you can apply to attributes.
The value of an attribute defined as the NMTOKEN type, must
be a valid XML name.
A common question, when designing an XML language,
is when to use a (nested) tag, vs. an attribute on the tag.
The rule-of-thumb is “data as tags; metadata as attributes”.
For example, in XHTML, the img tag has the filename as an attribute,
because the image is the data; the filename is information about how to find the (real) data.
A couple of guidelines:
Does the additional info itself contain sub-information?
Then you must use a tag element.
Most common case: markup.
For example, if a caption may contain emphasized text, you can't do that with an attribute.
Does the info's order matter, relative to its sibling information?
Can there be two such pieces of info?
If so, you must use tag elements.
(Attributes must be unique, and order doesn't matter.)
For example, if a book can have multiple authors (and order matters),
you can't do that with an attribute.
Does this represent data, or meta-data (i.e. data about data)?
E.g. the fact that the data 37 is being measured in feet is meta-data.
Put meta-information as attributes, as possible.
Design practice
Sample exercise:
(a) create a reasonable DTD for census records so that the following file would be legal:
ch06-census.xml
(b) Critique any strengths and weaknesses of how that file represents information —
what changes would you make to represent census records?
Sample exercise:
Come up with a DTD for a flow charts, such as
imgur.com/ECkYukd#.