|
These notes influenced by XML Visual Quickstart Guide by Kevin Howard Goldberg, and notes therefrom by Jack Davis (jcdavis@radford.edu).
We've seen quick examples of some standardized file-formats using their own variant of XML: iTunes collections, .svg files, Word documents. We also saw a made-up set of tags about children, as well as the books' made-up set of tags about ancient wonders. Actually, even the “standardized” file formats are just somebody who originally made up tags, to describe the hierarchical information they wanted to represent. But if the tags are made-up, who's to declare if people are using them correctly?
video: DTD: Introduction; Elements (22m59s)When defining a new XML language, you must specify its grammar — what tags are allowed, what child-tags they may (or, must) contain. Similarly, for attributes — what attributes are required on certain tags (like “img” tags must have a “src” attribute and otherwise be empty), and what the allowed values are for attributes.
In fact, you can compare any XML document to its corresponding grammar to validate whether it conforms to the rules specified in the schema. If an XML document is deemed valid, then it data is in the proper form as specified by the schema. (Of course, just as a syntax-checker can validate a Java program's syntax, it won't validate its meaning; similarly schema's won't validate an XML document's logical content.) This isn't as much a problem for XML documents (especially small ones) as for programs. For databases implemented as large XML files, people might build other ad hoc tools to run sanity checks on the contents of an XML document's meaning — check that a wonder's year-destroyed isn't less than its year-built, that links actually resolve, etc..
There are two common formats for specifying XML grammars: DTDs, and XML Schema. A DTD, “Document Type Definition”, is an older but widely used system with a peculiar and limited syntax. However, they are lightweight: compact and easily comprehended with a little study. Since they are relatively simple and still widely used, studying them is a good first step in understanding XML tag set definition. A DTD is a text-only document itself and therefore does not begin with the standard XML declaration.
An example file: the textbook's ch06-wonders.dtd, as referenced in ch06-wonders.xml (do a view-source, line 3)
You can have the DTD as an external file, or in-document (similar to providing css-files).
<!DOCTYPE ancient_wonders [ <!ELEMENT ancient_wonders wonders*> <!ELEMENT wonder (name+, …)> ⋮ ]> <ancient_wonders> <wonder> <name>Great Pyramid of Ghiza</name> ⋮ </wonder> </ancient_wonders> |
Defining Elements (tags)
Elements are the foundational units of an XML document. They
can contain values, have attributes, and they can contain other
elements. A DTD for a given custom markup language (tag set)
will define a list of elements and any child elements that each
element can have. It will define any attributes that each
element can have, and it will define whether these elements and
attributes are optional or required.
see also: w3schools.com
Optional and repeatable child nodes:
<!ELEMENT tag (child1*, child2+, child3?) >
You can define the number of occurrences a child element may appear
in an XML document, (* : 0 or more), (+ : 1 or more), (? : 0 or 1).
For example:
<!ELEMENT pet (name+, nickname*, vet?) >
saying that a tag pet can have one or more proper names,
any number of nicknames,
and (possibly) one current vet (all in that order).
You can also use quantifiers to define the number of occurrences for
a sequence of tags.
There is no special way to define a specific quantity of an element (for
example, 3 occurrences). The one rather awkward way to do it is:
<!ELEMENT tag (child1, child1, child1) >
Attributes are useful to provide additional data about an element. Information contained in attributes tends to be about the content of the XML document, as opposed to being the content itself. General best practices suggest that elements are better used for information you want to display. Attributes are better used for information about information. Some of the reasons are: Attributes cannot describe data relationships like child elements can, their value are not as easily validated by a DTD, and they cannot contain multiple values whereas child elements can. Attributes are often used with empty elements where they describe information about the element. For example, they are often used to store ID's, as attributes are not the data, but information about the data.
<!ELEMENT height (#PCDATA)> <!ATTLIST height |
<story> The oldest of the &wow;, the Great Pyramid, is … </story>will render as “The oldest of the Wonders o’ the World, the Great Pyramid, is…”.
DTDs actually allow fancier entity handling, but the concepts introduced aren't as fundamental as the general idea of formally defining our data, which is the reason we have looked at DTDs.
Security issue: There are many ways to trick computers into serving up resources they shouldn't; here is an one related to a program which purportedly processes a docx document, but instead exploits its XML format and its DTD: click1 Candy Cane»show solutionand just look at the DTD file at the bottom of the solution..
<character because that would indicate a sub-tag — despite the fact that the term
parsedseems to misleadingly suggest that sub-tags would be processed. ↩
The details are easy to confuse:
DTDs allow #PCDATA
for elements and CDATA
for attributes
(note the #
or lack of it),
but there is no PCDATA
nor #CDATA
.
Furthermore, XML already allows <![CDATA[…]]>, a
processing-instruction meaning (unparsed) character-data that might contain
characters like literal <
and &
s.
But that's not part of DTD.
This page licensed CC-BY 4.0 Ian Barland Page last generated | Please mail any suggestions (incl. typos, broken links) to ibarlandradford.edu |