RU beehive logo ITEC dept promo banner
ITEC 325
2021spring
flo

DTDs
tags and attributes

These notes influenced by XML Visual Quickstart Guide by Kevin Howard Goldberg, and notes therefrom by Jack Davis (jcdavis@radford.edu).

We've seen quick examples of some standardized file-formats using their own variant of XML: iTunes collections, .svg files, Word documents. We also saw a made-up set of tags about children, as well as the books' made-up set of tags about ancient wonders. Actually, even the “standardized” file formats are just somebody who originally made up tags, to describe the hierarchical information they wanted to represent. But if the tags are made-up, who's to declare if people are using them correctly?

video: DTD: Introduction; Elements (22m59s)

When defining a new XML language, you must specify its grammar — what tags are allowed, what child-tags they may (or, must) contain. Similarly, for attributes — what attributes are required on certain tags (like “img” tags must have a “src” attribute and otherwise be empty), and what the allowed values are for attributes.

In fact, you can compare any XML document to its corresponding grammar to validate whether it conforms to the rules specified in the schema. If an XML document is deemed valid, then it data is in the proper form as specified by the schema. (Of course, just as a syntax-checker can validate a Java program's syntax, it won't validate its meaning; similarly schema's won't validate an XML document's logical content.) This isn't as much a problem for XML documents (especially small ones) as for programs. For databases implemented as large XML files, people might build other ad hoc tools to run sanity checks on the contents of an XML document's meaning — check that a wonder's year-destroyed isn't less than its year-built, that links actually resolve, etc..

There are two common formats for specifying XML grammars: DTDs, and XML Schema. A DTD, “Document Type Definition”, is an older but widely used system with a peculiar and limited syntax. However, they are lightweight: compact and easily comprehended with a little study. Since they are relatively simple and still widely used, studying them is a good first step in understanding XML tag set definition. A DTD is a text-only document itself and therefore does not begin with the standard XML declaration.

The three things a DTD specifies

An example file: the textbook's ch06-wonders.dtd, as referenced in ch06-wonders.xml (do a view-source, line 3)

Where to put the DTD file

You can have the DTD as an external file, or in-document (similar to providing css-files).

  1. In-document, inside the same file as the XML:
    <!DOCTYPE ancient_wonders [
        <!ELEMENT ancient_wonders wonders*>
        <!ELEMENT wonder (name+, )>
        
    ]>
    
    <ancient_wonders>
        <wonder>
            <name>Great Pyramid of Ghiza</name>
            
        </wonder>
    </ancient_wonders>    
                  
    This is the lightest-weight solution, and is suitable for the homework assignment.
  2. For an external document file on your own computer, you can use SYSTEM Start your xml file with a DOCTYPE specifying where the dtd is found, e.g. <!DOCTYPE ancient_wonders SYSTEM "wonders.dtd">. This is what the above ch06-wonders.xml does (remember to view-source, line 3).
  3. An external document on the interwebs, using PUBLIC: <!DOCTYPE ancient_wonders PUBLIC "-//ibarland//DTD Archaeological Wonders 1.0//EN" "https://itec-php01.radford.edu/~itec325/2016spring-ibarland/Lectures/wonders.dtd">. See here for the syntax of the word after “PUBLIC”. (And: on your homework, prefer the SYSTEM technique to this one.)

Defining your XML: Elements, Attributes, and Entities

Security issue: There are many ways to trick computers into serving up resources they shouldn't; here is an one related to a program which purportedly processes a docx document, but instead exploits its XML format and its DTD: click 1 Candy Cane » show solution and just look at the DTD file at the bottom of the solution..

1 However, an element that is defined to contain PCDATA can't contain any other element (no sub-tags). (Thus an element allowed to be #PCDATA cannot contain a raw < character because that would indicate a sub-tag — despite the fact that the term parsed seems to misleadingly suggest that sub-tags would be processed.      
2

The details are easy to confuse: DTDs allow #PCDATA for elements and CDATA for attributes (note the # or lack of it), but there is no PCDATA nor #CDATA.

Furthermore, XML already allows <![CDATA[]]>, a processing-instruction meaning (unparsed) character-data that might contain characters like literal < and &s. But that's not part of DTD.

     

logo for creative commons by-attribution license
This page licensed CC-BY 4.0 Ian Barland
Page last generated
Please mail any suggestions
(incl. typos, broken links)
to ibarlandradford.edu
Rendered by Racket.