XML and related standards

This lecture presents a very brief, very selective introduction to XML and related standards.

Table of contents

  • XML basics: XML, DTDs, XML schema languages, XML Namespaces, JSON
  • Document Object Model
  • XML parsing
  • XML document querying and transformation: XPath and XSLT
  • XML database querying and transformation: XQuery (omitted in 2009)
  • General references
  • XML basics

    XML stands for eXtensible Markup Language.

    "XML is a simple [textual] notation for describing trees. Each internal node in the tree is an element, and leaf nodes are either attributes or text. An XML document that is properly nested (so that it really describes a tree) is called well-formed. In addition, one can give a DTD (Document Type Definition) or Schema, which specifies what nodes might appear in the tree. For each element type, one can list what attributes might appear with that element, and give a regular expression specifying what elements can appear within that element. A document that satisfies the associated DTD or Schema is called valid. Each XML dialect is specified by giving its DTD or Schema. Some XML systems work with any well-formed document, while others also check that the document is valid." (Wadler)

    XML looks like HTML but differs (a) in using an open-ended set of elements and attributes, (b) in describing content and structure only (omitting all presentation) and (c) in requiring stricter formatting (e.g. all tags must be closed).

    Here is a possible (large) XML document:

    <?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE bank SYSTEM "bank.dtd">
        <customer type="personal">
            <street>10 Main Street</street>
          <phone>07 3123 4567</phone>
        <customer type="business"> ... </customer>
          <name>John &amp; Mary Smith</name>
        <account> ... </account>

    Clearly, this represents a tree with root bank and children customers and accounts. The first child customers has several customer children, and so on. Note there is no presentation information. Many more features, including (internal) links, are possible but not shown in this example.

    XML is increasingly becoming the preferred means of storing and transferring information on the Web. Groups of users in a single domain can agree on the structure and elements of documents they wish to exchange, allowing them to effectively process documents they receive. The fact that documents are in text form aids interoperability.

    But how do users in a single domain define the structure and elements of documents they wish to exchange?


    A document type definition (DTD) is a grammar that defines the structure of a valid document using regular expressions (remember them?). For example, the above document might be valid with respect to the following DTD:

    <!ELEMENT bank ( customers, accounts, ... )>
    <!ELEMENT customers ( customer+ )>
    <!ATTLIST customer type CDATA #REQUIRED>
    <!ELEMENT customer ( id, name, address, phone?)>
    <!ELEMENT name ( family, given+ )>
    <!ELEMENT family ( #PCDATA )>
    <!ELEMENT given ( #PCDATA )>
    <!ELEMENT address ( street, suburb, state, postcode )>

    Here, CDATA denotes unparsed character data (i.e., text) and #PCDATA denotes parsed character data (i.e., defined entities such as &amp; and &lt; are expanded). Clearly, the design of a DTD is analogous to the design of a relational database schema. A new design decision is whether to use attributes or elements. Another decision is whether to use "foreign keys" such as customer in account or explicit links. Other decisions are whether to store the whole bank in a single file, or to store customers in one file and accounts in another, or to store each customer and account in a separate file.

    The above bank document refers to an external DTD (bank.dtd) with root bank. If we wish to include the DTD in the document itself (as an embedded DTD), the form is:

    <?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE bank [
    <!ELEMENT bank ( customers, accounts, ... )>

    File abook.dtd is a possible DTD for an guest book database; file abook.xml is an XML file containing some guest book messages.

    File bib.dtd is a DTD for a bibliography database, file bib.xml is an XML file containing some bibliographic data. Note the use of alternatives ("|") in the DTD.

    All of these XML documents are simple ones, in that they do not contain links between documents, do not refer to "namespaces", do not contain "processing elesments", etc.

    Note that XML documents may be nested arbitrarily deeply if their DTDs are recursive. This is appropriate for example with family trees, books containing arbitrarily deeply nested sections and subsections, and other hierarchical structures. A recursive DTD might contain rules such as:

    <!ELEMENT person ( name, gender, children )>
    <!ELEMENT children ( person+ )>

    Generally, the theory of designing a "good" XML DTD (specifically, one with no redundancy) is less well understood than that of designing a good relational database schema, but research (and practice) is progressing. (See recent papers in ACM PODS, TODS and SIGACT, for example.)

    XML is thus seen to be a metalanguage, a notation in which different languages can be defined for different purposes.

    XML Schema

    However, DTDs are limited; they only define the "syntax" of XML documents. XML Schema (or XSD) is an XML language that extend DTDs by providing more (database-like) types; by allowing you to define your own types; by allowing you to specify the type, range and precision of attributes, by allowing you to indicate that elements may occur in any order (sets), by allowing you to specify the cardinality (number of occurrences) of one element in another, by allowing you to indicate keys for elements, and so on. Moreover, an XML schema, unlike a DTD, is an XML document. XML validators can then check that given documents are consistent with given schemas. So, in practice, members of a given domain prefer to define their data as XML schemas rather than as DTDs.

    You can validate an XML file with respect to a DTD or XML Schema online using the STG validator or the W3Schools.com validator. In each case, you need to include the DTD with the XML document or put both the DTD and the XML on a Web server reachable by the service (which excludes dwarf).

    Alternative schema languages

    Many developers have criticised XML Schema as being both too complex and too restrictive for its intended applications. One arguably better schema language that is becoming increasingly widely used is RELAX NG. See the RELAX NG Tutorial and the RELAX NG Compact Syntax Tutorial for more information. Here is the RELAX NG compact syntax corresponding to the guestbook DTD above and here is the compact syntax corresponding to the bibliography DTD above.

    XML Namespaces

    Namespaces are used to resolve possible name conflicts, e.g., we may distinguish between table elements that occur in different XML documents by prefixing each occurrence of the name with a namespace identifier and by providing a distinct URL for each namespace. For example, compare

    <h:table xmlns:h="http://www.w3.org/TR/html4/">
       <h:tr> <h:td>Apples</h:td> <h:td>Bananas</h:td> </h:tr>


    <f:table xmlns:f="http://www.w3schools.com/furniture/">
       <f:name>African Coffee Table</f:name>

    We can avoid the need for all these prefixes by defining a default namespace:

    <table xmlns="http://www.w3.org/TR/html4/">
       <tr> <td>Apples</td> <td>Bananas</td> </tr>

    Note that the URLs used in namespaces need not contain information (though they may do so); they are primarily used to distinguish different namespaces.

    Alternative metalanguages for the Web

    Some developers have criticised XML as being too complex for most of its intended applications. One simpler metalanguage that is gaining increasing acceptance is JSON (JavaScript Object Notation). JSON assumes all data can be represented by composing numbersand strings using just lists and associative arrays (or maps). For example:

    person = {
      "name": "Simon Willison",
      "age": 25,
      "height": 1.68,
      "urls": [
    JSON omits many features such as links and namespaces that complicate XML. Libraries for encoding/decoding data into/from JSON exist for every modern programming language, and JSON is increasingly being used for applications that previously used XML, including Ajax.


    XHTML 1.0 is a particular XML language corresponding to HTML 4.01. Until very recently, the W3C has been promoting XHTML as a replacement for HTML so that systems for XML processing can be applied to (X)HTML documents.

    Like HTML 4, XHTML 1.0 comes in three varieties: Strict, Transitional and Frames, each defined by a separate DTD.

    There is a working draft of a (significantly different) XHTML 2.0 specification, but see below.

    Here is a minimal (strict) XHTML 1.0 document:

    <?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
      <title>Required title</title>
      <p>Document content...</p>

    The header is required to inform the browser how to process the document.

    A well-formed XHTML document must satisfy the following conditions:

    XHTML documents, like HTML documents, may be validated by the W3C Markup Validation Service.

    The W3C has been recommending that new applications should use XHTML instead of HTML. At the very least, new HTML documents should satisfy the above conditions to facilitate future conversion to XHTML.

    The HTML vs XHTML debate

    The W3C has not revised HTML4.01 since 1999. It (the W3C) has been encouraging developers to use XHTML1.1 now and XHTML2.0 in the future. But developers have resisted this encouragement partly because Internet Explorer the dominant browser does not support XHTML1.1 properly, partly because XHTML1.1 offers no real functional advantages over HTML4.01, and certainly because no browser properly implements XHTML2.0. Meanwhile, HTML4.01 fails to provide many features, especially form features and multimedia features, that developers require.

    In response to this situation, WHATWG (Web Hypertext Application Technology Working Group) has proposed its own version of HTML, called HTML5, which significantly simplifies and extends HTML4.01. Some browsers already render HTML5.

    Belatedly, the W3C has (re)started its own HTML Working Group with the charter to "revisit the standard [HTML4.01] and ... meet ...community needs". The Chairs of this Working Group have promised to take the WHATWG specifications into account and possibly to build upon them. The future is unclear and progress will probably be slow (unfortunately). However, in February 2008, the W3C published a working draft of HTML 5 based on the WHATWG specification. All major browser developers have committed to supporting this new standard and have already started implementing it.. The future for Web site developers is promising. (Except for the massive size of the HTML 5 specification.)

    Document object model

    (This section is adapted from www.w3schools.com. See What is the Document Object Model? for another good short introduction. See the W3C's DOM site for full specifications.)

    The document object model (DOM) specifies how to access and transform XML (and HTML) documents from within languages such as JavaScript, Java, and C++. As an XML (or HTML) document is really a tree, the DOM provides an API for tree construction, traversal and update, using XML (or HTML) concepts.

    The documentElement is the top level of the tree. This element has one or many childNodes that represent the subtrees of the tree. Then, the childNodes property of the documentElement can be accessed with a 'foreach' construct to enumerate the subtrees of the tree.

    The following table lists the most common node types in the DOM:

    Node Type Example
    Document type <!DOCTYPE food SYSTEM "food.dtd">
    Processing instruction <?xml version="1.0"?>
    Element <drink type="beer">Fosters</drink>
    Attribute type="beer"
    Text Fosters

    The DOM is object-oriented and, e.g., a documentElement is a subtype of an element.

    In the Java API for DOM (which is a component API of the JAXP, the Java API for XML Processing), each node of a document is an instance of a particular type such as element, attribute, text, and so on. For each type, a set of methods are defined. For example, class element has methods to return the list of attributes of the element, to return the list of subelements (subtrees) of the element, to return the parent of the element, to add a new subelement (subtree) to the element, and so on. See the examples below.

    The Java API for DOM is actually rather complicated, so a simpler API, called JDOM, which is closer to the W3C DOM specification, has been developed outside Sun. A good introduction to this JDOM API is provided here and in the JAXP documentation. XOM is another proposal. PHP also has a DOM API close to the W3C DOM specification, which is fairly straightforward to use.

    There is also a JavaScript API for DOM, which allows JavaScript to process either XML documents or HTML documents. The HTML DOM is oriented towards HTML and browser-related concepts such as windows, frames, menubars, status lines, etc.

    XML documents may be created directly by repeated calls of the DOM methods for constructing nodes and adding them to other nodes. More commonly, an XML parser is used to read an XML document from a text file and store it as a tree in main memory. How this is done is described in the next section.

    The PHP DOM API is used in the XML-backed guestbook example.

    XML parsing

    There are two main approaches to parsing XML documents: SAX and DOM. Java classes (JAXP) for using either approach are included since J2SE 1.4, but must be downloaded separately for earlier versions, e.g., by downloading the parser Xerces Java 2 from xml.apache.org. See also Python and XML through Safari; XML parsers are included with the standard Python and PHP implementations.

    SAX parsers

    SAX stands for Simple API for XML. This was the first API used for parsing XML documents and is still widely used. Full information about SAX is available at the official website for SAX. Implementations of the API are available for Java, Python, PHP and other languages.

    The main idea of SAX is that a program invokes a parser on an XML document, and every time a start/end of document, start/end of element, attribute, whitespace, or character sequence is read, a user-defined method is called. As such, SAX is called an event-based API. An important advantage of such event-based APIs is that they make it possible to search a large XML document without having to store it in memory first.

    A good simple introduction to the actual construction of a SAX parser can be found at the Quickstart page of the SAX website. See also the sample programs in the PHP documentation and the Java EE Tutorial.

    DOM parsers

    Parsing using DOM is significantly simpler than using SAX as it is not necessary to define methods for each document component. It basically suffices to call a parsing method on an XML (text) document; the parsing method returns the corresponding in-memory tree. However, instead, the user has to work at defining methods to traverse and transform the resulting XML document instead. It is worth creating an in-memory tree if repeated traversals of the document are required or if the document is to be transformed. Transformations are generally easier with the abstract DOM representation than with the concrete textual representation.

    Again, see the sample programs in the PHP documentation and the Java EE Tutorial. Note in particular the PHP functions DOMDocument->load() for loading an XML document from a file and DOMDocument->save() for saving an XML document to a file.

    The PHP DOM parser is used in the XML-backed guestbook example. See the start of method getEntries().

    XML document querying and transformation

    (This section is adapted from Chapters 4 to 7 of Sun's Java EE Tutorial, from www.w3schools.com, and from the Zvon XPath and XSLT tutorials.)

    Because XML documents contain user-defined tags, browsers can't know how to display them, and hence must rely on user-defined style sheets. XSL (eXtensible Stylesheet Language) is used for this purpose. XSL is to XML as CSS is to HTML (actually it's much more powerful). XSL has three components: XPath (a language for selecting parts of an XML document), XSLT (a language for transforming XML documents), and XSL FO (a language for formatting objects). Now, as XSL FO is not widely implemented and as XSLT can transform any XML document into an XHTML document, which browsers can display, we focus on XPath and XSLT in this course.

    Stylesheets should be declared as follows:

    <xsl:stylesheet version="2.0"

    Here's a first example of how to use XSLT to transform an XML document into XHTML. Consider document cdcatalog.xml (and DTD cdcatalog.dtd) below:

    <?xml version="1.0" encoding="ISO-8859-1"?>
    <!DOCTYPE catalog SYSTEM "cdcatalog.dtd">
    <?xml-stylesheet type="text/xsl" href="cdcatalog.xsl"?>
        <title>Empire Burlesque</title>
        <artist>Bob Dylan</artist>

    The third line specifies the stylesheet to be used to transform this document.

    A possible definition of the stylesheet cdcatalog.xsl is the following. This stylesheet selects all CDs whose artist is "Bob Dylan" and displays them in an HTML table.

    <?xml version="1.0" encoding="ISO-8859-1"?>
    <xsl:stylesheet version="2.0"
    <xsl:template match="/">
        <h2>My CD Collection</h2>
        <table border="1">
        <tr bgcolor="cyan">
          <th align="left">Title</th>
          <th align="left">Artist</th>
        <xsl:for-each select="catalog/cd[artist='Bob Dylan']">
          <td><xsl:value-of select="title"/></td>
          <td><xsl:value-of select="artist"/></td>

    Here, the string values of select attributes are XPath expressions, rules for selecting components of the XML document.

    We can also sort the selected components of the document. For example, if we modified the above stylesheet as follows, it would display all CDs ordered by element artist.

        <xsl:for-each select="catalog/cd">
        <xsl:sort select="artist"/>
          <td><xsl:value-of select="title"/></td>
          <td><xsl:value-of select="artist"/></td>

    An XSL-compliant browser will now transform cdcatalog.xml into XHTML and render it as requested.

    You can see that XSLT stylesheets are also somehow analogous to Smarty templates.

    Smarty templates and XSLT stylesheets are used in the two different versions of the XML-backed guestbook example.

    More on XPath

    To repeat, XPath is a query language for selecting parts of an XML documents based on document structure, element name, attribute value and textual content. (Other components of an XML document may also be used but are ignored here. ) We illustrate its use using bib.xml and bib.dtd as a basis. (We assume some additional elements and attributes where required.)

    Selection starts from the root of the document. For example, the path /bib/book/title returns all <title>...</title> elements (hereafter, title elements) in book elements in bib elements in the document (cf. the Unix file system).

    The path //author/last returns all last elements of author elements in the document wherever the author elements occur in the document (e.g., an author may occur in a book or in a chapter of a collection).

    The path /bib/book/author/* returns all elements, possibly of different types, located by this path.

    The path bib/book/author[last = "Budd"] returns all author elements whose child element last has value "Budd".

    The path bib/book/author[last = "Budd"]/first returns all first elements of author elements whose last element has value "Budd".

    There are many functions that may be applied to elements in such conditions, e.g., the path bib/book/author/first[contains(.,"Z")]/.. returns all author elements that contain "Z" in their first element. Here, . refers to the current element and .. to the parent of the current element (cf. Unix file system).

    Conditions may also use attributes. The path //book[@year >= 2000] returns all book elements with an attribute year whose value is at least 2000. Standard numeric and string comparison operators and functions, and boolean operators may be used in conditions on attribute and text values.

    The path //book/author[1] returns the first author of each book element (indexing starts from 1!).

    The path //title/text() returns all string contents of title elements in the document. I.e., it returns a list of strings, whereas //title returns a list of title elements.

    The path //ISBN[normalize_spaces(@value) = "123456"] removes leading and trailing spaces from the value attribute of the ISBN element before performing the comparison. Many other functions may be used in conditions.

    The path //last | //first returns all last and first elements in the document.

    By default, paths move from one child to the next (along the child axis, abbreviated "/"). Paths may also move to all descendents, to all following (or preceding) siblings, to the parent, to all ancestors, and so on. Each of these navigation directions is called an axis. For example, the path //author/descendent::* uses the descendent axis and returns all elements that are descendents of any author elements in the document. Similarly, the path //author/descendent-or-self::* returns all elements that are descendents of or equal to any author element in the document. (This axis is abbreviated "//".) The path //last[contains(.,"Z")]/parent::* returns all parents (author or editor elements) of all last elements containing "Z" in the document. (The parent axis is abbreviated "..".) There are about 14 such axes that may be used. There are many functions that may be applied to elements and attributes.

    As noted above, selections based on element or attribute values can be made at any position of a path, e.g., /bib/book[@year>=2000]/title.

    XPath is quite a powerful query language for XML documents.

    See this XPath tutorial for more information on XPath.

    Laboratory 10 contains software and exercises for testing your understanding of XPath.

    More on XSLT

    XSLT is a stylesheet language for transforming XML documents into other XML documents, HTML documents or plain text. It is based on templates. A typical transformation will used several templates. (The example above uses a single inline template.) Each template applies to those items selected by an XPath expression. Values used may be defined by XPath expressions. Templates may be applied recursively. Important control structures are xsl:for-each (for iteration), xsl:if (for selection), xsl:choose, xsl:when, xsl:otherwise (also for selection). These are easily understood from examples.

    More important is the use of templates. A well-written stylesheet consists of a set of templates. The xsl:apply-template element applies a template rule to the current element or to the current element's children. The xsl:call-template and xsl:with-param elements are used to call a named template. Templates may be defined in XSLT or in a language such as Java using the DOM API.

    Consider the following example from www.w3schools.com which transforms the CD catalog to a list of red title / green artist paragraphs using a set of independent templates.

    <?xml version="1.0" encoding="UTF-8"?>
    <xsl:stylesheet version="2.0"
    <xsl:template match="/">
    <h2>My CD Collection</h2> 
    <xsl:template match="cd">
    <xsl:apply-templates select="title"/> 
    <xsl:apply-templates select="artist"/>
    <xsl:template match="title">
    Title: <span style="color: red"> <xsl:value-of select="."/> </span> <br />
    <xsl:template match="artist">
    Artist: <span style="color: green"> <xsl:value-of select="."/> </span> <br />

    Here is how the same XML catalog is displayed using this revised XSLT style sheet.

    An XSLT stylesheet is used in one version of the XML-backed guestbook example.

    The Zvon XPath and XSLT tutorials provide a systematic introduction to XPath and XSLT, and an interactive interpreter to evaluate different XPath queries and XSLT style sheets.

    XML database querying and transformation

    (Omitted in 2009.)

    (This section is based on An early look at XQuery by A. Eisenberg and J. Melton in SIGMOD Record 31, 4 (Dec.2002), pp.113-120.)

    As data is increasingly being stored in the form of XML documents, and as there is a need to search and transform this data, there is a need for a human readable/writable (and machine processable) query language for XML document collections. Such a language is slowly emerging, a working draft has been published by W3C, and prototype implementations are available. The language is called XML Query or XQuery. It is closely related to XPath. It comes in a human-readable form analogous to SQL and in an XML form (XQueryX).

    The most important construct in XQuery is the FLWR (for-let-where-return) expression. Here are some examples of its use. Suppose that file employees.xml is defined using the DTD

    <!ELEMENT employee ( name, address, dept, HSYears, UnivYears, ... )>
    <!ELEMENT name ( #PCDATA )>

    and file department is defined using the DTD

    <!ELEMENT department ( name, location, employee+, ... )>

    Return within a single element those employees who have more than 8 years of post-primary education:

      { for $emp in document('employees.xml')
        where $emp/HSYears + $emp/UnivYears gt 8
        return $emp/name


      <name>Chris McLean</name>
      <name>Rodney Topor</name>

    Return as a sequence of strings the names of such employees:

    { for $emp in document('employees.xml')
      where $emp/HSYears + $emp/UnivYears gt 8
      return data($emp/name)


    Chris McLean
    Rodney Topor

    Return an organisational structure with employees contained within their departments:

    let $dept := document('departments.xml')
    let $emp := document('employees.xml')
      for $d in $dept//department
        <department name='{$d/name}'>
          { for $e in $emp//employee
            where $e/dept eq $d/name
                { data($e/name) }


      <department name="accounting">
        <employee>Albert Jones</employee>

    Return a sequence of employees with their names and the names of their departments as attributes:

    let $dept := document('departments.xml')
    let $emp := document('employees.xml')
    for $e in $emp//employee,
    for $d in $dept//department
    where $e/dept eq $d/name
        <employee name='{data($e/name)}'
                  dept='{data($d/dept)}' />


    <employee name='Albert Jones'
              dept='accounting' />

    Two free, high quality, XQuery systems are Galax and Saxon. Galax has an interactive query evaluator to experiment with. More are described at the W3C XQuery page.

    Note that the design of DTDs and XML schemas to avoid redundancy, based on functional (and other) dependencies, is also important, but is less well-understood and is a subject of current research.

    General references

    See the given resources, starting with Web Design in a Nutshell (third edition) by Niederst Robbins, An Introduction to XML and Web Technologies by Møller and Schwartzbach, XML in a Nutshell (second edition) by Harold and Means, Learning XML (second edition) by Ray, W3Schools XML Tutorials, Zvon's XSL Tutorials, and Phil Wadler's guide to XML.

    Last updated: $Date: 2010/02/08 23:47:44 $, by Rodney Topor