XML applications


This lecture presents a very brief, very selective introduction to XML applications.

Table of contents

"Traditional" Web applications

Data can be stored in XML documents instead of SQL databases and used similarly in dynamic Web applications.

Here is another implementation of the guest book application, this time storing the messages in an XML document gb.xml. This implementation uses the XML and DOM libraries in PHP. You can download the entire application as a zipped archive.

There exist similar, extensive XML and DOM libraries for Python, Java, C++ and many other languages. The ones for Java are rather complicated. Consequently, an alternative JDOM API has been proposed (independently of the Java Community Process).

Currently, this style of application is not widely used because it is expensive to read a large XML document into main memory in order to query it, and it is expensive to read, update and write a large document in order to update it.

It is more common to store large, static information collections in XML documents, using the very good XML processing tools which now exist, for one-off extraction and transformation tasks. For example, USQ's Department of Computing and Mathematics, stores its entire course catalog as an XML document, and extracts and transforms it in different ways for different purposes, e.g., university catalogs, student course handouts, and so on.

In the future, database management systems will be able to store XML data efficiently, it will be possible to read only part of a document and to use indexes to query XML data efficiently with XQuery, it will be possible to express XML document updates using the emerging XQuery!, and it will be possible to perform XML document updates without reading and writing the whole document.

Currently the most popular open source native XML database system appears to be eXist. Another open source native XML database system is Oracle's Berkeley DB XML. More mature commercial products are IBM's DB2 9 and Mark Logic. Other native XML database systems are described at Wikipedia and XML and Databases.

Web feeds (RSS, Atom, AtomPub)

References

Description

RSS is an XML schema that is used by news sources as an alternative way (to Web sites) to distribute information to readers. Web sites may contain a link to an RSS feed, stream or channel. Selecting this link delivers an XML document in RSS format. In the meantime, users have to run a "news reader" to display RSS feeds. Examples of news readers are NetNewsWire (for Mac OS X), BlogBridge (all platforms) and Google Reader and Bloglines (both Web-based). Some browsers such as Opera and Safari can be used as news readers.

The form of an RSS feed is indicated by the following informal DTD:

<!ELEMENT rss ( channel )>
<!ELEMENT channel ( title, link, description, language, docs, ..., item+ )>
<!ELEMENT title ( #PCDATA )>
...
<!ELEMENT item ( title, link, description, pubDate, ... )>
...

Here is a toy example of an RSS 2.0 feed (from the comparison cited below):

<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0">
    <channel>

	<title>Example Feed</title>
	<description>Insert witty or insightful remark here</description>
	<link>http://example.org/</link>
	<lastBuildDate>Sat, 13 Dec 2003 18:30:02 GMT</lastBuildDate>
	<managingEditor>johndoe@example.com (John Doe)</managingEditor>

	<item>
	    <title>Atom-Powered Robots Run Amok</title>
	    <link>http://example.org/2003/12/13/atom03</link>
	    <guid isPermaLink="false">urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a</guid>
	    <pubDate>Sat, 13 Dec 2003 18:30:02 GMT</pubDate>
	    <description>Some text.</description>
	</item>

    </channel>
</rss>

Examples of RSS feeds may be found at ABC News, Mac News Network, and everywhere else.

There are several versions of RSS: RSS 2.0 (which originated from RSS 0.93) seems to be the most used one; RSS 1.0 is a quite independent specification by a different group of developers.

To address the resulting inconsistencies and to provide an improved future platform, a competing specification called Atom 1.0 has been proposed. Here is the formal Atom 1.0 specification.

Here is a brief, single example of an Atom 1.0 feed document (from the Atom 1.0 specification):

<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">

    <title>Example Feed</title>
    <subtitle>Insert witty or insightful remark here</subtitle>
    <link href="http://example.org/"/>
    <updated>2003-12-13T18:30:02Z</updated>
    <author>
	<name>John Doe</name>
	<email>johndoe@example.com</email>
    </author>
    <id>urn:uuid:60a76c80-d399-11d9-b93C-0003939e0af6</id>

    <entry>
	<title>Atom-Powered Robots Run Amok</title>
	<link href="http://example.org/2003/12/13/atom03"/>
	<id>urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a</id>
	<updated>2003-12-13T18:30:02Z</updated>
	<summary>Some text.</summary>
    </entry>

</feed>

Here is a more extensive, single-entry Atom 1.0 Feed Document from the same specification:

<?xml version="1.0" encoding="utf-8"?>

<feed xmlns="http://www.w3.org/2005/Atom">
  <title type="text">dive into mark</title>
  <subtitle type="html">
    A &lt;em&gt;lot&lt;/em∓gt; of effort went into making this effortless.
  </subtitle>

  <updated>2005-07-31T12:29:29Z</updated>
  <id>tag:example.org,2003:3</id>
  <link rel="alternate" type="text/html" 
   hreflang="en" href="http://example.org/"/>
  <link rel="self" type="application/atom+xml" 
   href="http://example.org/feed.atom"/>
  <rights>Copyright (c) 2003, Mark Pilgrim</rights>

  <generator uri="http://www.example.com/" version="1.0">
    Example Toolkit
  </generator>

  <entry>
    <title>Atom draft-07 snapshot</title>
    <link rel="alternate" type="text/html" 
     href="http://example.org/2005/04/02/atom"/>

    <link rel="enclosure" type="audio/mpeg" length="1337"
     href="http://example.org/audio/ph34r_my_podcast.mp3"/>
    <id>tag:example.org,2003:3.2397</id>
    <updated>2005-07-31T12:29:29Z</updated>
    <published>2003-12-13T08:29:29-04:00</published>

    <author>
      <name>Mark Pilgrim</name>
      <uri>http://example.org/</uri>
      <email>f8dy@example.com</email>
    </author>
    <contributor>
      <name>Sam Ruby</name>
    </contributor>
    <contributor>
      <name>Joe Gregorio</name>
    </contributor>

    <content type="xhtml" xml:lang="en" 
     xml:base="http://diveintomark.org/">
      <div xmlns="http://www.w3.org/1999/xhtml">
	<p><i>[Update: The Atom draft is finished.]</i></p>

      </div>
    </content>
  </entry>
</feed>

Here is a comparison between RSS 2.0 and Atom 1.0.

Many news readers support both specifications; some support only one. Feed publishers can use either RSS or Atom with confidence that most news readers will accept their feeds. Both RSS and Atom feeds may be validated at the independent Feed Validator or the W3C Feed Validation Service. It is not clear which specification will eventually dominate.

News readers may automatically reload each selected RSS feed, indicate when they have changed, and allow users to browse the items in each feed. Feeds are specified by their URLs. RSS-enabled browsers, allow users to retrieve the most recent content from the feed by selecting the corresponding link (to an RSS feed).

News aggregators present items from multiple news feeds in a uniform way. As each news feed is just an XML document, the implementation of an news reader is an exercise in XML programming, actually DOM programming.

Feeds are widely used by bloggers.

Tools are available to simplify the providision of Web feeds, e.g., ROME, the generation of Web feeds, .e.g., PHP Universal Feed Generator, and Web feed parsing, .e.g., SimplePie.

It seems that the use of Web feeds is becoming increasingly widely used by information providers and news sources (but not by learning@griffith), and that users can benefit by using news readers and aggregators to optimise their information gathering and news reading activities.

The Atom Publishing Protocol (AtomPub)

AtomPub is a set of conventions for creating, editing and deleting feeds and feed entries.

References

Summary

An (AtomPub) collection is a resource whose representation is an Atom feed. The AtomPub defines a collection's response to GET and POST requests, and also allows PUT and DELETE requests.

A collection is a collection of (AtomPub) members. A member is an entry in an Atom feed, a weblog entry, a news article. a bookmark or, indeed, a multimedia object such as an image, an audio file or a video file.

(AtomPub) clients create members inside a collection by POSTing a representation of the member (e.g., an Atom entry) to the collection URL. The server assigns the new member to the collection, returns an HTTP 201 ("Created") code together with a Location header containing the URL of the new member.

Each collection may contain accept tags that describe what kind of members may be POSTed to the collection.

Summary of the AtomPub
Resource Method Representation Description
Member GET Atom Entry Retrieve the Atom representation of the entry.
Member PUT Atom Entry Update the member resource with the Atom entry representation.
Member DELETE Atom Entry Delete the member resource.
Collection GET Atom Feed Retrieve a list of the members in the collection. May be a subset.
Collection POST Atom Entry Create a new member resource with the given Atom Entry.

Standard Web servers do not recognise the HTTP requests PUT and DELETE, so AtomPub servers are required as a "front-end" to the Web server.

Many AtomPub clients and servers have been written and deployed. (Links to follow.)

Not just for news feeds and blogs, has wider application.

Web services

In traditional dynamic Web applications, users (i.e., people) interact with remote programs and associated data from a Web browser. Web services allow programs to interact with such remote programs, providing new types of distributed computing applications.

Simple examples include services that invoke Google Web searches or retrieve stock exchange prices. More complex services might track the status of the goods you have ordered or retrieve your mechanic's or doctor's timetable so your agent can make appointments automatically.

Traditional distributed computing applications (see 3510CIT for details) use sockets, Java remote method invocation or CORBA remote function calls for communication. Web services are characterised by the use of a defined client API, HTTP for communication, XML for queries and results, SOAP (or XML-RPC) for transferring queries and results, and remote procedure calls (using SOAP or XML-RPC). A system supporting Web services requires facilities for describing available services (WSDL) and for locating available services (UDDI). Because SOAP is simpler than CORBA (and other distributed computing environments), and because organisations already have Web infrastructure (HTTP, XML) in place, Web services provide an easier way to implement distributed applications than previous approaches.

"Web services" may be defined in different ways.

"A Web service is a software system designed to support interoperable machine-to-machine interaction over a network. It has an interface described in a machine-processable format (specifically WSDL). Other systems interact with the Web service in a manner prescribed by its description using SOAP messages, typically conveyed using HTTP with an XML serialization in conjunction with other Web-related standards." -- W3C Web Services Architecture Note (WSAN)

"Web services, in the general meaning of the term, are services offered by one application to other applications via the World Wide Web. Clients of these services can aggregate them to form an end-user application, enable business transactions, or create new Web services. In a typical Web services scenario, a business application sends a request to a service at a given URL using the SOAP protocol over HTTP. The service receives the request, processes it, and returns a response." -- Java Web Services Tutorial (Introduction to Web Services)

See also this flash animation from Microsoft.

The following figure from WSAN presents a simplified message-oriented view of Web services:

message-oriented view

Here, the originating agent is a program that replaces the standard Web browser, the processing agent is a program like a Web server, the message body is XML, the message headers are SOAP (or XML-RPC), and the message transport is HTTP.

In contrast, this figure from WSAN presents a simplified service-oriented view:

service-oriented view

Here, the metadata is a WSDL file stored in a UDDI registry.

The technologies used in Web services are shown in the following figure from WSAN:

We discuss (some of) these technologies below.

Google Web APIs

First, let's look at a simple concrete example of a Web service. Google allows users to search for Web sites based on keywords and other document proerties. Google Web APIs allow users to write programs that call Google to search the Web for them (or do spelling correction for them).

Such programs use a defined API. A Java implementation of the APIs is freely distributed. This implementation comes with a demonstration program with a command-line interface. The command-line interface calls the APIs to create a GoogleSearch instance, send the instance to Google, receive a GoogleSearchResult, and call the APIs to display it. Here is an example of a possible (SOAP) search request and here is a corresponding search response.

Now, let's consider the technologies involved in a bit more detail. XML Web Services Basics is a good starting point.

SOAP

SOAP for Simple Object Access Protocol (or Service-Oriented Architecture Protocol, or... something). It is a communication mechanism. It defines the XML format of messages sent between clients and servers. It specifies how XML documents can be wrapped in messages. It specifies how a message can be interpreted as a remote procedure call consisting of the procedure to be called, the arguments and the results. It specifies the form of an HTTP message that contains a SOAP message.

SOAP messages are normally constructed and interpreted by software so that users don't need to deal with the details.

SOAP messages consist of an envelope that contains a header that describes the message data and a body which contains the message data. Here are some examples:

Note that the second example demonstrates the start of an extended "conversation" between two agents.

The following example illustrates a SOAP message that represents a remote procedure call to procedure chargeReservation with arguments a reservation and a creditCard.

There is much more...

Good introductions to SOAP can be found at W3 Schools' SOAP Tutorial, Microsoft's Understanding SOAP and W3C's SOAP Primer.

XML-RPC is an older, simpler version of SOAP that is still in use.

WSDL

"WSDL (Web Services Description Language) is an XML-based language for describing Web services and how to access them." -- W3 Schools

Each WSDL document is a simple XML document that contains a set of definitions to define a Web service. The definition specifies the operations performed by the service, the messages used by the service, the data types used by the service and the communication protocol used by the service.

For example, this simplified WSDL document (from the W3 Schools tutorial) specifies that getTerm is an operation with input message getTermRequest and output message getTermResponse, and that the two messages have parts term and value (resp.) each of type string.

This more complicated document (from the same source) also specifies the SOAP binding to be used.

A SOAP request to this service then has this form and a SOAP response has this form.

Good introductions to WSDL can be found at W3 Schools and Microsoft.

UDDI

Web services are useless unless they can be found. UDDI (Universal Desription, Discovery and Integration) is an industry-based directory service that allows organisations to register and search for Web services. It can be considered a dynamic, searchable database of WSDL documents. Like WSDL, it's based on XML, SOAP and other Web standards. Details are complex.

UDDI usage

See W3 Schools, Microsoft, and uddi.org for more information.

RESTful Web Services

Many independent developers think that (Big) Web Services (based on SOAP, WSDL and UDDI) are too complex and (unnecessarily) require the use of (complex, expensive, unsatisfactory) commercial products.

They argue that REST (Representational State Transfer) is a much simpler alternative.

REST is an alternative Web Service architecture based directly on HTTP (proposed by Roy Fielding, ca. 1995). State is transferred between client and server through URLs. The client can use all the HTTP verbs - POST, GET, PUT, DELETE, HEAD - to create, retrieve, update, delete and summarise server-side resources (identified by URLs). (Services can still be described with WSDL and registered with UDDI.)

References

Unfortunately, some of these tutorials lack examples.

The following presentation follows RESTful Web Services.

Atom Publishing Protocol

AtomPub (described above) is a canonical RESTful protocol.

Amazon's Simple Storage Service (S3)

(This presentation follows RWS, Chapter 3.)

Amazon provides several Web Services. Amazon's Simple Storage Service (S3) is a practical RESTful Web Service described in detail in RESTFUL Web Services.

S3 allows users to store data on Amazon's servers either privately or publicly (for a small fee). S3 can be used either as a backup server or as a data server (as an alternative to an ISP). S3 is not accessible to human users through a Web site, only to programs through a programmable interface. Amazon provides sample libraries in many modern languages to make it easy to write clients that access S3.

S3 is based on two concepts: S3 "buckets" and S3 "objects". An object is a named piece of data with associated metadata. A bucket is a named container for objects. That's all. (Buckets and objects correspond roughly to tables and rows in a relational database.)

Each object has four components:

For example, to host O'Reilly's web site on S3 would require a bucket called "orielly.com" filled with objects whose names are "", "catalog", "catalog/9780596529269" and so on. These objects correspond to the URLs http://orelly.com/, http://oreilly.com/catalog, and so on. The object's values would be te HTML contents of O'Reilly's Web pages. These S3 objects would have their Content-Type metadata value set to text/html.

The S3 service provides three types of resources:

Resources are accessed and updated by HTTP requests:

Resource GET HEAD PUT DELETE
The bucket list (/) List your buckets - - -
A bucket (/{bucket}) List the bucket's objects - Create the bucket Delete the bucket
An object (/{bucket}/{object}) Get the object's value and metadata Get the object's metadata Set the object's value and metadata Delete the object

Note that S3 uses PUT differently from AtomPub.

Each request returns a response code, metadata and a representation of a bucket-list, a bucket or an object. Possible response codes include 200 (OK), 301 (moved permanently), 400 (bad request), 403 (forbidden), 404 (not found), 409 (conflict), 410 (gone), 500 (internal server error), and so on.

For example, a request to get your bucket list could return the following XML document:

<?xml version ='1.0' encoding='UTF-8'>
<ListAllMyBucketsResult xmlns='http://s3.amazonaws.com/doc/2006-03-01'>
 <Owner>
  <ID>c0363572......</ID>
  <DisplayName>leonardrichardson</DisplayName>
 </Owner>
 <Buckets>
  <Bucket>
   <Name>crummy.com</Name>
   <CreationDate>2006-10-26T18:47:45.000Z</CreationDate>
  </Bucket>
 </Buckets>
</ListAllMyBucketsResult>

This request might have used the XPath query //Bucket/Name to get the name of each bucket (and not its other properties).

A request to get a particular bucket could return the following XML document:

<?xml version ='1.0' encoding='UTF-8'>
<ListBucketResult xmlns='http://s3.amazonaws.com/doc/2006-03-01'>
 <Name>crummy.com</Name>
 <Prefix></Prefix>
 ...
 <IsTruncated>False</IsTruncated>
 <Contents>
  <Key>mydocument</Key>
  <LastModified>...</LastModified>
  <ETag>...</ETag>
  <Size>22</Size>
  <Owner>
   <ID>c0363572......</ID>
   <DisplayName>leonardrichardson</DisplayName>
  </Owner>
  <StorageClass>STANDARD</StorageClass>
 </Contents>
</ListBucketResult>

A request to return an object returns an analogous XML document.

Authentication and authorisation are important to ensure that each person can only get their own resources or resources that they have permission toget. Details are omitted.

The Resource-Oriented Architecture (ROA)

(This presentation follows RWS, Chapter 4.)

RESTful Web Services use a Resource-Oriented Architecture (ROA). The ROA consists of just four concepts:

  1. Resources
  2. Their names (URLs)
  3. Their representations (.g., XML/XHTML documents)
  4. The links between them

and four properties:

  1. Addressibility
  2. Statelessness
  3. Connectness
  4. A uniform interface

Resources

A resource is anything important enough to be references as a thing in itself, normally stored on a computer, and having a name and address (a URL), e.g., version 1.0.3 of a software release, the latest version of the software release, the first weblog entry for 25 OCtober 2006, a road map of Sunnybank, Brisbane, a directlry of resources relating to squids, the next prime number after 1024, a list of the open bugs in the bug database.

URLs

URLs should be descriptive. (Compare with script names in a Web application.) Different URLs may denote the same data, e.g., version 1.0.3 of a software release and the current version of the software release, but should be used differently.

Representations

A representation (e.g., an HTML document, an XHTML document, an XML document, a JSON document) is some data about the current state of a resource.

Information about the nature of a resource may be stored in URLs or headers; URLs are preferable.

Links

The value of a Web service depends on the extent to which representations contain links to other representations.

Addressibility

Resources used in a Web Services must be addressable (by URLs).

Statelessness

Pass state in representations of resources and in URLs.

Distinguish between application state (page 10 of results to my query) and resource state (the new name of the picture in the database).

Connectness

See links above.

Uniform interface

The most important uniform interface is the set of HTTP requests based on the verbs GET, PUT, POST, DELETE, HEAD and OPTIONs.

Note that HTML does not have a uniform interface: it uses POST (and GET) in many different ways.

RESTful Web Service Design

(Draft.)

  1. Figure out the data set.
  2. Design the resources.
  3. Design the representations accepted from the client.
  4. Design the representations served to the client.
  5. Connect resources to each other.
  6. Decide what's supposed to happen in response to each request.
  7. Figure out what may go wrong and report it.

A world map service

See RWS, Chapters 5 and 6.

De.lic.io.us social bookmarking service

Del.icio.us is a social bookmarking service also described in detail in RWS, Chapter 7.

See RWS, pp.180 and 181 for the core functionality.

RESTful Web Service Implementations

The client-side (POST, GET, PUT, DELETE) requests to the server can be generated with the XMLHttpRequest API provided by most JavaScript implementations. Also, forms in the emerging HTML 5 standard will allow PUT, DELETE and other methods, in addition to the current GET and POST methods.

Related techniques

Even when writing traditional Web applications, it is desirable to maintain state in documents and URLs, not in hidden locations on server, whenever possible. See:

JSON

JSON is a light-weight data-interchange format, that can act as a replacement for SOAP in Web Services and RESTful applications. See above.

Ajax

See a previous lecture.


Last updated: $Date: 2010/05/26 00:14:21 $, by Rodney Topor