MSc-IT Study Material
June 2010 Edition

Computer Science Department, University of Cape Town

Chapter 9. XML

Table of Contents

Introduction to Markup Languages
SGML
HTML
XML
Relationship
XML Primer
Validity and Well-Formedness
XML Declaration
Encoding: Unicode
Document Type Definition (DTD)
Elements / Tags
Entities
Creating your own ML based on XML
Parsing and Processing XML
SAX
DOM
SAX vs DOM
XML Namespaces
Default Namespaces
Explicit Namespaces
XML Schema
Schema Structure
Sequences
Nested Elements
Extensions
Attributes
Named Types
Other Content Models
Schema Namespaces
Schema Example
Data and Metadata
Metadata Standards
Metadata Transformation
XPath
XPath Syntax
XSL
XSLT
XSLT Templates
XSLT Special Tags
XSLT Language
XSLT Example
Answers
Answer to Activity 1

XML (eXtensible Markup Language) is a markup language for documents that contain structured information. Markup refers to auxiliary information interspersed with text to indicate structure and semantics. Documents does not only refer to traditional text-based documents, but also to a wide variety of other XML data formats, including graphics, mathematical equations, financial transaction over a network, and many other classes of information.

Introduction to Markup Languages

Examples of markup languages include LaTex, which uses markup to specify formatting (e.g. \emph), and HTML which uses markup to specify structure (e.g. <p>). A markup language specifies the syntax and semantics of the markup tags.

Here is a comparison between plain text and marked up text:

Plain text:

The quick brown
      fox jumped over the lazy dog.

Marked up text:

*paragraphstart*The *subjectstart*quick brown fox
      *subjectend* *verbstart*jumped*verbend* over the *objectstart*
      lazy dog*objectend* .*paragraphend* 

Marked up text aids in semantic understanding, since more information is associated with the sentence than just text itself. This also makes it possible to automatically (i.e. by computer) translate to other formats.

SGML

SGML (Standard Generalised Markup Language) specifies a standard format for text markup. All SGML documents follow a Document Type Definition (DTD that specifies the document's structure). Here is an example:

<!DOCTYPE uct PUBLIC "-//UCT//DTD SGML//EN">
<title>test SGML document
<author email='pat@cs.uct.ac.za' office=410 lecturer>Pat Pukram
<version>
   <number>1.0
</version>
      

To do: SGML

Can you see why SGML does not require end tags? Find out more about SGML on the Internet. As a starting point, look at the SGML resources page on the W3 Consortium website. Also search for SGML on Google

HTML

HTML (HyperText Markup Language) specifies standard structures and formatting for linked documents on the World Wide Web. HTML is a subset of SGML. In other words, SGML defines a general framework, while HTML defines semantics for a specific application.

<html><head><title>test HTML document</title></head>
<body>
<h1>Author</h1>
<p>Pat Pukram
<br>Lecturer
<br>Email: pat@cs.uct.ac.za
<br>Office: 410
</p>
<h1>Version</h1>
<p>1.0</p>
</body>
</html>
      

To Do: HTML

HTML is used to specify both the structure and the formatting of Web documents. Examine the list of HTML tags that you have learnt so far and decide which group each tag belongs into. Read up more on the HTML section of the W3 page.

XML

XML, a subset of SGML, was introduced to ease adoption of structured documents on the Web. While SGML has been the standard format for maintaining such documents, its suitability for the Web was poor (for various technical reasons, some of which are discussed later; however, some are beyond the scope of this course). SGML conformity means that XML documents can be read by any SGML system. However, the upside of XML is that XML documents do not require a system capable of understanding the full SGML language.

Both HTML and SGML were considered unsuitable for the use that XML was put to. HTML specifies the semantics of a document (which in HTML's case denotes formatting), but does not provide arbitrary structure. SGML however, does provide arbitrary structure, but is too complex to implement in a Web browser. XML was not designed to replace SGML. As a result, many companies use a SGML to XML filter for their content.

<uct>
<title>test XML document</title>
<author email="pat@cs.uct.ac.za" office="410" 
type="lecturer">Pat Pukram</author>
<version>
   <number>1.0</number>
</version>
</uct>
      

Relationship

The figure below illustrates the relationship between SGML, HTML, XML, and XHTML. XHTML is discussed later in the chapter.