![]() | MSc-IT Study Material June 2010 Edition Computer Science Department, University of Cape Town |
Table of Contents
XML (eXtensible Markup Language) is a markup language for documents that contain structured information. Markup refers to auxiliary information interspersed with text to indicate structure and semantics. Documents does not only refer to traditional text-based documents, but also to a wide variety of other XML data formats, including graphics, mathematical equations, financial transaction over a network, and many other classes of information.
Examples of markup languages include LaTex, which uses markup to specify formatting (e.g. \emph), and HTML which uses markup to specify structure (e.g. <p>). A markup language specifies the syntax and semantics of the markup tags.
Here is a comparison between plain text and marked up text:
Plain text:
The quick brown fox jumped over the lazy dog.
Marked up text:
*paragraphstart*The *subjectstart*quick brown fox *subjectend* *verbstart*jumped*verbend* over the *objectstart* lazy dog*objectend* .*paragraphend*
Marked up text aids in semantic understanding, since more information is associated with the sentence than just text itself. This also makes it possible to automatically (i.e. by computer) translate to other formats.
SGML (Standard Generalised Markup Language) specifies a standard format for text markup. All SGML documents follow a Document Type Definition (DTD that specifies the document's structure). Here is an example:
<!DOCTYPE uct PUBLIC "-//UCT//DTD SGML//EN"> <title>test SGML document <author email='pat@cs.uct.ac.za' office=410 lecturer>Pat Pukram <version> <number>1.0 </version>
Can you see why SGML does not require end tags? Find out more about SGML on the Internet. As a starting point, look at the SGML resources page on the W3 Consortium website. Also search for SGML on Google
HTML (HyperText Markup Language) specifies standard structures and formatting for linked documents on the World Wide Web. HTML is a subset of SGML. In other words, SGML defines a general framework, while HTML defines semantics for a specific application.
<html><head><title>test HTML document</title></head> <body> <h1>Author</h1> <p>Pat Pukram <br>Lecturer <br>Email: pat@cs.uct.ac.za <br>Office: 410 </p> <h1>Version</h1> <p>1.0</p> </body> </html>
HTML is used to specify both the structure and the formatting of Web documents. Examine the list of HTML tags that you have learnt so far and decide which group each tag belongs into. Read up more on the HTML section of the W3 page.
XML, a subset of SGML, was introduced to ease adoption of structured documents on the Web. While SGML has been the standard format for maintaining such documents, its suitability for the Web was poor (for various technical reasons, some of which are discussed later; however, some are beyond the scope of this course). SGML conformity means that XML documents can be read by any SGML system. However, the upside of XML is that XML documents do not require a system capable of understanding the full SGML language.
Both HTML and SGML were considered unsuitable for the use that XML was put to. HTML specifies the semantics of a document (which in HTML's case denotes formatting), but does not provide arbitrary structure. SGML however, does provide arbitrary structure, but is too complex to implement in a Web browser. XML was not designed to replace SGML. As a result, many companies use a SGML to XML filter for their content.
<uct> <title>test XML document</title> <author email="pat@cs.uct.ac.za" office="410" type="lecturer">Pat Pukram</author> <version> <number>1.0</number> </version> </uct>