![]() | MSc-IT Study Material June 2010 Edition Computer Science Department, University of Cape Town |
Data refers to digital objects that contain useful information for information seekers. Metadata refers to descriptions of these objects. Many systems manipulate metadata records, which contain pointers to the actual data.
To promote interoperability among systems, there are popular metadata standards to describe objects (both semantically and syntactically).
Dublin Core: uses fifteen simple elements to describe every object.
MARC: a comprehensive system devised to describe items in a (physical) library.
RFC1807: the computer science publications format.
IMS Metadata Specification: courseware object description.
VRA-Core: multimedia (especially image) description.
EAD: aids to locate archived items.
Dublin Core is one of the most popular (and simplest) metadata formats. It contains fifteen elements, each with recommended semantics. All the elements are optional and repeatable. They are:
Title | Creator | Subject |
Description | Publisher | Contributor |
Date | Type | Format |
Identifier | Source | Language |
Relation | Coverage | Rights |
Below is a Dublin Core in XML example:
<oaidc:dc xmlns="http://purl.org/dc/elements/1.1/" xmlns:oaidc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd"> <title>02uct1</title> <creator>Hussein Suleman</creator> <subject>Visit to UCT </subject> <description>the view that greets you as you emerge from the tunnel under the freeway - WOW - and, no, the mountain isnt that close - it just looks that way in 2-D</description> <publisher>Hussein Suleman</publisher> <date>2002-11-27</date> <type>image</type> <format>image/jpeg</format> <identifier>http://www.husseinsspace.com/pictures/200230uct/02uct1.jpg </identifier> <language>en-us</language> <relation>http://www.husseinsspace.com</relation> <rights>unrestricted</rights> </oaidc:dc>
To do this, take the following steps:
Use an XML parser to parse data.
Use SAX/DOM to extract individual elements and generate the new format.
The following code converts UCT to Dublin Core (Don't worry if you do not understand it):
my $parser = new DOMParser; my $document = $parser->parsefile ('uct.xml')->getDocumentElement; foreach my $title ($document->getElementsByTagName ('title')) { print "<title>".$title->getFirstChild->getData."</title>\n"; } foreach my $author ($document->getElementsByTagName ('author')) { print "<creator>".$author->getFirstChild->getData."</creator>\n"; } print "<publisher>UCT</publisher>\n"; foreach my $version ($document->getElementsByTagName ('version')) { foreach my $number ($version->getElementsByTagName ('number')) { print "<identifier>". $number->getFirstChild->getData."</identifier>\n"; } }
As you will see later in this unit, there is an easier way to achieve this. in the unit.