MSc-IT Study Material
June 2010 Edition

Computer Science Department, University of Cape Town

Data and Metadata

Data refers to digital objects that contain useful information for information seekers. Metadata refers to descriptions of these objects. Many systems manipulate metadata records, which contain pointers to the actual data.

Metadata Standards

To promote interoperability among systems, there are popular metadata standards to describe objects (both semantically and syntactically).

  • Dublin Core: uses fifteen simple elements to describe every object.

  • MARC: a comprehensive system devised to describe items in a (physical) library.

  • RFC1807: the computer science publications format.

  • IMS Metadata Specification: courseware object description.

  • VRA-Core: multimedia (especially image) description.

  • EAD: aids to locate archived items.

Dublin Core Example

Dublin Core is one of the most popular (and simplest) metadata formats. It contains fifteen elements, each with recommended semantics. All the elements are optional and repeatable. They are:

TitleCreatorSubject
DescriptionPublisherContributor
DateTypeFormat
IdentifierSourceLanguage
RelationCoverageRights

Below is a Dublin Core in XML example:

<oaidc:dc xmlns="http://purl.org/dc/elements/1.1/" xmlns:oaidc="http://www.openarchives.org/OAI/2.0/oai_dc/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
   <title>02uct1</title>
   <creator>Hussein Suleman</creator>
   <subject>Visit to UCT </subject>
   <description>the view that greets you as you emerge from the tunnel under the freeway - WOW - and, no, the mountain isnt that close - it just looks that way in 2-D</description>
   <publisher>Hussein Suleman</publisher>
   <date>2002-11-27</date>
   <type>image</type>
   <format>image/jpeg</format>
   <identifier>http://www.husseinsspace.com/pictures/200230uct/02uct1.jpg
</identifier>
   <language>en-us</language>
   <relation>http://www.husseinsspace.com</relation>
   <rights>unrestricted</rights>
</oaidc:dc>
       

Metadata Transformation

To do this, take the following steps:

  1. Use an XML parser to parse data.

  2. Use SAX/DOM to extract individual elements and generate the new format.

The following code converts UCT to Dublin Core (Don't worry if you do not understand it):

my $parser = new DOMParser;
my $document = $parser->parsefile ('uct.xml')->getDocumentElement;
foreach my $title ($document->getElementsByTagName ('title'))
{
   print "<title>".$title->getFirstChild->getData."</title>\n";
}
foreach my $author ($document->getElementsByTagName ('author'))
{
   print "<creator>".$author->getFirstChild->getData."</creator>\n";
}
print "<publisher>UCT</publisher>\n";
foreach my $version ($document->getElementsByTagName ('version'))
{
   foreach my $number ($version->getElementsByTagName ('number'))
   {
      print "<identifier>".
            $number->getFirstChild->getData."</identifier>\n";
   }
}
      

As you will see later in this unit, there is an easier way to achieve this. in the unit.