Research Interests: Hussein Suleman
[ For details on current and past projects, see the website of the Digital Libraries Laboratory. ]
My primary research area is digital
libraries, with current focii, firstly, on the architecture of highly distributed
interoperable and scalable Internet-based information systems and, secondly, on digital preservation, especially of cultural heritage. Digital libraries is a
relatively new research area, at the intersection of Computer Science, computer
networking and information sciences. From a Computer Science perspective, there
are various technical issues that need to be resolved to support the ultimate
aim of enabling simpler access to more information of a higher quality to all
users of online and electronic systems. I have worked closely with the Open
Archives Initiative (http://www.openarchives.org)
and currently work with the Networked Digital Library of Theses and Dissertations (http://www.ndltd.org) and have worked with the Sivulile (http://www.sivulile.org/) group promoting Open Access in South(ern) Africa, thus collaborating with
institutions and individual researchers on a wide and distributed scale. I have active and ongoing collaborations with UCT's Fine Arts, Archaeology and Geomatics departments related to the preservation of Bushman and other African heritage.
I am interested in working with motivated postgraduate students who share my somewhat idealistic passion for improving the lives of people by removing barriers to information and computing. The ideas listed below are therefore far from exhaustive and all wild and wacky ideas are welcome and encouraged.
General Areas
Digital Library Architecture In attempting to move closer to the goal of making information readily available to users, managed and flexible information systems must be placed within the grasp of all institutions and archivists. As such, the architecture of digital libraries needs to be simple but flexible. Ongoing research in this area, at UCT and with various international collaborators, is producing component models, frameworks, visual interfaces and specification languages for the construction of custom digital libraries without the need for custom software development. There is still much scope for additional work in these aspects as well as methodologies for component packaging and user interface workflow definition that is relevant not only to digital libraries but all online systems.
High Performance Computing in Developing Countries HPC techniques have become increasing popular in order to solve computationally-intensive problems. However many of these solutions are not applicable in the African context because of limited computational, storage or network resources. A general aim of many ongoing efforts is to adapt scalable solutions to local conditions, thereby making HPC more practical for those without supercomputers or massive bandwidth.
Digital Preservation of Cultural Heritage South Africa has many important collections of information such as the Bleek and LLoyd Collection documenting the Bushman languages, the DISA project documenting the struggle for liberation and the District Six museum. A recurring problem with such projects is the difficulty in managing the process of digitising and creating and managing data and metadata electronically. There is much scope for improvement in usability and in the creation of tools specifically aimed at heritage collection (by scanning, oral recordings, etc.) and preservation. Projects related to this are not about algorithms but innovative interventions to safeguard history.
Specific Projects (open)
High Performance Digital Libraries on-Demand The aim of this research is to develop techniques for building scalable digital information management systems based on efficient and on-demand use of generic grid-based technologies. Specifically, the following questions are of interest:
- Can we migrate a typical DL architecture to a Grid system such that remote resources are transparently brought into service when needed?
- How do we allocate, distribute and schedule resources for maximal efficiently of data transfer algorithms such as OAI-PMH?
- Can we layer a typical DL architecture over a volunteer computing paradigm such as BOINC, just as SETI@home has done?
In summary, this research aims to look at various aspects that will affect the adoption of grid technology for digital archives in resource-constrained environments, as typically found in developing countries. The expected outcome of this work is a set of proven guidelines and experimental tools to move the digital library community closer to an ideal of simple, flexible, scalable and robust digital library architectures, building on an underlying Grid fabric.
Digital Libraries as Platform Facebook has become a phenomenal success largely because of its clean API, simple toolkits and reasonable model to add third party applications to a core system. This plug-in approach has not been as successful in other Web-based systems but Facebook appears to have hit on the best compromise between capability and control. In particular, digital library systems (such as the ACM digital library) could possibly offer lots of interesting services (e.g., recommendations, local copies) to users but these systems are notoriously difficult to extend. This project will look into how a digital library system can be decomposed into a platform with services so that extensions work in a manner similar to Facebook. The big question is: can the technology of Facebook Applications (or other systems like Google Gadgets) be generalised to provide services to users in arbitrary content management systems?
Web-based Component Testing With the rapid acceptance of Web Services and Web-based technology, there is a growing proliferation of services that can be accessed remotely through well-defined interfaces. Past experience in protocol development has shown that well-defined interface specifications are not sufficient to ensure compliance with a standard and this usually results in multiple non-conformant interpretations and, generally, problems for human and machine users of the services. The incompatibilities among Web browsers is possibly the best contemporary example that illustrates why standards-compliance and compliance-testing are crucial in networked environments. In the digital library community, Hussein has worked with the Open Archives Initiative in developing protocol testing tools such as the Repository Explorer (a local mirror is at http://re.cs.uct.ac.za) and this has greatly influenced the success of the standard it tests. This is, however, a first generation testing tool. Much work remains to be done in generalising the testing framework so that testing tools can be automatically generated or driven by specifications. In an ideal environment, any Web-based protocol should be specified formally, in order to generate testing tools and test cases automatically. This work can have a major impact on the success of emerging digital library protocols and standards based within the Web Services initiative in general.
Innovative Document Management Currently, there are a number of digital repository software toolkits to support centralised archiving of electronic resources. However, all of these tools require user intervention where users are required to explicitly submit items with associated descriptions. This has long been recognised as the bottleneck in acquiring and archiving material. Innovative techniques are required to support users and incorporate archiving (and sharing when appropriate) into their routine tasks by integrating document management into desktop software and other systems. An example of such a system would be one that transparently and efficiently archives all versions of a word processor document at the level of the filesystem. An example from a different extreme would be a system to replace photocopying for archival purposes with a scanner and software to automatically tag, organise and manage short-term and long-term duplicate copies of documents. Personal archiving is very relevant in an age where we produce a growing number of digital artefacts such as email messages, digital photos, PDA schedule entries and electronic documents how do we effectively manage such fluid information in a connected world where digital photos may be shared on one website, research documents on others and everything related to an individual must be periodically archived?
Specific Current and Past Projects
simplyCT Systems to manage can arguably be simplified by the use of preservation-directed principle-oriented data stores. While much work on engineering better archives has focused on high level Web-based APIs (such as FEDORA), simplyCT is an attempt to define the lowest level of data storage with an emphasis on simplicity to enable the use, adaptation and extension of such systems, especially in low resources environments. The simplyCT project attempts to replicate the success of Project Gutenberg - just about the only successful text archival project in the last 30 years - to other contemporary projects on a wide scale.
There is currently one MSc student working on early aspects of this project. However, there is scope for many follow-on projects in the next 3-4 years:
- Curation and management systems
- Packaging, deployment and scalability of system instances (One MSc student is currently doing this with DSpace on a private cloud).
- High performance preservable data stores
- Institutional repositories based on simplyCT
Large Scale Information Management Systems / Terascale IR Current research in information management
systems has progressed to the point where many large repositories of
information are publicly accessible. However, high quality services based on
this information are still rare. This is partly due to the complexity of
building search engines and other information services based on massive
quantities of constantly evolving data. There is a dire need for mechanisms to
deal with the problems of
- how to build efficient dynamic indexing mechanisms
- how to parallelise algorithms for information management and
- how to increase the availability of popular services.
Solutions may lie at the intersection of data warehousing, grid computing, agent technology and cluster computing, but little research has been done in this area to date. In the new information arena where rapidly changing data collections are no longer part of the “hidden Web”, can we discover/locate such data easily, and can such solutions eventually be applied to personal and community memory libraries as they emerge in the future?
Very specifically, can we build information retrieval systems to deal with terabytes of information? How do we design index/query algorithms differently to cater for ever-expanding collections of data? How do we parallelise such algorithms? How do we deal with the incremental update problem? Google cannot be the only organisation with a terascale IR facility – as more data collections emerge, we need general tools and techniques to deal with this problem.
(An MSc project was done in this area in 2007-2009; An Honours project related to this is underway in 2012.)
Open Source Usability Open Source Software has unique usability problems that are created because of this “openness”. For example, when a software package is installed, the installer can obtain and install dependencies automatically since the dependencies are themselves free. Also, since there are no restrictions on seat numbers, a tool could have multiple instances operating simultaneously and owned by multiple users (think Web servers on different ports). Some of these concerns are being addressed by package managers and OSS software design. There is still scope to look into the applicability of generic OSS package management to popular digital library tools (e.g., DSpace) – in fact any discipline where OSS is used by non-IT specialists could benefit enormously from simpler management of software. A larger problem is that of the design of core OS tools – is it possible to port OSS system software to a clean class/instance/registry component model? Ultimately, can we create network servers and clients as easily as documents but with better management?
(This project was investigated in 2007-9 but further work is needed.)
[ For more details on current and past projects, see the website of the Digital Libraries Laboratory. ]

