You are here
Use of descriptive metadata as a knowledgebase for analyzing data in large textual collections. Proc. IS&T Archiving 2013. Washington D.C.
Descriptive metadata, such as an article’s title, authors, institutes, keywords and date of publication, collected either manually or generated from document contents in automated fashion, is often used to search and retrieve relevant documents in an archived collection. This metadata, especially for a large text corpus, may encapsulate valuable information, such as patterns and trends, which is usually revealed by writing specialized data analysis software designed to answer specific questions. A more useful, generalized approach is to repurpose this metadata to serve as a knowledgebase that can answer semantic queries about the dataset. This is especially useful for biomedical collections, where information is sought not only on important discoveries on drugs and diseases, but also on various facts related to such discoveries.
At the US National Library of Medicine (NLM), we recently acquired a large biomedical collection, which comprised annual conference proceedings containing research findings on cholera, conducted between the years 1960-2011 under the “US-Japan Cooperative Medical Science Program” (CMSP), which was established to address health problems in Southeast Asia and other developing countries. In addition to preserving the collection, an important objective of archiving the dataset was to gain insight about the program itself, including relevant facts about its research community, Study Section reviewers and participating countries over the program’s life-span. An R&D information management system developed at NLM, called “System for the Preservation of Electronic Resources” (SPER), was used to meet these goals cost-effectively. SPER used machine learning models to extract relevant metadata from the contents of the articles, conference attendee and panelist rosters, and associated documents. This metadata was used to create a DSpace-based archive at NLM for standard search/retrieval of the articles, and further to develop special-purpose data analysis software. In addition, a prototype knowledgebase was created from this metadata, so as to enable the retrieval of information on various aspects of the program through semantic queries.
In this paper, we present the automated extraction of different types of descriptive metadata from the CMSP documents, and the generation of the CMSP knowledgebase from this metadata. Specifically, we describe the ontology model developed to represent the CMSP Program with its conferences, publications and personnel as a set of OWL-based hierarchical concepts and relationships. We discuss the pipeline process, created using open source tools and in-house software, to convert the original metadata tables from SQL format to an intermediate dataset of RDF triples or graphs, and the use of these RDF graphs, along with the CMSP ontology model, to populate the CMSP conference knowledgebase. We then show the results of applying the above-mentioned methodology to a subset of the CMSP dataset. We further discuss the scalability issues encountered in extending it to the entire dataset, along with our continuing work to overcome these limitations. Finally we discuss how this approach could be customized for other large collections, such as the one from the Food and Drug Administration previously archived by SPER.