The Digital Preservation Research (DPR) project addresses an important mandate for libraries and archives: to retain electronic files for posterity as well as to retrieve information from preserved documents through semantic search. To preserve digitized documents, researchers have built and deployed a System for Preservation of Electronic Resources (SPER). SPER builds on open source systems and standards (e.g., DSpace or RDF) while incorporating inhouse-developed modules that implement key preservation functions: ingesting, automated metadata extraction and knowledge discovery.

NLM curators are using SPER to preserve more than 60,000 court documents from a historic medico-legal collection acquired from the FDA. In addition, SPER is being used to preserve another important collection, from NIAID, comprising conference proceedings of the “US-Japan Cooperative Medical Science Program on Cholera,” a program conducted over a 50-year period from 1960 to 2010. Our activities toward this initiative include building a full repository for this collection with more than 10,000 documents, 2,500 research articles, and names and affiliations of 6,000 investigators dealing with cholera. We extracted metadata from the document contents using automated metadata extraction (AME) techniques, and then built a portal for research articles, authors, investigators and institutions. The AME processes include: (a) layout analysis to recognize different types of information within a document set; (b) evaluating the effectiveness of models such as Support Vector Machine and Hidden Markov Model for different metadata layouts; and (c) capturing relationships among various entities in the collection from the extracted metadata.

Investigators are conducting research toward knowledge discovery from information preserved in this repository by (a) developing a domain-specific vocabulary, (b) generating RDF graphs or triples from the preserved information using this vocabulary and natural language processing techniques, and (c) building a knowledgebase accessible over the Web.

