You are here

Digital Preservation Research

Project information

LHNCBC is no longer conducting active research on this project. Information is presented here for historical purposes.

The Digital Preservation Research (DPR) project addresses an important mandate for libraries and archives: to retain electronic files for posterity as well as to retrieve information from preserved documents through semantic search. To preserve digitized documents, researchers have built and deployed a System for Preservation of Electronic Resources (SPER). SPER builds on open source systems and standards (e.g., DSpace or RDF) while incorporating inhouse-developed modules that implement key preservation functions: ingesting, automated metadata extraction and knowledge discovery.

NLM curators are using SPER to preserve more than 60,000 court documents from a historic medico-legal collection acquired from the FDA. In addition, SPER is being used to preserve another important collection, from NIAID, comprising conference proceedings of the “US-Japan Cooperative Medical Science Program on Cholera,” a program conducted over a 50-year period from 1960 to 2010. Our activities toward this initiative include building a full repository for this collection with more than 10,000 documents, 2,500 research articles, and names and affiliations of 6,000 investigators dealing with cholera. We extracted metadata from the document contents using automated metadata extraction (AME) techniques, and then built a portal for research articles, authors, investigators and institutions. The AME processes include: (a) layout analysis to recognize different types of information within a document set; (b) evaluating the effectiveness of models such as Support Vector Machine and Hidden Markov Model for different metadata layouts; and (c) capturing relationships among various entities in the collection from the extracted metadata.

Investigators are conducting research toward knowledge discovery from information preserved in this repository by (a) developing a domain-specific vocabulary, (b) generating RDF graphs or triples from the preserved information using this vocabulary and natural language processing techniques, and (c) building a knowledgebase accessible over the Web.

Demner-Fushman D, Humphrey SM, Ide NC, Loane RF, Ruch P, Ruiz ME, Smith LH, Tanabe LK, Wilbur WJ, Aronson AR. Finding Relevant Passages in Scientific Articles: Fusion of Automatic Approaches vs. an Interactive Team Effort. Proc TREC 2006, 569-76.
Le DX, Thoma GR. Automatically Creating Biomedical Bibliographic Records from Printed Volumes of Old Indexes In: Callaos N, Lesso W, editors. SCI 2005. Proc 9th World Multiconference on Systemics, Cybernetics and Informatics; 2005 Jul 10-13; Vol. 3, Computer Science and Engineering. Orlando (FL): International Institute of Informatics and Systemics; c2005. 267-74
Pearson G. Methods to Store Metadata within Motion JPEG 2000 Files. Technical Report Preprint. May 2005.
Mao S, Misra D, Seamans J, Thoma GR. Design Strategies for a Prototype Electronic Preservation System for Biomedical Documents IS&T Archiving 2005 Conference, April 2005; 48-53.
Thoma GR, Mao S, Misra D. Automated Metadata Extraction to Preserve the Digital Contents of Biomedical Collections Proc VIIP 2005. September 2005. Benidorm, Spain; 214-19
Walker FL, Thoma GR. A Web-Based Paradigm for File Migration Proc. of IS and T's Archiving Conference. 2004 April.
Hersh WJ, Velterop J, McCray AT, Eynsenbach G, Boguski M. Overcoming Impediments to Effective Health and Biomedical Digital Libraries, JCDL JCDL. 2002;: 360.
Ray J, Dale R, Moore R, Reich V, Underwood W, McCray AT. Panel on Digital Preservation, JCDL Panel on Digital Preservation, JCDL. 2002;: 365-367.
Lingappa G, Thoma GR, Antani SK. Web Interface: MyMorph