You are here

Digital Preservation Research

Project information
Research Area: 
Researchers: 

The Digital Preservation Research (DPR) project addresses an important mandate for libraries and archives: to retain electronic files for posterity as well as to retrieve information from preserved documents through semantic search. To preserve digitized documents, researchers have built and deployed a System for Preservation of Electronic Resources (SPER). SPER builds on open source systems and standards (e.g., DSpace or RDF) while incorporating inhouse-developed modules that implement key preservation functions: ingesting, automated metadata extraction and knowledge discovery.

NLM curators are using SPER to preserve more than 60,000 court documents from a historic medico-legal collection acquired from the FDA. In addition, SPER is being used to preserve another important collection, from NIAID, comprising conference proceedings of the “US-Japan Cooperative Medical Science Program on Cholera,” a program conducted over a 50-year period from 1960 to 2010. Our activities toward this initiative include building a full repository for this collection with more than 10,000 documents, 2,500 research articles, and names and affiliations of 6,000 investigators dealing with cholera. We extracted metadata from the document contents using automated metadata extraction (AME) techniques, and then built a portal for research articles, authors, investigators and institutions. The AME processes include: (a) layout analysis to recognize different types of information within a document set; (b) evaluating the effectiveness of models such as Support Vector Machine and Hidden Markov Model for different metadata layouts; and (c) capturing relationships among various entities in the collection from the extracted metadata.

Investigators are conducting research toward knowledge discovery from information preserved in this repository by (a) developing a domain-specific vocabulary, (b) generating RDF graphs or triples from the preserved information using this vocabulary and natural language processing techniques, and (c) building a knowledgebase accessible over the Web.

Publications/Tools: 
Misra D, Thoma GR. Use of descriptive metadata as a knowledgebase for analyzing data in large textual collections. Proc. IS&T Archiving 2013. Washington D.C. Proc. IS&T Archiving 2013. Washington D.C. pg 193-199.
Pearson G, Gill MJ. An Evaluation of Motion JPEG 2000 for Video Archiving. Proc. Archiving 2005. Washington, D.C. April 2005:237-43.
Misra D, Hall RH, Payne SM, Thoma GR. Digital preservation and knowledge discovery based on documents from an international health science program. Proc. 12th ACM/IEEE-CS JCDL, pg 23-26 (2012). doi: 10.1145/2232817.2232823.
Chen S, Misra D, Thoma GR. Efficient Automatic OCR Word Validation Using Word Partial Format Derivation and Language Model Document Recognition and Retrieval XVII. Proceedings of the SPIE. San Jose, CA. January 2010;7534:75340O-75340O-8
Misra D, Seamans J, Thoma GR. Testing the Scalability of a DSpace-based Archive Proc. IS&T Archiving 2008. Bern, Switzerland. June 2008:36-40
Hsu W, Long LR, Antani SK. SPIRS: A Framework for Content-based Image Retrieval from Large Biomedical Databases Stud Health Technol Inform. 2007;129(Pt 1):188-92.
Bennett A, Liu J, Van Ryk D, Bliss D, Arthos J, Henderson RM, Subramaniam S. Cryoelectron Tomographic Analysis of an HIV-neutralizing Protein and Its Complex with Native Viral gp120 J Biol Chem. 2007 Sep 21;282(38):27754-9. Epub 2007 Jun 28
Misra D, Mao S, Rees J, Thoma GR. Archiving a Historic Medico-legal Collection: Automation and Workflow Customization Proc IS&T Archiving 2007. Arlington, Virginia, May 2007; 157-61
Demner-Fushman D, Lin J. Answering Clinical Questions with Knowledge-based and Statistical Techniques Computational Linguistics. 2007 Jan;33(1):63-103
Thoma GR, Mao S, Misra D, Rees J. Design of a Digital Library for Early 20th Century Medico-legal Documents Proc ECDL 2006. Eds: Gonzalo J et al. Berlin: Springer-Verlag; LNCS 4172: 147-57

Pages