You are here

Medical Article Record System

Project information
Research Area: 
Researchers: 

The Medical Article Records System (MARS) project develops automated systems to extract bibliographic text from journal articles, in both paper as well as electronic forms. For the approximately 1000 journal titles that arrive at NLM in paper form, a production MARS system combines document scanning, optical character recognition (OCR), and rule-based and machine learning algorithms to yield citation data that NLM’s indexers use to complete bibliographic records for MEDLINE. Our algorithms extract this data in a pipeline process: segmenting page images into zones, assigning labels to the zones signifying its contents (title, author names, abstract, etc.), pattern matching to identify these entities, lexicon-based pattern matching to correct OCR errors and reduce words that are incorrectly labeled as errors to increase operator productivity.

A recently-developed system, Publisher Data Review (PDRS), is designed to provide data missing from the XML citations received from publishers, such as databank accession numbers, NIH grant numbers, grant support categories, Investigator Names, and Commented-on Article information. By providing these missing data, PDRS reduces the manual effort in completing the citations sent in by publishers, as well as correct their errors. The automated steps to fill in missing data and to correct wrong data substantially reduces the load on the operators, eliminating the need to look through an entire article to find this information, and then to key them in.

A third system, WebMARS, addresses cases where NLM is missing a journal issue or when citation data from publishers is incomplete. WebMARS is a software tool that operators can use to automatically create missing citations from these problematic issues. This eliminates the current manual labor on part of the operators to type, copy, and paste data from online articles, a very time-consuming step.

The MARS, PDR and WebMARS systems rely on underlying research in image analysis enables the creation of new initiatives in which these techniques find application.

Publications/Tools: 
Mao S, Nie L, Thoma GR. Unsupervised Style Classification of Document Page Images Proc IEEE International Conference on Image Processing, September 2005, Genova, Italy; Vol. II: 510-13
Kim J, Le DX, Thoma GR. Automated Labeling Of Biomedical Online Journal Articles In: Callaos N, Lesso W, editors. SCI 2005. Proc 9th World Multiconference on Systemics, Cybernetics and Informatics; 2005 Jul 10-13; Vol. 4; Orlando (FL): International Institute of Informatics and Systemics; c2005. 406-11
Kim I, Le DX, Thoma GR. Automated Cleanup Processing for Extracting Bibliographic Data from Biomedical Online Journals In: Callaos N, Lesso W, editors. SCI 2005. Proc. 9th World Multiconference on Systemics, Cybernetics and Informatics; 2005 Jul 10-13; Vol. 4; Orlando (FL): International Institute of Informatics and Systemics; c2005. 401-5
Mao S, Kim J, Thoma G. A Dynamic Feature Generation System for Automated Metadata Extraction in Preservation of Digital Materials Proc. International Workshop on Document Image Analysis for Libraries (DIAL2004). 2004 Jan;: 225-32.
Mao S, Kim J, Thoma G. Style-Independent Document Labeling: Design and Performance Evaluation Proc. SPIE - Document Recognition and Retrieval. 2004 Jan;: 14-22.
Le DX, Straughan SR, Thoma GR. Greek Alphabet Recognition Technique for Biomedical Documents Proc. 6th World Multiconference on Systemics, Cybernetics and Informatics, eds: Callaos N, et al. 2002 July;III: 86-91.
Thoma GR, Ford G. Automated Data Entry System: Performance Issues Proc. SPIE: Document Recognition and Retrieval IX. 2002 Jan;4670: 181-90.
Mao S, Kim J, Le DX, Thoma GR. Generating Robust Features for Style-Independent Labeling of Bibliographic Fields in Medical Journal Articles Proc. 7th World Multiconference on Systemics, Cybernetics and Informatics.2003 July;III:53-6.
Mao S, Kanungo T. Automatic Training of Page Segmentation Algorithms: An Optimization Approach International Conference on Pattern Recognition. 2000 Sept.;:531-534.
Le DX, Thoma GR. Automated Document Labeling for Web-Based Online Medical Journals Proc. 7th World Multiconference on Systemics, Cybernetics and Informatics. 2003 July;II: 411-15.

Pages