You are here

Medical Article Record System

Project information
Research Area: 

The Medical Article Records System (MARS) project develops automated systems to extract bibliographic text from journal articles, in both paper as well as electronic forms. For the approximately 1000 journal titles that arrive at NLM in paper form, a production MARS system combines document scanning, optical character recognition (OCR), and rule-based and machine learning algorithms to yield citation data that NLM’s indexers use to complete bibliographic records for MEDLINE. Our algorithms extract this data in a pipeline process: segmenting page images into zones, assigning labels to the zones signifying its contents (title, author names, abstract, etc.), pattern matching to identify these entities, lexicon-based pattern matching to correct OCR errors and reduce words that are incorrectly labeled as errors to increase operator productivity.

A recently-developed system, Publisher Data Review (PDRS), is designed to provide data missing from the XML citations received from publishers, such as databank accession numbers, NIH grant numbers, grant support categories, Investigator Names, and Commented-on Article information. By providing these missing data, PDRS reduces the manual effort in completing the citations sent in by publishers, as well as correct their errors. The automated steps to fill in missing data and to correct wrong data substantially reduces the load on the operators, eliminating the need to look through an entire article to find this information, and then to key them in.

A third system, WebMARS, addresses cases where NLM is missing a journal issue or when citation data from publishers is incomplete. WebMARS is a software tool that operators can use to automatically create missing citations from these problematic issues. This eliminates the current manual labor on part of the operators to type, copy, and paste data from online articles, a very time-consuming step.

The MARS, PDR and WebMARS systems rely on underlying research in image analysis enables the creation of new initiatives in which these techniques find application.

Tran LQ, Moon CW, Le DX, Thoma GR. Web Page Downloading and Classification Proc. 14th IEEE Symposium on Computer-Based Medical Systems: IEEE Computer Society. 2001 Jul;:321-6.
Mao S, Rosenfeld A, Kanungo T. Document Structure Analysis Algorithms: A Literature Survey Proc. SPIE Electronic Imaging. 2003 Jan;5010:197-207.
Mao S, Kanungo T. Empirical Performance Evaluation Methodology and its Application to Page Segmentation Algorithms IEEE Transactions on Pattern Analysis and Machine Intelligence. 2001 Mar;23(3): 242-256.
Kanungo T, Mao S. Stochastic Language Model for Style-Directed Physical Layout Analysis of Documents IEEE Transactions on Image Processing. 2003 May;12 (5)5:583-596.
Ford G, Hauser SE, Le DX, Thoma GR. Pattern Matching Techniques for Correcting Low Confidence OCR Words in a Known Context Proc. SPIE., Document Recognition and Retrieval VIII. 2001 Jan;4307:241-9.
Kim J, Le DX, Thoma GR. Automated Labeling Algorithms for Biomedical Document Images Proc. 7th World Multiconference on Systemics, Cybernetics and Informatics. 2003 July;V: 352-57.
Kim J, Le DX, Thoma GR. Automated Labeling in Document Images Proc. SPIE, Document Recognition and Retrieval VIII. 2001 Jan;4307:111-22.
Lasko TA, Hauser SE. Approximate String Matching Algorithms for Limited-Vocabulary OCR Output Correction Proc. SPIE, Document Recognition and Retrieval VIII. 2001 Jan;4307:232-40.
Le DX, Tran LQ, Chow J, Kim J, Hauser SE, Moon CW, Thoma GR. Automated Medical Citation Records Creation for Web-Based Online Journals Proc. 14th IEEE Symposium on Computer-Based Medical Systems: IEEE Computer Society. 2001.
Mao S, Kanungo T. Software Architecture of PSET: A Page Segmentation Evaluation Toolkit International Journal on Document Analysis and Recognition. 2002 Mar;4(3):205-217.