You are here

Medical Article Record System

Project information
Research Area: 
Researchers: 

The Medical Article Records System (MARS) project develops automated systems to extract bibliographic text from journal articles, in both paper as well as electronic forms. For the approximately 1000 journal titles that arrive at NLM in paper form, a production MARS system combines document scanning, optical character recognition (OCR), and rule-based and machine learning algorithms to yield citation data that NLM’s indexers use to complete bibliographic records for MEDLINE. Our algorithms extract this data in a pipeline process: segmenting page images into zones, assigning labels to the zones signifying its contents (title, author names, abstract, etc.), pattern matching to identify these entities, lexicon-based pattern matching to correct OCR errors and reduce words that are incorrectly labeled as errors to increase operator productivity.

A recently-developed system, Publisher Data Review (PDRS), is designed to provide data missing from the XML citations received from publishers, such as databank accession numbers, NIH grant numbers, grant support categories, Investigator Names, and Commented-on Article information. By providing these missing data, PDRS reduces the manual effort in completing the citations sent in by publishers, as well as correct their errors. The automated steps to fill in missing data and to correct wrong data substantially reduces the load on the operators, eliminating the need to look through an entire article to find this information, and then to key them in.

A third system, WebMARS, addresses cases where NLM is missing a journal issue or when citation data from publishers is incomplete. WebMARS is a software tool that operators can use to automatically create missing citations from these problematic issues. This eliminates the current manual labor on part of the operators to type, copy, and paste data from online articles, a very time-consuming step.

The MARS, PDR and WebMARS systems rely on underlying research in image analysis enables the creation of new initiatives in which these techniques find application.

Publications/Tools: 
Kim I, Le DX, Thoma GR. Automated identification of biomedical article type using support vector machines. Proc. 18th SPIE Document Recognition and Retrieval, 7874:787403 (1-9), San Francisco, January 2011.
Zhang X, Zou J, Le DX, Thoma GR. Investigator Name Recognition From Medical Journal Articles: A Comparative Study of SVM and Structural SVM International Workshop on Document Analysis Systems. June 2010:121-8
Zou J, Le DX, Thoma GR. Locating and parsing bibliographic references in HTML medical articles. Int J Doc Anal Recognit. 2010 Jun 1;13(2):107-119.
Kim J, Le DX, Thoma GR. Naive Bayes and SVM Classifiers For Classifying Databank Accession Number Sentences From Online Biomedical Articles IS&T/SPIE's 22nd Annual Symposium on Electronic Imaging. San Jose, CA. January 2010;7534:75340U-1 - 8
Zhang X, Zou J, Le DX, Thoma GR. A Stacked Sequential Learning Method For Investigator Name Recognition From Web-based Medical Articles 17th Document Recognition and Retrieval Conference (SPIE-DR&R). San Jose, CA. January 2010;7534:753404-7
Kim J, Le DX, Thoma GR. Inferring Grant Support Types From Online Biomedical Articles 22nd IEEE ISCBMS. Albuquerque, NM. August 2009
Zhang X, Zou J, Le DX, Thoma GR. A Semi-supervised Learning Method to Classify Grant Support Zone in Web-based Medical Articles Proc SPIE Electronic Imaging Science and Technology, Document Recognition and Retrieval. January 2009;7247:7247 OW(1-8)
Kim J, Le DX, Thoma GR. Naive Bayes Classifier for Extracting Bibliographic Information From Biomedical Online Articles Proc 2008 International Conference on Data Mining. Las Vegas, Nevada, USA. July 2008;II:373-8
Thoma GR, Le DX, Kim I, Kim JW, Moon C, Tran L, Zou J. Automation to Accelerate the Production of MEDLINE April 2008 Technical Report to the LHNCBC Board of Scientific Counselors.
Kim IC, Le DX, Thoma GR. Hybrid approach combining contextual and statistical information for identifying and statistical information for identifying MEDLINE citation terms. Proc. SPIE-IS/T Electronic Imaging. San Jose, CA. January 2008;6815:68150P(1-9)

Pages