You are here

Medical Article Record System

Project information
Research Area: 

The Medical Article Records System (MARS) project develops automated systems to extract bibliographic text from journal articles, in both paper as well as electronic forms. For the approximately 1000 journal titles that arrive at NLM in paper form, a production MARS system combines document scanning, optical character recognition (OCR), and rule-based and machine learning algorithms to yield citation data that NLM’s indexers use to complete bibliographic records for MEDLINE. Our algorithms extract this data in a pipeline process: segmenting page images into zones, assigning labels to the zones signifying its contents (title, author names, abstract, etc.), pattern matching to identify these entities, lexicon-based pattern matching to correct OCR errors and reduce words that are incorrectly labeled as errors to increase operator productivity.

A recently-developed system, Publisher Data Review (PDRS), is designed to provide data missing from the XML citations received from publishers, such as databank accession numbers, NIH grant numbers, grant support categories, Investigator Names, and Commented-on Article information. By providing these missing data, PDRS reduces the manual effort in completing the citations sent in by publishers, as well as correct their errors. The automated steps to fill in missing data and to correct wrong data substantially reduces the load on the operators, eliminating the need to look through an entire article to find this information, and then to key them in.

A third system, WebMARS, addresses cases where NLM is missing a journal issue or when citation data from publishers is incomplete. WebMARS is a software tool that operators can use to automatically create missing citations from these problematic issues. This eliminates the current manual labor on part of the operators to type, copy, and paste data from online articles, a very time-consuming step.

The MARS, PDR and WebMARS systems rely on underlying research in image analysis enables the creation of new initiatives in which these techniques find application.

Rae A, Kim J, Le DX, Thoma GR. Main Content Detection in HTML Journal Articles. DocEng ’18: ACM Symposium on Document Engineering 2018, August 28–31, 2018, Halifax, NS, Canada. ACM, New York, NY, USA, 4 pages.
Zou J, Antani SK, Thoma GR. Localizing and Recognizing Labels for Multi-Panel Figures in Biomedical Journals. Proceedings of International Conference on Document Analysis and Recognition, November 13, 2017
Kim I, Thoma GR. Machine Learning with Selective Word Statistics for Automated Classification of Citation Subjectivity in Online Biomedical Articles. Proc. Int’l Conf. Artificial Intelligence (ICAI’17), pp. 201-207, Las Vegas, July 2017.
Kim J, Hong S, Thoma GR. Labeling Author Affiliations in Biomedical Articles Using Markov Model Classifiers. The 13th International Conference on Data Mining (DMIN2017), pp. 105-110, Las Vegas, USA, July 2017.
Kim J, Thoma GR. Named Entity Recognition in Affiliations of Biomedical Articles Using Statistics and HMM Classifiers. The 2016 International Conference on Data Mining (DMIN2016), Las Vegas, USA, pp. 236-241, July, 2016.
Kim J, Lobuglio PS, Thoma GR. Visualization of Statistics from MEDLINE. 2016 IEEE 29th International Symposium on Computer-Based Medical Systems (CBMS 2016), Dublin and Belfast, Ireland, pp. 290-291, June, 2016.
Kim I, Thoma GR. Automated Classification of Author’s Sentiments in Citation Using Machine Learning Techniques: A Preliminary Study. Proc. the 2015 IEEE Conf. Computational Intelligence in Bioinformatics and Computational Biology (CIBCB 2015), Niagara Falls, Canada, Aug. 12-15, 2015.
Kim I, Le DX, Thoma GR. Automated method for extracting "citation sentences" from online biomedical articles using SVM-based text summarization technique. Proc. the 2014 IEEE Int'l Conf. on Systems, Man, and Cybernetics (SMC 2014), pp. 2006-2011, San Diego, October, 2014
Kim J, Le DX, Thoma GR. Identification of Investigator Name Zones Using SVM Classifiers and Heuristic Rules. 12th international Conference on Document Analysis and Recognition (ICDAR). Washington D.C., August 2013.
Kim I, Le DX, Thoma GR. Identifying “comment-on” citation data in online biomedical articles using SVM-based text summarization technique. Proc. Int’l Conf. Artificial Intelligence (ICAI’12), vol. 1, pp. 431-437, Las Vegas, July 2012.