You are here
Correcting OCR Text by Association with Historic Datasets
The Medical Article Records System (MARS) developed by the Lister Hill National Center for Biomedical Communications uses scanning, OCR and automated recognition and reformatting algorithms to generate electronic bibliographic citation data from paper biomedical journal articles. The multi-engine OCR server incorporated in MARS performs well in general, but fares less well with text printed in small or italic fonts. Affiliations are often printed in small italic fonts in the journals processed by MARS. Consequently, although the automatic processes generate much of the citation data correctly, the affiliation field frequently contains incorrect data, which must be manually corrected by verification operators. In contrast, author names are usually printed in large, normal fonts that are correctly converted to text by the OCR server.
The National Library of Medicine's MEDLINE® database contains 11 million indexed citations for biomedical journal articles. This paper documents our effort to use the historical author, affiliation relationships from this large dataset to find potential correct affiliations for MARS articles based on the author and the affiliation in the OCR output. Preliminary tests using a table of about 400,000 author/affiliation pairs extracted from the corrected data from MARS indicated that about 44% of the author/affiliation pairs were repeats and that about 47% of newly converted author names would be found in this set. A text-matching algorithm was developed to determine the likelihood that an affiliation found in the table corresponding to the OCR text of the first author was the current, correct affiliation. This matching algorithm compares an affiliation found in the author/affiliation table (found with the OCR text of the first author) to the OCR output affiliation, and calculates a score indicating the similarity of the affiliation found in the table to the OCR affiliation. Using a ground truth set of 519 OCR author/OCR affiliation/correct affiliation triples, the matching algorithm is able to select a correct affiliation for the author 43% of the time with a false positive rate of 6%, a true negative rate of 44% and a false negative rate of 7%.
MEDLINE citations with United States affiliations typically include the zip code. In addition to using author names as clues to correct affiliations, we are investigating the value of the OCR text of zip codes as clues to correct USA affiliations. Current work includes generation of an author/affiliation/zipcode table from the entire MEDLINE database and development of a daemon module to implement affiliation selection and matching for the MARS system using both author names and zip codes. Preliminary results from the initial version of the daemon module and the partially filled author/affiliation/zipcode table are encouraging.