You are here

Approximate String Matching Algorithms for Limited-Vocabulary OCR Output Correction

Printer-friendly versionPrinter-friendly version
Lasko TA, Hauser SE
Proc. SPIE, Document Recognition and Retrieval VIII. 2001 Jan;4307:232-40.
Abstract: 

Five methods for matching words mistranslated by optical character recognition to their most likely match in a reference dictionary were tested on data from the archives of the National Library of Medicine. The methods, including an adaptation of the cross correlation algorithm, the generic edit distance algorithm, the edit distance algorithm with a probabilistic substitution matrix, Bayesian analysis, and Bayesian analysis on an actively thinned reference dictionary were implemented and their accuracy rates compared. Of the five, the Bayesian algorithm produced the most correct matches (87%), and had the advantage of producing scores that have a useful and practical interpretation.

Lasko TA, Hauser SE. Approximate String Matching Algorithms for Limited-Vocabulary OCR Output Correction Proc. SPIE, Document Recognition and Retrieval VIII. 2001 Jan;4307:232-40.