Spell checker for consumer language (CSpell).
Lu C, Aronson AR, Shooshan SE, Demner-Fushman D
Journal of the American Medical Informatics Association, Volume 26, Issue 3, 1 March 2019, Pages 211-218, https://doi.org/10.1093/jamia/ocy171 21 January 2019 (Editor's Choice).
Abstract:
Objective
Automated
understanding of consumer health inquiries might be hindered by
misspellings. To detect and correct various types of spelling errors in
consumer health questions, we developed a distributable spell-checking
tool, CSpell, that handles nonword errors, real-word errors, word
boundary infractions, punctuation errors, and combinations of the above.
Methods
We developed a novel approach of using dual embedding within Word2vec
for context-dependent corrections. This technique was used in
combination with dictionary-based corrections in a 2-stage ranking
system. We also developed various splitters and handlers to correct word
boundary infractions. All correction approaches are integrated to
handle errors in consumer health questions.
Results
Our approach achieves an F1 score of 80.93% and 69.17% for spelling
error detection and correction, respectively.
Discussion
The dual-embedding model shows a significant improvement (9.13%) in F1
score compared with the general practice of using cosine similarity with
word vectors in Word2vec for context ranking. Our 2-stage ranking
system shows a 4.94% improvement in F1 score compared with the best
1-stage ranking system.
Conclusion
CSpell improves over the state of the art and provides near real-time
automatic misspelling detection and correction in consumer health
questions. The software and the CSpell test set are available at
https://lsg3.nlm.nih.gov/LexSysGroup/Projects/cSpell/current/web/index.html.
Lu C, Aronson AR, Shooshan SE, Demner-Fushman D. Spell checker for consumer language (CSpell).
Journal of the American Medical Informatics Association, Volume 26, Issue 3, 1 March 2019, Pages 211-218, https://doi.org/10.1093/jamia/ocy171 21 January 2019 (Editor's Choice).
PMID | PMCID