Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov

CSpell

Real-word Correction

This page describes the algorithm for real-word correction. In general, detection and correction for real-word errors in CSpell is computed on the fly, based on context score, word frequency score, and other heuristic rules. No confusion set or assumption on the number of real-word errors were used.

I. Functions

II. Results on the Training Set

Tested different methods on the real-word included gold standard from the training set.

MethodsRaw dataPerformance
Ensemble (Use Non-Word on Real-Word)556|825|9640.6739|0.5768|0.6216
Ensemble (Real-Word)517|718|9640.7201|0.5363|0.6147
CSpell: NW609|731|9640.8331|0.6317|0.7186
CSpell: NW + RW_Merge619|742|9640.8342|0.6421|0.7257
CSpell: NW + RW_Split611|737|9640.8290|0.6338|0.7184
CSpell: NW + RW_1To1614|740|9640.8297|0.6369|0.7207
CSpell: NW + RW_Merge + RW_Split621|747|9640.8313|0.6442|0.7259
CSpell: NW + RW_Merge + RW_Split + RW_1To1626|756|9640.8280|0.6494|0.7279

  • RW_M and RW_S: ~1 min.
  • RW_1: ~4 min.
  • RW_M_S: ~1 min.
  • RW_A: ~4.5 min.

III. Examples

  • Merge:

    IDInputOutputNotes
    M-1on seton setNo merge
    M-2based on set criteriabased on set criteriaNo merge
    M-3early on setearly onsetMerged
    M-4on set dementiaonset dementiaMerged
    M-5dianosed early on set deminitadiagnosed early onset dementiaMerged with other NW corrections
    • "on set" is merged to "on set" depends on the context. In Example M-5, dianosed and deminita are also corrected to "diagnosed" and "dementia" respectively in the non-word functions before the real-word merged.

  • Split:

    IDInputOutputNotes
    S-1alongalongNo Split
    S-2for along timefor a long timeSplit
    S-3He is alongHe is alongNo split
    S-4He is a long with meHe is along with meNo split - Merge
    • Google does not correct S-2 and S-4!!

  • Spelling (1-to-1):

    IDInputOutput
    1-1foul smallfoul smell
    1-2bad smallbad smell
    1-3small an odorsmell an odor
    1-4sense of smallsense of smell
    1-5taste and smalltaste and smell
    1-6smell sizesmall size
    1-7smell amountsmall amount
    1-8a smell sip of watera small sip of water
    1-9smell intestinesmall intestine
    1-10very smellvery small
    1-11relatively smellrelatively small
    • Google does not correct 1-3, 1-5, 1-10 and 1-11!!