Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov

CSpell

Non-word Correction

This page describes the algorithm for non-word correction.

I. Functions

II. Results on Training Set

Tests CSpell ranking mode on the development set for non-word with different function modes:

Function ModeRaw dataPerformance
ESpell230|1180|7740.1949|0.2972|0.2354
Jazzy (ASpell)186|393|7740.4733|0.2403|0.3188
Ensemble552|825|7740.6691|0.7132|0.6904
CSpell, non-dictionary-based
non-dictionary-based340|373|7740.9115|0.4393|0.5929
CSpell, non-word, Single Function
1-to-1588|699|7740.8412|0.7597|0.7984
Split365|469|7740.7783|0.4716|0.5873
Merge343|382|7740.8979|0.4432|0.5934
CSpell, non-word, Combined Functions
1-to-1 + Split603|724|7740.8329|0.7791|0.8051
1-to-1 + Split + Merge606|731|7740.8290|0.7829|0.8053

From the results:

  • The performance is improved 11.5% from Baseline

III. Examples

  • ND (non-dictionary-based):

    IDInputOutputNotes
    ND-1"Good""Good"Xml/Html handler
    ND-2plspleaseInformal Expression handler
    ND-320years20 yearsLeading Digit Splitter
    ND-4from2007from 2007Ending Digit Splitter
    ND-5volunteers(healthy)volunteers (healthy)Leading Punctuation Splitter
    ND-6pain.help!pain. help!Ending Punctuation Splitter
    ND-7pain.pls help!pain. please help!Combo
    ND-8visit at pain.com!visit at pain.com!No correction!
    • Splitters and handlers are used in a Java-8 stream operation for non-dictionary-based corrections.

  • NW, Merge:

    IDInputOutputNotes
    M-1dur ingduringMerge
    M-2non drugnondrugMerge
    M-3non proteinnon-proteinMerge with hyphen
    M-4non surgicalnon surgicalNo merge
    • Example 2,3,4: depends on the spVars and the context and frequency to decide if it merges, merged with space or hyphen
    • "non" is an element-non-word, it is used for non-word merge operation.
    • Most element word are valid single word. However, few of them are invalid single words, such as "non", "se", "pre", "vitro", "vivo", "intra". They are element-non-words and only exist in the multiwords:
      multiwordElement-non-word
      non surgicalnon
      in vitrovitro
      in vivo grownvivo
      intra articular routeintra
      per sese

  • NW, 1To1:

    IDInputOutput
    1-1good diagnosisedgood diagnosis
    1-2was diagnosised withwas diagnosed with
    • diagnosised is corrected to diagnosis (best orthographic score) in example 1-1. However, it is corrected to diagnosed in Example 1-2 with context. From these 2 example, we observed that this unsupervised context score model captured certain syntactical and semantic regularities.

  • NW, Split:

    IDInputOutput
    S-1thankyouthank you
    S-2shuntfrom2007.howshunt from 2007. how
    • Example S-2 shows a combined corrections from ND and NW splitter.