Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov

CSpell

Non-word Spelling (1-To-1)

I. Introduction

This page describes the processes for non-word spelling (1-to-1) detection and correction.

II. Processes

  • Detector:
    NonWordDetector.java
    • non-word: invalid word, not in checkDic. checkDic includes EW, NUM, etc.)
    • Not exceptions: digit, punctuation, digit/punctuation, email, url, empty string, upperCase, 1Char, measurement
  • Candidates:
    OneToOneCandidates.java
    • max. length of word <= 25 (configurable: CS_CAN_NW_1TO1_WORD_MAX_LENGTH)
      Longer non-word generate too many candidates and results in slower speed performance. This variable is used to resolve this issue. The recall might decreased if this value is set too small.

    • Edit Dist <= 2
    • candidate is in the suggDic (valid word)
  • Ranker:
    RankNonWordByMode.java,
    uses the top ranked candidate in the two-stage ranking system for correction:
    • Stage-1:
      • Orthographic score
        • Edit Distance Similarity score
        • Phonetic Similarity score (Double Metaphone)
        • Overlap Similarity score
      • Find the top orthographic score
      • Stage 1 Range factor for qualifying candidate = 0.08 (configurable: CS_RANKER_NW_S1_RANK_RANGE_FAC)
        All candidates within the distance of 0.08 of the top orthographic score are selected as qualified candidates to go to stage-2 for final ranking. That is cnadidates have top 92% of orthographic score as the highest candidate will be qualified as candidates for stage-2 ranking.
      • The ranks by orthographic score in this stage is disregarded in stage-2
    • Stage-2:
      Use chain comparators in a sequential order of the following scores:
    • Corrector:
      OneToONeCorrector.java
      • Update the focus token with the top rank candidate
      • Update process history to non-word-1-to-1

    III. Development Test

    • True-Positive non-word 1-to-1:
      IdSourceOriginal WordCorrected Word
      TP-110023knoledgeknowledge
      TP-210040truelytruly
      TP-310475diagnostdiagnosed
      TP-46diagnosiseddiagnosed
      ............
      • TP-3, 4: the correction changed when the context is changed!
        • diagnost -> diagnosis
        • was diagnost -> was diagnosed
        • diagnost with -> diagnosed with
        • was diagnost with -> was diagnosed with

        • diagnosised -> diagnosis
        • was diagnosised with -> was diagnosed with
    • False-Positive non-word 1-to-1:
      IdSourceOriginal WordCorrected WordCorrect Word
      FP-110058BbeB
      FP-210084i.e.ice.i.e.
      FP-311144clancychancyclumsy
      FP-411588bagingbaggingbegging
      ...............
      • FP-1, 2: could be improved by word length and case
      • FP-3: the distance is too far away
    • False-Negative non-word 1-to-1:
      IdSourceOriginal WordCorrected WordCorrect Word
      FN-110285hitialahitialahiatal
      FN-210714havyhaveheavy
      FN-310ewingsewingsewing's
      FN-411144traumatologotraumatologotraumatologist
      FN-511186segmenssegmentsegments
      • FP-3: possessive
      • FP-4: the distance is too far away
      • FP-5: inflectional variants