Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov

CSpell

Real-word Spelling (1-To-1)

This page describes the processes for real-word spelling (1-to-1) detection and correction.

I. Processes

  • Detector:
    RealWord1To1Detector.java
    • Not corrected previously in the CSpell pipeline.
    • real-word: valid word (in checkDic)
    • Not exceptions: digit, punctuation, digit/punctuation, url, email, empty string, measurement, properNoun, abbreviation/acronym
    • word has context score
    • word WC >= 65 (configurable: CS_DETECTOR_RW_1TO1_WORD_MIN_WC)
    • word has length >= 2 (configurable: CS_DETECTOR_RW_1TO1_WORD_MIN_LENGTH)
  • Candidates:
    RealWord1To1Candidates.java
    • Max. length of real-word <= 10 (configurable: CS_CAN_RW_1TO1_WORD_MAX_LENGTH)
      Only generate real-word 1-to-1 candidates for the word has length less than certain value to prevent over-generating and slow performance. The recall will be decreased if this number is too small (with faster speed).
    • Generate all possible candidate as in the non-word
    • Filter out invalid candidates (IsValid1To1Cand)
      => Ideally, we only correct real-word with candidates that are very similar to the inWord, that is they looks (orthographic) and sounds (phonetic) alike. If we loose this restriction, the real-word correction will be mainly rely on the context score (word2vec). In this version, our corpus for word2vec is relatively small and thus it generates too much noise [FP] and results in low precision and F1. This restriction of sounds and looks alike also helps (a little) on the run time performance (less context score calculation in ranking).
      • in suggDic (valid word)
      • has context score (word2Vec)
      • WC >= 1 (has word count, configurable: CS_CAN_RW_1TO1_CAND_MIN_WC)
      • length >= 2 (configurable: CS_CAN_RW_1TO1_CAND_MIN_LENGTH)
      • candidate is not a inflectional variant of inWord
        In this version, we do not correct grammar and thus no inflectional variants (such as plural nouns, 3rd personal singular verb, etc.) are corrected.
      • Heuristic rules of looks and sounds alike:
        • sounds alike: both phonetic codes of double metaphone and refined soundex must be the same
          • same double metaphone code (pmDist = 0)
          • same refined soundex code (prDist = 0)
        • look alike: small edit distance with similar sounds
          • leadDist + endDist + lengthDist + pmDist + prDist < 3
          • editDist + pmDist + prDist < 4
          • phonetic codes for double metaphone (pmDist = 0)

    • Key size in HashMap to store real-time 1-To-1 candidates in memoery: 1,000,000,000 (configurable: CS_CAN_RW_1TO1_CAND_MAX_KEY_SIZE)
      Slow run time performance due to too many real-words and their candidates. The generation of all possible candidates on the fly causes slow performances. To resolve this issue, we saved generated candidates (values) with real-word (key) to memory (in HashMap) to improve performance. Our test showed the elastped time is improved from 25+ min. to 3.5 min. on the training set. This is because:
      • lots of real-word are repeated
      • the candidates of real-word are the same
  • Ranker:
    RankRealWord1To1ByCSpell.java
    • Find the top rank candidate
      Sort the candidates by CSpellScoreRw1To1Comparator.java:
      • OrthographicScoreComparator
        The top ranked candidate (highest Orthographic score) must also have the highest scores of the follows in the candidate list:
      • FrequencyScore
      • EditDistSimilarityScore
      • PhoneticSimilarityScore
      • OverlapSimilarityScore
    • Validate the top ranked candidate
      Use context score to validate the top ranked candidate (IsTopCandValid):
      • context radius = 2 (configurable, CS_RW_1TO1_CONTEXT_RADIUS)
      • Set the RealWord_1To1_Confidence_Factor = 0.0 (configurable:CS_RANKER_RW_1TO1_C_FAC) for more strict restriction to avoid false-positive candidates
      • orgScore < 0
        • & topScore > 0
          • Context Score Check (on min., distance, and ratio)
            • Min: topScpre > rw1To1CandMinCs (0.00, configurable: CS_RANKER_RW_1TO1_CAND_MIN_CS)
            • Dist: topScore - orgScore > rw1To1CandCsDist (0.085, configurable: CS_RANKER_RW_1TO1_CAND_CS_DIST)
            • Ratio: (topScore/-orgScore) > rw1To1CandCsFactor (0.1, configurable: CS_RANKER_RW_1TO1_CAND_CS_FAC)

            • Min: orgScore > rw1To1WordMinCs (-0.085, configurable: CS_RANKER_RW_1TO1_WORD_MIN_CS)
          • Frequency Score Check (on min., distance, and ratio)
            • Min: topFScore > rw1To1CandMinFs (0.0006, configurable: CS_RANKER_RW_1TO1_CAND_MIN_FS)
            • Dist: topFScore > orgFScore or (orgFScore - topFScore) < rw1To1CandFsDist (0.02, configurable: CS_RANKER_RW_1TO1_CAND_FS_DIST)
            • Ratio: (topFScore/orgFScore) > rw1To1CandFsFactor (0.035, configurable: CS_RANKER_RW_1TO1_CAND_FS_FAC)
        • & topScore < 0 & topScore * RealWord1To1CFactor > orgScore
      • orgScore > 0
        • & topScore * RealWord_1To1_Confidence_Factor > orgScore
          => Never happen beacuse RealWord_1To1_Confidence_Factor is 0.0
      • orgScore = 0
        • No real-word 1-to-1 correction because they are exclusive from the detector (no word2Vec information on the inspected word)
  • Corrector:
    OneToOneCorrector.java
    • Update the focused (inspected) token with the top ranked candidate.
    • Update process history to real-word-1To1

II. Development Tests

Tested different real-word 1-to-1 factors on the revised real-word included gold standard from the training set. Each test takes about 3~5 min. (depends on computer and memory size)

  • Detector (check on focus token):
    FunctionMin. LengthMin. WCRaw dataPerformance
    NW (All)N/AN/A607|777|9640.7812|0.6297|0.6973
    NW + RW_1To1165612|786|9640.7786|0.6349|0.6994
    NW + RW_1To1265612|786|9640.7786|0.6349|0.6994
    NW + RW_1To1365612|786|9640.7786|0.6349|0.6994
    NW + RW_1To1465612|786|9640.7786|0.6349|0.6994
    NW + RW_1To1565611|783|9640.7803|0.6338|0.6995
    NW + RW_1To1665609|781|9640.7798|0.6317|0.6980
    NW + RW_1To1765608|778|9640.7815|0.6307|0.6980
    NW + RW_1To1865607|777|9640.7812|0.6297|0.6973
    NW + RW_1To121612|786|9640.7786|0.6349|0.6994
    NW + RW_1To1210612|786|9640.7786|0.6349|0.6994
    NW + RW_1To1265612|786|9640.7786|0.6349|0.6994
    NW + RW_1To12100611|785|9640.7783|0.6338|0.6987
    NW + RW_1To12500610|784|9640.7781|0.6328|0.6979
    NW + RW_1To121000610|782|9640.7801|0.6328|0.6987
    NW + RW_1To1210000608|778|9640.7815|0.6307|0.6980

    • Test on Min. length:
      • Increase it for better precision, worse recall.
      • Use a small number, precision does not increase.
      • The TPs starts to drop after 5. This might results in better/worse F1.
      • No TPs by RW-1To1 when it is 8 (>= 8), because the length of all corrections in the development set are less than 8.
      • Choose 2 for more recall with same F1 and precision. This means if the length of target word is 1, it is not a valid real-word for 1-To-1 correction.
    • Test on Min. WC (word count)
      • Increase it for better precision, worse recall, and faster run time.
      • Use a small number is precision does not increase.
      • Choose 1 for more recall with same F1 and precision.

  • Candidates (check on candidates):
    FunctionMin. LengthMin. WCRaw dataPerformance
    NW (All)N/AN/A607|777|9640.7812|0.6297|0.6973
    NW + RW_1To111612|786|9640.7786|0.6349|0.6994
    NW + RW_1To121612|786|9640.7786|0.6349|0.6994
    NW + RW_1To131612|787|9640.7776|0.6349|0.6990
    NW + RW_1To141612|785|9640.7796|0.6349|0.6998
    NW + RW_1To151612|785|9640.7796|0.6349|0.6998
    NW + RW_1To161609|779|9640.7818|0.6317|0.6988
    NW + RW_1To171608|778|9640.7815|0.6307|0.6980
    NW + RW_1To121612|786|9640.7786|0.6349|0.6994
    NW + RW_1To1210612|787|9640.7776|0.6349|0.6990
    NW + RW_1To12100612|791|9640.7737|0.6349|0.6974
    NW + RW_1To121000611|791|9640.7724|0.6338|0.6963
    NW + RW_1To1210000608|782|9640.7775|0.6307|0.6964
    • Candidate Min. length:
      • Increase it for better precision, worse recall.
      • If it pass a threshold, recall and precision drops.
      • Best F1 when it is at 4-5 because all TP are >= 4 (see example below).
      • This number must coordinated with min. focus length.
      • Choose 2 (candidate with length of 1 is not a valid candidate)
    • Candidate Min. WC:
      • Increase it for better precision, worse recall.
      • Choose 1 (corrections might be at small WC)

  • Rankers - confidence factor for selecting and validating the top candidate:
    FunctionC FactorC ScoreF ScoreRaw dataPerformance
    NW (All)N/AN/AN/A607|777|9640.7812|0.6297|0.6973
    NW + RW_1To10.000.01|0.00|0.085|-0.0850.035|0.0006|0.02612|786|9640.7786|0.6349|0.6994
    NW + RW_1To10.010.01|0.00|0.085|-0.0850.035|0.0006|0.02612|789|9640.7757|0.6349|0.6982
    NW + RW_1To10.100.01|0.00|0.085|-0.0850.035|0.0006|0.02612|813|9640.7528|0.6349|0.6888
    NW + RW_1To10.500.01|0.00|0.085|-0.0850.035|0.0006|0.02612|998|9640.6132|0.6349|0.6239
    NW + RW_1To10.000.01|0.00|0.085|-0.0850.035|0.0006|0.02612|786|9640.7786|0.6349|0.6994
    NW + RW_1To10.000.10|0.00|0.085|-0.0850.035|0.0006|0.02612|786|9640.7786|0.6349|0.6994
    ... TBD ...
    • Confidence Factor:
      • A very strict restriction is needed for confident factor to eliminate the FP.
      • Choose C factor to 0.00. (top candidate is only valid when the focus token has negative score and top candidate has positive score

III. Observations from Development test set (F1 = 0.6994)

  • [TP] real-word 1-To-1 corrections:
    IDSourceDetected WordsCorrected WordTextNotes
    TP-111225weatherwhetherfrom one Person to another. Weather it can happen or
    TP-211597bowlbowelirregular bowl movements.
    TP-312748effectaffectwhat is TSD/Clubfoot, and how does it effect a baby
    TP-413922theirtherein the Chicago area hospitals is their a surgeon familiar with the shoudice
    TP-517713smallsmelllost ability to taste and small, and who is profoundly depressedsmell size

    Example: smell vs. small

    • taste and small, foul small, bad small, small an odor, sense of small
    • smell size, smell amounts, a smell sip of water, smeller amounts, smell intestine

  • [FP] real-word 1-To-1:
    IDSourceDetected WordCorrected WordText
    FP-110349pleaseplace...give me good advice please
    FP-318855headhad... backalso inner head pain.com
    FP-42causescasesWhat are some causes of anorexia
    • FP-3: Corpus has more "also and had" than "inner head"
    • FP-4: "some causes of anorexia", but add "are" the "causes" is corrected to "cases". But it is OK for "What are some causes of pain" or "What are causes of anorexia"

  • [FN] real-word 1-To-1:
    IDSourceFocus WordsCorrected WordText
    TP-132thenthan
    TP-251thingthink
    TP-310138knownow
    TP-410375triedtired
    TP-510934speciallyespecially
    TP-611186repotreport
    TP-711378thenthanIs Radioiodine treatment better then surgery for me?
    TP-816734weatherwhetherI was particularly interested in learning weather parents should be worried about cribs death
    TP-912286lessonlessenWhat can I do to lesson the severity of the adema
    TP-1012757pregnancypregnant
    TP-1112788leavelive
    TP-1215759tenttend
    TP-1316256accessexcess
    TP-1416297loosinglosing
    • TP-9: "lesson" is not in the corpus of word2Vec.
      => Only "lessons" is in. Maybe use inflVars for detection.
      => Need a much bigger corpus for the word2Vec
      => The word2vec is very good on precision. However, the corpus used for training have to include such information (words and their context).