Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov

CSpell

Non-word Split

I. Introduction

This page describes the processes for non-word split detection and correction.

II. Processes

  • Detector:
    NonWordDetector.java
    • non-word: invalid word, not in checkDic. checkDic includes EW, NUM, etc.
    • Not Exceptions: digit, punctuation, digit/punctuation, email, url, empty string, upperCase, 1Char, measurement
  • Candidates:
    SplitCandidates.java
    • SplitNo <= 5 (configurable: CS_CAN_NW_MAX_SPLIT_NO)
    • is a multiword (in mwDic)
    • each word (unigram) in the candidate is in splitDic, splitDic does not include pure aA, such as "er"
    • unigram is not digit, unit, etc. (already split in ND splitter)
  • Ranker:
    RankNonWordByMode.java,
    uses the top ranked candidate in the two-stage ranking system for correction:
    • Stage-1:
      • Orthographic score
        • Edit Distance Similarity
        • Phonetic Similarity (Double Metaphone)
        • Overlap Similarity
      • Find the top orthographic score
      • All candidates within the distance of 0.08 of top orthographic score are selected as qualified candidates to go to stage-2 for final ranking
      • The ranks by orthographic score in this stage is disregarded in stage-2
    • Stage-2:
      Use chain comparators in a sequential order of the following scores:
      • Context Score (Dual embedding Word2Vec)
        • context radius = 2 (configurable, CS_NW_SPLIT_CONTEXT_RADIUS)
          This value is not used/implemented in CSpell because CSpell combine non-word split and 1-to-1 correction module together.

        • topScore != 0
      • Noisy Channel Score
  • Corrector:
    SplitCorrector.java
    • Update the focus token with top rank split candidate
    • FlatMap the split word to inTokenlist
    • Update process history to non-word-split

III. Development Test

  • True-Positive Non-word Split:
    IdSourceOriginal WordSplit Word
    TP-110225aftercareemailaftercare email
    TP-210225facebooksharefacebook share
    TP-310225friendsharefriend share
    TP-412616leftsideleft side
    TP-513090viceversavice versa
    TP-613509inthisin this
    TP-714849shuntfrom2007.Howshunt from 2007. How
    TP-814849oftendooften do
    TP-914knowaboutknow about
    TP-1016928thankyouthank you
    TP-1117942everytimeevery time
    TP-1218175ofcourseof course
    TP-1318611aquestiona question
    TP-1418855backalsoback also
    TP-1526diseaseanydisease any
    TP-167saythissay this
    TP-1788ilosti lost
    • TP-7: involved splitter operation from ND and NW:
      • Input: shuntfrom2007.How
      • ND: shuntfrom 2007. How
      • NW: shunt from 2007. How
  • False-Positive Non-word Split:
    IdSourceOriginal WordSplit WordCorrect Words
    FP-112235counterindicativecounter indicativecontraindicated
    FP-212271earthmoversearth moversearthmovers
    FP-313014orthopaedicianorthopaedic ianorthopaedician
    FP-413165iami amiam (error?)
    FP-513922shoudiceshou diceshouldice
    FP-61nonethingnone thingnothing
    FP-74diseardis eardisease
    FP-861metopticmet opticmetopic
    FP-97chromezonechrome zonechromosome
    FP-1012574biletanbile tanbiletan
    • TP-6, 7: too far away
    • TP-4: error in the goldStd set.
    • TP-2, 3, 5, 10: Need more coverage in the corpus and dictionary
  • False-Negative Non-word Split:
    IdSourceOriginal WordCorrected WordCorrect Word
    FN-110025u-creatininecreatinineurine creatinine
    FN-211186tbinthetbinthetb in the
    FN-311243menimgtisneefmenimgtisneefmeningitis needs
    FN-412271area!unfortionatlyarea! unfortionatlyarea! unfortunately
    FN-512616camedowncame downcame down
    FN-614514ihavehavei have
    FN-714alotalota lot
    FN-816519eye-doctoreye-doctoreye doctor
    FN-918203pthrpeptidepthrpeptidepthr peptide
    FN-1088polipsremovedpolipsremovedpolyps removed
    • TP-1, 3, 4, 9, 10: multiple operation involved (not in the design scope)
    • TP-2: TB was no in the split dictionary
    • TP-5, 6, 7: need further investigation. Maybe to separate Split and 1-To-1 into two class in NW.
    • TP-8: spVars