Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov

CSpell

Ensemble Algorithm

The high level algorithm of ensemble method for spelling correction are described as follows.

I. Source code:
LinearWeightedEnsembleSpellCorrection.java

II. Algorithm

  • Input text: read in text of the whole question
  • Converted to List<Span> processSpans: remove header, such as SUBJECT:, EMAIL:, etc.
  • Converted to fixed: preProcessed text to handle contractions, informational expression, puntuaction, split digits, etc.

  • Converted to List<CoreMap> sentences: use CoreNLP for annotation, treat the whole text as 1 sentence
  • Converted to List<CoreLabel> tokenAnns: Token separated by space and punctuation (NLPCore)
  • ProcessTokens to get:
    • List<String> origTokens: Separated by space and period (end of sentences) only.
    • List<String> modTokens: Tag [MUM] and others
    • List&Integer> begins: the beginning position of modToken in the origTokens list
    • List&Integer> positions: the index of modToken in the origTokens list
    • List&Integer> origPositions: the beginning position of origToken in the origTokens list

  • correct to get corrected text:
    Go through each origToken

    • skip token if it is "clinicaltrials", [no letter token], [leading number], [ProNoun]

    • Get Candidates:
      • LinkedHashSet<String> suggestions: single word suggestions
      • Map<String,String> mergeSuggestions: merge suggestions, key: merge suggestion, value: before merge tokens

      • Real-Word:
        token in the dictionary & selected real-word option
        • token.length( ) > 3
        • mergeSuggestion: higher than frequency and use word Vector
        • candidates with higher frequency threshold and merge suggestion
      • Non-Word:
        token not in the dictionary & selected real-word option
        • token.length( ) > 2
        • mergeSuggestion: no frequency check and not use word Vector
        • candidates with no frequency check and not use word Vector

    • Rank Candidates:
      Find the candidate with highest score
      • Real-Word:
        Score = 0.15 * w2vScore + 0.25 * corpusScore + 0.6
      • Non-Word:
        Score = 0.15 * w2vScore + 0.25 * corpusScore + 0.2 * (overlapScore + phoneticScore + edScore)

      Where:

      ScoreSource CodeNotes
      edScoreDictionaryBasedSpellChecker.getEditSimScore( )
      • Get cost from Jazzy Edit Distance:
        • Delete: 95
        • Insert: 95
        • Substitute: 100
        • Transpose: 90
        • Case: 10
      • Add split penalty: 95
      • Normalized score: 0.0 ~ 1.0 (= (1000-cost)/1000)
      phoneticScoreDictionaryBasedSpellChecker.getPhoneticSimScore( )
      • Get Metaphone codes of src/tar from Jazzy double Metaphone
      • Get cost of codes from Jazzy edit distance
      • Get penalty cost of src/tar
      • Normalized score: 0.0 ~ 1.0 (= (1000-cost)/1000)
      overlapScoreOverLapUtil.leadTrailOverlap( )
      corpusScoreCorpusFrequencyCounts.getUnigramScore( )
      w2vScoreWord2Vector.getSimScore( )