CSpell

Ensemble Algorithm

The high level algorithm of ensemble method for spelling correction are described as follows.

I. Source code:
LinearWeightedEnsembleSpellCorrection.java

II. Algorithm

Input text: read in text of the whole question
Converted to List<Span> processSpans: remove header, such as SUBJECT:, EMAIL:, etc.
Converted to fixed: preProcessed text to handle contractions, informational expression, puntuaction, split digits, etc.
Converted to List<CoreMap> sentences: use CoreNLP for annotation, treat the whole text as 1 sentence
Converted to List<CoreLabel> tokenAnns: Token separated by space and punctuation (NLPCore)
ProcessTokens to get:
- List<String> origTokens: Separated by space and period (end of sentences) only.
- List<String> modTokens: Tag [MUM] and others
- List&Integer> begins: the beginning position of modToken in the origTokens list
- List&Integer> positions: the index of modToken in the origTokens list
- List&Integer> origPositions: the beginning position of origToken in the origTokens list

correct to get corrected text:
Go through each origToken

skip token if it is "clinicaltrials", [no letter token], [leading number], [ProNoun]
Get Candidates:
- LinkedHashSet<String> suggestions: single word suggestions
- Map<String,String> mergeSuggestions: merge suggestions, key: merge suggestion, value: before merge tokens
- Real-Word:
  token in the dictionary & selected real-word option
  - token.length( ) > 3
  - mergeSuggestion: higher than frequency and use word Vector
  - candidates with higher frequency threshold and merge suggestion
- Non-Word:
  token not in the dictionary & selected real-word option
  - token.length( ) > 2
  - mergeSuggestion: no frequency check and not use word Vector
  - candidates with no frequency check and not use word Vector

Rank Candidates:
Find the candidate with highest score

Real-Word:
Score = 0.15 * w2vScore + 0.25 * corpusScore + 0.6
Non-Word:
Score = 0.15 * w2vScore + 0.25 * corpusScore + 0.2 * (overlapScore + phoneticScore + edScore)

Where:

Score	Source Code	Notes
edScore	`DictionaryBasedSpellChecker.getEditSimScore( )`	Get cost from Jazzy Edit Distance: Delete: 95 Insert: 95 Substitute: 100 Transpose: 90 Case: 10 Add split penalty: 95 Normalized score: 0.0 ~ 1.0 (= (1000-cost)/1000)
phoneticScore	`DictionaryBasedSpellChecker.getPhoneticSimScore( )`	Get Metaphone codes of src/tar from Jazzy double Metaphone Get cost of codes from Jazzy edit distance Get penalty cost of src/tar Normalized score: 0.0 ~ 1.0 (= (1000-cost)/1000)
overlapScore	`OverLapUtil.leadTrailOverlap( )`
corpusScore	`CorpusFrequencyCounts.getUnigramScore( )`
w2vScore	`Word2Vector.getSimScore( )`