Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov
Ensemble Algorithm
The high level algorithm of ensemble method for spelling correction are described as follows.
I. Source code:
LinearWeightedEnsembleSpellCorrection.java
II. Algorithm
text
: read in text of the whole question
List<Span> processSpans
: remove header, such as SUBJECT:, EMAIL:, etc.
fixed
: preProcessed text to handle contractions, informational expression, puntuaction, split digits, etc.
List<CoreMap> sentences
: use CoreNLP for annotation, treat the whole text as 1 sentence
List<CoreLabel> tokenAnns
: Token separated by space and punctuation (NLPCore)
ProcessTokens
to get:
List<String> origTokens
: Separated by space and period (end of sentences) only.
List<String> modTokens
: Tag [MUM] and others
List&Integer> begins
: the beginning position of modToken in the origTokens list
List&Integer> positions
: the index of modToken in the origTokens list
List&Integer> origPositions
: the beginning position of origToken in the origTokens list
correct
to get corrected text:
LinkedHashSet<String> suggestions
: single word suggestions
Map<String,String> mergeSuggestions
: merge suggestions, key: merge suggestion, value: before merge tokens
Where:
Score | Source Code | Notes |
---|---|---|
edScore | DictionaryBasedSpellChecker.getEditSimScore( ) |
|
phoneticScore | DictionaryBasedSpellChecker.getPhoneticSimScore( ) |
|
overlapScore | OverLapUtil.leadTrailOverlap( ) | |
corpusScore | CorpusFrequencyCounts.getUnigramScore( ) | |
w2vScore | Word2Vector.getSimScore( ) |