Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov
Context Score
Introduction
This page describes the ranking algorithm using context to choose a correct word from the suggested candidates for a misspelt word. There are two major approaches:
In CSpell, we chose the Continuous Bag of Words (CBOW) model in word2vec to rank candidates because CBOW is designed to predict a word from a surrounding context.
Components
${PRE_PROCESS}/RunCorpus
3
4
6
(Best)
shell> ${DEV}/DL/word2vec/word2vec/word2vec2 -train ${IN_FILE} -outsyn0 ${SYN_0_FILE} -outsyn1 ${SYN_1_FILE} -outsyn1neg ${SYN_1N_FILE} -size 200 -window 5 -cbow 1 -hs 1 -threads 12
Source Code:
Tests:
Test Case | Software | Data (Word Vec) | Score Methods | Performance | Notes |
---|---|---|---|---|---|
Baseline | Baseline | Cosine | 358|807|774 0.4436|0.4625|0.4529 | Baseline | |
2-1.c.cos.b | CSpell | Baseline | Cosine: [IM] | 484|771|774 0.6278|0.6253|0.6265 | |
2-2.c.cos.0 | CSpell | Health Corpora | Cosine: [IM] | 443|770|774 0.5753|0.5724|0.5738 | baseline of new Corpus |
2-3.c.cbow.0-1 | CSpell | Health Corpora | CBOW: [IM] & [OM], syn1 Only use positive scores | 406|678|774 0.5988|0.5245|0.5592 | Not used, use syn1neg instead |
2-4.c.cbow.0-1n.+0- | CSpell | Health Corpora | CBOW: [IM] & [OM], syn1neg Use only positive (+) scores | 429|524|774 0.8187|0.5543|0.6610 | |
2-5.c.cbow.0-1n.+-0!= | CSpell | Health Corpora | CBOW: [IM] & [OM], syn1neg Rank by +, -, 0 | 505|748|774 0.6751|0.6525|0.6636 | |
2-6.c.cbow.0-1n.+0-!= | CSpell | Health Corpora | CBOW: [IM] & [OM], syn1neg Use +, - (only if no +) scores | 445|554|774 0.8032|0.5749|0.6702 | |
2-9.c.cbow.0-1n.+0-!=.cos | CSpell | Health Corpora | CBOW cos: [IM] & [OM], syn1neg *Use +, - (only if no +) scores | 446|554|774 0.8051|0.5762|0.6717 | Best (10% improvement) |
2-10.c.cbow.0-1n.+0-!=.cos + fixed LC on W2V | CSpell | Health Corpora | CBOW cos: [IM] & [OM], syn1neg *Use +, - (only if no +) scores | 457|562|774 0.8231|0.5904|0.6841 | Best (11% improvement) |
Final | CSpell | Health Corpora | CBOW cos: [IM] & [OM], syn1neg *Use +, - (only if no +) scores | 458|564|774 0.8121|0.5917|0.6846 | Best (11% improvement) |
* Word2Vec Score Algorithm:
Word2VecScore.java
: Use Cosine Similarity score
ContextScoreComparator.java
: to sort the context score
RankByContext.java
: