CSpell

Non-word Spelling (1-To-1)

I. Introduction

This page describes the processes for non-word spelling (1-to-1) detection and correction.

II. Processes

Detector:
NonWordDetector.java
- non-word: invalid word, not in checkDic. checkDic includes EW, NUM, etc.)
- Not exceptions: digit, punctuation, digit/punctuation, email, url, empty string, upperCase, 1Char, measurement
Candidates:
OneToOneCandidates.java
- max. length of word <= 25 (configurable: CS_CAN_NW_1TO1_WORD_MAX_LENGTH)
  Longer non-word generate too many candidates and results in slower speed performance. This variable is used to resolve this issue. The recall might decreased if this value is set too small.
- Edit Dist <= 2
- candidate is in the suggDic (valid word)

Ranker:
RankNonWordByMode.java,
uses the top ranked candidate in the two-stage ranking system for correction:

Stage-1:
- Orthographic score
  - Edit Distance Similarity score
  - Phonetic Similarity score (Double Metaphone)
  - Overlap Similarity score
- Find the top orthographic score
- Stage 1 Range factor for qualifying candidate = 0.08 (configurable: CS_RANKER_NW_S1_RANK_RANGE_FAC)
  All candidates within the distance of 0.08 of the top orthographic score are selected as qualified candidates to go to stage-2 for final ranking. That is cnadidates have top 92% of orthographic score as the highest candidate will be qualified as candidates for stage-2 ranking.
- The ranks by orthographic score in this stage is disregarded in stage-2
Stage-2:
Use chain comparators in a sequential order of the following scores:
- Context Score (Dual embedding Word2Vec CBOW)
  - context radius = 2 (configurable, CS_NW_1TO1_CONTEXT_RADIUS)
  - topScore != 0
- Noisy Channel Score
- Find the best candidate with top rank in stage-2 ranking
  orthographic score of the top candidate >= 2.70 (configurable: CS_RANKER_NW_S1_MIN_OSCORE)
- See cSpell ranker for the details
Corrector:
OneToONeCorrector.java
- Update the focus token with the top rank candidate
- Update process history to non-word-1-to-1

III. Development Test

Id	Source	Original Word	Corrected Word
TP-1	10023	knoledge	knowledge
TP-2	10040	truely	truly
TP-3	10475	diagnost	diagnosed
TP-4	6	diagnosised	diagnosed
...	...	...	...

Id	Source	Original Word	Corrected Word	Correct Word
FP-1	10058	B	be	B
FP-2	10084	i.e.	ice.	i.e.
FP-3	11144	clancy	chancy	clumsy
FP-4	11588	baging	bagging	begging
...	...	...	...	...

False-Negative non-word 1-to-1:

Id	Source	Original Word	Corrected Word	Correct Word
FN-1	10285	hitiala	hitiala	hiatal
FN-2	10714	havy	have	heavy
FN-3	10	ewings	ewings	ewing's
FN-4	11144	traumatologo	traumatologo	traumatologist
FN-5	11186	segmens	segment	segments