CSpell

Dictionary Functions - Check Proper Noun

I. Introduction

Proper nouns should be checked separately for spelling errors to increase the performance. Proper nouns could include mixed cases as shown in the table below.

Capitalized	Aachen, Beyer, Colgate
Mixed Cases	zur Hausen, ABC Medical Center, al-Tawil
lower case	amicon, coll, dang
upper case	BCDE, BSMMU, CINAHL

II. Approaches

Three approaches are compared as follows:

By Algorithm:
- As implemented in baseline, proper nouns are detected by algorithm:
  - Capitalized case
By Data - case sensitive:
- Use proper nouns from Lexicon
- Use case sensitive dictionary
By Data - case insensitive:
- Use proper nouns from Lexicon
- Use non-case sensitive dictionary

III. Results

Test result with Single-Word, English-Word as dictionary:

Approach	TP\|Ret\|Rel	Precision	Recall	F1
Algorithm	521\|710\|814	0.7338	0.6400	0.6837

Data-Case	537\|755\|814	0.7113	0.6579	0.6845
Data-No Case	537\|751\|814	0.7150	0.6579	0.6863

With data approach, F1 and recall are increased, precision is decreased.
The [TP] is the same between two data approaches, the difference in retrieval includes 4 [FP]:
- 14276 prego preg => Prego, no case is not right
- 16167 thier ther => Thier, no case is not right
- 17055 veracruz vera cruz => Veracruz, no case is good
- 17991 gujarat gujar at => Gujarat, no case is good
=> It is about 50% correct for using case-sensitive approach, and result in worse precision and F1 compared to case-non-sensitive approach (because F1 and precision are all above 70%). Thus, the data non-sensitive approach is implemented. One of the main reason for using case insensitive is that users (consumers) might put lowercase/upper case/mixed case for proper nouns. So the chance is 50/50.
Use data - case sensitive could increase the recall (by finding more spelling errors), but, it will rely on the ranking algorithm to find the correct word for improving precision.