Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov

CSpell

Dictionary Functions - Check Proper Noun

I. Introduction

Proper nouns should be checked separately for spelling errors to increase the performance. Proper nouns could include mixed cases as shown in the table below.

CapitalizedAachen, Beyer, Colgate
Mixed Caseszur Hausen, ABC Medical Center, al-Tawil
lower caseamicon, coll, dang
upper caseBCDE, BSMMU, CINAHL

II. Approaches

Three approaches are compared as follows:

  • By Algorithm:
    • As implemented in baseline, proper nouns are detected by algorithm:
      • Capitalized case
  • By Data - case sensitive:
    • Use proper nouns from Lexicon
    • Use case sensitive dictionary
  • By Data - case insensitive:
    • Use proper nouns from Lexicon
    • Use non-case sensitive dictionary

III. Results

Test result with Single-Word, English-Word as dictionary:

ApproachTP|Ret|RelPrecisionRecallF1
Algorithm521|710|8140.73380.64000.6837
Data-Case537|755|8140.71130.65790.6845
Data-No Case537|751|8140.71500.65790.6863

  • With data approach, F1 and recall are increased, precision is decreased.
  • The [TP] is the same between two data approaches, the difference in retrieval includes 4 [FP]:
    • 14276 prego preg => Prego, no case is not right
    • 16167 thier ther => Thier, no case is not right
    • 17055 veracruz vera cruz => Veracruz, no case is good
    • 17991 gujarat gujar at => Gujarat, no case is good

    => It is about 50% correct for using case-sensitive approach, and result in worse precision and F1 compared to case-non-sensitive approach (because F1 and precision are all above 70%). Thus, the data non-sensitive approach is implemented. One of the main reason for using case insensitive is that users (consumers) might put lowercase/upper case/mixed case for proper nouns. So the chance is 50/50.
  • Use data - case sensitive could increase the recall (by finding more spelling errors), but, it will rely on the ranking algorithm to find the correct word for improving precision.