Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov

CSpell

Performance Tests on Phonetic Similarity Score

I. Test Setup

  • Data: Training Set
  • Gold Standard: non-word only
  • Dictionary: CSpell (Lexicon-based)
  • Corpus: none
  • Ranking: Orthographic ranking

II. Test Results

  • Tests on various phonetic coding system within orthographic similarity score ranking.

    IDPhoneticPrecisionRecallF1
    11Double Metaphone0.74900.75190.7505
    12Refined Soundex0.73320.73700.7351
    13Caverphone-20.71720.72090.7191
    14Metaphone0.74870.75060.7497
    15Metaphone-30.74520.74810.7466

  • Tests on various weighting factors (WF) on costs of the edit distance (delete, insert, substitute, and transpose) with Metaphone 2 in the orthographic similarity score.

    IDDeleteInsertSubstituteTransposePrecisionRecallF1Notes
    10.950.950.950.950.74900.75190.7505Same ratio of WF
    21.000.950.950.950.73490.73770.7363Increasing 1 WF
    30.951.000.950.950.72750.73130.7294
    40.950.951.000.950.74130.74420.7427
    50.950.950.951.000.74900.75190.7505
    60.900.950.950.950.72750.73130.7294Decreasing 1 WF
    70.950.900.950.950.74390.74680.7453
    80.950.950.900.950.71720.72090.7191
    90.950.950.950.900.74390.74680.7453
    100.950.900.951.000.74390.74680.7453Try and error to find the WF of cost and phonetic
    99-10.950.950.950.900.73750.74030.7389

  • Tests on various weighting factors (WF) on costs of the edit distance (delete, insert, substitute, and transpose). The WF for orthographic is 1.0, 1.0, 1.0.

III. Discussion

  • From the results of test 11-15, we chose Double Metaphone as the phonetic system in the orthographic similarity score.

  • From the results of test 2-5, we observed the higher the weighting factor of transpose cost, the better the F1 score.
  • From the results of test 6-9, we observed the lower the weighting factor of insert cost, the better the F1 score.
  • Find the best F1 by try and error from tests 10-99-1, that is lower the cost of insert and raise the cost of transpose.

  • Use test 13 for the weighting factors for costs of delete, insert, substitute and transpose.