CSpell

Dictionary Analysis

I. Introduction

Four dictionaries (Jazzy, Ensemble, Medline, and Lexicon) are compared. Here are the summary:

JazzyEnsembleMedlineLexicon
Size159,345459,038496,387558,353
Files1 + 101 + 1011
Preserved CaseNo (LC)No (LC)No (LC)Yes
VerifiedNoNoNoYes
General EnglishYesYesYesYes
BiomedicalNoYesYesYes
CodedNoNoNoYes, extra information are available:
  • POS
  • Multiword
  • Single word
  • Abbreviation/Acronym
  • Proper noun
  • Trade mark
  • Unicode
  • Number

II. Analysis and Tests

Analysis and performance tests are conducted from various dictionaries to obtain a better dictionary generation. Please see the following URL for details:

From the above results, we observe:

  • The verified dictionary from Lexicon is the best for spelling error checking
  • The Ensemble dictionary is the best for spelling error suggestion/correction
  • Multiwords are useful in split cases
  • The abbreviations and acronyms are useful in split cases (to avoid invalid split)
  • Extra dictionary from Lexicon (numbers and units) are useful

  • The combination of Ensemble and Lexicon seems reach the best performance
  • The Lexicon dictionary lacks of terms from drugs, problem, ..

III. Overlap and Contain Check

Lexicon (lexicon.ewLc.dic, 534,330)
Src+TarSrc-TarTar-Src
Ensemble (medical.dic, 299,670)71,212228,458463,118
Medline (medline.dic, 496,387)212,961283,426321,369
Jazzy (eng_com.dic, 150,843)104,85345,990429,477
Jazzy (spVar10File.dic, 8,502)6,1982,304528,132