Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov

CSpell

Performance Tests on Corpora

I. Test Setup

  • Data: Training Set
  • Gold Standard: non-word only
  • Dictionary: CSpell (Lexicon-based)
  • Corpora:
    • Tested 2 different corpora for word frequency score and noisy channel score
    • Use the consumer health corpus to train word2vec
  • Ranking: CSpell

II. Test Results

CorpusSizePrecisionRecallF1
MEDLINE496,3880.80850.79070.7995
Consumer Health Corpus109,8180.84070.78420.8115

III. Discussion

  • The performance of F1 score dropped 1.2% when changing the corpus from consumer health corpus to MEDLINE corpus.
  • The corpus from MEDLINE is 4.52 times the size of consumer health corpus.
  • A smaller relevant corpora outperform general large collections that are not necessary related to consumer health data.