CSpell

Dictionary in Ensemble

I. Introduction

The dictionary (eng_medical.dic) in the Ensemble method includes:

  • General English (eng_com.dic):
  • Medical Terms (medical.dic - from Halil):
    • medical terms from UMLS (consumer health related medical terms)
      • English strings
      • unigram
      • semantic type
        • Interventions: topp, lbrp, diap
        • Problem: cgab, acab, inpo, patf, dsyn, anab, neop, mobd, sosy, bact
        • drugs: drdd, clnd, antb, phsu, nsba, strd, vita, aapp
      • lower case

      • File name: ${PRE_PROCESS}/data/Umls/${RELEASE}/outData/umls.dic
    • some manually added data (Gopher + problem list).

    • 4 files from Dina's consumer's data:
      • umls_anatomy_merged.txt
      • umls_interventions_merged.txt
      • umls_population_merged.txt
      • umls_problem_merged.txt

    • Retrieved the 1st field from above 4 files
    • Retrieved unigrams from above terms
    • Excluded words in Jazzy (mistakes: but not yse.dic and yze.dic)
  • Total: 450K tokens (only unigrams)

II. Format

word (lower cased unigrams)

III. Re-generate the Dictionary

We tried to re-produce the dictionary in the Ensemble:

  • File name: ${PRE_PROCESS}/data/Baseline/outData/baseline.dic
  • Format: lowercase word
  • Differences:
    • The generated medical Dictionary (medDic.data) and Halil's file (medical.dic):
      • Almost identical
      • The only difference is Non-ASCII Unicode (from file encoding format)
    • Compare
      • A: Halil's Eng_medical.dic
      • B: Eng_com.dic + medical.dic

      the results are: