CSpell

CSpell

Dictionary in Ensemble

I. Introduction

The dictionary (eng_medical.dic) in the Ensemble method includes:

General English (eng_com.dic):
- Jazzy spell checker
Medical Terms (medical.dic - from Halil):
- medical terms from UMLS (consumer health related medical terms)
  - English strings
  - unigram
  - semantic type
    - Interventions: topp, lbrp, diap
    - Problem: cgab, acab, inpo, patf, dsyn, anab, neop, mobd, sosy, bact
    - drugs: drdd, clnd, antb, phsu, nsba, strd, vita, aapp
  - lower case
  - File name: ${PRE_PROCESS}/data/Umls/${RELEASE}/outData/umls.dic
- some manually added data (Gopher + problem list).
- 4 files from Dina's consumer's data:
  - umls_anatomy_merged.txt
  - umls_interventions_merged.txt
  - umls_population_merged.txt
  - umls_problem_merged.txt
- Retrieved the 1st field from above 4 files
- Retrieved unigrams from above terms
- Excluded words in Jazzy (mistakes: but not yse.dic and yze.dic)
Total: 450K tokens (only unigrams)

II. Format

word (lower cased unigrams)

III. Re-generate the Dictionary

We tried to re-produce the dictionary in the Ensemble:

File name: ${PRE_PROCESS}/data/Baseline/outData/baseline.dic
Format: lowercase word
Differences:
- The generated medical Dictionary (medDic.data) and Halil's file (medical.dic):
  - Almost identical
  - The only difference is Non-ASCII Unicode (from file encoding format)
- Compare
  - A: Halil's Eng_medical.dic
  - B: Eng_com.dic + medical.dic
  the results are:
  - A and B: 450,504
  - A - B: 22
  - B - A: 8