Dictionary in Ensemble
I. Introduction
The dictionary
(eng_medical.dic)
in the Ensemble method includes:
- General English (eng_com.dic):
- Medical Terms (medical.dic - from Halil):
- medical terms from UMLS (consumer health related medical terms)
- English strings
- unigram
- semantic type
- Interventions: topp, lbrp, diap
- Problem: cgab, acab, inpo, patf, dsyn, anab, neop, mobd, sosy, bact
- drugs: drdd, clnd, antb, phsu, nsba, strd, vita, aapp
- lower case
- File name: ${PRE_PROCESS}/data/Umls/${RELEASE}/outData/umls.dic
- some manually added data (Gopher + problem list).
- 4 files from Dina's consumer's data:
- umls_anatomy_merged.txt
- umls_interventions_merged.txt
- umls_population_merged.txt
- umls_problem_merged.txt
- Retrieved the 1st field from above 4 files
- Retrieved unigrams from above terms
- Excluded words in Jazzy (mistakes: but not yse.dic and yze.dic)
- Total: 450K tokens (only unigrams)
II. Format
word (lower cased unigrams)
|
---|
III. Re-generate the Dictionary
We tried to re-produce the dictionary in the Ensemble:
- File name: ${PRE_PROCESS}/data/Baseline/outData/baseline.dic
- Format: lowercase word
- Differences:
- The generated medical Dictionary (medDic.data) and Halil's file (medical.dic):
- Almost identical
- The only difference is Non-ASCII Unicode (from file encoding format)
- Compare
- A: Halil's Eng_medical.dic
- B: Eng_com.dic + medical.dic
the results are: