Distilled MEDLINE N-Gram Set
I. Introduction
The MEDLINE n-gram set includes many invalid LMWs that are not needed for most NLP research. LSG developed a set of exclusive filters that filter out these invalid LMWs. The filtering process filtered out about 2/3 of n-grams from MEDLINE n-gram set release. This enhanced/filtered N-Gram set is called the distilled MEDLINE n-gram set.
II. Precision and Recall
This distilled MEDLINE n-gram set has higher precision and same (similar) recall rate in terms of valid multiwords. LSG performs the accuracy test on all developed exclusive filters by applying these filters on Lexicon (valid LMW). The minimum passing rate is 99.99%. In other words, these filters only filter out invalid LMWs without removing valid LMWs. A simple calculation is described as below:
III. Conclusion
The distilled MEDLINE n-gram Set vs. MEDLINE n-gram Set
IV. Release Processes
shell>cd ${MULTIWORDS}/data/${YEAR}/outData/02.NGram/nGrams
shell>wc -l nGramSet.${YEAR}.30
Year | nGram Number |
---|---|
2014 | 17,023,819 |
2015 | 18,148,692 |
2016 | 19,325,338 |
2017 | 21,963,037 |
2018 | 23,171,133 |
2019 | 24,666,816 |
2020 | 26,310,808 |
2021 | 28,103,252 |
2022 | 30,090,771 |
2023 | 32,107,061 |
2024 | 34,160,908 |
2025 | 36,370,468 |
shell>mkdir ${MULTIWORD_DIR}/data/${YEAR}/outData/05.ApplyFilters
shell>cd ${MULTIWORD_DIR}/data/${YEAR}/outData/05.ApplyFilters
shell>ln -sf ../02.NGram/nGrams/nGramSet.${YEAR}.30 nGram.${YEAR}
shell>cd ${MULTIWORD_DIR}/data/${YEAR}/inData
shell>ln -sf nfsvol/lex/Lu/Backup/Releases/UMLS/${YEAR}_AA_release/LEX/NUMBERS/NRVAR NRVAR
shell>cd ${MULTIWORD_DIR}/data/${YEAR}/inData
shell>cp -p ../../${PREV_YEAR}/inData/stopWords.data.${PREV_YEAR} stopWords.data.${YEAR}
shell>ln -sf ./stopWords.data.${YEAR} stopWords.data
shell>cp -p ../../${PREV_YEAR}/inData/unit.data.${PREV_YEAR} unit.data.${YEAR}
shell>ln -sf ./unit.data.${YEAR} unit.data
shell>cd ${MULTIWORD_DIR}/data/${YEAR}/outData/03.LeadEndTerm
shell>cat invalidLeadTerms.data invalidLeadTerms.data.append > invalidLeadTerms.data.${YEAR}
shell>cp -p ../../../${PREV_YEAR}/outData/03.LeadEndTerm/invalidLeadTerms.data.${PREV_YEAR} invalidLeadTerms.data.${YEAR}
shell>ln -sf ./invalidLeadTerms.data.${YEAR} invalidLeadTerms.data.abs
shell>cd ${MULTIWORD_DIR}/data/${YEAR}/outData/03.LeadEndTerm
shell>mv invalidEndTerms.data invalidEndTerms.data.${YEAR}
shell>cp -p ../../../${PREV_YEAR}/outData/03.LeadEndTerm/invalidEndTerms.data.${PREV_YEAR} invalidEndTerms.data.${YEAR}
shell>ln -sf ./invalidEndTerms.data.${YEAR} invalidEndTerms.data.abs
shell>cd ${MULTIWORD_DIR}/data/${YEAR}/outData/03.LeadEndTerm
shell>cp -p ../../../${PREV_YEAR}/outData/03.LeadEndTerm/invalidLeadEndTermCandidates.data .
03.LeadEndTerm
shell>cd ${MULTIWORD_DIR}/data/${YEAR}/outData/03.LeadEndTerm
shell>cp -p ../../../${PREV_YEAR}/outData/03.LeadEndTerm/validLeadTerms.data.pat.${PREV_YEAR} validLeadTerms.data.pat.${YEAR}
shell>ln -sf ./validLeadTerms.data.pat.${YEAR} validLeadTerms.data.pat
shell>cd ${MULTIWORD_DIR}/data/${YEAR}/outData/03.LeadEndTerm
shell>cp -p ../../../${PREV_YEAR}/outData/03.LeadEndTerm/validEndTerms.data.pat.${PREV_YEAR} validEndTerms.data.pat.${YEAR}
shell>ln -sf ./validEndTerms.data.pat.${YEAR} validEndTerms.data.pat
shell>cd ${MULTIWORDS}/bin/05.ApplyFilters ${YEAR}
1
10-14
20-25
30-34
40
or
shell>cd 05.ApplyFiltersAll
shell>runApplyFilersAll ${YEAR}
shell> cp -p ApplyFilters.rpt ApplyFilters.rpt.${YEAR}
shell> cp -p nGram.2018.34.invEndTermPat ../02.NGram/nGrams/distilledNGram.${YEAR}
shell> gtar -czvf distilledNGram.${YEAR}.tgz distilledNGram.${YEAR}
shell> cd ${MULTIWORDS}/bin
shell> 06.NGramUtil ${YEAR}
20
for the MEDLINE nGram set
21
for the distilled nGram set
22
for nGrams (N = 3 ~ 5)
3
4
5
V. Release Logs
VI. Run the Test data on the Lexicon