The SPECIALIST Lexicon

Antonym Generation for CC Model

shell>cd ${ANTONYM_DIR}/bin
shell>GetAntonyms ${YEAR}

CC model: co-occurrence in a corpus

Use the latest MEDLINE -N-gram Set, the Lexicon, STMT

OptionDescriptioninputOutputNotesOption
70
  • Get Antonyms from MEDLINE 3-grams by a specify middle keyword (and/or):
  • Medline.GetAntCandFrom3GramPatMid.java
  • ${ML_DIR}/input/3-gram.${ML_YEAR}.30.core
  • ${META_DIR}/input/normTermCui.data
  • ${META_DIR}/input/MRSTY.RRF
  • ${LEX_DIR}/input/inflVars.data
  • ${LEX_DIR}/input/synonym.data
  • ${ANT_DIR}/input/antCand.data.tag.${YEAR}
  • ${ANT_DIR}/input/domain.data
  • ${PROJECT_DIR}/LVG/lvg${LVG_YEAR}/data/config/lvg.properties
  • ./output/PreCand/antCandPatMid.andOr.data
  • This step is not used in the annual processes. But, it is used to debug one keyWord in the step-71.
  • This step is used to pre-run Step-71 by using 1 middle word in 3-grams to get collocates for antonyms. Must run this to make sure everything is OK before running Step-71.
  • If run the 1st time:
    • shell> mkdir ./output/PreCand
    • make sure all input files are setup correctly
  • Different versions of data are used due to different released dates of data:
    • Lexicon Antonym release: ${YEAR}
    • META-thesaurus: ${PREV_YEAR}AA
    • MEDLINE: ${PREV_YEAR}
    • LVG: ${PREV_YEAR}
  • This program set the defaults keyword to "and/or".
70
71
  • Get Antonyms from MEDLINE 3-grams by specify middle keywords
  • Medline.GetAntCandFrom3GramPatMid.java
  • ${ML_DIR}/input/3-gram.${YEAR}.30.core
  • ${META_DIR}/input/normTermCui.data
  • ${META_DIR}/input/MRSTY.RRF
  • ${LEX_DIR}/input/inflVars.data
  • ${LEX_DIR}/input/synonym.data
  • ${ANT_DIR}/input/antCand.data.tag.${YEAR}
  • ${ANT_DIR}/input/domain.data
  • ${PROJECT_DIR}/LVG/lvg${LVG_YEAR}/data/config/lvg.properties
  • ./output/PreCand/antCandPatMid.${KEY_WORD}.data
  • Currently, this program includes the top 9 highest frequency keywords: [and], [or], [to], [versus], [than], [vs], [from], [nor], [and|or], as defined in the scripts.
  • The latest data are used with different version, because of different released dates of data:
    • Lexicon Antonym release: ${YEAR}
    • Lexicon: ${YEAR}
    • META-thesaurus: ${PREV_YEAR}AA
    • MEDLINE: ${PREV_YEAR}
    • LVG: ${PREV_YEAR}
71
72
  • Get Antonyms from MEDLINE 5-grams by specify middle keywords
  • Medline.GetAntCandFrom5GramPatMid.java
  • ${ML_DIR}/input/5-gram.${YEAR}.30.core
  • ${META_DIR}/input/normTermCui.data
  • ${META_DIR}/input/MRSTY.RRF
  • ${LEX_DIR}/input/inflVars.data
  • ${LEX_DIR}/input/synonym.data
  • ${ANT_DIR}/input/antCand.data.tag.${YEAR}
  • ${ANT_DIR}/input/domain.data
  • ${PROJECT_DIR}/LVG/lvg${LVG_YEAR}/data/config/lvg.properties
  • ./output/PreCand/antCandPatMid.${KEY_WORD}.data
  • Currently, this program includes the 1 keyword: "as well as", as defined in the scripts.
  • The latest data are used with different version, because of different released dates of data:
    • Lexicon Antonym release: ${YEAR}
    • Lexicon: ${YEAR}
    • META-thesaurus: ${PREV_YEAR}AA
    • MEDLINE: ${PREV_YEAR}
    • LVG: ${PREV_YEAR}
72
75
  • Get antCand by combining results from above steps: 71 and 72
  • Medline.CombineAntCandFrom3GramPatMid.java
  • Medline.CombineAntCandFrom5GramPatMid.java
  • ./output/PreCand/antCandPatMid.${KEY_WROD}.data.wc
  • ./output/PreCand/keyWords.data
  • ./output/PreCand/antCandPatMid.cand.data.raw
    => include raw co-occurrences that happen once in 1 of 10 keywords
  • ./output/PreCand/antCandPatMid.cand.data.filtered
    Heuristic filter rules:
    => include filtered co-occurrences: happen in 3 of 9 keywords, not include "other|E0044444", and not self-aPairs
    => is the sum of files: tag + tbd
  • ./output/Cand/antCandPatMid.cand.data.tag
  • ./output/candTagged/antCandPatMid.cand.data.tag.CC
  • ./output/candTagged/antCandPatMid.cand.data.tag.tagged
  • ${ML_DIR}/output/Cand/antCandPatMid.cand.data.tbd
  • If run the first time:
    • shell> mkdir Cand
    • shell> mkdir candTagged
    • copy ${PreCand}/keyWords.data from ${PREV_YEAR}
  • TBD should be 0
  • If not, copy ./Cand/antCandPatMid.cand.data.tbd antCandPatMid.cand.data.tbd.${YEAR}.${NO}
  • send cand ${ML_DIR}/output/Cand/antCandPatMid.cand.data.tbd.${YEAR}.${NO} to linguists to tag
  • put tagged file at ./Cand/antCandPatMid.cand.data.tbd.${YEAR}.${NO}.tagged
75
76
  • Validate and fix tags of antonym candidates (CC)
  • Antonym.ValidateTaggedCand.java
  • ${CC_DIR}/output/candTagged/antCandPatMid.data.tag.tagged
  • ${ANT_DIR}/input/domain.data
  • ${CC_DIR}/output/candTagged/antCandPatMid.data.tag.fixed
  • Prepare/add tagged candidates to ./candTagged/tagged.data.tag.tagged
    • copy ./Cand/antCandPatMid.cand.data.tbd.${YEAR}.${NO}.tagged ./candTagged/antCandPatMid.cand.data.tbd.${YEAR}.${NO}.tagged
    • convert tagged candidate file to standard format:
      shell> flds 3,4,5,6,7,8,9,10,11,12 antCandPatMid.cand.data.tbd.{YEAR}.${NO}.tagged > antCandPatMid.data.data.tbd.${YEAR}.${NO}.tagged.3-12
    • append antCandPatMid.data.data.tbd.${YEAR}.${NO}.tagged.3-12 to antCandPatMid.data.tag.tagged.${YEAR}.${NO}
    • sort -u antCandPatMid.data.tag.tagged.${YEAR}.${NO} > antCandPatMid.data.tag.tagged.${YEAR}.${NO}.uSort
    • shell> cp -p antCandPatMid.data.tag.tagged.${YEAR}.${NO}.uSort antCandPatMid.data.tag.tagged
  • run this step (76) until tag and fixed files are the same
    • Fixed file is the auto-fixes on [TYPE_TBD] and [DOMAIN_TBD] to [NA] and [DOMAIN_NONE].
    • Manually copy the fixed file to tagged file, then run it again until they are the same
  • Manually copy antCandPatMid.data.tag.tagged to antCandPatMid.data.tag.tagged.${YEAR}
76
77
  • Update release antonyms tagged file form CC
  • Antonym.UpdateAllTaggedFile.java
  • ${CC_DIR}/output/candTagged/antCandPatMid.data.tag.tagged.${YEAR}
  • ${ANT_DIR}/input/antCand.data.tag.${YEAR}
  • ${ANT_DIR}/input/domain.data
  • ${ANT_DIR}/input/antCand.data.tag.updated
  • This step auto-update all antonym candidate tag file
  • Manully copy antCand.data.tag.updated to antCand.data.tag.updated.CC
  • Manully copy antCand.data.tag.updated to antCand.data.tag.${YEAR}
  • The output file is used to generate antonym and negation files for the release.
  • Re-run steps 75-77 until it passes all steps
  • Re-run 75-77 to gen the latest aPair candidate list for linugists
77