SPECIALIST Lexicon

Antonym Generation for CC Model

shell>cd ${ANTONYM_DIR}/bin
shell>GetAntonyms ${YEAR}

CC model: co-occurrence in a corpus

Use the latest MEDLINE -N-gram Set, the Lexicon, STMT

Option	Description	input	Output	Notes	Option
70	Get Antonyms from MEDLINE 3-grams by a specify middle keyword (and/or): Medline.GetAntCandFrom3GramPatMid.java	${ML_DIR}/input/3-gram.${ML_YEAR}.30.core ${META_DIR}/input/normTermCui.data ${META_DIR}/input/MRSTY.RRF ${LEX_DIR}/input/inflVars.data ${LEX_DIR}/input/synonym.data ${ANT_DIR}/input/antCand.data.tag.${YEAR} ${ANT_DIR}/input/domain.data ${PROJECT_DIR}/LVG/lvg${LVG_YEAR}/data/config/lvg.properties	./output/PreCand/antCandPatMid.andOr.data	This step is not used in the annual processes. But, it is used to debug one keyWord in the step-71. This step is used to pre-run Step-71 by using 1 middle word in 3-grams to get collocates for antonyms. Must run this to make sure everything is OK before running Step-71. If run the 1st time: shell> mkdir ./output/PreCand make sure all input files are setup correctly Different versions of data are used due to different released dates of data: Lexicon Antonym release: ${YEAR} META-thesaurus: ${PREV_YEAR}AA MEDLINE: ${PREV_YEAR} LVG: ${PREV_YEAR} This program set the defaults keyword to "and/or".	70
71	Get Antonyms from MEDLINE 3-grams by specify middle keywords Medline.GetAntCandFrom3GramPatMid.java	${ML_DIR}/input/3-gram.${YEAR}.30.core ${META_DIR}/input/normTermCui.data ${META_DIR}/input/MRSTY.RRF ${LEX_DIR}/input/inflVars.data ${LEX_DIR}/input/synonym.data ${ANT_DIR}/input/antCand.data.tag.${YEAR} ${ANT_DIR}/input/domain.data ${PROJECT_DIR}/LVG/lvg${LVG_YEAR}/data/config/lvg.properties	./output/PreCand/antCandPatMid.${KEY_WORD}.data	Currently, this program includes the top 9 highest frequency keywords: [and], [or], [to], [versus], [than], [vs], [from], [nor], [and\|or], as defined in the scripts. The latest data are used with different version, because of different released dates of data: Lexicon Antonym release: ${YEAR} Lexicon: ${YEAR} META-thesaurus: ${PREV_YEAR}AA MEDLINE: ${PREV_YEAR} LVG: ${PREV_YEAR}	71
72	Get Antonyms from MEDLINE 5-grams by specify middle keywords Medline.GetAntCandFrom5GramPatMid.java	${ML_DIR}/input/5-gram.${YEAR}.30.core ${META_DIR}/input/normTermCui.data ${META_DIR}/input/MRSTY.RRF ${LEX_DIR}/input/inflVars.data ${LEX_DIR}/input/synonym.data ${ANT_DIR}/input/antCand.data.tag.${YEAR} ${ANT_DIR}/input/domain.data ${PROJECT_DIR}/LVG/lvg${LVG_YEAR}/data/config/lvg.properties	./output/PreCand/antCandPatMid.${KEY_WORD}.data	Currently, this program includes the 1 keyword: "as well as", as defined in the scripts. The latest data are used with different version, because of different released dates of data: Lexicon Antonym release: ${YEAR} Lexicon: ${YEAR} META-thesaurus: ${PREV_YEAR}AA MEDLINE: ${PREV_YEAR} LVG: ${PREV_YEAR}	72

75	Get antCand by combining results from above steps: 71 and 72 Medline.CombineAntCandFrom3GramPatMid.java Medline.CombineAntCandFrom5GramPatMid.java	./output/PreCand/antCandPatMid.${KEY_WROD}.data.wc ./output/PreCand/keyWords.data	./output/PreCand/antCandPatMid.cand.data.raw => include raw co-occurrences that happen once in 1 of 10 keywords ./output/PreCand/antCandPatMid.cand.data.filtered Heuristic filter rules: => include filtered co-occurrences: happen in 3 of 9 keywords, not include "other\|E0044444", and not self-aPairs => is the sum of files: tag + tbd ./output/Cand/antCandPatMid.cand.data.tag ./output/candTagged/antCandPatMid.cand.data.tag.CC ./output/candTagged/antCandPatMid.cand.data.tag.tagged ${ML_DIR}/output/Cand/antCandPatMid.cand.data.tbd	If run the first time: shell> mkdir Cand shell> mkdir candTagged copy ${PreCand}/keyWords.data from ${PREV_YEAR} TBD should be 0 If not, copy ./Cand/antCandPatMid.cand.data.tbd antCandPatMid.cand.data.tbd.${YEAR}.${NO} send cand ${ML_DIR}/output/Cand/antCandPatMid.cand.data.tbd.${YEAR}.${NO} to linguists to tag put tagged file at ./Cand/antCandPatMid.cand.data.tbd.${YEAR}.${NO}.tagged	75
76	Validate and fix tags of antonym candidates (CC) Antonym.ValidateTaggedCand.java	${CC_DIR}/output/candTagged/antCandPatMid.data.tag.tagged ${ANT_DIR}/input/domain.data	${CC_DIR}/output/candTagged/antCandPatMid.data.tag.fixed	Prepare/add tagged candidates to ./candTagged/tagged.data.tag.tagged copy ./Cand/antCandPatMid.cand.data.tbd.${YEAR}.${NO}.tagged ./candTagged/antCandPatMid.cand.data.tbd.${YEAR}.${NO}.tagged convert tagged candidate file to standard format: `shell> flds 3,4,5,6,7,8,9,10,11,12 antCandPatMid.cand.data.tbd.{YEAR}.${NO}.tagged > antCandPatMid.data.data.tbd.${YEAR}.${NO}.tagged.3-12` append antCandPatMid.data.data.tbd.${YEAR}.${NO}.tagged.3-12 to antCandPatMid.data.tag.tagged.${YEAR}.${NO} sort -u antCandPatMid.data.tag.tagged.${YEAR}.${NO} > antCandPatMid.data.tag.tagged.${YEAR}.${NO}.uSort `shell> cp -p antCandPatMid.data.tag.tagged.${YEAR}.${NO}.uSort antCandPatMid.data.tag.tagged` run this step (76) until tag and fixed files are the same Fixed file is the auto-fixes on [TYPE_TBD] and [DOMAIN_TBD] to [NA] and [DOMAIN_NONE]. Manually copy the fixed file to tagged file, then run it again until they are the same Manually copy antCandPatMid.data.tag.tagged to antCandPatMid.data.tag.tagged.${YEAR}	76
77	Update release antonyms tagged file form CC Antonym.UpdateAllTaggedFile.java	${CC_DIR}/output/candTagged/antCandPatMid.data.tag.tagged.${YEAR} ${ANT_DIR}/input/antCand.data.tag.${YEAR} ${ANT_DIR}/input/domain.data	${ANT_DIR}/input/antCand.data.tag.updated	This step auto-update all antonym candidate tag file Manully copy antCand.data.tag.updated to antCand.data.tag.updated.CC Manully copy antCand.data.tag.updated to antCand.data.tag.${YEAR} The output file is used to generate antonym and negation files for the release. Re-run steps 75-77 until it passes all steps Re-run 75-77 to gen the latest aPair candidate list for linugists	77

The SPECIALIST Lexicon