SPECIALIST Lexicon

LMW Candidate Post-Processes

The post-processes are used to conduct analysis and aggregation on LMW candidate lists. These lists include LMW candidates from various models. The numbers are based on real-time data. In other words, this program needs to be re-run to get the latest number when:

Lexicon (./data/Candidates/0.LexiconInflVars/inflVars.data.current):
must link to the latest Lexicon and inflVars
Not-Base/LMW files
must updateds in the LexCheck.
A candidate list is completed (to calculate performance)

I. Program:

Root directory: ${MULTIWORDS}/data/Candidates
Command: ${MULTIWORDS}/bin/00.CandidateList
1
2
3
4

II. Functionality:

Combined all previous completed candidate lists
Use the latest Lexicon (inflVars.data from LexBuild) to auto-tag valid/invalid LMWs.
Calculate the performance (on precision) for each candidate list, each model and over all.
to remove candidates in the raw generated LMW candidate list that is already in the Lexicon (inflVars.data)
to auto tag (|AUTO_N) or remove candidates in the raw generated LMW candidate list that is previously tagged as invalid LMWs (notBaseForm.data and not LMW.data)

III. Input Files:

./0.LexiconInflVars/inflVars.data.current
=> Link to the latest InflVars.data from LexBuild daily backup
./1.LexiconAbbAcrExpansion/
./2.MNSMatcherParAcr/
./3.DMNSMatcherCuiEndWord/
./4.DMNSMatcherSpVarWc/
./8.WordNet/
=> Use all completed candidate lists.

IV. Output Files:

prevCand.lmw.tag
prevCand.lmw.yes
prevCand.lmw.no
prevCand.lmw.rpt
=> The result table shown below is based on this report, results might be slightly different over the time due to the updates on Lexicon

V. Detail Process

Step	Description	Input	Output	Notes
1	Aggregate and analyze all previous LMW candidate files => This program is to analyze the precision of candidate list (candidates are valid LMWs)	0.LexiconInflVars/inflVars.data.current 1.LexiconAbbAcrExpansion/newEuis.a[bc][br].tagged.txt.y.20NN 2.MNSMatcherParAcr/acronymExp.tag.data.tag.final.tbd.20NN 3.DMNSMatcherCuiEndWor/disNGram.Core.endword.new.out.gsp.20NN 4.DMNSMatcherSpVarWc/* 8.WordNet/* (not include *.tbd)	prevCand.data prevCand.data.no (invalid LMWs) prevCand.data.yes (valid LMWs) prevCand.data.rpt (detail stats report)	Must update: candidate list if completed inflVars (link to the latest inflVars from LexBuild) Check the latest valid vs. invalid ratio
2	Get not-BaseForm/LMW from LexCheck files => This program is to analyze the precision of invalid LMWs from LexCheck file: notBaseForm.data and notLmw.data	5.LexCheckNotBaseFor/notBaseForm.data.${YEAR} 6.LexCheckNotLmw/notLmw.data.${YEAR}	notBaseLmw.data notBaseLmw.data.no (invalid LMWs) notBaseLmw.data.yes (valid LMWs) notBaseLmw.data.rpt (detail stats report)	Must update: notBaseForm.data.${YEAR} notLmw.data.${YEAR} inflVars (link to latest inflVars from LexBuild) Check the latest valid vs. invalid ratio
3	Combine output files from steps 1 and 2 to get the total data set .	./prevCand.data ./notBaseLmw.data	./totalData.data ./totalData.data.yes ./totalData.data.no	Must run steps 1 and 2 Check the latest valid vs. invalid ratio Can be used as tagged data for machine learning model
4	Copy result files in Steps 1-3 to ./DataLog	./notBaseLmw.. ./prevCand.. ./totalTerms.1_2..	./DataLog/${YEAR}/${YEAR}_${MM}_${DD}/ ./notBaseLmw.. ./prevCand.. ./totalTerms.1_2..	Must run steps 1 - 3 Check the latest valid vs. invalid ratio Can be used as tagged data for machine learning model

10	Filter and tag valid/invalid LMWs for a raw candidate file	./0.LexiconInflVars/inflVars.data.current (valid LMW file) ./totalData.data.no (invalid LMW file) Specify: inFile.data outFile.data	outFile.data	Must complete/update steps 1 ~ 3 input the new candidate file (or link to ./inFile.data) all generated raw candidate files should run this step.

20	Generate DL TtSet from valid/invalid LMWs candidate files
21	Generate DL TtSet from inflVars (valid) and invalid LMWs in n-grams ..

The SPECIALIST Lexicon