The SPECIALIST Lexicon

LMW Candidate Post-Processes

The post-processes are used to conduct analysis and aggregation on LMW candidate lists. These lists include LMW candidates from various models. The numbers are based on real-time data. In other words, this program needs to be re-run to get the latest number when:

  • Lexicon is updated (i.e. must use the latest Lexicon and inflVars)
  • A candidate list is completed (to calculate performance)
  • Not-Base/LMW files in LexCheck is updated.
I. Program:
  • Root directory: ${MULTIWORDS}/data/Candidates
  • Command: ${MULTIWORDS}/bin/00.CandidateList
    1
    2
    3
    4

II. Functionality:

  • Combined all previous completed candidate lists
  • Use the latest Lexicon (inflVars.data from LexBuild) to auto-tag valid/invalid LMWs.
  • Calculate the performance (on precision) for each candidate list, each model and over all.

  • to remove candidates in the raw generated LMW candidate list that is already in the Lexicon (inflVars.data)
  • to auto tag (|AUTO_N) or remove candidates in the raw generated LMW candidate list that is previously tagged as invalid LMWs (notBaseForm.data and not LMW.data)

III. Input Files:

  • ./0.LexiconInflVars/inflVars.data.current
    => Link to the latest InflVars.data from LexBuild daily backup

  • ./1.LexiconAbbAcrExpansion/
  • ./2.MNSMatcherParAcr/
  • ./3.DMNSMatcherCuiEndWord/
  • ./4.DMNSMatcherSpVarWc/

  • ./8.WordNet/
    => Use all completed candidate lists.

IV. Output Files:

  • prevCand.lmw.tag
  • prevCand.lmw.yes
  • prevCand.lmw.no

  • prevCand.lmw.rpt
    => The result table shown below is based on this report, results might be slightly different over the time due to the updates on Lexicon

V. Detail Process

StepDescriptionInputOutputNotes
1Aggregate and analyze all previous LMW candidate files
=> This program is to analyze the precision of candidate list (candidates are valid LMWs)
  • 0.LexiconInflVars/inflVars.data.current

  • 1.LexiconAbbAcrExpansion/newEuis.a[bc][br].tagged.txt.y.20NN
  • 2.MNSMatcherParAcr/acronymExp.tag.data.tag.final.tbd.20NN
  • 3.DMNSMatcherCuiEndWor/disNGram.Core.endword.new.out.gsp.20NN
  • 4.DMNSMatcherSpVarWc/*
  • 8.WordNet/* (not include *.tbd)
  • prevCand.data
  • prevCand.data.no (invalid LMWs)
  • prevCand.data.yes (valid LMWs)
  • prevCand.data.rpt (detail stats report)
Must update:
  • candidate list if completed
  • inflVars (link to the latest inflVars from LexBuild)
  • Check the latest valid vs. invalid ratio
2Get not-BaseForm/LMW from LexCheck files
=> This program is to analyze the precision of invalid LMWs from LexCheck file: notBaseForm.data and notLmw.data
  • 5.LexCheckNotBaseFor/notBaseForm.data.${YEAR}
  • 6.LexCheckNotLmw/notLmw.data.${YEAR}
  • notBaseLmw.data
  • notBaseLmw.data.no (invalid LMWs)
  • notBaseLmw.data.yes (valid LMWs)
  • notBaseLmw.data.rpt (detail stats report)
Must update:
  • notBaseForm.data.${YEAR}
  • notLmw.data.${YEAR}
  • inflVars (link to latest inflVars from LexBuild)
  • Check the latest valid vs. invalid ratio
3Combine output files from steps 1 and 2 to get the total data set .
  • ./prevCand.data
  • ./notBaseLmw.data
  • ./totalData.data
  • ./totalData.data.yes
  • ./totalData.data.no
Must run steps 1 and 2
  • Check the latest valid vs. invalid ratio
  • Can be used as tagged data for machine learning model
4Copy result files in Steps 1-3 to ./DataLog
  • ./notBaseLmw.*.*
  • ./prevCand.*.*
  • ./totalTerms.1_2.*.*
./DataLog/${YEAR}/${YEAR}_${MM}_${DD}/
  • ./notBaseLmw.*.*
  • ./prevCand.*.*
  • ./totalTerms.1_2.*.*
Must run steps 1 - 3
  • Check the latest valid vs. invalid ratio
  • Can be used as tagged data for machine learning model
10Filter and tag valid/invalid LMWs for a raw candidate file
  • ./0.LexiconInflVars/inflVars.data.current (valid LMW file)
  • ./totalData.data.no (invalid LMW file)

Specify:

  • inFile.data
  • outFile.data
  • outFile.data
  • Must complete/update steps 1 ~ 3
  • input the new candidate file (or link to ./inFile.data)
  • all generated raw candidate files should run this step.
20Generate DL TtSet from valid/invalid LMWs candidate files
21Generate DL TtSet from inflVars (valid) and invalid LMWs in n-grams ..