The SPECIALIST Lexicon

Not Base Forms/LMWs Files

In addition to use inflVars.data from the Lexicon to find valid LMWs, we also use various of files to find invalid LMWs. They are:

  • Invalid LMWs from previous candiate lists
    • We found less than 2% of invalid LMWs become valid LMWs due to new usage (such as mellitus) or errors.
    • The latest file can be found:
      ${MULTIWORD}/bin/12.CandidateList
      is at ${CANDIDATES}/prevCand.data.no

The LexCheck releases include files that are not base forms (invalid LMWs and inflections of LMWs) and not valid LMWs. These files are derived from the expansionof abbreviations or acronyms in LEXICON. They can be used to retrieved invalid LMWs. This page is a snap shot based on the data in 01/2019.

  • Program: ${MULTIWORDS}/bin/12.CandidateList
    1
  • Data directory: ${MULTIWORDS}/data/Candidate/
  • In Files:
    • ./5.LexCheckNotBaseForm/
    • ./6.LexCheckNotLmw/
    • ./7.CandNotLmw/
  • Out Files:
    • notBaseLmw.data
    • notBaseLmw.data.yes
    • notBaseLmw.data.no
    • notBaseLmw.data.rpt
    • 5.LexCheckNotBaseForm
      • terms that are not base forms
      • can be a valid term if it is an inflVars

      YearTotalValidInvalid
      20156661196 (2.94%)6465 (97.06%)
      20168418269 (3.20%)8149 (96.80%)
      20178688280 (3.22%)8408 (96.78%)
      20189196292 (3.18%)8904 (96.82%)
      20199335293 (3.14%)9042 (96.86%)
      20209395336 (3.58%)9059 (96.42%)
      20219426337 (3.58%)9089 (96.42%)
      Accu.9426337 (3.58%)9089 (96.42%)

      * These files are accumulated. So, the accu. data must be the same as the latest release.

    • 6.LexCheckNotLmw
      • terms that are not valid LMWs
      • can be a valid LMW due to the tagging errors or linguistic usage changes.

        YearTotalValidInvalid
        201740723 (5.65%)384 (94.35%)
        201877724 (3.09%)753 (96.91%)
        201991624 (2.62%)892 (97.38%)
        Accu.91624 (2.62%)892 (97.38%)

        * These files are accumulated. So, the accu. data must be the same as the latest release.

    • 7.CandNotLmw
      • terms that are not valid LMWs from previous candidate list (with auto-tagged AUTO_N)

        ModelYearTotalValidInvalid
        MNSMatcherParAcr20187780 (0.00%)778 (100.00%)
        AllAccu.7780 (0.00%)778 (100.00%)

    • Out Tagged Not Base/LMW Files:
      • terms from all above sources that are evaluated previously.

        TotalValidInvalid
        10083293 (2.91%)9790 (97.09%)