The SPECIALIST Lexicon

LMW Candidate Post-Processes - Results

The performance results of previous tagged LMW candidate files are in the output file of ${MULTIWORDS}/data/Candidates/DataLog/${YEAR}/${YYYY}_${MM}_${DD}/prevCand.lmw.rpt. The results shown below is a snapshot on the completion of the latest candidate list and is based on the above file report, results might be slightly different over the time due to the updates on Lexicon (when valid words become invalid words and vise versa).

  • 1.LexiconAbbAcrExpansion
    • candidates derived from the expansion of abbreviations/acronyms in Lexicon release
    • includes both valid and invalid words
    • After 2020+, this candidate list is generated in the preprocess (${MULTIWORD}/12.LexAbbAcrCand/)

    YearAcronym ExpansionsAbbreviation Expansions
    TotalValidInvalidTotalValidInvalid
    2015908881 (97.03%)27 (2.97%)6240 (64.52%)22 (35.48%)
    20165959 (100.00%)0 (0.00%)183180 (98.36%)3 (1.64%)
    20173939 (100.00%)0 (0.00%)2219 (86.36%)3 (13.64%)
    20181716 (94.12%)1 (5.88)2826 (92.86%)2 (7.14%)
    2019151142 (94.04%)9 (5.96%)1312 (92.31%)1 (7.69%)
    YearTotalValidInvalid
    2020148112 (75.68%)36 (24.32%)
    2021158129 (81.65%)29 (18.35%)
    20229453 (56.38%)41 (43.62%)
    202322 (100.00%)0 (0.00%)
    202421 (50.00%)1 (50.00%)
    202573 (42.86%)4 (57.14%)
    Accu.Total: 1816Valid: 1640 (90.31%)Invalid: 176 (9.69%)

    * Some of the terms might be duplicated among years

  • 2.MNSMatcherParAcr
    • candidates derived from the (ACR) matcher in MNS (07.MatcherParAcr)
    • includes both valid and invalid words
    • acronymExp.tag.data.tag.final.tbd.${YEAR}
      => CandList/acronymExp.tag.data.tag.final.tbd.${YEAR}.used.rmYesNo candidate only, not include AUTO_N
      => CandList/acronymExp.tag.data.tag.final.tbd.${YEAR}.used.rmYesTagNo include AUTO_N (sent to Linguists)

      YearTotalValidInvalidNotes
      201549943681(73.71%)1313 (26.29%)
      2016360200 (55.56%)160 (44.44%)
      201718551317 (71.00%)538 (29.00%)
      • Completed: 2018-11-15
      2018808604 (74.75%)204 (25.25%)
      • AUTO_N is not included (see details below)
      • Completed: 2019-01-03
      20191081663 (61.33%)418 (38.67%)
      • AUTO_N is not included (see details below)
      • Completed: 2019-10-16
      20201061787 (74.18%)274 (25.82%)
      • AUTO_N is not included (see details below)
      • Completed: 2020-08-18
      Accu.98167060 (71.92%)2756 (28.08%)

      * Some of the terms might be duplicated among years

    • acronymExp.tag.data.tag.final.tbd.${YEAR}.rmYesTagNo include AUTO_N
      => AUTO_N: monitor and calculate AUTO_N becomes valid LMWs. This featuer is to show the consistency of tagging.

      YearTotalValidInvalidNotes
      201855739 (7.00%)518 (93.00%)7.00% became valid
      20192533236 (9.32%)2297 (90.68%)9.32% became valid
      2020277158 (2.09%)2713 (97.91%)2.09% became valid
      consistent: small percentage.

  • 3.DMNSMatcherCuiEndWord
    • candidates derived from the CUI and Endword matchers in DMNS
    • includes both valid and invalid words
    • Use the precision from last file (> 80%) and number of current file (36....rmYesNo: ~1000) to decide number of top endWords
    • 36.disNGram.Core.endword.out.rmYesNo.gsp.${YEAR}

      YearTotalValidInvalidNotes
      201663705725 (89.87%)645 (10.13%)
      • top 33 endwords
      201719451764 (90.69%)181 (9.31%)
      • Top 43 endwords
      • AUTO_N is not included (detailed below)
      • Completed: 2019-05-20
      2018819703 (85.84%)116 (14.16%)
      • Top 51 endwords
      • AUTO_N is not included (detailed below)
      • Completed: 2019-08-02
      201929182588 (88.69%)330 (11.31%)
      • Top 57 endwords
      • AUTO_N is not included (detailed below)
      • Completed: 2020-06-12
      202028462489 (87.46%)357 (12.54%)
      • Top 80 endwords
      • AUTO_N is not included (detailed below)
      • Completed: 2021-03-01
      Accu.1489813269 (89.07%)1629 (10.93%)

      * Some of the terms might be duplicated among years

    • 36.disNGram.Core.endword.out.rmYesTagNo.gsp.${YEAR}
      => AUTO_N: monitor and calculate AUTO_N becomes valid LMWs

      YearTotalValidInvalidNotes
      20171034393 (38.01%)641 (61.99%)38.01% become valid
      Main reason is some candidates were not tagged
      2018953133 (13.96%)820 (86.04%)13.96% become valid
      Clean up
      201998450 (5.08%)934 (94.92%)5.08% become valid
      consistent: small percentage
      2020129124 (1.86%)1267 (98.14%)1.86% become valid
      consistent: small percentage

  • 4.DMNSMatcherSpVarWc
    • candidates derived from the SpVar and Frequency matchers in DMNS
    • includes both valid and invalid words

      YearWord CountTotalValidInvalidAccu. P
      2015100000033682397 (71.17%)971 (28.83%)71.17%
      10000022181520 (68.53%)698 (31.47%)70.12%
      10000895605 (67.60%)290 (32.40%)69.77%
      1000588249 (42.35%)339 (57.65%)67.49%
      100538119 (22.12%)419 (77.88%)64.33%
      Accu.Accu.76024890 (64.33%)2712 (35.67%)64.33%

      * This model is not performed due to the time consuming and limited resources

  • 8.WordNet
    • candidates derived from the derivations, synonyms, antonyms in WordNet 3.0
    • includes both valid and invalid words
    • unique lowercase terms are used (input has both cases), so the total number is smaller than the actual no. of input terms

      ModelsTotalValidInvalidNotes
      zeroD, CUI322322 (100.00%)0 (0.00%)WordNetCand.ZD.cui.2021
      zeroD, no CUI626601 (96.01%)25 (3.99%)WordNetCand.ZD.noCui.2021
      aPairs19121413 (73.90%)499 (26.10%)WordNetCand.AP.2021
      suffixD36543428 (93.81%)226 (6.19%)WordNetCand.SD.2021
      Accu.65085758 (88.48%)750 (11.52%)
  • History Logs of prevCand.lmw.rpt

    DateTotalValidInvalidNotes - completed candList
    2018-11-152195516096 (73.31%)5859 (26.69%)2.MNSMatcherParAcr, 2017
    2019-01-032276316687 (73.31%)6076 (26.69%)2.MNSMatcherParAcr, 2018
    2019-07-192485618915 (76.10%)5941 (23.90%)1.LexiconAbbAcrExpansion, 2020
    2019-08-022567519608 (76.37%)6067 (23.63%)3.DMNSMatcherCuiEndWord, 2018
    2019-10-162675620429 (76.35%)6327 (23.65%)2.MNSMatcherParAcr, 2019
    2020-06-122967423041 (77.65%)6633 (22.35%)3.DMNSMatcherCuiEndWord, 2019
    2020-07-172983223192 (77.74%)6640 (22.26%)1.LexiconAbbAcrExpansion, 2021
    2020-08-183089223999 (77.69%)6893 (22.32%)2.MNSMatcherParAcr, 2020
    2021-03-013373726512 (78.58%)7225 (21.42%)3.DMNSMatcherCuiEndWord, 2020
    2021-07-133383126571 (78.54%)7260 (21.46%)1.LexiconAbbAcrExpansion, 2022
    2022-01-103412826868 (78.73%)7260 (21.27%)8.WordNetCand.ZD.cui.2021
    2022-01-103475427466 (79.03%)7288 (20.97%)8.WordNetCand.ZD.noCui.2021
    2022-07-063475627471 (79.04%)7285 (20.96%)1.LexiconAbbAcrExpansion, 2023
    2022-09-273664928865 (78.76%)7784 (21.24%)8.WordNetCand.AP.2021
    2024-07-104030732298 (80.13%)8009 (19.87%) 8.WordNetCand.SD.2021
    1.LexiconAbbAcrExpansion, 2024 & 2025