SPECIALIST Lexicon

LMW Candidate Post-Processes Overview

I. Objectives:

This page describes the design concept of the post-processes on LMW candidate post-processing. The post-processes are conducted when:

To calcualte the performance candidate files: executed when a LMW candidate file is competed.
To retrieve invalid LMWs: during the Lexicon annual release processes.

II. Design:

The design concept is shown in the shcematic diagram below. They are summaried as folows:

No tags [Y|N] are needed on candidate files.
Use the latest Lexicon to auto-tag valid and invalid words from a completed candidate list to decide the performance on precision
Theoretically, all valid words are in the Lexicon. That is valid LMWs are identified if they are inflectional variants from the latest Lexicon.
Candidates are not in the Lexicon are invalid words.
Invlaid LMWs includes two sources:
- From previous candidate files while they are not in the Lexicon (prevcand.data.no)
- From known (legacy) invalid base forms (notBaseLmw.data.no)
Use those invalid word to update invalid base/LMW files
Use the Invlaid base/LMW file to auto-tag [AUTO_NO] on a raw generated LMW candidates before sent to linguists.

III. Algorithm Details:

After 2018+, a systematic post-process was implemented to filtered valid and invalid LMW from candidate list:

use the latest Lexicon to tag valid|invalid LMWs
- Valid LMWs:
  - Collect all terms from the latest Lexicon (inflVars). These are valid LMWs.
  - Generate inflVars from LexBuild - postPorcess
- Invalid LMWs:
  - Get all terms from previous candidate list (without |ATUO_N tag)
  - Remove valid LMWs from above
  - The rest are invalid LMWs
  - Tag invalid LMWs from the new candidate list as [AUTO_N]
    =>These can be removed if all of them are [n] after several running through several candidate list.
Auto-tag raw LMW candidate list before sent to Linguists
- The generated raw candidate lists are auto-tagged using post-processes (00.CandidateList, step 10) to auto-tag:
  - valid LMWs (terms in the latest Lexicon inflVars.data)
  - invalid LMWs (known invalid LMWs, Total.data.no)
  - CandList.rmYesNo are new terms are not auto-tagged. They are used as candidate list (to calculate precision), as shown at the end of the process in the above diagram.
  - CandList.rmYesTagNo is the actual file sent to linguists so they can have another chance to review terms with AUTO_TAG_NO to ensure they are invalid LMWs.
  - The precision of all models for the latest Lexicon are in the prevCand.data.rpt.
Auto-tag completed LMW candidate list
- Once candidate list are completed (submitted and approved) by the linguists, the list need to go through processes in 00.CandidateList (step 1-3) to update invalid LMWs (prevCand.data.no)
  - Add completed candidate list to appropriated directory (${LMW_DIR}/data/Candidates/..)
  - Run 00.CandidateList step 1.
  - candidates are in the linked latest inflVars.data are valid LMWs (prevCand.data.yes)
  - candidates are not in the latest inflVars.data are invalid LMWs (prevCand.data.no)
    - Invalid LMWs in the candidate list are automatic updated, prevCand.data.no in the above diagram
    - All skipped candidates are considered as invalid LMWs in this process (that is why the program provides a 2nd chance for linguist to review invalid LMWs - ATUO_TAG_NO in the final tagging process .
  - CandList.rmYesNo should become empty by runing through the above process with latest Lexicon and invalid LMWs when it is completed.
    - Valid candidates is added to the Lexicon throguh LexBuild DB.
    - Invalid candidates is added to the invalid LMWs file.
- Other invalid terms (notBaseLmw.data.no) are static legacy data used before 2019-. (no update after 2020+)
- The file Total.data.no (= prevCand.data.no + notBaseLmw.data.no) are used as the latest invalid LMW collections. This file should be used in LexAccess.Files after 2020+.

IV. Processes before 2018- (not used anymore)

The process used before 2018- are described below for reference purpose.

Tagged Invalid LMW Candidates
All the invalid terms from tagged candidate list should be retrieved and save to "invalidLmwList.out". These invalid LMW terms and words in the Lexicon should be filtered out from the candidate list.
Calculate precision
- Total precision: (newYes + AutoYes)/total candidates
- New precision: (newYes)/new candidates
- total candidate = autoYes + autoNo + newYes + newNo
- autoYes (terms in Lexicon) and autoNo (terms in Invalid LMW candidates)
- newYes and newNo are manually tagged by linguist

The SPECIALIST Lexicon