Procedures of preparing Lexical Tools files
There are scripts and programs to generate lexical tools data files automatically. It is detailed as follows:
I. Location:
II. Inputs:
III. Outputs:
V. Detail procedures:
This script copies initial original files to dataOrg directory ($LVG_COMPONENTS/PreDataBase/data/${YEAR}/dataOrg/).
Steps | Notes | Source | Target |
---|---|---|---|
1 | Copy inflection variables file | $Lexicon/${YEAR}/tables/inflVars.data | $PreDataBase/data/${YEAR}/dataOrg/inflVars.data |
2 | Copy & Modify acronyms file | $Lexicon/${YEAR}/tables/LRABR | $PreDataBase/data/${YEAR}/dataOrg/acronyms
$PreDataBase/data/${YEAR}/dataOrg/acr_exp |
3 | Copy & modify proper file | $Lexicon/${YEAR}/tables/LRPRP | $PreDataBase/data/${YEAR}/dataOrg/proper |
4 | Copy nominalization file | $Lexicon/${YEAR}/tables/LRNOM | $PreDataBase/data/${YEAR}/dataOrg/LRNOM |
5 | Copy synonyms file | None (synonym.data has its own script to generate after 2017+) | $PreDataBase/data/${YEAR}/dataOrg/synonyms.data |
6 | Copy derivation file | None (derivation.data has its own script to generate after 2013+) | None |
7 | Copy antonym file | None (antonym.data has its own script to generate after 2022+) | None |
8 | Run above 7 steps | see above | see above |
This script generates final lvg files to data directory from dataOrg directory.
Steps | Notes | Source | Target |
---|---|---|---|
1 | Copy inflection variables file | $PreDataBase/data/${YEAR}/dataOrg/inflVars.data | $PreDataBase/data/${YEAR}/data/infl.data |
2 | Copy & Modify acronyms file | $PreDataBase/data/${YEAR}/dataOrg/acronyms | $PreDataBase/data/${YEAR}/data/acronym.data |
3 | Copy proper file | $PreDataBase/data/${YEAR}/dataOrg/proper | $PreDataBase/data/${YEAR}/data/properNoun.data |
4 | Copy nominalization file | $PreDataBase/data/${YEAR}/dataOrg/LRNOM | $PreDataBase/data/${YEAR}/data/nominalization.data |
5 | Copy synonyms file | $Synonyms/data/${YEAR}/outData/Results/synonyms.data.${YEAR}.release | $PreDataBase/data/${YEAR}/data/synonyms.data |
6 | Copy derivation files | $Derivation/5.All/data/${YEAR}/data/derivation.data | $PreDataBase/data/${YEAR}/data/derivation.data |
7 | Copy antonyms file | $Antonym/data/0.Antonym/${YEAR}/output/antonyms.data.${YEAR}.release | $PreDataBase/data/${YEAR}/data/antonyms.data |
8 | Run above 7 steps | see above | see above |
This script copies/moves final lvg files from data directory to ${LVG}/data/tables directory.
Steps | Notes | Source | Target |
---|---|---|---|
1 | Copy infl.data file | $PreDataBase/data/${YEAR}/data/infl.data | ${LVG_DIR}/data/tables/infl.data |
2 | Copy acronym.data file | $PreDataBase/data/${YEAR}/data/acronym.data | ${LVG_DIR}/data/tables/acronym.data |
3 | Copy properNoun.data file | $PreDataBase/data/${YEAR}/data/properNoun.data | ${LVG_DIR}/data/tables/properNoun.data |
4 | Copy nominalization.data file | $PreDataBase/data/${YEAR}/data/nominalization.data | ${LVG_DIR}/data/tables/nominalization.data |
5 | Copy synonyms.data file | $PreDataBase/data/${YEAR}/data/synonyms.data | ${LVG_DIR}/data/tables/synonyms.data |
6 | Copy derivation.data files | $PreDataBase/data/${YEAR}/data/derivation.data | ${LVG_DIR}/data/tables/derivation.data |
7 | Copy antonyms.data files | $PreDataBase/data/${YEAR}/data/antonyms.data | ${LVG_DIR}/data/tables/antonyms.data |
8 | Run above 7 steps | see above | see above |
Analyze files to find the max. length of each field. Then check with the database design on each fields of each tables
Steps | Notes | Source | Table |
---|---|---|---|
1 | AnalyzeInflection | ${LVG_DIR}/data/tables/infl.data | Inflection |
2 | AnalyzeAcronym | ${LVG_DIR}/data/tables/acronym.data | Acronym |
3 | AnalyzeProperNoun | ${LVG_DIR}/data/tables/properNoun.data | ProperNoun |
4 | AnalyzeNominalization | ${LVG_DIR}/data/tables/nominalization.data | Nominalization |
5 | AnalyzeSynonym | ${LVG_DIR}/data/tables/synonyms.data | LexSynonym |
6 | AnalyzeDerivation | ${LVG_DIR}/data/tables/derivation.data | Derivation |
7 | AnalyzeAntonym | ${LVG_DIR}/data/tables/antonyms.data | LexAntonym |
shell> cd ${META_DIR}/bin
shell> 2.GetAtoms
${PREV_YEAR}AA
1
2
3
shell> cd ${LHC_GIT}/lvg_canongenerator/bin
shell> 0.ModifyAtoms ${YEAR}
Steps | Notes | Source | Target |
---|---|---|---|
1 | Prepare directories and files
| $META/data/${PRE_YEAR}AA/outputs/atoms.data | $CANON_GEN/data/${YEAR}/dataOrg/atoms.org |
2 | Get ENG entry from atoms.org file | $CANON_GEN/data/${YEAR}/dataOrg/atoms.org | $CANON_GEN/data/${YEAR}/dataOrg/atoms.org.ENG |
3 | Get SPA entry from atoms.org file | $CANON_GEN/data/${YEAR}/dataOrg/atoms.org | $CANON_GEN/data/${YEAR}/dataOrg/atoms.org.SPA |
4 | Generate atoms.data file | $CANON_GEN/data/${YEAR}/dataOrg/atoms.org.ENG | $CANON_GEN/data/${YEAR}/dataOrg/atoms.data |
-------------------------------------- Which Program ? -------------------------------------- 1) Generate terms list 2) Generate words list 3) Generate unique words list 4) Generate base forms list 5) Generate unique base forms list 6) Generate canoncal forms 7) Check non-ASCII canon 8) All (default) 9) Generate canoncal forms from test ---------- 8
Steps | Notes | Source | Target |
---|---|---|---|
0 | Prepare directories and files | ||
1 | Get terms list |
| $CANON_GEN/data/${YEAR}/data/termList.data |
2 | Get words list | $CANON_GEN/data/${YEAR}/data/termList.data | $CANON_GEN/data/${YEAR}/data/wordList.data |
3 | Sort and unify words list | $CANON_GEN/data/${YEAR}/data/wordList.data | $CANON_GEN/data/${YEAR}/data/uniqueWordList.data |
4 | Get base forms of unique words list | $CANON_GEN/data/${YEAR}/data/uniqueWordList.data | $CANON_GEN/data/${YEAR}/data/baseList.data |
5 |
| $CANON_GEN/data/${YEAR}/data/baseList.data | $CANON_GEN/data/${YEAR}/data/uniqueBaseList.data |
6 | Generate canonical forms
=> Use ${LVG_DIR}/lib/jdbcDrivers/HSqlDb/hsqldb.jar => Make sure the size of "varchar(110)" is big enough in => If not, it will show as SQLException: data exception ..., then modify the source code to fit.
| $CANON_GEN/data/${YEAR}/data/uniqueBaseList.data | $CANON_GEN/data/${YEAR}/data/canonical.data |
7 | Check/modify non-ASCII in Canonical forms | $CANON_GEN/data/${YEAR}/data/canonical.data |
|
Must run on lexdev (with huge memory). Other machines (lexdev01) take more than 1 day (too slow)
Lvg Release | Processes | Computer | HSqlDb version | Run-time | Canonical size |
---|---|---|---|---|---|
2010 | Step-6 | lexdev | HSqlDb.2.0.0.0 | ~60 min. | 1,173,712 |
2012 | Step-6 | lexdev | HSqlDb.2.2.5 | ~140 min. | 1,395,720 |
2015 | Step-6 | lexdev | HSqlDb.2.3.2 | ~40 min. | 1,744,398 |
2017 | Step-6 | lexdev1 | HSqlDb.2.3.4 | ~30 min. | 1,921,878 |
2018 | Step-6 | lexdev1 | HSqlDb.2.3.4 | ~38 min. | 2,044,325 |
2019 | Step-6 | lexdev | HSqlDb.2.4.1 | ~45 min. | 2,163,739 |
2020 | Step-6 | lexdev | HSqlDb.2.5.0 | ~50 min. | 2,241,434 |
2021 | Step-6 | lexdev | HSqlDb.2.5.1 | ~45 min. | 2,332,096 |
2022 | Step-6 | lexdev | HSqlDb.2.5.1 | ~45 min. | 2,410,325 |
2023 | Step-6 | lexdev | HSqlDb.2.7.0 | ~50 min. | 2,476,202 |
2024 | Step-6 | lexdev | HSqlDb.2.7.2 | ~40 min. | 1,897,973 |
2025 | Step-6 | lexdev | HSqlDb.2.7.3 | ~30 min. | 1,931,674 |
Update variable ${LVG_DIR} in ${LVG_DEV_DIR}/data/config/lvg.properties to be AUTO_MODE
Generate lvg files from lvg
The lvg used is in the ${LHC_GIT}/lvg-p
make sure variable ${LVG_DIR} uses the full path of lvg in the lvg config file (not AUTO_MODE), lvg.properties.
shell> cd ${LVG_PRE_DATABASE}/bin
shell> 5.Generate2Files <year>
Steps | Notes | Source | Table | Run Time |
---|---|---|---|---|
1 | Generate fruitful variants | ${LVG_DIR}/data/tables/infl.data | $PreDateBase/data/${YEAR}/data/fruitful.data | 2 hr. |
2 | Generate AntiNorm | ${LVG_DIR}/data/tables/infl.data | $PreDateBase/data/${YEAR}/data/antiNorm.data | 1 hr. |
3 | Copy canonical data | $CanonGenerator/data/${YEAR}/data/canonical.data | $PreDateBase/data/${YEAR}/data/canonical.data | 2 hr. |
PS. GenerateAntiNorm requires recompile with new lvg${YEAR}dist.jar
This script copies/moves final lvg generated files from data directory to ${LVG_DIR}/data/tables directory.
Steps | Notes | Source | Target |
---|---|---|---|
1 | Copy fruitful.data file | $PreDataBase/data/${YEAR}/data/fruitful.data | ${LVG_DIR}/data/tables/fruitful.data |
2 | Copy antiNorm.data file | $PreDataBase/data/${YEAR}/data/antiNorm.data | ${LVG_DIR}/data/tables/antiNorm.data |
3 | Copy canonical.data file | $PreDataBase/data/${YEAR}/data/canonical.data | ${LVG_DIR}/data/tables/canonical.data |
Analyze files to find the max. length of each field. Then check with the database design on each fields of each tables
Steps | Notes | Source | Table |
---|---|---|---|
1 | AnalyzeFruitful | ${LVG_DIR}/data/tables/fruitful.data | Fruitful |
2 | AnalyzeAntiNorm | ${LVG_DIR}/data/tables/antiNorm.data | AntiNorm |
3 | AnalyzeCanon | ${LVG_DIR}/data/tables/canonical.data | Canonical |