Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.
Procedures of preparing Lexical Tools files
There are scripts and programs to generate lexical tools data files automatically. It is detailed as follows:
I. Location:
II. Inputs:
III. Outputs:
V. Detail procedures:
This script copies initial original files to dataOrg directory ($LVG_COMPONENTS/PreDataBase/data/${YEAR}/dataOrg/).
Steps | Notes | Source | Target |
---|---|---|---|
1 | Copy inflection variables file | $Lexicon/${YEAR}/tables/inflVars.data | $PreDataBase/data/${YEAR}/dataOrg/inflVars.data |
2 | Copy & Modify acronyms file | $Lexicon/${YEAR}/tables/LRABR | $PreDataBase/data/${YEAR}/dataOrg/acronyms
$PreDataBase/data/${YEAR}/dataOrg/acr_exp |
3 | Copy & modify proper file | $Lexicon/${YEAR}/tables/LRPRP | $PreDataBase/data/${YEAR}/dataOrg/proper |
4 | Copy nominalization file | $Lexicon/${YEAR}/tables/LRNOM | $PreDataBase/data/${YEAR}/dataOrg/LRNOM |
5 | Copy synonyms file | None (synonym.data has its own script to generate after 2017+) | $PreDataBase/data/${YEAR}/dataOrg/synonyms.data |
6 | Copy derivation file | None (derivation.data has its own script to generate after 2013+) | None |
7 | Copy antonym file | None (antonym.data has its own script to generate after 2022+) | None |
8 | Run above 7 steps | see above | see above |
This script generates final lvg files to data directory from dataOrg directory.
Steps | Notes | Source | Target |
---|---|---|---|
1 | Copy inflection variables file | $PreDataBase/data/${YEAR}/dataOrg/inflVars.data | $PreDataBase/data/${YEAR}/data/infl.data |
2 | Copy & Modify acronyms file | $PreDataBase/data/${YEAR}/dataOrg/acronyms | $PreDataBase/data/${YEAR}/data/acronym.data |
3 | Copy proper file | $PreDataBase/data/${YEAR}/dataOrg/proper | $PreDataBase/data/${YEAR}/data/properNoun.data |
4 | Copy nominalization file | $PreDataBase/data/${YEAR}/dataOrg/LRNOM | $PreDataBase/data/${YEAR}/data/nominalization.data |
5 | Copy synonyms file | $Synonyms/data/${YEAR}/outData/Results/synonyms.data.${YEAR}.release | $PreDataBase/data/${YEAR}/data/synonyms.data |
6 | Copy derivation files | $Derivation/5.All/data/${YEAR}/data/derivation.data | $PreDataBase/data/${YEAR}/data/derivation.data |
7 | Copy antonyms file | $Antonym/data/0.Antonym/${YEAR}/output/antonyms.data.${YEAR}.release | $PreDataBase/data/${YEAR}/data/antonyms.data |
8 | Run above 7 steps | see above | see above |
This script copies/moves final lvg files from data directory to ${LVG}/data/tables directory.
Steps | Notes | Source | Target |
---|---|---|---|
1 | Copy infl.data file | $PreDataBase/data/${YEAR}/data/infl.data | ${LVG_DIR}/data/tables/infl.data |
2 | Copy acronym.data file | $PreDataBase/data/${YEAR}/data/acronym.data | ${LVG_DIR}/data/tables/acronym.data |
3 | Copy properNoun.data file | $PreDataBase/data/${YEAR}/data/properNoun.data | ${LVG_DIR}/data/tables/properNoun.data |
4 | Copy nominalization.data file | $PreDataBase/data/${YEAR}/data/nominalization.data | ${LVG_DIR}/data/tables/nominalization.data |
5 | Copy synonyms.data file | $PreDataBase/data/${YEAR}/data/synonyms.data | ${LVG_DIR}/data/tables/synonyms.data |
6 | Copy derivation.data files | $PreDataBase/data/${YEAR}/data/derivation.data | ${LVG_DIR}/data/tables/derivation.data |
7 | Copy antonyms.data files | $PreDataBase/data/${YEAR}/data/antonyms.data | ${LVG_DIR}/data/tables/antonyms.data |
8 | Run above 7 steps | see above | see above |
Analyze files to find the max. length of each field. Then check with the database design on each fields of each tables
Steps | Notes | Source | Table |
---|---|---|---|
1 | AnalyzeInflection | ${LVG_DIR}/data/tables/infl.data | Inflection |
2 | AnalyzeAcronym | ${LVG_DIR}/data/tables/acronym.data | Acronym |
3 | AnalyzeProperNoun | ${LVG_DIR}/data/tables/properNoun.data | ProperNoun |
4 | AnalyzeNominalization | ${LVG_DIR}/data/tables/nominalization.data | Nominalization |
5 | AnalyzeSynonym | ${LVG_DIR}/data/tables/synonyms.data | LexSynonym |
6 | AnalyzeDerivation | ${LVG_DIR}/data/tables/derivation.data | Derivation |
7 | AnalyzeAntonym | ${LVG_DIR}/data/tables/antonyms.data | LexAntonym |
shell> cd ${META_DIR}/bin
shell> 2.GetAtoms
${PREV_YEAR}AA
1
2
3
shell> cd ${LHC_GIT}/lvg_canongenerator/bin
shell> 0.ModifyAtoms ${YEAR}
Steps | Notes | Source | Target |
---|---|---|---|
1 | Prepare directories and files
| $META/data/${PRE_YEAR}AA/outputs/atoms.data | $CANON_GEN/data/${YEAR}/dataOrg/atoms.org |
2 | Get ENG entry from atoms.org file | $CANON_GEN/data/${YEAR}/dataOrg/atoms.org | $CANON_GEN/data/${YEAR}/dataOrg/atoms.org.ENG |
3 | Get SPA entry from atoms.org file | $CANON_GEN/data/${YEAR}/dataOrg/atoms.org | $CANON_GEN/data/${YEAR}/dataOrg/atoms.org.SPA |
4 | Generate atoms.data file | $CANON_GEN/data/${YEAR}/dataOrg/atoms.org.ENG | $CANON_GEN/data/${YEAR}/dataOrg/atoms.data |
-------------------------------------- Which Program ? -------------------------------------- 1) Generate terms list 2) Generate words list 3) Generate unique words list 4) Generate base forms list 5) Generate unique base forms list 6) Generate canoncal forms 7) Check non-ASCII canon 8) All (default) 9) Generate canoncal forms from test ---------- 8
Steps | Notes | Source | Target |
---|---|---|---|
0 | Prepare directories and files | ||
1 | Get terms list |
| $CANON_GEN/data/${YEAR}/data/termList.data |
2 | Get words list | $CANON_GEN/data/${YEAR}/data/termList.data | $CANON_GEN/data/${YEAR}/data/wordList.data |
3 | Sort and unify words list | $CANON_GEN/data/${YEAR}/data/wordList.data | $CANON_GEN/data/${YEAR}/data/uniqueWordList.data |
4 | Get base forms of unique words list | $CANON_GEN/data/${YEAR}/data/uniqueWordList.data | $CANON_GEN/data/${YEAR}/data/baseList.data |
5 |
| $CANON_GEN/data/${YEAR}/data/baseList.data | $CANON_GEN/data/${YEAR}/data/uniqueBaseList.data |
6 | Generate canonical forms
=> Use ${LVG_DIR}/lib/jdbcDrivers/HSqlDb/hsqldb.jar => Make sure the size of "varchar(110)" is big enough in => If not, it will show as SQLException: data exception ..., then modify the source code to fit.
| $CANON_GEN/data/${YEAR}/data/uniqueBaseList.data | $CANON_GEN/data/${YEAR}/data/canonical.data |
7 | Check/modify non-ASCII in Canonical forms | $CANON_GEN/data/${YEAR}/data/canonical.data |
|
Must run on lexdev (with huge memory). Other machines (lexdev01) take more than 1 day (too slow)
Lvg Release | Processes | Computer | HSqlDb version | Run-time | Canonical size |
---|---|---|---|---|---|
2010 | Step-6 | lexdev | HSqlDb.2.0.0.0 | ~60 min. | 1,173,712 |
2012 | Step-6 | lexdev | HSqlDb.2.2.5 | ~140 min. | 1,395,720 |
2015 | Step-6 | lexdev | HSqlDb.2.3.2 | ~40 min. | 1,744,398 |
2017 | Step-6 | lexdev1 | HSqlDb.2.3.4 | ~30 min. | 1,921,878 |
2018 | Step-6 | lexdev1 | HSqlDb.2.3.4 | ~38 min. | 2,044,325 |
2019 | Step-6 | lexdev | HSqlDb.2.4.1 | ~45 min. | 2,163,739 |
2020 | Step-6 | lexdev | HSqlDb.2.5.0 | ~50 min. | 2,241,434 |
2021 | Step-6 | lexdev | HSqlDb.2.5.1 | ~45 min. | 2,332,096 |
2022 | Step-6 | lexdev | HSqlDb.2.5.1 | ~45 min. | 2,410,325 |
2023 | Step-6 | lexdev | HSqlDb.2.7.0 | ~50 min. | 2,476,202 |
2024 | Step-6 | lexdev | HSqlDb.2.7.2 | ~40 min. | 1,897,973 |
2025 | Step-6 | lexdev | HSqlDb.2.7.3 | ~30 min. | 1,931,674 |
Update variable ${LVG_DIR} in ${LVG_DEV_DIR}/data/config/lvg.properties to be AUTO_MODE
Generate lvg files from lvg
The lvg used is in the ${LHC_GIT}/lvg-p
make sure variable ${LVG_DIR} uses the full path of lvg in the lvg config file (not AUTO_MODE), lvg.properties.
shell> cd ${LVG_PRE_DATABASE}/bin
shell> 5.Generate2Files <year>
Steps | Notes | Source | Table | Run Time |
---|---|---|---|---|
1 | Generate fruitful variants | ${LVG_DIR}/data/tables/infl.data | $PreDateBase/data/${YEAR}/data/fruitful.data | 2 hr. |
2 | Generate AntiNorm | ${LVG_DIR}/data/tables/infl.data | $PreDateBase/data/${YEAR}/data/antiNorm.data | 1 hr. |
3 | Copy canonical data | $CanonGenerator/data/${YEAR}/data/canonical.data | $PreDateBase/data/${YEAR}/data/canonical.data | 2 hr. |
PS. GenerateAntiNorm requires recompile with new lvg${YEAR}dist.jar
This script copies/moves final lvg generated files from data directory to ${LVG_DIR}/data/tables directory.
Steps | Notes | Source | Target |
---|---|---|---|
1 | Copy fruitful.data file | $PreDataBase/data/${YEAR}/data/fruitful.data | ${LVG_DIR}/data/tables/fruitful.data |
2 | Copy antiNorm.data file | $PreDataBase/data/${YEAR}/data/antiNorm.data | ${LVG_DIR}/data/tables/antiNorm.data |
3 | Copy canonical.data file | $PreDataBase/data/${YEAR}/data/canonical.data | ${LVG_DIR}/data/tables/canonical.data |
Analyze files to find the max. length of each field. Then check with the database design on each fields of each tables
Steps | Notes | Source | Table |
---|---|---|---|
1 | AnalyzeFruitful | ${LVG_DIR}/data/tables/fruitful.data | Fruitful |
2 | AnalyzeAntiNorm | ${LVG_DIR}/data/tables/antiNorm.data | AntiNorm |
3 | AnalyzeCanon | ${LVG_DIR}/data/tables/canonical.data | Canonical |