Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.

Lexical Tools

Procedures of preparing Lexical Tools files

There are scripts and programs to generate lexical tools data files automatically. It is detailed as follows:

I. Location:

  • "${LVG_COMPONENTS}/PreDataBase/bin/"

II. Inputs:

  • "${LEXICON}/data/${YEAR}/tables/"
  • "${LVG_COMPONENTS}/PreDataBase/data/${YEAR}/data/

III. Outputs:

  • "${LVG_COMPONENTS}/PreDataBase/data/${YEAR}/data/"

V. Detail procedures:

  • shell> 1.LoadLexiconFiles ${YEAR}

    This script copies initial original files to dataOrg directory ($LVG_COMPONENTS/PreDataBase/data/${YEAR}/dataOrg/).

    StepsNotesSourceTarget
    1Copy inflection variables file $Lexicon/${YEAR}/tables/inflVars.data $PreDataBase/data/${YEAR}/dataOrg/inflVars.data
    2Copy & Modify acronyms file $Lexicon/${YEAR}/tables/LRABR $PreDataBase/data/${YEAR}/dataOrg/acronyms
    $PreDataBase/data/${YEAR}/dataOrg/acr_exp
    3Copy & modify proper file $Lexicon/${YEAR}/tables/LRPRP $PreDataBase/data/${YEAR}/dataOrg/proper
    4Copy nominalization file $Lexicon/${YEAR}/tables/LRNOM $PreDataBase/data/${YEAR}/dataOrg/LRNOM
    5Copy synonyms file None (synonym.data has its own script to generate after 2017+) $PreDataBase/data/${YEAR}/dataOrg/synonyms.data
    6Copy derivation file None (derivation.data has its own script to generate after 2013+) None
    7Copy antonym file None (antonym.data has its own script to generate after 2022+) None
    8Run above 7 steps see above see above

  • shell> 2.GenerateLexiconFiles ${YEAR}

    This script generates final lvg files to data directory from dataOrg directory.

    StepsNotesSourceTarget
    1Copy inflection variables file $PreDataBase/data/${YEAR}/dataOrg/inflVars.data $PreDataBase/data/${YEAR}/data/infl.data
    2Copy & Modify acronyms file $PreDataBase/data/${YEAR}/dataOrg/acronyms $PreDataBase/data/${YEAR}/data/acronym.data
    3Copy proper file $PreDataBase/data/${YEAR}/dataOrg/proper $PreDataBase/data/${YEAR}/data/properNoun.data
    4Copy nominalization file $PreDataBase/data/${YEAR}/dataOrg/LRNOM $PreDataBase/data/${YEAR}/data/nominalization.data
    5Copy synonyms file $Synonyms/data/${YEAR}/outData/Results/synonyms.data.${YEAR}.release $PreDataBase/data/${YEAR}/data/synonyms.data
    6Copy derivation files $Derivation/5.All/data/${YEAR}/data/derivation.data $PreDataBase/data/${YEAR}/data/derivation.data
    7Copy antonyms file $Antonym/data/0.Antonym/${YEAR}/output/antonyms.data.${YEAR}.release $PreDataBase/data/${YEAR}/data/antonyms.data
    8Run above 7 steps see above see above

  • shell> 3.MoveLexiconFiles ${YEAR}

    This script copies/moves final lvg files from data directory to ${LVG}/data/tables directory.

    StepsNotesSourceTarget
    1Copy infl.data file $PreDataBase/data/${YEAR}/data/infl.data ${LVG_DIR}/data/tables/infl.data
    2Copy acronym.data file $PreDataBase/data/${YEAR}/data/acronym.data ${LVG_DIR}/data/tables/acronym.data
    3Copy properNoun.data file $PreDataBase/data/${YEAR}/data/properNoun.data ${LVG_DIR}/data/tables/properNoun.data
    4Copy nominalization.data file $PreDataBase/data/${YEAR}/data/nominalization.data ${LVG_DIR}/data/tables/nominalization.data
    5Copy synonyms.data file $PreDataBase/data/${YEAR}/data/synonyms.data ${LVG_DIR}/data/tables/synonyms.data
    6Copy derivation.data files $PreDataBase/data/${YEAR}/data/derivation.data ${LVG_DIR}/data/tables/derivation.data
    7Copy antonyms.data files $PreDataBase/data/${YEAR}/data/antonyms.data ${LVG_DIR}/data/tables/antonyms.data
    8Run above 7 steps see above see above

  • shell> 4.AnalyzeLvgFiles ${YEAR}

    Analyze files to find the max. length of each field. Then check with the database design on each fields of each tables

    StepsNotesSourceTable
    1AnalyzeInflection ${LVG_DIR}/data/tables/infl.data Inflection
    2AnalyzeAcronym ${LVG_DIR}/data/tables/acronym.data Acronym
    3AnalyzeProperNoun ${LVG_DIR}/data/tables/properNoun.data ProperNoun
    4AnalyzeNominalization ${LVG_DIR}/data/tables/nominalization.data Nominalization
    5AnalyzeSynonym ${LVG_DIR}/data/tables/synonyms.data LexSynonym
    6AnalyzeDerivation ${LVG_DIR}/data/tables/derivation.data Derivation
    7AnalyzeAntonym ${LVG_DIR}/data/tables/antonyms.data LexAntonym

    • Check the max. field length, if exceed, change source code in ${LVG_DIR}/loadDb/ to fit
    • Also, recompile if changing the source codes

  • Load data from Lexicon files to Lvg database
    Load these data into HSqlDb database
    • shell> cd ${LVG_DIR}/loadDb/bin
    • shell> 2.LoadDb ${YEAR}
    • choose Db (HSqlDb)
      PS. make sure the property value "readonly=false" in ${LVG_DIR}/data/HSqlDb/lvg${YEAR}.properties
    • choose tables option 11) to load Lexicon tables (1 ~ 7)
    • Change back the property value "readonly=true" in ${LVG_DIR}/data/HSqlDb/lvg${YEAR}.properties

  • Generate canonical data
    Generate canonical data for luiNorm
    • Make sure reload above files into Db on the ${LVG_DEV}
    • Make sure recompile (ant dist) on the ${LVG_DEV}
      => So that the following data will be generated by the latest lvg

    • Generate atoms.data (get it fromOCCS before 2013-)
      shell> cd ${META_DIR}/bin
      shell> 2.GetAtoms
      ${PREV_YEAR}AA
      1
      2
      3

      => Total difference no (must be 0): 0

    • shell> cd ${LHC_GIT}/lvg_canongenerator/bin
    • shell> 0.ModifyAtoms ${YEAR}

      StepsNotesSourceTarget
      1Prepare directories and files
      shell> cd ${CANON_GEN}/data/
      shell> mkdir ${YEAR}
      shell> cd ${YEAR}
      shell> mkdir dataOrg
      shell> mkdir data
      shell> mkdir output
      $META/data/${PRE_YEAR}AA/outputs/atoms.data $CANON_GEN/data/${YEAR}/dataOrg/atoms.org
      2Get ENG entry from atoms.org file $CANON_GEN/data/${YEAR}/dataOrg/atoms.org $CANON_GEN/data/${YEAR}/dataOrg/atoms.org.ENG
      3Get SPA entry from atoms.org file $CANON_GEN/data/${YEAR}/dataOrg/atoms.org $CANON_GEN/data/${YEAR}/dataOrg/atoms.org.SPA
      4Generate atoms.data file $CANON_GEN/data/${YEAR}/dataOrg/atoms.org.ENG $CANON_GEN/data/${YEAR}/dataOrg/atoms.data

    • Update variable ${LVG_DIR} in ${LVG_DEV_DIR}/data/config/lvg.properties (can't be AUTO_MODE)
      => Must use development LVG because of the updated DB for infl.data...
    • shell> cd ${CANON_GEN}/data/
    • shell> rm -rf HSqlDb
      => It uses HsqlDb to save base|inflVars|canonical form
      => move HSqlDb to HsqlDb.${YEAR} after it is done

    • shell> 1.RunCanonAll ${YEAR}
      	
      --------------------------------------
      Which Program ?
      --------------------------------------
      1) Generate terms list
      2) Generate words list
      3) Generate unique words list
      4) Generate base forms list
      5) Generate unique base forms list
      6) Generate canoncal forms
      7) Check non-ASCII canon
      8) All (default)
      9) Generate canoncal forms from test
      ----------
      8
      	
      	

      StepsNotesSourceTarget
      0Prepare directories and files
      1Get terms list
      • ${LVG_DIR}/data/tables/infl.data
      • $CANON_GEN/data/${YEAR}/dataOrg/atoms.data
      $CANON_GEN/data/${YEAR}/data/termList.data
      2Get words list $CANON_GEN/data/${YEAR}/data/termList.data $CANON_GEN/data/${YEAR}/data/wordList.data
      3Sort and unify words list $CANON_GEN/data/${YEAR}/data/wordList.data $CANON_GEN/data/${YEAR}/data/uniqueWordList.data
      4Get base forms of unique words list $CANON_GEN/data/${YEAR}/data/uniqueWordList.data $CANON_GEN/data/${YEAR}/data/baseList.data
      5
      • Combine bases (spelling variants) from infl.vars with baseList.data;
      • normalize non-ASCII characters;
      • sort and unify bases list
      $CANON_GEN/data/${YEAR}/data/baseList.data $CANON_GEN/data/${YEAR}/data/uniqueBaseList.data
      6Generate canonical forms
      => Use ${LVG_DIR}/lib/jdbcDrivers/HSqlDb/hsqldb.jar
      => Make sure the size of "varchar(110)" is big enough in
      => If not, it will show as SQLException: data exception ..., then modify the source code to fit.
      • "base varchar(110)" in CanonDbBaseForms.CreateBaseTable( );
      • "base varchar(110)" in CanonDbCanon.CreateCanonTable( );
      • "inflection varchar(110)" in CanonDbInflection.CreateInflectionTable( );
      $CANON_GEN/data/${YEAR}/data/uniqueBaseList.data $CANON_GEN/data/${YEAR}/data/canonical.data
      7Check/modify non-ASCII in Canonical forms $CANON_GEN/data/${YEAR}/data/canonical.data
      • $CANON_GEN/data/${YEAR}/data/notKnownUnicode.data
      • $CANON_GEN/data/${YEAR}/data/nonAscii.data

    Must run on lexdev (with huge memory). Other machines (lexdev01) take more than 1 day (too slow)

    Lvg ReleaseProcessesComputerHSqlDb versionRun-timeCanonical size
    2010Step-6lexdevHSqlDb.2.0.0.0~60 min.1,173,712
    2012Step-6lexdevHSqlDb.2.2.5~140 min.1,395,720
    2015Step-6lexdevHSqlDb.2.3.2~40 min.1,744,398
    2017Step-6lexdev1HSqlDb.2.3.4~30 min.1,921,878
    2018Step-6lexdev1HSqlDb.2.3.4~38 min.2,044,325
    2019Step-6lexdevHSqlDb.2.4.1~45 min.2,163,739
    2020Step-6lexdevHSqlDb.2.5.0~50 min.2,241,434
    2021Step-6lexdevHSqlDb.2.5.1~45 min.2,332,096
    2022Step-6lexdevHSqlDb.2.5.1~45 min.2,410,325
    2023Step-6lexdevHSqlDb.2.7.0~50 min.2,476,202
    2024Step-6lexdevHSqlDb.2.7.2~40 min.1,897,973
    2025Step-6lexdevHSqlDb.2.7.3~30 min.1,931,674

    Update variable ${LVG_DIR} in ${LVG_DEV_DIR}/data/config/lvg.properties to be AUTO_MODE

  • shell> 5.Generate2Files ${YEAR}

    Generate lvg files from lvg
    The lvg used is in the ${LHC_GIT}/lvg-p
    make sure variable ${LVG_DIR} uses the full path of lvg in the lvg config file (not AUTO_MODE), lvg.properties.
    shell> cd ${LVG_PRE_DATABASE}/bin
    shell> 5.Generate2Files <year>

    StepsNotesSourceTableRun Time
    1Generate fruitful variants ${LVG_DIR}/data/tables/infl.data $PreDateBase/data/${YEAR}/data/fruitful.data 2 hr.
    2Generate AntiNorm ${LVG_DIR}/data/tables/infl.data $PreDateBase/data/${YEAR}/data/antiNorm.data 1 hr.
    3Copy canonical data $CanonGenerator/data/${YEAR}/data/canonical.data $PreDateBase/data/${YEAR}/data/canonical.data 2 hr.

    PS. GenerateAntiNorm requires recompile with new lvg${YEAR}dist.jar

  • shell> 6.Move2Files ${YEAR}

    This script copies/moves final lvg generated files from data directory to ${LVG_DIR}/data/tables directory.

    StepsNotesSourceTarget
    1Copy fruitful.data file $PreDataBase/data/${YEAR}/data/fruitful.data ${LVG_DIR}/data/tables/fruitful.data
    2Copy antiNorm.data file $PreDataBase/data/${YEAR}/data/antiNorm.data ${LVG_DIR}/data/tables/antiNorm.data
    3Copy canonical.data file $PreDataBase/data/${YEAR}/data/canonical.data ${LVG_DIR}/data/tables/canonical.data

  • shell> 7.Analyze2Files ${YEAR}

    Analyze files to find the max. length of each field. Then check with the database design on each fields of each tables

    StepsNotesSourceTable
    1AnalyzeFruitful ${LVG_DIR}/data/tables/fruitful.data Fruitful
    2AnalyzeAntiNorm ${LVG_DIR}/data/tables/antiNorm.data AntiNorm
    3AnalyzeCanon ${LVG_DIR}/data/tables/canonical.data Canonical

  • Load data from 2 files to Lvg database
    Load these data into HSqlDb database
    • shell> cd ${LVG_DIR}/loadDb/bin
    • shell> LoadDb ${YEAR}
    • choose Db (HSqlDb & MySql)
      PS. make sure the property value "readonly=false" in ${LVG_DIR}/data/HSqlDb/lvg${YEAR}.properties
    • choose tables option 12) to load 2 tables

    • After it is done, change "readonly=true" in ${LVG_DIR}/data/HSqlDb/lvg${YEAR}.properties