Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.

Lexical Tools

Optimizing 2024 SD-Rule Set - Baseline

I. Get the stats (yes|no) from current year data

  • DIR: ${SUFFIXD_DIR}
  • Program:
    shell> cd ${SUFFIXD_DIR}/bin
    shell> GetSuffixD ${YEAR}
    11
    ALL
  • Outputs:
    • sdRules.stats.rpt.* (sdRules.stats.rpt.pipe is used in this analysis)

II. Establish the baseline: remove all Child SD-Rules and use it as the baseline

  • Create a new directory: ${SUFFIXD_DIR}/data/${YEAR}/dateR/SdRulesOptimum/00.baseline
  • shell> cd ${SUFFIXD_DIR}/data/${YEAR}/dateR/SdRulesOptimum/00.baseline
  • shell> cp -p ../../../data/sdRules.stats.rpt.pipe sdRules.stats.in.${YEAR}
  • shell> cp -p sdRules.stats.in.${YEAR} sdRules.stats.in.${YEAR}.removeChild
  • Manually comment out (#) all CHILD rules (24)
    shell> fgrep "|CHILD" sdRules.stats.in.${YEAR}.removeChild | wc -l
  • shell> ln -sf ./sdRules.stats.in.${YEAR}.removeChild sdRules.stats.in

III. Get the Optimal Set

  • Algorithm:
    • Let program to select optimal set automatically. which is to cover min. 95% of precision for all root parents rules.
    • The F1 may not be the best, but, 95% precision is our objective.

    • Then, we evaluate child rules for these parent rules to find the best child rules for optimal set.

  • Program:
    shell> cd ${SUFFIXD_DIR}/bin
    shell> GetSdRule ${YEAR}
    1
    others
    00.baseline
    0
  • Outputs:
    • ${SUFFIXD_DIR}/data/${YEAR}/dateR/SdRulesOptimum/00.baseline/sdRules.stats.out.*

IV. Results

The result of this baseline set of SD-Rules includes 162 unique PARENT/SELF SD-Rules (no CHILD rules). They are sorted by a descending order of precision (= relevant, retrieved No./retrieved No.) and then retrieved No. rate. The top 102 SD-Rules are used as the optimized SD-Rule set to cover 95.26% system (accumulated) precision and 87.24% system (accumulated) recall rate with a system performance (F1) of 1.8251. The total valid instance (relevant, retrieved) number is 59,911 (from the last column in ./sdRules.stats.out).

-- Total line no: 189
-- Total comment no: 27
-- Total Sd-Rule no: 162
---------------------------------------
-- Optimum SD-Rules: 102|73.13%|67|49|18|0|$|verb|per$|noun|2024|WORDNET|SELF|95.26%|87.24%|1.8251|52268|54867