Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.

The SPECIALIST Lexicon

Synonyms Candidates

I. Setup

  • program: GetSynonymCandidates.java
  • Inputs:
    • MRCONSO.RRF
    • inflVars.data
    • cuiPreferredTerm.data
    • MRSTY.RRF
    • SemGroups.filter.txt
    • LRABR.f1.uSort
    • LRNOM
  • Outputs:
    • synonymCan.data.*

II. Algorithm

Go through all lines in MRCONSO.RRF to generate sClass (synonym class). A sClass includes:

  • key: CUI and preferred term
    • In UMLS, preferred terms may be capitalized, plural form (not base form), and not in Lexicon
    • The preferred form associated with CUI is used as reference for tagging synonym candidates in the sClass
  • values: candidate, terms has the same CUI and meet following requirements
    • format: POS|EUI|coreTerm.lc
    • EUI is used to uniquely identify the term because some different terms with different EUIs have same spelling and POS
  • Example of a synonym class (sClass):
    #SYNONYM_CLASS|C0000715|Abattoirs
    128|E0203495|abattoir|
    128|E0205229|slaughterhouse|
    
    #SYNONYM_CLASS|C0000744|Abetalipoproteinemia
    128|E0006481|abetalipoproteinemia|
    128|E0217186|acanthocytosis|
    128|E0430334|Bassen-Kornzweig syndrome|
    128|E0441749|Bassen-Kornzweig disease|
    
  • Please note that candidates are case sensitive to preserve the original base form as in the Lexicon
  • They are sent to linguists to tag [y|n] for valid and invalid synonyms to the CUI|PT

The number are based on UMLS.2014 (11,936,143)

DescriptionsOutput Logs
Retrieve all English terms in Lexicon with same CUI
  • Enlgish term: Field-2, LAT = ENG (English only),
  • non-Enlgish: 2,913,250, Enlgish: 9,022,893
  • same CUI (definition of synonym, same concept)
  • Normalized to coreTermLc (strip initial and final punctuation, then lowercased), used as key for lexRecord look up (inflVars.data)
  • Known to Lexicon
  • not in Lexicon: 5,704,964, Lexicon: 450,234
  • Inflection of base and POS of noun, verb, and adj.
  • Disallowed Lexicon POS: 620,085, Good lexicon POS: 453,261
  • Output format:
    #SYNONYM_CLASS|CUI|Preferred Term
    POS-1|EUI-1|Base-1
    POS-2|EUI-2|Base-2
    ...
    
SynonymCan.data.1.all
Exclude terms with disallowed STI, such as Chemicals and Drugs
  • CuiStiMap: use ./inData/MRSTY.RRF to map CUI to STI
  • disallowedStiSet: ./inData/SemGroups.filter.txt specified disallowed STI (tagged by linguists), such as SemGroup is CHEM.
  • disallowed: 2,867,695, allowed: 6,155,198
  • Example-1: The following synonym class is removed because of disallowed STI
    #SYNONYM_CUI|C0000098|1-Methyl-4-phenylpyridinium
    128|E0020400|cyperquat|
    128|E0319735|mpp|
SynonymCan.data.2.disallow
Exclude terms are acronyms or abbreviations because they drops precision too much.
  • There are too many expansions, such as "AA" has 39 expansions in Lexicon.
  • Preprocess:
    shell> flds 1 LRABR | sort -u > LRABR.f1.uSort
  • Use LRABR.f1.uSort to check if a term is an abbreviations or acronyms.
  • AcrAbb: 26,596, NotAcrAbb: 426,665
  • Example-2: lines with abbreviations are removed
    128|E0006443|abdomen|
    128|E0554771|abdominal|
    128|E0689526|abd|
    128|E0689531|abd|
    1|E0006444|abdominal|
    1|E0692924|abd|

  • Example-3: The synonym class is removed, after remove acad, this class has only one candidates, thus is removed!
    #SYNONYM_CUI|C0000876|Academies
    128|E0006659|academy|
    128|E0417973|acad|
    128|E0722828|acad|
SynonymCan.data.3.abb
Remove spVars to reduce manual tagging efforts.
  • If a term has a synonym of A, all spVars of that term are synonym of A.
  • Do not add to sClass if EUI exist in the sClass (spVars)
  • Use EUI in inflVar.data
  • Use any base form for terms have spVars (same EUI).
  • spVarNo: 274,469, after remove spVar no: 152,196
  • SpVars should be added in Post-process
  • Example-4: lines are spVar are removed
    #SYNONYM_CUI|C0000934|Acclimatization
    128|E0006730|acclimation|
    128|E0006731|acclimatisation|
    128|E0006731|acclimatization|
    128|E0007239|adaptation|
    128|E0422110|adaption|

    In the post-process, the deleted spVars will be added back in (if the tag of acclimatisation is [y]), so the record will become (assuming all tags are [y]):
    #SYNONYM_CUI|C0000934|Acclimatization
    128|E0006730|acclimation|
    128|E0006731|acclimatisation|
    128|E0006731|acclimatization|
    128|E0007239|adaptation|
    128|E0422110|adaption|

  • Example-5: The synonym class is removed, after remove spVar, this class has only one candidates, thus it is removed!
    #SYNONYM_CUI|C0000880|Acanthamoeba Keratitis
    128|E0429790|acanthameba keratitis|
    128|E0429790|acanthamoeba keratitis|
    => In the post-process, no synonyms will be generated for this sClass.
SynonymCan.data.4.spVar
Remove nominalization of a term.
  • If a term has a synonym of A, all nominalization of that term are synonym of A.
  • Sort sClass by CUI (key)
  • Use nomMap: ./inData/LRNOM, key: EUI of noun, value is a set of EUIs of nominalizations (adj and verb).
  • For implemenation, keep noun, remove its nominalizationof adj and verb
  • nomNo: 819, passNomNo: 151,377
  • All nominalization are synonyms (use LRNOM).
  • Example-6: lines are nominalization of a noun is removed
    #SYNONYM_CUI|C0001807|Agressvie behavior
    128|E0007791|aggression|
    128|E0007793|aggressiveness|
    128|E0528674|aggressive|
    1|E0007792|aggressive|
    => In the post-process, nominalization of all lines are added as follows:
    #SYNONYM_CUI|C0001807|Agressvie behavior
    128|E0007791|aggression|
    128|E0007793|aggressiveness|
    128|E0528674|aggressive|
    1|E0007792|aggressive|
    1024|E02212219|aggress|
    1|E0007792|aggressive|
SynonymCan.data.5.nom
Print sClass with multiple candidates (must have more than 1 term in the sCalss)
  • notMultiCanNo: 96,455, multiCanNo: 54,922
SynonymCan.data
  • sClassNo: 21,655