The SPECIALIST Lexicon

Multiword Candidates Generation Processes:
SpVar Matcher with Frequency in the Distilled Medline N-gram Set

N-grams matches SpVar pattern is a good sources for multiword candidates. Over 10+ SpVar types were developed to identify spVars from a given corpus.

  • For example: terms of

    • bloodpressure
    • blood pressure
    • blood-pressure
    • tradeoff
    • trade off
    • trade-off
    are in a corpus and matches the spVar types (SVT_SPACE|SVT_PUNC_DASH) in the spVar model. Thus, they are good candidates for LMWs.
  • Frequency filter (WC) are added to this list for frequency analysis:
  • Matcher SpVar: Steps 60-61A (08.MatcherSpVar)
  • Some candidate is automatically tag [AUTO_YES|AUTO_NO]
  • Should apply highest frequency strategy
  • Not as productive as expected, not used after 2016+.

  • Generated files:
    Distilled MEDLINE nGram SetCandidate FilesStatusNotes
    2015 DoneTag [Y|N]
    2016+N/APostphone due to limited resources