Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov

CSpell

Leading Punctuation Splitter

  • Description:
    This splitter is used to process a split by adding a space before leading punctuation if a token contains leading punctuation. Leading punctuation includes: &([{

  • Features:
    Split a token in front of leading punctuation.

  • Examples:

    File NameInputOutput
    12134.txtdoppler(doppler (
    12271.txt1-plug&1-plug &
    12353.txtepilepsy(epilepsy (
    12353.txtvolunteers(volunteers (
    12706.txtdr.[dr. [
    18186.txttest(test (
    18341.txtvain(vain (
    2.txtone(one (
    30.txtfolitrax(folitrax (
    50.txt,[, [
    78.txtgenes[genes [

  • Implementation Logic:
    • Recursively perform the following process:
    • Converts input word to coreTerm by stripping off leading punctuation, spaces, and digits.
    • Check if the coreTerm contains leading punctuation, if yes
      • Find the first leading punctuation
      • Check if the coreTerm matches the exceptions of the leading punctuation, if not:
        • Add space before the leading punctuation
    • Check if the prefix contains leading punctuation, if yes
      • Find the first leading punctuation
      • Check if the prefix matches the exceptions of the leading punctuation, if not:
        • Add space before the leading punctuation
    • Check if the suffix leads with leading punctuation, if yes
      • Add space before the leading punctuation
    • Converts the updated coreTerm back to output term if split happen in coreterm, prefix, or suffix.

  • Notes:
    • Baseline source code: PreProcSplit.java
    • Enhancement:
      • not used dictionary
      • Add leading punctuation of [&]
      • Remove leading punctuation of [/] and [-] to increase precision
      • Implements exceptions separately for each leading punctuation
      • Use coreTermObj to split to prefix, coreTerm, suffix
      • Recursively split until there is no more split
    • Punctuation of @ and * might be qualified for leading punctuation, it needs further analysis.
    • Action: Redesign and implemented
    • Apply the non-dictionary splitter model with matchers/filters by utilizing regular expression for each leading punctuation. They are described in the following table:
      Broader Generic Matchers (Qualifiers)
      MatcherRegular ExpressionExamples
      Contains Leading Punctuation^.*[&\\(\\[\\{].*$

      Filters (Specific Exceptions for Each Leading Punctuation)
      Leading PunctuationFilter (Exception)Regular ExpressionExamples
      Ampersand [&]1. Abbreviations
      [A-Z]+&[A-Z]+
      ^[A-Z]+&[A-Z]+$
      • AT&T
      • R&D
      Left Parenthesis [(] 1. contains digits or plus sign
      [non-space]*([digit]+\+?)[non-space]*
      ((\\S)*\\([\\d]+(\\+)?\\)(\\S)*)
      • RS(3)PE
      • δ(18)O
      • Ca(2+)
      • Ca(2+)-ATPase
      2. max or min
      [non-space]*(max|min)[non-space]*
      ((\\S)*\\((max|min)\\))
      • V(max)
      • C(min)
      3. contains a single char or plus
      [non-space]*(+char)[non-space]*
      ((\\S)*\\([+\\w]\\)(\\S)*)
      • D(+)HUS
      • GABA(A)
      • apolipoprotein(a)
      • beta(1)s
      • homocyst(e)ine
      4. parenthetic plural forms
      [word]+((s|es)|(y(ies)))
      ([\\w]+((s\\(es\\))|(y\\(ies\\))))
      • finger(s)
      • fetus(es)
      • extremity(ies)
      5. after a hyphen
      [non-space]*-([non-space]*)
      ((\\S)*-\\((\\S)*)
      • poly-(ethylene
      • poly-(ADP-ribose)
      • C-(17:0)
      • I-(alpha)
      Left Square Bracket [[] 1. [ [lower] ]
      [non-space]*[[lower]][non-space]*
      (\\S*\\[[a-z]\\]\\S*)
      • benzo[a]pyrene
      • B[e]P
      2. leads with tilde or hyphen
      (tilde|hyphen)[
      ([~\\-]\\[\\S*)
      • -[NAME]
      • ~[NAME]
      Left Curly Brace [{]1. No exceptions found$^None

  • Source Code: