Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov

CSpell

Leading Digit Splitter

  • Description:
    This splitter is used to process a split by adding a space after the leading digits if a token leads with digits.

  • Features:
    Split a token at the end of leading digits.

  • Examples:

    File NameInputOutput
    73.txt4miscarriages4 miscarriages
    10349.txt20years20 years
    11579.txt29yrs29 yrs
    10349.txt1.5years1.5 years
    13082.txt3weeks3 weeks
    13175.txt50mg50 mg

  • Implementation Logic:
    • Converts input word to coreTerm by strip off leading and ending punctuation and spaces.
    • Check if the coreTerm leads with digit, if yes
      • Check if the coreTerm matches the exceptions, if not:
        • Add space after the leading digit
    • Converts the updated coreTerm back to output term

  • Notes:
    • Baseline source code: PreProcSplit.java
    • Enhancement:
      • Not used dictionary
      • In addition to handle ordinal number (e.g. 1st, 2rd. 3rd. 4th), more exception patterns are extracted from Lexicon and consumer data to increase the precision (see detail below).
    • Action: Redesign and implemented
    • Apply the non-dictionary splitter model with matchers/filters by utilizing regular expression. They are described in the following table:

      Matchers
      MatcherRegular ExpressionExamples
      Leads with digit(s)^(\\d*\\.?\\d+)([a-zA-Z]{2,})(.*)$
      • 21year
      • 1.5months
      • 5mg
      • 5and

      Filters (Exceptions)
      Filter (Exception)Regular ExpressionExamples
      1. ordinal number^((\\d*)(1st|2nd|3rd))|((\\d+)(th))$
      • 1st
      • 42nd
      • 3rd
      • 435th
      2. [single chars] after the leading digit^(\\d+)([a-zA-Z])$
      • 31D
      • 9L
      • 5q
      3. [Upper], [Upper or digit]* after leading digit ^(\\d+)([A-Z]+)([A-Z0-9]*)$"
      • 67LR
      • 3Y1
      • 7PA2
      • 5FU
      4. [Upper, lower]+, [-], [word]* after leading digit^(\\d+)([a-zA-Z]+)-(\\w*)$
      • 111In-Cl
      • 5q-syndrome
      • 38C-13
      5. [Upper, lower], [punc, digit]* after leading digit^(\\d+)([a-zA-Z])([\\p{Punct}\\d]*)$
      • 16P-13.11
      • 16P-13
      • 1q21.1.

  • Source Code: LeadingDigitSplitter.java