Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov

CSpell

Performance Tests on Context Window Size

I. Test Setup

  • Data: Training Set
  • Gold Standard: non-word only
  • Dictionary: CSpell (Lexicon-based)
  • Corpus: Consumer health corpus
  • Ranking: Context score and CSpell ranking

II. Test Results

  • Tests on various context window sizes in context score ranking

    Context RadiusPrecisionRecallF1
    10.77800.61110.6845
    20.80350.59170.6815
    30.80440.56850.6662
    40.81560.55430.6600
    50.82520.54910.6594
    60.82810.54130.6547
    70.82400.53230.6468
    80.83200.53100.6483
    90.84430.53230.6529
    100.83740.52580.6460
    250.84330.50780.6339
    500.84420.50390.6311
    1000.84420.50390.6311

  • Tests on various context window sizes in CSpell score ranking

    Context RadiusPrecisionRecallF1
    10.83800.78170.8088
    20.84070.78420.8115
    30.83660.78040.8075
    40.83520.77910.8061
    50.83520.77910.8061
    60.82960.77390.8008
    70.83100.77520.8021
    80.83100.77520.8021
    90.83100.77520.8021
    100.82960.77390.8008
    250.82830.77260.7995
    500.82830.77260.7995
    1000.82830.77260.7995

III. Discussion

  • Closer (local) context is more important than far away (global) context
  • The far (global) context does not contribute too much on context score
  • The radius of context should be equivalent to window size in the training set. Training window size = (2 * context radius + 1).
  • Chose radius of 2 (total window size of 5) because it has the best F1 score in CSpell ranking