Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov

CSpell

CSpell Pipeline Design

I. Introduction

Different types of errors have different characteristics and require specific strategies for corrections. A multi-layer design consisting of models for non-dictionary-based and dictionary-based corrections was implemented in CSpell. CSpell integrates several stand-alone spelling correction models combined in the sequential order as shown in the following figure.

II. Non-dictionary-based correction

The non-dictionary correction model includes handlers and splitters.

  • handlers: handle HTML/XML tags and informal expression
  • splitters: split on agglutination on punctuation and numbers.

    Splitters uses the Lexicon to derive generic patterns for matchers and filters for split operation on run-on on digits and punctuation. These patterns are implemented in regular expression and algorithm for split operations and briefly shown in the following diagram.

They were arranged as a chain of intermediate operators to handle HTML/XML tags introduced by the software that consumers use to ask questions, informal expressions and missing spaces on adjacent punctuation or digits.

III. Dictionary-based correction

The dictionary-based correction model includes four modules:

  • detector: to detect errors
  • candidate generator: to generate correcting candidates
  • ranker: to rank candidates and find the best correction
  • corrector: to replace the detected error with the best correction. The corrector is needed to cope with single-token (spelling and split) and multi-token (merge) corrections.