Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov
Lazy Tokenizer
I. Introduction
This page describes the lazy implementation of tokenizer. "Lazy" means conduct process until it needs to be processed for faster speed performance. In CSpell, the input text is tokenized to words and processed sequentially. A lazy implementation of tokenization on punctuation (delay tokenizing on punctuation until the last moment) with coreTerm class were used to avoid unnecessary computation for tokenization and assembly on punctuation. This implementation save time and easier to maintain. It avoid unnecessary tokenization and fit perfectly with Java 8 stream operation.
II. Source code
TokenObj.java
TokenUtil.java
TextObj.java
TermUtil.java
III. Design and Algorithm