Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov
Training Set
II. Description
We used both the training set and the test set from the Ensemble method as our training set to develop CSpell. The training set is summarized as follows:
Consumer health questions | 471* |
Tokens | 24,837 |
Annotation tags | 1,008 |
Instances of non-word corrections | 774 |
Instances of real-word corrections | 964 |
Word count per question | 5 - 328 |
Average word count per question | 52.49 |
Error per question | 0 - 27 |
Average error per question | 2.14 |
Error rate (error per token) | 0.04 (= 964/24,837) |
*One question (11199.txt) is removed from the Ensemble method data because it contains too many non-English words.
III. Distribution of Errors in the Training Set
Count | Minimum | Maximum | Average |
---|---|---|---|
Character | 34 | 1985 | 296.37 |
Word | 5 | 328 | 52.49 |
Error Tag | 0 | 27 | 2.14 |
Correction needed | non-word | real-word | ND | Multiple | Total |
---|---|---|---|---|---|
Spelling | 348 | 153 | 113 | N/A | 614 |
Merge | 10 | 38 | 0 | N/A | 48 |
Split | 24 | 10 | 281 | N/A | 315 |
Multiple | N/A | N/A | N/A | 31 | 31 |
Total | 382 | 201 | 394 | 31 | 1008 |
Percentage | 37.90% | 19.94% | 39.09%A | 3.08% | 100.00% |
where:
IV. Other Components
V. Performance Tests