Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.
The MEDLINE N-gram Set 2019: by Split, Group, Filter, and Combine Algorithm
The MEDLINE n-gram set (generated by split, group, filter, and combine algorithm) is listed as bellows. For each MEDLINE record, title and abstract are used as the source of n-grams. They are combined, tokenized into sentences, and then tokenized into tokens (words use space as word boundary). Finally, n-grams are generated by filtering out terms with more than 50 characters or the total word count is less than 30. The specifications of generating these n-grams are listed as follows:
Document count | Word count | N-gram |
N-grams | File | Zip Size | Actual Size | No. of n-grams |
---|---|---|---|---|
Unigrams | 1-gram.2019.tgz | 7.2 MB | 18 MB | 1,075,227 |
Bigrams | 2-gram.2019.tgz | 46 MB | 135 MB | 6,336,698 |
Trigrams | 3-gram.2019.tgz | 71 MB | 233 MB | 9,078,536 |
Four-grams | 4-gram.2019.tgz | 50 MB | 174 MB | 5,729,590 |
Five-grams | 5-gram.2019.tgz | 24 MB | 86 MB | 2,446,765 |
N-gram Set | nGramSet.2019.30.tgz | 196 MB | 644 MB | 24,666,816 |
Distilled N-gram Set | distilledNGram.2019.tgz | 77 MB | 250 MB | 9,595,606 |