MEDLINE N-Gram Set

The MEDLINE N-gram Set 2025: by Split, Group, Filter, Combine and Sort Algorithm

The MEDLINE n-gram set (generated by split, group, filter, combine and sort - SGFCS algorithm) is listed as bellows. For each MEDLINE record, title and abstract are used as the source of n-grams. They are combined, tokenized into sentences, and then tokenized into tokens (words use space as word boundary). Finally, n-grams are generated by filtering out terms with more than 50 characters or the total word count is less than 30. The specifications of generating these n-grams are listed as follows:

MEDLINE: 2025 - TI and AB (from MEDLINE Baseline Repository - MBR, pubmed25nXXXX.xml -> PmidTiAbS25nXXXX.txt: 1 ~ 1274)
Method: Split, Group, Filter, Combine and Sort Algorithm
Max. Character Size: 50
Min. word count: 30
Min. document count: 1
Total document count: 38,201,553
Total sentence count: 270,098,242
Total token count: 5,676,864,905
N-gram files
- File format - 3 fields:
  
  Document count Word count N-gram
- Sorted by document count, word count, then alphabetic order of n-grams. N-gram set is not sorted. It can be sorted by nGramUtil package.

Download:

N-grams	File	Zip Size	Actual Size	No. of n-grams
Unigrams	1-gram.2025.tgz	9.6 MB	24 MB	1,441,038
Bigrams	2-gram.2025.tgz	64 MB	189 MB	8,825,402
Trigrams	3-gram.2025.tgz	104 MB	344 MB	13,303,488
Four-grams	4-gram.2025.tgz	77 MB	270 MB	8,817,816
Five-grams	5-gram.2025.tgz	40 MB	142 MB	3,982,724

N-gram Set	nGramSet.2025.30.tgz	293 MB	967 MB	36,370,468
Distilled N-gram Set	distilledNGram.2025.tgz	119 MB	392 MB	14,722,972