MEDLINE N-Gram Set

The MEDLINE N-gram Set 2025: by Split, Group, Filter, Combine and Sort Algorithm

The MEDLINE n-gram set (generated by split, group, filter, combine and sort - SGFCS algorithm) is listed as bellows. For each MEDLINE record, title and abstract are used as the source of n-grams. They are combined, tokenized into sentences, and then tokenized into tokens (words use space as word boundary). Finally, n-grams are generated by filtering out terms with more than 50 characters or the total word count is less than 30. The specifications of generating these n-grams are listed as follows:

  • MEDLINE: 2025 - TI and AB (from MEDLINE Baseline Repository - MBR, pubmed25nXXXX.xml -> PmidTiAbS25nXXXX.txt: 1 ~ 1274)
  • Method: Split, Group, Filter, Combine and Sort Algorithm
  • Max. Character Size: 50
  • Min. word count: 30
  • Min. document count: 1

  • Total document count: 38,201,553
  • Total sentence count: 270,098,242
  • Total token count: 5,676,864,905

  • N-gram files
    • File format - 3 fields:
      Document countWord countN-gram
    • Sorted by document count, word count, then alphabetic order of n-grams. N-gram set is not sorted. It can be sorted by nGramUtil package.

  • Download:
    N-gramsFileZip SizeActual SizeNo. of n-grams
    Unigrams1-gram.2025.tgz9.6 MB24 MB1,441,038
    Bigrams2-gram.2025.tgz64 MB189 MB8,825,402
    Trigrams3-gram.2025.tgz104 MB344 MB13,303,488
    Four-grams4-gram.2025.tgz77 MB270 MB8,817,816
    Five-grams5-gram.2025.tgz40 MB142 MB3,982,724
    N-gram SetnGramSet.2025.30.tgz293 MB967 MB36,370,468
    Distilled N-gram SetdistilledNGram.2025.tgz119 MB392 MB14,722,972