The SPECIALIST Lexicon

The MEDLINE.2024 N-gram Set

This page describes the details of generating n-grams (n = 1-5) from MEDLINE. Please make sure all n-grams are generated correctly in the step-11 (group) with correct setup. 2024 MEDLINE n-gram set release using split, combine, and filter algorithm.
The data of these tables are from:

I. Log and data files

DescrptionLocationNotes
Input options for runGen${N}GramAll${MULTIWORDS}/bin/02.NGramGenAll/inData/${YEAR}/${N}-gram
  • Parameters for options 10-13
runGen${N}GramAll log${MULTIWORDS}/bin/02.NGramGenAll/log.${N}
  • run time
log file for options 10~13${MULTIWORDS}/bin/02.NGramGenAll/logData/${YEAR}/${N}-gram
  • Detail log for option 10-13
  • Not used in the table below
1.Split log${MULTIWORDS}/bin/Log.${YEAR}/02.NGramGen/log.heap.${N}.50
  • Total documents, Sentences, Tokens counts
  • Unique unigram count
  • not unique split gram count
N-gram out files${MULTIWORDS}/data/${YEAR}/outData/02.NGram/${N}-gram/*
  • Result files from Split, Group, and FilterCombine

II. Detail logs

ProgramNAppro. Time (Hr.) Option 1.1
  • GenPmidTiAbSentenceFromXmls
  • pubmed{YY}n{DDDD}.xml
  • PmidTiAbSentences{YY}n{DDDD}.txt
Option 10
  • Gen split n-gram
  • GetNGramFromSentenceFiles
  • MAX_CL = 50

  • 1.Split:
Option 11
  • Group split n-grams by alphabetic characters
  • GroupSpliteNGrams

  • 2.Group:
Option 12
  • Filter by WC and combine alphabetic n-grams
  • FilterWcCombineNGrams
  • WC = 30

  • 3.FilterCombine:
Option 13
  • Sort n-grams by dwt, tdw
  • NGramFilter

  • 3.FilterCombine:
Preprocess 4.0
  • ~5.0 hr (~300 files/hr).
  • PmidTiAbS24n: 0001-1219
    
unigramsn=1< 1.0 hr.
(from ./bin/02.NGramGenAll/logData/${YEAR}/N-gram/*.log)
 
  • param: 10,1, (150000000)
  • 50 min.


    from ./bin/Log.${YER}/02-NGramGen/log.heap.1.50:

  • Documents: 36,555,430
  • Sentences: 253,923,392
  • Tokens: 5,326,576,788

  • split: 1, no split
  • 1-grams (not unique, from log.heap.1.50): 43,517,942
    (it is unique beacuse no split, use wc -l)

  • Files:
    • nGram.out.1.heap.50.s01.0001-1219 (715 MB, use ls -alh)
  • param:
    • 11,1,01,NO,NO
  • 2 min.

  • Group Alphabetically
  • 1-gram (unique): 43,517,942

  • Files:
    • ${NGram}.g01.NO-NO (715 MB|43MB, from ./logData/${YEAR}/1-gram/11-1.log)
  • param: 12, 1, 30
  • 1 min

  • 1-gram (WC >= 30): 1,374,878

  • File:
    • 1-gram.${YEAR}.30 (23 MB)
  • param: 13, 1, 30
  • 1 min.

  • 1-gram (sorted): 1,374,878

  • File:
    • 1-gram.${YEAR}.30.dwt (23 MB)
bigramsn=27.1 hr. 
  • param: 10,2, (150000000)
  • 3.0 hr.

  • split: 4
  • 2-gram (not unique from log.heap.2.50): 515,200,567

  • Files:
    • s01.0001-0583 (3.1 GB)
    • s02.0584-0890 (3.0 GB)
    • s03.0891-1139 (3.0 GB)
    • s02.1140-1219 (1.3 GB)
  • param: see file names below
    • 11,2,01,NO,M
    • 11,2,02,M,k
    • 11,2,03,k,NO
  • 1.0 hr.

  • Group Alphabetically
  • 2-gram (unique, use wc -l): 385,792,790

  • Files:
    • ${NGram}.g01.NO-M (2.2GB|113MB)
    • ${NGram}.g02.M-k (3.2GB|149MB)
    • ${NGram}.g03.k-NO (2.6GB|122MB)
  • param: 12, 2, 30
  • 3.0 hr.

  • 2-gram (WC >= 30): 8,369,463

  • File:
    • 2-gram.${YEAR}.30 (179 MB)
  • param: 13, 2, 30
  • 2 min.

  • 2-gram (sorted): 8,369,463

  • File:
    • 2-gram.${YEAR}.30.dwt (179 MB)
trigramsn=314.0 hr. 
  • param: 10,3, (150000000)
  • 4.5 hr.

  • split: 14
  • 3-gram (not unique - from log.heap.3.50): 2,052,218,934

  • Files:
    • s01.0001-0151 (3.5 GB)
    • s02.0152-0309 (3.5 GB)
    • s03.0310-0400 (3.5 GB)
    • s04.0401-0532 (3.5 GB)
    • s05.0533-0614 (3.5 GB)
    • s06.0615-0692 (3.5 GB)
    • s07.0693-0760 (3.5 GB)
    • s08.0761-0826 (3.5 GB)
    • s09.0827-0890 (3.5 GB)
    • s10.0891-0954 (3.6 GB)
    • s11.0955-1014 (3.5 GB)
    • s12.1015-1073 (3.5 GB)
    • s13.1074-1131 (3.5 GB)
    • s14.1132-1189 (3.5 GB)
    • s15.1190-1219 (2.0 GB)
  • param: see file names below
  • 9.0 hr.

  • Group Alphabetically
  • 3-gram (unique wc -l): 1,331,672,919

  • Files:
    • g01.NO-E (4.4GB|187MB)
    • g02.E-Z (3.8GB|154MB)
    • g03.Z-c (4.1GB|159MB)
    • g04.c-f (4.2GB|154MB)
    • g05.f-j (3.9GB|148MB)
    • g06.j-o (2.6GB|98MB)
    • g07.o-r (3.6GB|140MB)
    • g08.r-th (3.4GB|126MB)
    • g09.th-NO (4.0GB|161MB)
  • param: 12, 3, 30
  • 0.5 hr.

  • 3-gram (WC >= 30): 12,511,710

  • File:
    • 3-gram.${YEAR}.30 (323 MB)
  • param: 13, 3, 30
  • 3 min.

  • 3-gram (sorted): 12,511,710

  • File:
    • 3-gram.${YEAR}.30.dwt (323 MB)
fourgramsn=431.5 hr. 
  • param: 10,4, (130000000)
  • 5.5 hr.

  • split: 25
  • 4-gram (not unique - from log.heap.4.50): 3,366,472,669

  • Files:
    • s01.0001-0077 (4.0 GB)
    • s02.0078-0204 (3.9 GB)
    • s03.0205-0272 (3.9 GB)
    • s04.0273-0319 (3.9 GB)
    • s05.0320-0372 (4.0 GB)
    • s06.0373-0419 (4.0 GB)
    • s07.0420-0517 (4.0 GB)
    • s08.0518-0559 (3.9 GB)
    • s09.0560-0605 (3.9 GB)
    • s10.0606-0649 (4.0 GB)
    • s11.0650-0696 (4.0 GB)
    • s12.0697-0735 (4.0 GB)
    • s13.0736-0773 (4.0 GB)
    • s14.0774-0811 (4.0 GB)
    • s15.0812-0848 (4.0 GB)
    • s16.0849-0885 (4.0 GB)
    • s17.0886-0921 (4.0 GB)
    • s18.0922-0959 (4.1 GB)
    • s19.0960-0995 (4.1 GB)
    • s20.0996-1030 (4.1 GB)
    • s21.1031-1065 (4.1 GB)
    • s22.1066-1099 (4.0 GB)
    • s23.1100-1132 (4.0 GB)
    • S24.1132-1166 (4.1 GB)
    • s25.1167-1200 (4.1 GB)
    • S26.1201-1219 (2.3 GB)
  • param: see file names below
  • 25 hr.

  • Group Alphabetically
  • 4-gram (unique): 2,643,278,820

  • Files:
    • g01.NO-8 (4.3GB|154MB)
    • g02.8-H (4.1GB|135MB)
    • g03.H-S (3.9GB|126MB)
    • g04.S-ad (4.1GB|135MB)
    • g05.ad-anl (4.3GB|142MB)
    • g06.anl-c (4.1GB|132MB)
    • g07.c-d (4.6GB|139MB)
    • g08.d-es (4.1GB|126MB)
    • g09.es-gm (4.3GB|137MB)
    • g10.gm-ine (4.4GB|143MB)
    • g11.ine-m (3.5GB|109MB)
    • g12.m-o (4.2GB|131MB)
    • g13.o-p (4.6GB|158MB)
    • g14.p-r (4.4GB|134MB)
    • g15.r-sh (3.8GB|120MB)
    • g16.sh-th (4.1GB|126MB)
    • g17.th-to (4.1GB|137MB)
    • g17.to-w (3.9GB|127MB)
    • g19.w-NO (2.9GB|95MB)
  • param: 12, 4, 30
  • 1.0 hr.

  • 4-gram (WC >= 30): 8,226,169

  • File:
    • 4-gram.${YEAR}.30 (251 MB)
  • param: 13, 4, 30
  • 3 min.

  • 4-gram (sorted): 8,226,169

  • File:
    • 4-gram.${YEAR}.30.dwt (251 MB)
fivegramsn=544.7 hr. 
  • param: 10,5, (120000000)
  • 6.0 hr.

  • split: 30
  • 5-gram (not unique): 3,856,685,759

    Files:

    • s01.0001-0064 (4.3 GB)
    • s02.0065-0112 (4.3 GB)
    • s03.0113-0233 (4.3 GB)
    • s04.0234-0279 (4.3 GB)
    • s05.0280-0316 (4.4 GB)
    • s06.0317-0360 (4.4 GB)
    • s07.0361-0398 (4.4 GB)
    • s08.0399-0482 (4.4 GB)
    • s09.0483-0524 (4.4 GB)
    • s10.0525-0558 (4.4 GB)
    • s11.0559-0597 (4.4 GB)
    • s12.0598-0633 (4.4 GB)
    • s13.0634-0671 (4.4 GB)
    • s14.0672-0705 (4.4 GB)
    • s15.0706-0736 (4.4 GB)
    • s16.0737-0767 (4.4 GB)
    • s17.0768-0798 (4.4 GB)
    • s18.0799-0828 (4.4 GB)
    • s19.0829-0858 (4.4 GB)
    • s20.0859-0889 (4.5 GB)
    • s21.0890-0918 (4.4 GB)
    • s22.0919-0950 (4.5 GB)
    • s23.0951-0979 (4.4 GB)
    • s24.0980-1007 (4.4 GB)
    • s25.1008-1035 (4.4 GB)
    • s26.1036-1063 (4.5 GB)
    • s27.1064-1091 (4.5 GB)
    • s28.1092-1118 (4.4 GB)
    • s29.1119-1145 (4.4 GB)
    • s30.1146-1173 (4.6 GB)
    • s31.1174-1200 (4.4 GB)
    • s32.1201-1219 (3.1 GB)
  • param: see file names below
  • 40.5 hr.

  • Group Alphabetically
  • 5-gram (unique): 3,300,384,957

  • Files:
    • g01.NO-2 (3.9GB|113MB)
    • g02.2-C (4.1GB|119MB)
    • g03.C-I (4.0GB|108MB)
    • g04.I-R (4.3GB|116MB)
    • g05.R-a (4.7GB|128MB)
    • g06.a-an (4.8GB|131MB)
    • g07.an-ane (4.8GB|135MB)
    • g08.ane-b (3.3GB|87MB)
    • g09.b-c (3.5GB|96MB)
    • g10.c-com (3.4GB|90MB)
    • g11.com-d (3.5GB|88MB)
    • g12.d-ef (4.6GB|119MB)
    • g13.ef-fol (4.8GB|124MB)
    • g14.fol-h (4.1GB|113MB)
    • g15.h-in (3.4GB|87MB)
    • g16.in-int (4.2GB|116MB)
    • g17.int-m (4.5GB|122MB)
    • g18.m-n (4.4GB|116MB)
    • g19.n-of (2.3GB|61MB)
    • g20.of-ofa (4.9GB|144MB)
    • g21.ofa-pl (5.0GB|134MB)
    • g22.pl-re (4.7GB|121MB)
    • g23.re-s (3.6GB|95MB)
    • g24.s-st (4.0GB|105MB)
    • g25.st-the (4.8GB|129MB)
    • g26.the-thea (4.7GB|134MB)
    • g27.thea-toa (3.6GB|101MB)
    • g28.toa-w (4.2GB|112MB)
    • g29.w-NO (5.0GB|141MB)
  • param: 12, 5, 30
  • 1.0 hr.

  • 5-gram (WC >= 30): 3,678,688

  • File:
    • 5-gram.${YEAR}.30 (131 MB)
  • param: 13, 5, 30
  • 2 min.

  • 5-gram (sorted): 3,678,688

  • File:
    • 5-gram.${YEAR}.30.dwt (131 MB)