Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.

Text Categorization

MEDLINE Tokenizer

  • Description:

    Read in Medline file and retrieve specified fields. The output is used as the input for JDI (both phrases and MeSHs).

  • Usage:
    > mlt -h
    
    Synopsis:
      Mlt [options]
    
    Description:
      Mlt is a program to tokenize MEDLINE citations by specifying field tags
    
    Options:
      -ci       Show configuration information
      -h        Print program help information (this is it)
      -i:STR    Specify input file (must specify)
      -pmid     Preserve PMID in the first field
      -s        Sort output by PMID
      -o:STR    Specify output file (must specify)
      -t:STR    Specify MEDLINE field tag:TI|AB|TIAB|MHs|TIABMHs|ALL|S_ALL (must specify)
                or any MEDLINE field tag
      -v        Print the current version of Mlt
      -x:STR    Specify an alternate configuration file
     
    

  • Sample Inputs:
    • 9801.2004.baseline.sorted
    • 9801.2005.baseline.sorted

  • Algorithm:
    • Read in file and save MEDLINE records into Java objects, CitationObjs.
    • Sort Citations by PMID if -s sorting flag is chose
    • Print out field data by specified field tags:
      • MHs: starred MeSHs (MH and SH are separated by '|');
      • TIABMHs: combination of TI, AB, and MHs
      • ALL: the original format
      • TIAB: combination of TI and AB
      • Field tag: legal field tag in MEDLINE, such as TI, AB, etc.. Please refer MEDLINE field tags for details.
    • Re-format MHs:
      • read in MEDLINE file and generate MH (Mesh)
      • Take care of multiple lines MH
      • Filter out MH without star (*)
      • Tokenize MeSH Main heading
      • Tokenize MeSH sub heading and only keep those with *
        => Indexing rules do not allow * on both MH and SH. However, our code is able to handle this situation
      • MeSH Main heading is always unique
      • Unify MeSH sub headings (sub heading may be duplicated)
      • Sub heading is changed to its abbreviation form
      • Print out main heading and sub heading use separator "|"

  • Sample commands:
    > mlt -t:TIAB -i:9801.2004.baseline.sorted -o:9801.2004.TIAB
    
    => Read in file 9801.2004.baseline.sorted, retrieve field TI and AB and send the results to file 9801.2004.TIAB

  • Sample outputs:
    • 9801.2004.TI
    • 9801.2004.AB
    • 9801.2004.TIAB
    • 9801.2004.MH
    • 9801.2004.TIABMH