Because of a lapse in government funding, the information on this website may not be up to date, transactions submitted via the website may not be processed, and the agency may not be able to respond to inquiries until appropriations are enacted. The NIH Clinical Center (the research hospital of NIH) is open. For more details about its operating status, please visit cc.nih.gov. Updates regarding government operating status and resumption of normal operations can be found at OPM.gov.

Text Categorization

STI: Text


  • Description:

    Read in the input text and perform ST indexing based on

    • word frequency count
    • document count for word

  • Inputs:
    • a phrase, such as title or abstract
    • a file, such as 9801.2004.TI.in
    • a file, such as 9801.2004.AB.in
    • a file, such as 9801.2004.TIAB.in

  • Algorithm:
    • Pre-Process (Input Filter):
      • Tokenize all words of the input term
      • Apply Word Extraction Filter
      • Apply acronym filter (TBD)
      • Filter out not legal words
      • Filter out duplicated words if unique flag is true
      • Assign the final words for processing
    • Process:
      • Get ST scores for each (legal) word in the text from DB: WORD_ST_SCORES table
      • Calculate Avg. ST scores for the text
    • Post-process (Output Filter):
      • Print out Input text (term)
      • Output filter details
      • Scores Entries display number
      • No output message
      • Cluster option
      • ST candidates
      • Use alphabetical order for STs have same score (such as chem and hcpp)

  • Sample commands:
    > sti -p
    => index a text from standard input with prompt
    
    > sti -i:9801.2004.TI.in -o:9801.2004.TI.out
    => index text from file, 9801.2004.TI.in, and send the results to a file, 9801.2004.TI.out
    

  • Sample Outputs:
    • a file, such as 9801.2004.TI.out
    • a file, such as 9801.2004.AB.out
    • a file, such as 9801.2004.TIAB.out