Text Categorization

STI: Text


  • Description:

    Read in the input text and perform ST indexing based on

    • word frequency count
    • document count for word

  • Inputs:
    • a phrase, such as title or abstract
    • a file, such as 9801.2004.TI.in
    • a file, such as 9801.2004.AB.in
    • a file, such as 9801.2004.TIAB.in

  • Algorithm:
    • Pre-Process (Input Filter):
      • Tokenize all words of the input term
      • Apply Word Extraction Filter
      • Apply acronym filter (TBD)
      • Filter out not legal words
      • Filter out duplicated words if unique flag is true
      • Assign the final words for processing
    • Process:
      • Get ST scores for each (legal) word in the text from DB: WORD_ST_SCORES table
      • Calculate Avg. ST scores for the text
    • Post-process (Output Filter):
      • Print out Input text (term)
      • Output filter details
      • Scores Entries display number
      • No output message
      • Cluster option
      • ST candidates
      • Use alphabetical order for STs have same score (such as chem and hcpp)

  • Sample commands:
    > sti -p
    => index a text from standard input with prompt
    
    > sti -i:9801.2004.TI.in -o:9801.2004.TI.out
    => index text from file, 9801.2004.TI.in, and send the results to a file, 9801.2004.TI.out
    

  • Sample Outputs:
    • a file, such as 9801.2004.TI.out
    • a file, such as 9801.2004.AB.out
    • a file, such as 9801.2004.TIAB.out