Text Categorization

STI: Text

Description:
Read in the input text and perform ST indexing based on
- word frequency count
- document count for word
Inputs:
- a phrase, such as title or abstract
- a file, such as 9801.2004.TI.in
- a file, such as 9801.2004.AB.in
- a file, such as 9801.2004.TIAB.in
Algorithm:
- Pre-Process (Input Filter):
  - Tokenize all words of the input term
  - Apply Word Extraction Filter
  - Apply acronym filter (TBD)
  - Filter out not legal words
  - Filter out duplicated words if unique flag is true
  - Assign the final words for processing
- Process:
  - Get ST scores for each (legal) word in the text from DB: WORD_ST_SCORES table
  - Calculate Avg. ST scores for the text
- Post-process (Output Filter):
  - Print out Input text (term)
  - Output filter details
  - Scores Entries display number
  - No output message
  - Cluster option
  - ST candidates
  - Use alphabetical order for STs have same score (such as chem and hcpp)

Sample commands:

> sti -p
=> index a text from standard input with prompt

> sti -i:9801.2004.TI.in -o:9801.2004.TI.out
=> index text from file, 9801.2004.TI.in, and send the results to a file, 9801.2004.TI.out

Sample Outputs:
- a file, such as 9801.2004.TI.out
- a file, such as 9801.2004.AB.out
- a file, such as 9801.2004.TIAB.out