A Comparison of 13 Tokenizers on MEDLINE

He Y, Kayaalp M
December 2006 Technical Report.

This report describes a study on tokenization of MEDLINE abstracts by 13 different software packages that are freely available. In literature, there is little or no comparative evaluation studies on general purpose tokenizers, nor is there any such study on tokenizers that are specific to biomedical text. Biomedical text processing in general and tokenization in particular are quite challenging as biomedical text contains a wide variety of domain-specific terms. This study explores various scenarios taken from actual MEDLINE abstracts, and provides critical evaluation on the observed performances of the tested tokenizers. The results of this study show that there is a wide variance among outputs of these tokenizers and choosing a right tokenizer requires detailed information that this report is aimed to compile. The target audience of this report may be those people who are interested in using any particular tokenizer and want to know what types of behavior are expected from general purpose and biomedical tokenizers. The report is prepared with the intention to aid the decision making process of the reader on choosing the right tokenizer and/or devising algorithms that can effectively use the resulting tokens with a minimum loss of information. The reader can find a list of factors that need to be taken into account in such decision. The report also discusses various pros and cons of the tokenizers that are tested.

