PUBLICATIONS

Abstract

Hybrid Ensemble-Rule Algorithm for Improved MEDLINE® Sentence Boundary Detection.


Le DX, Mork JG, Antani S

AMIA Annual Symposium Proceeding 2021;2021:677-686.

Abstract:

Sentence boundary detection (SBD) is a fundamental building block in the Natural Language Processing (NLP) pipeline. Incorrect SBD may impact subsequent processing stages resulting in decreased performance. In well-behaved corpora, a few simple rules based on punctuation and capitalization are sufficient for successfully detecting sentence boundaries. However, a corpus like MEDLINE citations presents challenges for SBD due to several syntactic ambiguities, e.g., abbreviation-periods, capital letters in first words of sentences, etc. In this manuscript we present an algorithm to address these challenges based on majority voting among three SBD engines (Python NLTK, pySBD, and Syntok) followed by custom post-processing algorithms that rely on NLP spaCy part-of-speech, abbreviation and capital letter detection, and computing general sentence statistics. Experiments on several thousand MEDLINE citations show that our proposed approach for combining multiple SBD engines and post-processing rules performs better than each individual engine.


Le DX, Mork JG, Antani S. Hybrid Ensemble-Rule Algorithm for Improved MEDLINE® Sentence Boundary Detection. 
AMIA Annual Symposium Proceeding 2021;2021:677-686.

PDF