You are here
Analysis of data that has been extracted from free-text using natural language processing: a likelihood model for misclassification with an application to medical informatics.
Many databases have free-text fields - such as clinicians' notes in medical databases - that pose a challenge to analysis. In order to analyze the information in free-text fields, usable data must first be abstracted, either through manual review (a potentially time-consuming process) or via an automated (or semi-automated) method such as natural language processing (NLP). NLP uses linguistic and computational algorithms to extract usable information from free-text. However, NLP-derived variables are measured with error: for example, state-of-the-art NLP methods typically have sensitivity and specificity rates of 80-95% for binary variables. When predictors that have been measured with error are treated naively, familiar problems occur, such as bias. Combining measurement error methods with NLP-derived predictors may provide a way to analyze free-text data, thereby rendering analyzable large quantities of data that would be impractical to review manually. In this study, we focus on misclassification. We propose a likelihood model for misclassification where a binary NLP-derived predictor is used to estimate the risk of a binary outcome. The model incorporates information from a manually validated subset of the data. We describe the likelihood model, and investigate its performance with simulations. We show that our misclassification method performs better than either the naive treatment of the NLP-derived predictor or analysis of the manual validation data alone. Finally, we illustrate the method using an NLP-derived indicator of smoking status, taken from clinicians' discharge notes, to predict the risk of smoking-related cancer.