You are here

NLP-derived information improves the estimates of risk of disease compared to estimates based on manually extracted data alone.

Printer-friendly versionPrinter-friendly version
Callaghan F, Jackson MT, Demner-Fushman D, Abhyankar S, McDonald CJ
5th International Symposium on Semantic Mining in Biomedicine (SMBM 2012), 2012 Sept 3-4, Zurich, Switzerland.
Abstract: 
Natural language processing (NLP) enables researchers to extract large quantities of information from free-text that otherwise could only be extracted manually. This information can then be used to answer clinical research questions via statistical analysis. However, NLP extracts information with some degree of error – the sensitivity and specificity of state-of-the-art NLP methods are typically80-90% – and most statistical methods assume that the information has been observed “without measurement error”. As we show in this paper, if an NLP-derived smoking status predictor is used, for example, to estimate the risk of smoking-related  cancer without any adjustment for measurement error, the estimate is biased. Conversely, if a smaller subset of manually extracted data is used alone, then the estimate is unbiased, but imprecise, and the corresponding inference methods tend to have low power to detect significant relationships. We propose using a  statistical measurement error method – a maximum likelihood (ML) method – that combines information from NLP with manually validated data to produce unbiased estimates that also have good power to detect a significant signal. This method has the potential to open-up large free-text databases to statistical  analysis for clinical research. With a case study using smoking status to predict smoking-related cancer and simulations, we demonstrate that the ML method performs better under a variety of scenarios than using either NLP or manually extracted data alone.
Callaghan F, Jackson MT, Demner-Fushman D, Abhyankar S, McDonald CJ. NLP-derived information improves the estimates of risk of disease compared to estimates based on manually extracted data alone. 5th International Symposium on Semantic Mining in Biomedicine (SMBM 2012), 2012 Sept 3-4, Zurich, Switzerland.