Automated Data Extraction for Clinical Databases using Natural Language Processing


Cardiothoracic surgery departments and clinics in the US rely on central agencies, such as the Society of Thoracic Surgeons (STS), to evaluate their operational performance compared to their peer institutions. Specifically, 97% of U.S. adult cardiac surgery programs collaborate and transfer their patient data to the STS National Database for quality improvement and risk assessment. However, as patient records primarily consist of unstructured text reports, such data transfers involve a large operational overhead, requiring manual data extraction by teams of experienced data managers. Utilizing recent breakthroughs in Natural Language Processing, we propose an end-to-end machine learning pipeline that automatically extracts all patient data (either structured or unstructured) from multiple sources, over multiple patient visits, and for multiple target outcomes. Our pipeline can be extended to new sources or outcomes, as hospitals may use different data conventions from each other. Preliminary results on Massachusetts General Hospital data show promise and our methodology achieves up to 98% AUC in common diagnoses and up to 85-95% in more challenging ones. We believe using an automated pipeline for data extraction has four benefits: i) reduce operational overhead and costs for institutions when transferring their data, ii) increase data consistency and quality, while reducing variation from human errors, iii) allow institutions to transfer even richer data that would otherwise require additional human effort, and iv) create a unified framework for end-to-end medical predictions and diagnoses from any type and format of patient medical reports.

Oct 15, 2022 10:00 AM
Future of OR workshop - INFORMS Annual Meeting 2022