This project aims to predict 3 adverse pregnancy outcomes - preeclampsia, preterm delivery, and obstretric hemorrhaging - using supervised machine learning models trained on health and demographic data.
This repository contains the Jupyter Notebooks, external resources, and constructed dataframes used to construct machine learning models for prediction of adverse pregnancy outcomes using the MIMIC-IV dataset.
notebooks
- contains Jupyter Notebooks used for preprocessing, EDA, and model construction. Note that the notebooks have numbers appended to the end of their file names. The numbers represent the general order in which the notebooks should be run. While not all notebooks are dependent on the results of another, many of the notebooks do depend on the CSV output of previous notebooks.final_dfs
- final dataframes used for model constructionresources
- external resources used for data pre-processing, filtering, and mapping
This project utilizes the open-source Multiparameter Intelligent Monitoring in Intensive Care (MIMIC)-IV dataset. The datset is a publicly available database sourced from the electronic health record (EHR) of the Beth Israel Deaconess Medical Center in Massachusetts.
The dataset contains data on a clinical cohort of patients that were admitted to the Emergency Department (ED) or an intensive care unit (ICU) between the years of 2008 and 2019. All patients are greater than 18 years of age and the patient records have been de-identified to abide by HIPAA regulations. The MIMIC-IV dataset takes on a relational structure and contains patient demographic data, health metrics, and mapping tables with the International Classification of Diseases (ICD) codes, Diagnosis Related Groups (DRGs), and the Healthcare Common Procedure Coding System (HCPCS).
- read data into Postgres
- filter data on pregnancy diagnosis and gender
- pre-process diagnosis, procedure, and medication codes - results in a truncated "root" representation of the original code
- handle outliers by measuring against clinically informed outlier values gathered in "MIMIC-Extract: A Data Extraction, Preprocessing, and Representation Pipeline for MIMIC-III"
- explore adverse outcomes by race, marital status, and other demographic factors
- explore common diagnoses and prescriptions for pregnant patients
For all classifiers:
- apply SMOTE to account for output imbalance
- apply stratified 5-fold cross-validation to ensure minority classes are represented
- evaluate via AUC and recall
Constructed models:
- AdaBoost
- Random Forest
- Long Short-Term Memory Network (LSTM)
-
Binary LSTM model (predicting adverse outcome v. no adverse outcome) achieved 88.5% AUC and 94% recall.
-
Multi-label LSTM model (predicts output labels for each adverse outcome) achieved 77% AUC and 92% recall.
-
Binary AdaBoost model achieved 86% recall and 88% precision.
-
The models tend to over-predict adverse outcomes (Type I error) compared to missing a diagnosis (Type II error).
- Physionet Page for MIMIC-IV v3.0
- Nature article on MIMIC-IV
- ICD-10 Diagnosis Repository
- ICD-9 Diagnosis Repository
- NDC for Medications
- ICD-10 Procedure Code Repository
- ICD-9-PCS to ICD-10-PCS Mapping Overview
- "MIMIC-Extract: A Data Extraction, Preprocessing, and Representation Pipeline for MIMIC-III"
- "An Extensive Data Processing Pipeline for MIMIC-IV"
- "Patient Subtyping via Time-Aware LSTM Networks"