Predicting Adverse Pregnancy Outcomes with Machine Learning

Project Overview

This project aims to predict 3 adverse pregnancy outcomes - preeclampsia, preterm delivery, and obstretric hemorrhaging - using supervised machine learning models trained on health and demographic data.

This repository contains the Jupyter Notebooks, external resources, and constructed dataframes used to construct machine learning models for prediction of adverse pregnancy outcomes using the MIMIC-IV dataset.

Directory Structure

notebooks - contains Jupyter Notebooks used for preprocessing, EDA, and model construction. Note that the notebooks have numbers appended to the end of their file names. The numbers represent the general order in which the notebooks should be run. While not all notebooks are dependent on the results of another, many of the notebooks do depend on the CSV output of previous notebooks.
final_dfs - final dataframes used for model construction
resources - external resources used for data pre-processing, filtering, and mapping

Dataset and Sources

This project utilizes the open-source Multiparameter Intelligent Monitoring in Intensive Care (MIMIC)-IV dataset. The datset is a publicly available database sourced from the electronic health record (EHR) of the Beth Israel Deaconess Medical Center in Massachusetts.

The dataset contains data on a clinical cohort of patients that were admitted to the Emergency Department (ED) or an intensive care unit (ICU) between the years of 2008 and 2019. All patients are greater than 18 years of age and the patient records have been de-identified to abide by HIPAA regulations. The MIMIC-IV dataset takes on a relational structure and contains patient demographic data, health metrics, and mapping tables with the International Classification of Diseases (ICD) codes, Diagnosis Related Groups (DRGs), and the Healthcare Common Procedure Coding System (HCPCS).

Methodology

Data Extraction and Cleaning

read data into Postgres
filter data on pregnancy diagnosis and gender
pre-process diagnosis, procedure, and medication codes - results in a truncated "root" representation of the original code
handle outliers by measuring against clinically informed outlier values gathered in "MIMIC-Extract: A Data Extraction, Preprocessing, and Representation Pipeline for MIMIC-III"

Exploratory Data Analysis

explore adverse outcomes by race, marital status, and other demographic factors
explore common diagnoses and prescriptions for pregnant patients

Model Selection and Training

For all classifiers:

apply SMOTE to account for output imbalance
apply stratified 5-fold cross-validation to ensure minority classes are represented
evaluate via AUC and recall

Constructed models:

AdaBoost
Random Forest
Long Short-Term Memory Network (LSTM)

Results

Binary LSTM model (predicting adverse outcome v. no adverse outcome) achieved 88.5% AUC and 94% recall.
Multi-label LSTM model (predicts output labels for each adverse outcome) achieved 77% AUC and 92% recall.
Binary AdaBoost model achieved 86% recall and 88% precision.
The models tend to over-predict adverse outcomes (Type I error) compared to missing a diagnosis (Type II error).

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
notebooks		notebooks
resources		resources
.gitignore		.gitignore
McDougall-Design-Document.pdf		McDougall-Design-Document.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Predicting Adverse Pregnancy Outcomes with Machine Learning

Project Overview

Directory Structure

Dataset and Sources