I implemented an end-to-end machine learning model utilizing decision trees and random forests to predict heart disease due to a variety of environmental and biologic factors. In this project I really delved under the hood to better understand the hyperparameter tuning of each model. One large difficulty in creating this model was that the dataset was extremely imbalanced.
- Which factors contribute most to an individual being at risk for coronary heart disease (CHD)?
- How can an imbalanced dataset be mitigated?
Decision Tree Classifier, Random Forest Classifier, Imbalanced Data
-All analysis and visualization done in Python using pandas numpy sklearn seaborn matplotlib