The repository contains the codes used for my Women in Data Science conference presentation. You can find the slides for my presentation here. I compared the performance of several random forests packages and showed how to generate confidence intervals for such models.
I am using California Housing dataset for estimates. You can download the dataset here. A modified version of the dataset is available at Kaggle.
I fitted the model using four R models from different packages: linear regression from base R, random forest estimates from ranger and grf and extreme boosting from xgboost. A minimal amount of hyperparameter tuning was performed to improve the performance xgboost.
The repository is organized as follows
- Load and preprocess data here.
- Visualize the data using point plots, maps and decision trees. I used the original dataset for visualizations and the preprocessed dataset to estimate the models.
- Estimate the models. That is, fit the models, make predictions and compute confidence intervals for predictions.Choose the optimal amount of trees by cross validation for xgboost. Also, plot figures of the most important features chosen by the models. Estimate variance and confidence intervals using grf.
A snapshot of names and versions of packages I used is available here.