From 75bfaf84d0a4eb09c634be27617cb61368a871f1 Mon Sep 17 00:00:00 2001
From: alvinthai <alvinthai@gmail.com>
Date: Wed, 27 Dec 2017 00:25:20 -0800
Subject: [PATCH] made notebook a tutorial

---
 examples/example.ipynb | 244 +++++++++++++++++++++++++++++++++++++++--
 1 file changed, 237 insertions(+), 7 deletions(-)
diff --git a/examples/example.ipynb b/examples/example.ipynb
index 8a95c96..615b9f0 100644
--- a/examples/example.ipynb
+++ b/examples/example.ipynb
@@ -1,12 +1,5 @@
 {
  "cells": [
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Import Libraries and Data"
-   ]
-  },
   {
    "cell_type": "code",
    "execution_count": 1,
@@ -19,6 +12,38 @@
     "%cd .."
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Table of Contents"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "- [Modeling](#Modeling)\n",
+    "- [Plot Feature Importance](#Plot-Feature-Importance)\n",
+    "- [Plot Partial Dependence](#Plot-Partial-Dependence)\n",
+    "- [Plot Metric Dependencies vs. Thresholds](#Plot-Metric-Dependencies-vs.-Thresholds)\n",
+    "- [Test and Attach Models](#Test-and-Attach-Models)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Import Libraries and Data"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 2,
@@ -37,6 +62,18 @@
     "from sklearn.tree import DecisionTreeClassifier"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "For this tutorial, we will be using the [Covertype dataset](http://archive.ics.uci.edu/ml/datasets/Covertype).\n",
+    "\n",
+    "**Dataset Characteristics**  \n",
+    "- Target variable: <font color='green'>Cover_Type</font> *(7 classes)*  \n",
+    "- Attributes: <font color='green'>54</font> *(10 quantitative, 44 binary)*  \n",
+    "- Rows: <font color='green'>581012</font>"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 3,
@@ -48,6 +85,13 @@
     "train_df, test_df = make_forest_cover_data()"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -55,6 +99,21 @@
     "### Modeling"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This is an example of how to initiate the OrderedOVRClassifier model.  \n",
+    "\n",
+    "Let's say we are interested in training seperate binary classifiers for <font color='green'>Cover_Type</font> values 2,6,5. The order of the training is specified with the **ovr_vals** parameter, and the machine learning models to train OrderedOVRClassifier with is defined in the **model_dict**.  \n",
+    "\n",
+    "The parameters passed into the init parameters of OrderedOVRClassifier indicates the following instructions:\n",
+    "- **Step 1**. Train class *2* vs classes *(1,3,4,5,6,7)* with a RandomForestClassifier.\n",
+    "- **Step 2**. Train class *6* vs classes *(1,3,4,5,7)* with a DecistionTreeClassifier.\n",
+    "- **Step 3**. Train class *5* vs classes *(1,3,4,7)* with an ExtraTreesClassifier.\n",
+    "- **Step 4**. Train a multiclass model on the remaining classes with a LGBMClassifier and pass verbose=10 to the fit function."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 4,
@@ -76,6 +135,19 @@
     "                            model_fit_params=model_fit_params)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Since `'Cover_Type'` was defined as the target varaible when initializing OrderedOVRClassifier, we do not need to pass the **y** parameter into the fit function.\n",
+    "\n",
+    "We can optionally include a test set into the fit operation for evaluating our trained models (if not specified, the fit operation will report evaluation results on the training data). For this tutorial, test_df will be passed into the **eval_set** parameter. If using LightGBM or XGBoost, the test set will also be used to trigger early stopping.  \n",
+    "\n",
+    "The thresholds for each binary one-vs-rest classifier will be picked automatically based on the best weighted f1 score. The threshold value for positive classification for the binary classifiers can be manually overwritten with the self.thresholds property of OrderedOVRClassifier.\n",
+    "\n",
+    "<a id='Cell5'></a>"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 5,
@@ -341,6 +413,13 @@
     "oovr.fit(train_df, eval_set=test_df)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -348,6 +427,15 @@
     "### Plot Feature Importance"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Now that we have a model trained to classify all the data points, let's evaluate which features are impacting our predictions the most.  \n",
+    "\n",
+    "This is an example of how to plot model-agnostic feature importance calculations using OrderedOVRClassifier. Refer to the [API](https://alvinthai.github.io/OrderedOVRClassifier/api_reference.html#plotting-api) for a description of what's going on under the hood."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 6,
@@ -384,6 +472,13 @@
     "oovr.plot_feature_importance(train_df)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -391,6 +486,19 @@
     "### Plot Partial Dependence"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "`Elevation` is clearly the most important feature among the 54 attributes in the Covertype dataset.  \n",
+    "\n",
+    "We may be interested in seeing how the predictions for the multiclassifications varies as `Elevation` changes.  \n",
+    "\n",
+    "Below shows an example of how to plot model-agnostic partial dependence calculations (with respect to a single variable) using OrderedOVRClassifier. Refer to the [API](https://alvinthai.github.io/OrderedOVRClassifier/api_reference.html#OrderedOVRClassifier.plot_partial_dependence) for a description of what's going on under the hood.  \n",
+    "\n",
+    "Class *1* is the most probable classification for `Elevation > 3000`, Class *2* is the most probable classification for `Elevation > 2400 and Elevation < 3000`, Class *3* and *6* is more prevalent at lower `Elevation`, and Class *7* is more prevalent at higher `Elevation`."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 7,
@@ -487,6 +595,13 @@
     "oovr.plot_partial_dependence(train_df, 'Elevation')"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -494,6 +609,15 @@
     "### Plot Metric Dependencies vs. Thresholds"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We may be interested in sacrificing accuracy or recall for one class to imporve accuracy or recall in another. With OrderedOVRClassifier, we can adjust our binary classification thresholds to our desired preferences!  \n",
+    "\n",
+    "The below diagnostic plots can be generated with the `plot_oovr_dependencies` method. If we are hypothetically looking for 98% recall on Class *1* and are willing to take a hit to 90% recall for Class *2*, these plots tell us we should set the binary classifier threshold for class *2* to 0.70 to achieve this goal."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 8,
@@ -584,6 +708,13 @@
     "oovr.plot_oovr_dependencies(2, test_df)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -591,6 +722,17 @@
     "### Test and Attach Models"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "OrderedOVRClassifier can be fit without training the full model pipeline.  \n",
+    "\n",
+    "The below code shows how to skip the training step for the final model (classes 1,3,4,7).  \n",
+    "\n",
+    "Note that a full model pipeline is required for OrderedOVRClassifier to make predictions."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 9,
@@ -612,6 +754,13 @@
     "                            train_final_model=False)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 10,
@@ -624,6 +773,13 @@
     "oovr.fit(train_df, eval_set=test_df)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -631,6 +787,15 @@
     "**Grid Search**"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's execute a simple grid search to see if there are better hyper-parameters for tuning the final model of the pipeline.  \n",
+    "\n",
+    "The `fit_test_grid` method in OrderedOVRClassifier is designed to handle GridSearchCV and RandomizedSearchCV from the sklearn.model_selection module."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 11,
@@ -649,6 +814,13 @@
     "gridm = GridSearchCV(est_lgb, grid, scoring='neg_log_loss')"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 12,
@@ -746,6 +918,13 @@
     "results = oovr.fit_test_grid(gridm, train_df, eval_set=test_df)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---"
+   ]
+  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -753,6 +932,13 @@
     "**Attach Model**"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The best set of hyper-parameters are {'num_leaves': 250, 'min_child_samples': 5, 'colsample_bytree': 1.0, 'subsample': 0.8}. Let's now train the final model with these hyper-parameters using the `fit_test` method of OrderedOVRClassifier."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 13,
@@ -797,6 +983,20 @@
     "final_model = oovr.fit_test(best_lgb, train_df, eval_set=test_df)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The **final_model** object that is returned from the `fit_test` method can be attached to the pipeline using the `attach_model` method of OrderedOVRClassifier."
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 14,
@@ -827,6 +1027,20 @@
     "oovr.attach_model(final_model)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We see an overall classification accuracy of 95.2% with the updated final model. This is an improvement over the 94.7% result we got earlier in [Cell 5](#Cell5)!"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 15,
@@ -859,6 +1073,22 @@
     "extended_classification_report(test_df['Cover_Type'], y_pred)"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "---"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "collapsed": true
+   },
+   "source": [
+    "That's it for this tutorial. Feel free to contact me (alvinthai@gmail.com) with any comments or questions you may have!"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,