Switch to tidymodels

juliasilge · May 15, 2020 · 3893c64 · 3893c64
1 parent 75a68e2
commit 3893c64
Show file tree

Hide file tree

Showing 202 changed files with 27,774 additions and 27,387 deletions.
diff --git a/LICENSE b/LICENSE
@@ -1,6 +1,6 @@
 The MIT License (MIT)
 
-Copyright (C) 2019 Ines Montani
+Copyright (C) 2019 Julia Silge
 
 Permission is hereby granted, free of charge, to any person obtaining a copy
 of this software and associated documentation files (the "Software"), to deal

diff --git a/README.md b/README.md
@@ -8,9 +8,8 @@ You can access [this course for free online](https://supervised-ml-course.netlif
 
 This course approaches supervised machine learning using:
 
-- [tidyverse](https://tidyverse.tidyverse.org/) tools
-- more mature parts of the [tidymodels](https://github.com/tidymodels) suite of packages
-- [caret](https://topepo.github.io/caret/)
+- the [tidyverse](https://www.tidyverse.org/)
+- the [tidymodels](https://www.tidymodels.org/) ecosystem
 
 The interactive course site is built on the amazing framework created by [Ines Montani](https://ines.io/), originally built for her [spaCy course](https://course.spacy.io).  The front-end is powered by
 [Gatsby](http://gatsbyjs.org/) and [Reveal.js](https://revealjs.com) and the
@@ -19,7 +18,7 @@ back-end code execution uses [Binder](https://mybinder.org). [Florencia D'Andrea
 [![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/juliasilge/supervised-ML-case-studies-course/binder) 
 ![Netlify Status](https://api.netlify.com/api/v1/badges/3ba21376-9a18-4cf0-960e-2c65e6bc2bbd/deploy-status)
 
-To learn more about building a course on this framework, see Ines's starter repos for making courses in [Python](https://github.com/ines/course-starter-python) and [R](https://github.com/ines/course-starter-r), and her explanation of how the framework works at [the original course repo](https://github.com/ines/spacy-course#-faq).
+To learn more about building a course on this framework, see Ines's starter repos for making courses in [Python](https://github.com/ines/course-starter-python) and [R](https://github.com/ines/course-starter-r), and her explanation of how the framework works at [the original course repo](https://github.com/ines/spacy-course#-faq). The original version of this course based on the R package caret [is available here](https://caret-ml-course.netlify.app/).
 
 Please note that this project is released with a [Contributor Code of Conduct](CODE_OF_CONDUCT.md). By contributing to this project, you agree to abide by its terms.
 

diff --git a/binder/install.R b/binder/install.R
@@ -1,8 +1,5 @@
-install.packages("e1071")
-install.packages("gbm")
-install.packages("xgboost")
-install.packages("randomForest") 
-install.packages("caret")  
-install.packages("rsample")
-install.packages("yardstick")
+install.packages("randomForest")
+install.packages("rpart")
+install.packages("ranger")
+install.packages("tidymodels")
 install.packages("tidyverse")
diff --git a/binder/runtime.txt b/binder/runtime.txt
@@ -1 +1 @@
-r-2019-08-10
+r-3.6-2020-05-01
diff --git a/chapters/chapter1.md b/chapters/chapter1.md
@@ -8,14 +8,14 @@ type: chapter
 id: 1
 ---
 
-<exercise id="1" title="Making predictions using machine learning" type="slides">
+<exercise id="1" title="Make predictions using machine learning" type="slides">
 
 <slides source="chapter1_01">
 </slides>
 
 </exercise>
 
-<exercise id="2" title="Choosing an appropriate model">
+<exercise id="2" title="Choose an appropriate model">
 
 In this case study, you will predict the fuel efficiency ⛽  of modern cars from characteristics of these cars, like transmission and engine displacement. Fuel efficiency is a numeric value that ranges smoothly from about 15 to 40 miles per gallon. What kind of model will you build?
 
@@ -47,7 +47,7 @@ A classification model predicts a group membership or discrete class label, not
 
 </exercise>
 
-<exercise id="3" title="Visualizing the fuel efficiency distribution">
+<exercise id="3" title="Visualize the fuel efficiency distribution">
 
 The first step before you start modeling is to explore your data. In this course we'll practice using tidyverse functions for exploratory data analysis. Start off this case study by examining your data set and visualizing the distribution of fuel efficiency. The ggplot2 package, with functions like [`ggplot()`](https://ggplot2.tidyverse.org/reference/ggplot.html) and [`geom_histogram()`](https://ggplot2.tidyverse.org/reference/geom_histogram.html) are included in the tidyverse.
 
@@ -57,9 +57,9 @@ The first time you run a code exercise, it may take a little while for your Dock
 
 **Wherever you see `___` in a code exercise, replace it with the correct code as instructed. Run the code (via the button) to see if it will run, and submit it (via the other button) to check if it's correct.**
 
-`tidyverse` is loaded for you. 
+The `tidyverse` metapackage is loaded for you, so you can use readr and ggplot2. 
 
-- Print the `cars2018` object. Notice that some of the column names have spaces in them and are surrounded by backticks, like `` `Recommended Fuel` ``.
+- Take a look at the `cars2018` object using `glimpse()`. Notice that some of the column names have spaces in them and are surrounded by backticks, like `` `Recommended Fuel` ``.
 - Use the appropriate column from the data set in the call to `aes()` so you can plot a histogram of fuel efficiency (MPG).
 - Set the correct `x` and `y` labels.
 
@@ -72,15 +72,15 @@ The first time you run a code exercise, it may take a little while for your Dock
 
 </exercise>
 
-<exercise id="4" title="Building a simple linear model">
+<exercise id="4" title="Build a simple linear model">
 
 Before embarking on more complex machine learning models, it's a good idea to build the simplest possible model to get an idea of what is going on. In this case, that means fitting a simple linear model using base R's `lm()` function.
 
 **Instructions**
 
 - Use [`select()`](https://dplyr.tidyverse.org/reference/select.html) to deselect the two columns `Model` and `Model Index` from the model; these columns tell us the individual identifiers for each car and it would *not* make sense to include them in modeling. 
 - Fit `MPG` as the predicted quantity, explained by all the predictors, i.e., `.` in the R formula input to `lm()`. (You may have noticed the log distribution of MPG in the last exercise, but don't worry about fitting the logarithm of fuel efficiency yet.) 
-- Print the summary of the model.
+- Print the `summary()` of the model.
 
 <codeblock id="01_04">
 
@@ -90,7 +90,7 @@ Before embarking on more complex machine learning models, it's a good idea to bu
 
 </exercise>
 
-<exercise id="5" title="Getting started with caret" type="slides">
+<exercise id="5" title="Getting started with tidymodels" type="slides">
 
 <slides source="chapter1_05">
 </slides>
@@ -99,13 +99,13 @@ Before embarking on more complex machine learning models, it's a good idea to bu
 
 <exercise id="6" title="Training and testing data">
 
-Training models based on all of your data at once is typically not the best choice. 🚫 Instead, you can create subsets of your data that you use for different purposes, such as *training* your model and then *testing* your model. 
+Training models based on all of your data at once is typically not a good choice. 🚫 Instead, you can create subsets of your data that you use for different purposes, such as *training* your model and then *testing* your model. 
 
 Creating training/testing splits reduces overfitting. When you evaluate your model on data that it was not trained on, you get a better estimate of how it will perform on new data.
 
 **Instructions**
 
-- Load the `rsample` package. 
+- Load the `tidymodels` metapackage, which also includes dplyr for data manipulation. 
 - Create a data split that divides the original data into 80%/20% sections and (roughly) evenly divides the partitions between the different types of `Transmission`.
 - Assign the 80% partition to `car_train` and the 20% partition to `car_test`.
 
@@ -119,57 +119,61 @@ Creating training/testing splits reduces overfitting. When you evaluate your mod
 
 </exercise>
 
-<exercise id="7" title="Training models with caret">
+<exercise id="7" title="Train models with tidymodels">
 
-Now that your `car_train` data is ready, you can fit a set of models with caret. The [`train()`](https://topepo.github.io/caret/model-training-and-tuning.html#model-training-and-parameter-tuning) function from caret is flexible and powerful. It allows you to try out many different kinds of models and fitting procedures. To start off, train one linear regression model and one random forest model, without any resampling. (This is what `trainControl(method = "none")` does; it turns off all resampling.)
+Now that your `car_train` data is ready, you can fit a set of models with tidymodels. When we model data, we deal with model **type** (such as linear regression or random forest), **mode** (regression or classification), and model **engine** (how the models are actually fit). In tidymodels, we capture that modeling information in a model specification, so setting up your model specification can be a good place to start. In these exercises, fit one linear regression model and one random forest model, without any resampling of your data.
 
 **Instructions**
 
-- Load the caret package. 
-- Train a basic linear regression model on your `car_train` data. 
+- Load the tidymodels metapackage. 
+- Fit a basic linear regression model to your `car_train` data. 
 
 (Notice that we are fitting to `log(MPG)` since the fuel efficiency had a log normal distribution.)
 
 <codeblock id="01_07_1">
 
-For linear regression, use `method = "lm"`.
+For linear regression, use the function `linear_reg()`.
 
 </codeblock>
 
 **Instructions**
 
-- Train a random forest model on your `car_train` data.
+- Fit a random forest model to your `car_train` data.
 
 <codeblock id="01_07_2">
 
-For random forest, use `method = "rf"`.
+For a random forest model, use the function `rand_forest()`.
 
 </codeblock>
 
 </exercise>
 
-<exercise id="8" title="Evaluating your models">
+<exercise id="8" title="Evaluate model performance">
 
-The `fit_lm` and `fit_rf` models you just trained are in your environment. It's time to evaluate them! 🤩 For regression models, we will focus on evaluating using the **root mean squared error**. This quantity is measured in the same units as the original data (log of miles per gallon, in our case). Lower values indicate a better fit to the data. It's not too hard to calculate root mean squared error manually, but the [yardstick](https://tidymodels.github.io/yardstick/) package offers convenient functions for this and other model performance metrics.
+The `fit_lm` and `fit_rf` models you just trained are in your environment. It's time to see how they did! 🤩 How are we doing do this, though?! 🤔 There are several things to consider, including both what _metrics_ and what _data_ to use.
+
+For regression models, we will focus on evaluating using the **root mean squared error** metric. This quantity is measured in the same units as the original data (log of miles per gallon, in our case). Lower values indicate a better fit to the data. It's not too hard to calculate root mean squared error manually, but the [yardstick](https://tidymodels.github.io/yardstick/) package offers convenient functions for this and many other model performance metrics.
 
 **Instructions**
 
-- Load the yardstick package. 
+- Load the tidymodels metapackage, to access yardstick functions. 
 - Create new columns for model predictions from each of the models you have trained, first linear regression and then random forest.
-- Evaluate the performance of these models using [`metrics()`](https://tidymodels.github.io/yardstick/reference/metrics.html) by specifying the column in the original data that contains the real fuel efficiency.
+- Evaluate the performance of these models using [`metrics()`](https://tidymodels.github.io/yardstick/reference/metrics.html) by specifying the column that contains the real fuel efficiency.
 
 <codeblock id="01_08">
 
 - Use `fit_lm` to predict the values for linear regression and `fit_rf` to predict values for random forest.
-- The "truth" column in the original data is the column that holds fuel efficiency, `MPG`.
+- The "truth" column in `results` is the column that holds fuel efficiency, `MPG`.
 
 </codeblock>
 
 </exercise>
 
-<exercise id="9" title="Using the testing data">
+<exercise id="9" title="Use the testing data">
+
+"But wait!" you say, because you have been paying attention. 🤔 "That is how these models perform on the *training* data, the data that we used to build these models in the first place." This is _not_ a good idea because when you evaluate on the same data you used to train a model, the performance you estimate is too optimistic.
 
-"But wait!" you say, because you have been paying attention. 🤔 "That is how these models perform on the *training* data, the data that we used to build these models in the first place." Let's evaluate how these simple models perform on the testing data.
+Let's evaluate how these simple models perform on the testing data instead.
 
 **Instructions**
 
@@ -192,39 +196,40 @@ Where you had `car_train` before, switch out to `car_test`.
 
 <exercise id="11" title="Bootstrap resampling">
 
-In the last set of exercises, you trained linear regression and random forest models without any resampling. Resampling can improve the accuracy of machine learning models, and reduce overfitting.
+In the last set of exercises, you trained linear regression and random forest models without any resampling. Resampling can help us evaluate our machine learning models more accurately.
 
-Let's try bootstrap resampling, which means creating data sets the same size as the original one by randomly drawing with replacement from the original. In caret, the default behavior for bootstrapping is 25 resamplings, but you can change this using [`trainControl()`](https://topepo.github.io/caret/model-training-and-tuning.html#tune) if desired.
+Let's try bootstrap resampling, which means creating data sets the same size as the original one by randomly drawing with replacement from the original. In tidymodels, the default behavior for bootstrapping is 25 resamplings, but you can change this using the `times` argument in [`bootstraps()`](https://tidymodels.github.io/rsample/reference/bootstraps.html) if desired.
 
 **Instructions**
 
-The data sets available in your environment are 10% of their original size, to allow the code in this exercise to evaluate quickly. (This means you may see some warnings, such as about rank-deficient fits.)
+The data set available in your environment is 10% of its original size, to allow the code in this exercise to evaluate quickly. (This means you will see some warnings, such as about rank-deficient fits.)
 
-- Which data set should you train these models with, `car_train` or `car_test`?
-- Train these models using bootstrap resampling. The method for this is `"boot"`.
+- Create bootstrap resamples to evaluate these models. The function to create this kind of resample is `bootstraps()`.
+- Evaluate both kinds of models, the linear regression model and the random forest model.
+- Use the bootstrap resamples you created `car_boot` for evaluating both models.
 
 <codeblock id="01_11">
 
-You should still use the training data, `car_train`, for training these models.
+First evaluate `lm_mod`, and then evaluate `rf_mod`.
 
 </codeblock>
 
 </exercise>
 
-<exercise id="12" title="Plotting modeling results">
+<exercise id="12" title="Plot modeling results">
 
-You just trained models using bootstrap resampling, `cars_lm_bt` and `cars_rf_bt`. These models are available in your environment, trained on the entire training set instead of 10% only. Now let's evaluate how those models performed and compare them. We will again use `metrics()` from the yardstick package, but also we will plot the model predictions to inspect them visually.
+You just trained models on bootstrap resamples of the training set and now have the results in `lm_res` and `rf_res`. These results are available in your environment, trained using the entire training set instead of 10% only. Now let's compare them. 
 
-Notice in this code how we use [`gather()`](https://tidyr.tidyverse.org/reference/gather.html) from tidyr (another tidyverse package) to tidy the data frame and prepare it for plotting with ggplot2.
+Notice in this code how we use [`bind_rows()`](https://dplyr.tidyverse.org/reference/bind.html) from dplyr to combine the results from both models, along with [`collect_predictions()`](https://tune.tidymodels.org/reference/collect_predictions.html) to obtain and format predictions from each resample.
 
 **Instructions**
 
-- Use [`mutate()`](https://dplyr.tidyverse.org/reference/mutate.html) to create the new columns with the predictions from the two models you trained.
-- Choose which columns should be specified as `truth` and which should be `estimate` when calling `metrics()`.
+- First `collect_predictions()` for the linear model.
+- Then `collect_predictions()` for the random forest model.
 
 <codeblock id="01_12_1">
 
-Specify the `MPG` column as `truth` and the column created from the prediction (either `Linear regression` or `Random forest`) as `estimate`.
+The two sets of results are available in `lm_res` and `rf_res`.
 
 </codeblock>