Jill Cates, Tariq Hassan, Avinash Prabhakaran
punisheR is a package for feature and model selection in R. Specifically, this package implements tools for forward and backward model selection. In order to measure model quality during the selection procedures, we have also implemented the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC).
The package contains two stepwise feature selection techniques:
-
forward()
: Forward selection starts with one feature and iteratively adds the features with the best scores using a model fit criterion. The process of adding features is repeated until either the maximum number of features (n_features
) is reached or the change in score is less than themin_change
threshold. -
backward()
: Backward selection/elimination starts with all features and iteratively deletes features with the worst scores using model fit criterion. The process of deleting features is repeated until either the maximum number of features (n_features
) is reached or the change in score is less than themin_change
threshold.
Sources: https://en.wikipedia.org/wiki/Stepwise_regression
The package contains three metrics that evaluate model performance:
-
aic()
: The Akaike information criterion (AIC) adds a penalty term which penalizes more complex models. Its formal definition is: $-2(L)+2*k $ where k is the number of features and L is the maximized value of the likelihood function. -
bic()
: The Bayesian information criterion adds a penality term which penalizes complex models to a greater extent than AIC. Its formal definition is: −2 * ln(L)+ln(n)*k where k is the number of features, n is the number of observations, and L is the maximized value of the likelihood function. -
r_squared()
: The coefficient of determination is the proportion of the variance in the response variable that can be predicted from the explanatory variable.
These three criteria measure the relative quality of models within forward()
and backward()
and can be configured using the criterion
parameter. In general, having more parameters in your model increases prediction accuracy but is highly susceptible to overfitting. AIC and BIC add a penalty for the number of features in a model. The lower the AIC and BIC score, the better the model.
In the R ecosystem, forward and backward selection are implemented in both the olsrr and MASS packages. The former provides ols_step_forward()
and ols_step_backward()
for forward and backward stepwise selection, respectively. Both of these use p-value as a metric for feature selection. The latter, MASS, contains StepAIC()
, which is complete with three modes: forward, backward or both. Other packages that provide subset selection for regression models are leaps and bestglm.
To demonstrate how punisheR's feature selection and criterion functions work, we will use our demo data mtcars_data()
which arranges [mtcars](https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/mtcars.html)
into the correct format for our use cases.
mtcars_data()
returns a list of 4 dataframes in the following order: X_train, y_train, X_val, and y_val. Horsepower (hp
) is the response variable (y
), while the remaining variables of mtcars
are the predictive features (X
). The data is split into training data, which is used to train the model, and validation data which validates (scores) it.
# Loading the demo mtcars data
data <- mtcars_data()
X_train <- data[[1]]
y_train <- data[[2]]
X_val <- data[[3]]
y_val <- data[[4]]
There are two parameters that determine how features are selected in forward selection:
n_features
specifies the number of features. If you setn_features
to 3, the forward selection function will select the 3 best features for your model.min_change
specifies the minimum change in score in order to proceed to the next iteration. The function stops when there are no features left that cause a change larger than the thresholdmin_change
.
In order for forward selection to work, only one of n_features
and min_change
can be active. The other must be set to NULL.
Let's look at how n_features
works within forward selection:
forward(X_train, y_train, X_val, y_val, min_change=NULL,
n_features=2, criterion='aic', verbose=FALSE)
#> [1] 9 4
When implementing forward selection on the mtcars dataset with hp
as the response variable , it returns a list of features that form the best model. In the above example, the desired number of features has been specified as 2 and the criterion being used is aic
. The function returns a list of 2 features.
forward(X_train, y_train, X_val, y_val, min_change=NULL,
n_features=3, criterion='bic', verbose=FALSE)
#> [1] 9 4 8
In the above example, the desired number of features has been specified as 3 and the criterion being used is bic
. The function returns a list of 3 features.
forward(X_train, y_train, X_val, y_val, min_change=NULL,
n_features=4, criterion='r-squared', verbose=FALSE)
#> [1] 2 1 6 3
In the above example, the desired number of features has been specified as 4 and the criterion being used is r-squared
. The function returns a list of 4 features.
Forward selection also works by specifying the smallest change in criterion, min_change
:
forward(X_train, y_train, X_val, y_val, min_change=0.5,
n_features=NULL, criterion='r-squared', verbose=FALSE)
#> [1] 2 1 6 3 7 5
In the example above, forward
selction returns a list of 6 features when a minimum change of 0.5 is required in r-squared
's score for an additional feature to be selected.
Note: When using the criterion as aic
or bic
, the value for min_change
should be carefully selected as aic
and bic
tend to have much larger values than r-squared
.
Backward selection works in the same way as forward selection such that you must configure n_features
or min_change
, as well as the criterion
to score the model.
backward(X_train, y_train, X_val, y_val,
n_features=7, min_change=NULL, criterion='aic',
verbose=FALSE)
#> [1] 1 4 5 7 8 9 10
backward(X_train, y_train, X_val, y_val,
n_features=7, min_change=NULL, criterion='bic',
verbose=FALSE)
#> [1] 1 4 5 7 8 9 10
backward(X_train, y_train, X_val, y_val,
n_features=7, min_change=NULL, criterion='r-squared',
verbose=FALSE)
#> [1] 1 2 3 5 6 7 9
With n_features
configured to 7, each example above returns the 7 best features based on model score. You can see above that changing the criterion can result in a different output of "best" features.
In the example below, backward
selection returns a list of 10 features when the min_change
in the r-squared
criterion is specified as 0.5.
backward(X_train, y_train, X_val, y_val,
n_features=NULL, min_change=0.5, criterion='r-squared',
verbose=FALSE)
#> [1] 1 2 3 4 5 6 7 8 9 10
punisheR also provides three standalone functions to compute AIC, BIC, and R2. For aic()
and bic()
you simply need to pass in the model (e.g., a lm()
object). You can also pass in the validation data and response variable (X_val
, y_val
). By default, X
and y
are extracted from the model.
model <- lm(y_train ~ mpg + cyl + disp, data = X_train)
aic(model, X_val, y_val)
#> [1] 217.1279
bic(model, X_val, y_val)
#> [1] 223.0182
When scoring the model using AIC and BIC, we can see that the penalty when using bic
is greater than the penalty obtained using aic
.
r_squared(model, X_val, y_val)
#> [1] 0.7838625
The value returned by the function r_squared()
will be between 0 and 1.