Coding project 3: neural networks for regression and binary classification.

For this project you will be writing an R package that implements gradient descent algorithms for neural networks with one hidden layer. You are allowed and encouraged to discuss your code/ideas with other students, but it is strictly forbidden to copy any code/documentation/tests/etc from other groups, or any web pages. Your code/project must be written from scratch by the members of your group. Any violations will result in one or more of the following:

a zero on this assignment,
a failing grade in this class,
expulsion from NAU.

Algorithms to code

By now you know how to code functions in C++ or R so you have the choice.

if you code the algorithm in C++, write an R function that calls your C++ code via .C
if you code the algorithm in pure R, use matrix/vector arithmetic for efficiency (do NOT use for loops over data points or features)

Neural networks with early stopping regularization

R Function name: NNetIterations

Inputs: X.mat (feature matrix, n_observations x n_features), y.vec (label vector, n_observations x 1), max.iterations (int scalar > 1), step.size, n.hidden.units (number of hidden units), is.train (logical vector of size n_observations, TRUE if the observation is in the train set, FALSE for the validation set).
internally make sure to compute a scaled train input matrix, X.train (n_train x n_nontrivial_features), each column should have mean=0 and sd=1. n_train = number of train observations, n_nontrivial_features = number of features with non-zero variance/sd. Keep track of which features have zero variance, and make sure to set the corresponding weights to zero in the final V.mat output. Also keep track of the mean/sd of each column, so you can return V.mat on the original scale (if you use the unscaled X.mat during gradient descent, it will not converge – numerical instability).
Output: list with named elements:
- pred.mat, n_observations x max.iterations matrix of predicted values (real number for regression, probability for binary classification).
- V.mat final weight matrix (n_features+1 x n.hidden.units). The first row of V.mat should be the intercept terms.
- w.vec final weight vector (n.hidden.units+1). The first element of w.vec should be the intercept term.
- predict(testX.mat), a function that takes an unscaled test feature matrix and returns a vector of predictions (real numbers for regression, probabilities for binary classification). You should be able to make predictions via

A.mat <- cbind(1, testX.mat) %*% V.mat
Z.mat <- sigmoid(A.mat)
pred.vec <- cbind(1, Z.mat) %*% w.vec

if all of y.vec is binary (either 0 or 1) then you should use the logistic loss, otherwise use the square loss.
you should optimize the mean loss, which is the sum over all data divided by the number of data (if you don’t divide by the number of observations the gradient can get numerically unstable for large data).

R Function name: NNetEarlyStoppingCV

Inputs: X.mat, y.vec, fold.vec=sample(rep(1:n.folds), length(y.vec)), max.iterations, step.size, n.hidden.units, n.folds=4).
should use K-fold cross-validation based on the fold IDs provided in fold.vec (n.folds argument should be used to randomly assign fold.vec if it is not specified by the user)
for each train/validation split, use NNetIterations to compute the predictions for all observations
compute mean.validation.loss.vec, which is a vector (with max.iterations elements) of mean validation loss over all K folds (use the square loss for regression and the 01-loss for binary classification).
compute mean.train.loss.vec, analogous to above but for the train data.
minimize the mean validation loss to determine selected.steps, the optimal number of steps/iterations.
finally use NNetIterations(max.iterations=selected.steps) on the whole training data set.
Output the same list from NNetIterations and add
- mean.validation.loss, mean.train.loss.vec (for plotting train/validation loss curves)
- selected.steps

Documentation and tests

for each R function, write documentation with at least one example of how to use it.
write at least two tests for each R function, in tests/testthat/test-xxx.R. You should at least test that (1) for valid inputs your function returns an output of the expected type/dimension, (2) for an invalid input, your function stops with an informative error message.

Experiments/application: run your code on the following data sets.

Binary classification
- ElemStatLearn::spam 2-class [4601, 57] output is last column (spam).
- ElemStatLearn::SAheart 2-class [462, 9] output is last column (chd).
- ElemStatLearn::zip.train: 10-class [7291, 256] output is first column. (ignore classes other than 0 and 1)
Regression
- ElemStatLearn::prostate [97 x 8] output is lpsa column, ignore train column.
- ElemStatLearn::ozone [111 x 3] output is first column (ozone).
For each data set, use 5-fold cross-validation to evaluate the prediction accuracy of your code. For each split, set aside the data in fold s as a test set. Use NNetEarlyStoppingCV to train a model on the other folds (which should be used in your function as internal train/validation sets/splits), then make a prediction on the test fold s.
For each train/test split, to show that your algorithm is actually learning something non-trivial from the inputs/features, compute a baseline predictor that ignores the inputs/features.
- Regression: the mean of the training labels/outputs.
- Binary classification: the most frequent class/label/output in the training data.
For each data set, compute a s x 2 matrix of mean test loss values:
- each of the rows are for a specific test set,
- the first column is for the neural network predictor,
- the second column is for the baseline/un-informed predictor.
Make one or more plot(s) or table(s) that compares these test loss values. For each of the five data sets, is the neural network more accurate than the baseline?
for each data set, run NNetEarlyStoppingCV functions on the entire data set, and plot the mean validation loss as a function of the regularization parameter. plot the mean train loss in one color, and the mean validation loss in another color. Plot a point and/or text label to emphasize the regularization parameter selected by minimizing the mean validation loss function.
Write up your results in vignettes/report.Rmd that shows the R code that you used for the experiments/application, along with the output.
- Documentation: Vignettes chapter of R packages book.
- Example Rmd vignette source code. vignette rendered to HTML.
- For this assignment the headings should be as follows:

## Data set 1: spam

### Matrix of loss values

print out and/or plot the matrix.

comment on difference in accuracy.

### Train/validation loss plots

plot the two loss functions.

What are the optimal regularization parameters?

## Data set 2: ...

Grading rubric: 100 points.

Your group should submit a link to your repo on GitHub.

20 points for completeness of report.
- 4 points for each data set (2 points each for loss matrix and train/validation loss plot)
20 points if your R package passes with no WARNING/ERROR on https://win-builder.r-project.org/
- minus 5 points for every WARNING/ERROR.
20 points for group evaluations – this is to make sure that each group member participates more or less equally. You will get points deducted if your fellow group members give you a bad evaluation.
20 points for accuracy of your code (I will run tests to make sure your functions give errors for bad inputs, and the proper output for good inputs).
10 points for R documentation pages.
- 5 points for informative example code.
- 5 points for documenting types/dimensions of inputs/outputs.
10 points for tests, as described above.

Extra credit:

1-15 points extra credit if, in your Rmd report, you also compare against NNLearnCV/LM__L2CV/LM__EarlyStoppingCV, and comment on whether or not the neural network is more accurate (max 3 points per data set, one point for each algo). Note that the only way to get this to work along with CRAN/win-builder checks is by copying the code from the previous R package(s) to your package for project 3.
10 points extra credit if, in your Rmd report, you use LaTeX code/MathJax to type the equations for the loss, gradient, and backpropagation updates.
10 points if, in your GitHub repo, you setup Travis-CI to check your R package, and have a green badge that indicates a build that passes checks. See blog and docs.
if you submit your work early (to me via email) you will get feedback from me and extra credit:
- First week: 10 points if you have written both functions described above, and you email me with a link to your github repo by Tuesday Apr 2. (1 point per function)
- Second week: 10 more points if you have started your report, and you email me with the rendered HTML report as an attachment by Tue Apr 9. You will get 2 points of extra credit for the analysis of each data set (1 point for plots of train/validation loss versus regularization parameter, 1 point for 4-fold CV loss matrix table/plot).
- Third week: do tests/docs, finish report, make sure package passes R CMD check with no WARNING/ERROR on win-builder, send me a link to your github project via email by Fri Apr 12.

FAQ

How to deal with scaling?

Same as in the linear model from project 2. Your NNetIterations function should return V.mat on the original scale.

How to deal with intercept/bias terms?

They should be the first element/row of the weight vector/matrix. e.g. use cbind(1, feature.mat) %*% weights

Do we have to include the intercept terms?

For the first layer weights (V.mat) you do need to include the intercept to report the weights on the original scale. For the second layer weights (w.vec) you can include an intercept term if you want. If you do then the second layer predictions should be cbind(1, Z.mat) %*% w.vec, otherwise just Z.mat %*% w.vec.

In NNetIterations do we have to store/return weight matrices/vectors for all iterations?

No. Just return the final weight matrix/vector after doing max.iterations steps of gradient descent. And make sure to compute the predictions for all observations during each step, and return them in a prediction matrix. The is.train input parameter should be a logical vector, with one element for each observation, that indicates the set of each observation (TRUE=train, FALSE=validation). Train observations should be used for scaling and gradient descent. The others should not be used for scaling/gradient descent, but predictions should be reported for all observations.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

project-3-neural-networks.org

project-3-neural-networks.org

Algorithms to code

Neural networks with early stopping regularization

Documentation and tests

Experiments/application: run your code on the following data sets.

Grading rubric: 100 points.

FAQ

How to deal with scaling?

How to deal with intercept/bias terms?

Do we have to include the intercept terms?

In NNetIterations do we have to store/return weight matrices/vectors for all iterations?

Files

project-3-neural-networks.org

Latest commit

History

project-3-neural-networks.org

File metadata and controls

Algorithms to code

Neural networks with early stopping regularization

Documentation and tests

Experiments/application: run your code on the following data sets.

Grading rubric: 100 points.

FAQ

How to deal with scaling?

How to deal with intercept/bias terms?

Do we have to include the intercept terms?

In NNetIterations do we have to store/return weight matrices/vectors for all iterations?