Week 4: Supervised Machine Learning - Part 2

This module covers more advanced supervised learning methods that include ensembles of trees (random forests, gradient boosted trees), and neural networks. .

Naive Bayes Classifiers

Naive Bayes classifiers are called naive because they make the assumption that each feature of an instance is independent of all the others, given the class. This is not always the case with features, in reality there can be correlations but with this assumption we get highly efficient learning and prediction but the generalisation performance may be worse then more sophisticated learning models.

Types:

Bernoulli: binary features e.g., work presence
Multinomial: discrete features e.g., word count
Guassian continuous/real valued features e.g., for each feature mean and standard deviation

from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
nbclf = GaussianNB().fit(X_train, y_train)

Typically, Gaussian Naive Bayes is used for high-dimensional data. When each data instance has hundreds, thousands or maybe even more features. On the negative side, when the conditional independence assumption about features doesn't hold. In other words, for a given class, there's significant covariance among features, as is the case with many real world datasets. Other more sophisticated classification methods that can account for these dependencies are likely to outperform Naive Bayes.

Random Forests

Random Forests are an example of an ensemble. An ensemble takes multiple individual learning models and combines them to produce an aggregate model that is more powerful than any of its individual learning models alone. By combining different individual models into an ensemble, we can average out their individual mistakes to reduce the risk of overfitting while maintaining strong prediction performance. Recall that decision trees have a tendancy to overfit the training data, so the idea behind random forests is to have a collection of trees that do resonably well at prediction but are intentionally and randomly varied during the build. This variation happens in two ways, first the data selected to build each tree is random and second the feature chosen to split are selected randomly.

from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier().fit(X_train, y_train)
print('Accuracy of RF classifier on training set: {:.2f}'.format(clf.score(X_train, y_train)))
print('Accuracy of RF classifier on test set: {:.2f}'.format(clf.score(X_test, y_test)))

Pros	Cons
* Widely used and excellent prediction performance	* The resulting models are difficult to interpret
* Doesn't require careful normalisation of features or parameter tunning	* Like descion trees, not good choice for high dimentional tasks
* Easily parallelised across multiple CPU's

Key Parameters:

n_estimator: number of trees to use in the estimator (default 10)
max_features: has a strong effect on the performance, influences the diversity of trees in the forest (default works well in practice)
max_depth: controls the depth of each tree.
n_jobs: how many cores to use in parallel during training.

Gradient Boosted Decision Trees

Like random forest, gradient boosted trees used an ensemble of multiple tress to create more powerful prediction models for classification and regression. In training GBDT builds a series of small trees (shallow trees are known as weak learners), where each tree attempts to correct errors from the previous stage. The learning rate controls how hard the next tree trys to correct the errors preceeding it (high will create more complex trees)

from sklearn.ensemble import RandomForestClassifier

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)
clf = RandomForestClassifier(max_features = 8, random_state = 0)
clf.fit(X_train, y_train)
print('Accuracy of RF classifier on training set: {:.2f}'.format(clf.score(X_train, y_train)))
print('Accuracy of RF classifier on test set: {:.2f}'.format(clf.score(X_test, y_test)))

Pros	Cons
* Best off the shelf accuracy on many problems	* The resulting models are difficult to interpret
* Using model for prediction requires only modest memory and is fast	* Like descion trees, not good choice for high dimentional tasks
* Doesn't require careful normalisation of features	* Training can require significant computation
* Can handle a mix of feature types	* Requires careful parameter tunnning in particular the learning rate.

Neural Networks

Can be used for classfication or regression.

from sklearn.neural_network import MLPClassifier

# one hidden layer
nnclf = MLPClassifier(hidden_layer_sizes = 100, solver='lbfgs',random_state = 0).fit(X_train, y_train)
# two hidden layers, with L2 regularised tunning through alpha
nnclf = MLPClassifier(solver='lbfgs', activation = 'tanh',alpha = 5.0,hidden_layer_sizes = [100, 100],random_state = 0).fit(X_train, y_train)

from sklearn.neural_network import MLPRegressor

mlpreg = MLPRegressor(hidden_layer_sizes = [100,100],
                             activation = 'tanh',
                             alpha = thisalpha,
                             solver = 'lbfgs').fit(X_train, y_train)

Pros	Cons
* Form the basis of state of the art models and can be formed into advanced archetectures.	* Larger, more complex models require significant training time, data and customisation
	* Careful preprocessing of data is needed
	* A good choice when the features are of similar type but not when they are very different.

Data Leakage

When the data you're using to train the machine learning algorithm happens to include unexpected extra information about the very thing you're trying to predict.

Resources

https://techcrunch.com/2017/04/13/neural-networks-made-easy/
http://playground.tensorflow.org/
https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-core-concepts/
https://medium.com/@colin.fraser/the-treachery-of-leakage-56a2d7c4e931
https://www.kaggle.com/c/the-icml-2013-whale-challenge-right-whale-redux/discussion/4865#25839#post25839
http://martin.zinkevich.org/rules_of_ml/rules_of_ml.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

week_4.md

week_4.md

Week 4: Supervised Machine Learning - Part 2

Naive Bayes Classifiers

Random Forests

Gradient Boosted Decision Trees

Neural Networks

Data Leakage

Resources

Files

week_4.md

Latest commit

History

week_4.md

File metadata and controls

Week 4: Supervised Machine Learning - Part 2

Naive Bayes Classifiers

Random Forests

Gradient Boosted Decision Trees

Neural Networks

Data Leakage

Resources