Skip to content

Latest commit

 

History

History
108 lines (82 loc) · 6.3 KB

File metadata and controls

108 lines (82 loc) · 6.3 KB

Week 4: Supervised Machine Learning - Part 2

This module covers more advanced supervised learning methods that include ensembles of trees (random forests, gradient boosted trees), and neural networks. .

Naive Bayes Classifiers

Naive Bayes classifiers are called naive because they make the assumption that each feature of an instance is independent of all the others, given the class. This is not always the case with features, in reality there can be correlations but with this assumption we get highly efficient learning and prediction but the generalisation performance may be worse then more sophisticated learning models.

Types:

  • Bernoulli: binary features e.g., work presence
  • Multinomial: discrete features e.g., word count
  • Guassian continuous/real valued features e.g., for each feature mean and standard deviation
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
nbclf = GaussianNB().fit(X_train, y_train)

Typically, Gaussian Naive Bayes is used for high-dimensional data. When each data instance has hundreds, thousands or maybe even more features. On the negative side, when the conditional independence assumption about features doesn't hold. In other words, for a given class, there's significant covariance among features, as is the case with many real world datasets. Other more sophisticated classification methods that can account for these dependencies are likely to outperform Naive Bayes.

Random Forests

Random Forests are an example of an ensemble. An ensemble takes multiple individual learning models and combines them to produce an aggregate model that is more powerful than any of its individual learning models alone. By combining different individual models into an ensemble, we can average out their individual mistakes to reduce the risk of overfitting while maintaining strong prediction performance. Recall that decision trees have a tendancy to overfit the training data, so the idea behind random forests is to have a collection of trees that do resonably well at prediction but are intentionally and randomly varied during the build. This variation happens in two ways, first the data selected to build each tree is random and second the feature chosen to split are selected randomly.

from sklearn.ensemble import RandomForestClassifier

clf = RandomForestClassifier().fit(X_train, y_train)
print('Accuracy of RF classifier on training set: {:.2f}'.format(clf.score(X_train, y_train)))
print('Accuracy of RF classifier on test set: {:.2f}'.format(clf.score(X_test, y_test)))
Pros Cons
* Widely used and excellent prediction performance * The resulting models are difficult to interpret
* Doesn't require careful normalisation of features or parameter tunning * Like descion trees, not good choice for high dimentional tasks
* Easily parallelised across multiple CPU's

Key Parameters:

  • n_estimator: number of trees to use in the estimator (default 10)
  • max_features: has a strong effect on the performance, influences the diversity of trees in the forest (default works well in practice)
  • max_depth: controls the depth of each tree.
  • n_jobs: how many cores to use in parallel during training.

Gradient Boosted Decision Trees

Like random forest, gradient boosted trees used an ensemble of multiple tress to create more powerful prediction models for classification and regression. In training GBDT builds a series of small trees (shallow trees are known as weak learners), where each tree attempts to correct errors from the previous stage. The learning rate controls how hard the next tree trys to correct the errors preceeding it (high will create more complex trees)

from sklearn.ensemble import RandomForestClassifier

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)
clf = RandomForestClassifier(max_features = 8, random_state = 0)
clf.fit(X_train, y_train)
print('Accuracy of RF classifier on training set: {:.2f}'.format(clf.score(X_train, y_train)))
print('Accuracy of RF classifier on test set: {:.2f}'.format(clf.score(X_test, y_test)))
Pros Cons
* Best off the shelf accuracy on many problems * The resulting models are difficult to interpret
* Using model for prediction requires only modest memory and is fast * Like descion trees, not good choice for high dimentional tasks
* Doesn't require careful normalisation of features * Training can require significant computation
* Can handle a mix of feature types * Requires careful parameter tunnning in particular the learning rate.

Neural Networks

Can be used for classfication or regression.

from sklearn.neural_network import MLPClassifier

# one hidden layer
nnclf = MLPClassifier(hidden_layer_sizes = 100, solver='lbfgs',random_state = 0).fit(X_train, y_train)
# two hidden layers, with L2 regularised tunning through alpha
nnclf = MLPClassifier(solver='lbfgs', activation = 'tanh',alpha = 5.0,hidden_layer_sizes = [100, 100],random_state = 0).fit(X_train, y_train)
from sklearn.neural_network import MLPRegressor

mlpreg = MLPRegressor(hidden_layer_sizes = [100,100],
                             activation = 'tanh',
                             alpha = thisalpha,
                             solver = 'lbfgs').fit(X_train, y_train)
Pros Cons
* Form the basis of state of the art models and can be formed into advanced archetectures. * Larger, more complex models require significant training time, data and customisation
* Careful preprocessing of data is needed
* A good choice when the features are of similar type but not when they are very different.

Data Leakage

When the data you're using to train the machine learning algorithm happens to include unexpected extra information about the very thing you're trying to predict.

Resources