This module covers more advanced supervised learning methods that include ensembles of trees (random forests, gradient boosted trees), and neural networks. .
Naive Bayes classifiers are called naive because they make the assumption that each feature of an instance is independent of all the others, given the class. This is not always the case with features, in reality there can be correlations but with this assumption we get highly efficient learning and prediction but the generalisation performance may be worse then more sophisticated learning models.
Types:
- Bernoulli: binary features e.g., work presence
- Multinomial: discrete features e.g., word count
- Guassian continuous/real valued features e.g., for each feature mean and standard deviation
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
nbclf = GaussianNB().fit(X_train, y_train)
Typically, Gaussian Naive Bayes is used for high-dimensional data. When each data instance has hundreds, thousands or maybe even more features. On the negative side, when the conditional independence assumption about features doesn't hold. In other words, for a given class, there's significant covariance among features, as is the case with many real world datasets. Other more sophisticated classification methods that can account for these dependencies are likely to outperform Naive Bayes.
Random Forests are an example of an ensemble. An ensemble takes multiple individual learning models and combines them to produce an aggregate model that is more powerful than any of its individual learning models alone. By combining different individual models into an ensemble, we can average out their individual mistakes to reduce the risk of overfitting while maintaining strong prediction performance. Recall that decision trees have a tendancy to overfit the training data, so the idea behind random forests is to have a collection of trees that do resonably well at prediction but are intentionally and randomly varied during the build. This variation happens in two ways, first the data selected to build each tree is random and second the feature chosen to split are selected randomly.
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier().fit(X_train, y_train)
print('Accuracy of RF classifier on training set: {:.2f}'.format(clf.score(X_train, y_train)))
print('Accuracy of RF classifier on test set: {:.2f}'.format(clf.score(X_test, y_test)))
Pros | Cons |
---|---|
* Widely used and excellent prediction performance | * The resulting models are difficult to interpret |
* Doesn't require careful normalisation of features or parameter tunning | * Like descion trees, not good choice for high dimentional tasks |
* Easily parallelised across multiple CPU's |
Key Parameters:
- n_estimator: number of trees to use in the estimator (default 10)
- max_features: has a strong effect on the performance, influences the diversity of trees in the forest (default works well in practice)
- max_depth: controls the depth of each tree.
- n_jobs: how many cores to use in parallel during training.
Like random forest, gradient boosted trees used an ensemble of multiple tress to create more powerful prediction models for classification and regression. In training GBDT builds a series of small trees (shallow trees are known as weak learners), where each tree attempts to correct errors from the previous stage. The learning rate controls how hard the next tree trys to correct the errors preceeding it (high will create more complex trees)
from sklearn.ensemble import RandomForestClassifier
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)
clf = RandomForestClassifier(max_features = 8, random_state = 0)
clf.fit(X_train, y_train)
print('Accuracy of RF classifier on training set: {:.2f}'.format(clf.score(X_train, y_train)))
print('Accuracy of RF classifier on test set: {:.2f}'.format(clf.score(X_test, y_test)))
Pros | Cons |
---|---|
* Best off the shelf accuracy on many problems | * The resulting models are difficult to interpret |
* Using model for prediction requires only modest memory and is fast | * Like descion trees, not good choice for high dimentional tasks |
* Doesn't require careful normalisation of features | * Training can require significant computation |
* Can handle a mix of feature types | * Requires careful parameter tunnning in particular the learning rate. |
Can be used for classfication or regression.
from sklearn.neural_network import MLPClassifier
# one hidden layer
nnclf = MLPClassifier(hidden_layer_sizes = 100, solver='lbfgs',random_state = 0).fit(X_train, y_train)
# two hidden layers, with L2 regularised tunning through alpha
nnclf = MLPClassifier(solver='lbfgs', activation = 'tanh',alpha = 5.0,hidden_layer_sizes = [100, 100],random_state = 0).fit(X_train, y_train)
from sklearn.neural_network import MLPRegressor
mlpreg = MLPRegressor(hidden_layer_sizes = [100,100],
activation = 'tanh',
alpha = thisalpha,
solver = 'lbfgs').fit(X_train, y_train)
Pros | Cons |
---|---|
* Form the basis of state of the art models and can be formed into advanced archetectures. | * Larger, more complex models require significant training time, data and customisation |
* Careful preprocessing of data is needed | |
* A good choice when the features are of similar type but not when they are very different. |
When the data you're using to train the machine learning algorithm happens to include unexpected extra information about the very thing you're trying to predict.
- https://techcrunch.com/2017/04/13/neural-networks-made-easy/
- http://playground.tensorflow.org/
- https://devblogs.nvidia.com/parallelforall/deep-learning-nutshell-core-concepts/
- https://medium.com/@colin.fraser/the-treachery-of-leakage-56a2d7c4e931
- https://www.kaggle.com/c/the-icml-2013-whale-challenge-right-whale-redux/discussion/4865#25839#post25839
- http://martin.zinkevich.org/rules_of_ml/rules_of_ml.pdf