In this exercise we will classify the "20 newsgroups" data set using our own naive bayes classifier and compare to the scikit learn built in version.
The 20 newsgroups dataset comprises around 18000 newsgroups posts on 20 topics split in two subsets: one for training (or development) and the other one for testing (or for performance evaluation). The split between the train and test set is based upon messages posted before and after a specific date.
Implement the barebone class provided: NaiveBayes(BaseEstimator, ClassifierMixin)
and implement its fit
, predict
and predict_proba
methods.
Steps:
- Load the train data using
from sklearn.datasets import fetch_20newsgroups
. remove headers, footers and quotes (see documentation) - Use
sklearn.feature_extraction.text import CountVectorizer
to count words (stop_words='english') - Use
sklearn.pipeline.make_pipeline
to chain the vectroizer and model. - note: limit the vocuabolary size if you suffer memory issues
- compare the accuracy over the test data. You can use
accuracy_score, classification_report
- compare to the built in
sklearn.naive_bayes.MultinomialNB
- compare to
TfidfVectorizer
preprocessing (you can use the built in model for doing the analysis) - plot the learning curve - is the model in the bias or variance regime (you can use the built in model for doing the analysis)