Skip to content

KevinSpek/Naive-Bayes-Model-Implementation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

Naive-Bayes-Implementation

Classifing Text Documents using Multinomial Naive Bayes

In this exercise we will classify the "20 newsgroups" data set using our own naive bayes classifier and compare to the scikit learn built in version.

The 20 newsgroups dataset comprises around 18000 newsgroups posts on 20 topics split in two subsets: one for training (or development) and the other one for testing (or for performance evaluation). The split between the train and test set is based upon messages posted before and after a specific date.

The Task

Implement the barebone class provided: NaiveBayes(BaseEstimator, ClassifierMixin) and implement its fit, predict and predict_proba methods.

Steps:

  • Load the train data using from sklearn.datasets import fetch_20newsgroups. remove headers, footers and quotes (see documentation)
  • Use sklearn.feature_extraction.text import CountVectorizer to count words (stop_words='english')
  • Use sklearn.pipeline.make_pipeline to chain the vectroizer and model.
  • note: limit the vocuabolary size if you suffer memory issues
  • compare the accuracy over the test data. You can use accuracy_score, classification_report
  • compare to the built in sklearn.naive_bayes.MultinomialNB
  • compare to TfidfVectorizer preprocessing (you can use the built in model for doing the analysis)
  • plot the learning curve - is the model in the bias or variance regime (you can use the built in model for doing the analysis)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published