Project of my master's degree in Computer Science ("Study and Research in Anti-Spam Systems").
For instructions on how to clone, build and run the project, please refer to this guide.
Machine learning library:
Data sets information:
- There are five data sets - Ling Spam, Spam Assassin, TREC (2005, 2006 and 2007) and Unifei (2017 and 2018) - available here. Each was pre-processed with three feature extraction methods (CHI2, FD and MI) and eight different feature vector sizes (8, 16, 32, 64, 128, 256, 512 and 1024).
Classification methods:
- A1DE - Averaged 1-Dependence Estimator
- A2DE - Averaged 2-Dependence Estimator
- ADTREE - Alternating Decision Trees
- BFTREE - Best-first tree
- CART - Classification And Regression Trees
- DTNB - Decision Table/Naive Bayes Hybrid Classifier
- FURIA - Fuzzy Unordered Rule Induction Algorithm
- FRF - Fast Random Forest
- HP - Hyper Pipes Classifier
- HT - Hoeffding tree (VFDT)
- IBK - K-Nearest Neighbours Classifier
- J48 - C4.5 Decision Tree
- J48C - C4.5 Consolidated Decision Tree
- J48G - C4.5 Grafted Decision Tree
- JRIP - Repeated Incremental Pruning to Produce Error Reduction
- LIBLINEAR - Large Linear Classifier
- LIBSVM - Support Vector Machine
- LMT - Logistic Model Trees
- MLP-BFGS - Multilayer Perceptron (custom, multi-thread, trained with BFGS)
- MLP-BPROP - Multilayer Perceptron (stock, single-thread, trained with Backpropagation)
- NB - Naive Bayes classifier
- NBTREE - Decision Tree with Naive Bayes Classifiers at the leaves
- RBF - Radial Basis Function network
- RANDTREE - Random Tree
- REPTREE - Reduced-Error Pruning Tree
- SGD - Stochastic Gradient Descent
- SMO - Sequential Minimal Optimization Algorithm
- SPEGASOS - Stochastic Primal Estimated sub-GrAdient SOlver for SVM
- VP - Voted Perceptron
- WRF - Weka Random Forest
- ZERO-RULE - Zero Rule Algorithm
- SLP-H - Single Layer Perceptron (Hebbian Learning) from wekaclassalgos
- SLP_WH - Widrow-Hoff Learning from wekaclassalgos
- MLP-BP - Multilayer Perceptron (Back Propagation) from wekaclassalgos
- MLP-BDBP - Multilayer Perceptron (Bold Driver Back Propagation - Vogl's Method) from wekaclassalgos
- WDL4J - WekaDeeplearning4J: Deep Learning using Weka
Metrics:
- Precision, recall and F1 score
- Area under Precision-Recall (PR) and Receiver Operating Characteristic (ROC) curves
- Training and testing times
This code also supports t-Distributed Stochastic Neighbor Embedding (t-SNE) to generate bidimensional plots of the data sets. For more information, please refer to the author's page.