Project of my master's degree in Computer Science ("Study and Research in Anti-Spam Systems").
For instructions on how to clone, build and run the project, please refer to "How To".
Machine learning library:
Data sets:
- 2017_BASE2 (with 8, 16, 32, 64, 128, 256, 512 and 1024 features - CHI2, DF and MI)
- 2017_MULT10 (with 10, 20, 30, 40, 50, 60, 70, 80, 90 and 100 features - CHI2, DF and MI)
Classification methods:
- A1DE - Averaged 1-Dependence Estimator
- A2DE - Averaged 2-Dependence Estimator
- BFTREE - Best-first tree
- CART - Classification And Regression Trees
- DTNB - Decision Table/Naive Bayes Hybrid Classifier
- FURIA - Fuzzy Unordered Rule Induction Algorithm
- FRF - Fast Random Forest
- HP - HyperPipe Classifier
- IBK - K-Nearest Neighbours Classifier
- J48 - C4.5 Decision Tree
- J48C - C4.5 Consolidated Decision Tree
- J48G - C4.5 Grafted Decision Tree
- JRIP - Repeated Incremental Pruning to Produce Error Reduction
- LIBLINEAR - Large Linear Classifier
- LIBSVM - Support Vector Machine
- MLP - Multilayer Perceptron
- NB - Naive Bayes classifier
- NBTREE - Decision Tree with Naive Bayes Classifiers at the leaves
- RBF - Radial Basis Function network
- RT - Random Tree
- SGD - Stochastic Gradient Gescent
- SMO - Sequential Minimal Optimization Algorithm
- SPEGASOS - Stochastic Primal Estimated sub-GrAdient SOlver for SVM
- WRF - Weka Random Forest
Metrics:
- Precision
- Recall
- Area under Precision-Recall Curve (PRC)
- Area under Receiver Operating Characteristic (ROC)
- F1 score (also known as F-score or F-measure)
- Training time
- Testing time