Mobile Crowd Sensing (MCS) Data Analysis with NB and KNN Classifiers Using Differnet Feature Selection Methods
This repository contains Python implementations of Naïve Bayes (NB) and K-Nearest Neighbor (KNN) classifiers applied on the MCS dataset. Explored advanced techniques to improve machine learning performance during the 2023 uOttawa ML course.
- Required libraries: scikit-learn, pandas, matplotlib.
- Execute cells in a Jupyter Notebook environment.
- The uploaded code has been executed and tested successfully within the Google Colab environment.
Task is to classify the MCS dataset legitimacy status: Legitimate / Fake.
- Features include ID, Latitude, Longitude, Day, Hour, Minute, Duration, RemainingTime, Resources, Coverage, OnPeakHours, GridNumber.
- 'Legitimacy' column represents the target with two classes: 'Legitimate' and 'Fake'.
-
Dataset Splitting based on 'Day' Feature:
-
Baseline Performance of NB and KNN:
-
Dimensionality Reduction (DR) using PCA and Auto Encoder (AE):
-
Explored PCA and AE methods to determine optimal reduced dimensions based on F1 scores of test datasets.
-
Plotted the number of components vs. F1 score for both classifiers, showcasing the best performance.
- Maximum of PCA-Bernoulli Naive Bayes: 93.31858407079646
- Maximum of PCA-K-Nearest Neighbors: 94.81165600568585
-
-
Feature Selection with Filter and Wrapper Methods:
-
Explored feature selection methods such as Information Gain, Mutual Information, Variance Threshold, and Chi-Square to determine the optimal number of features and analyzed the relationship between the number of features and F1 scores, improving baseline performance.
-
Employed Wrapper Selection techniques like Forward Feature Elimination, Back Feature Elimination, and Recursive Feature Elimination to evaluate feature relevance. Investigated the correlation between the number of features and F1 scores, enhancing the baseline performance.
-
Visualized results through 2D TSNE plots using the selected best method for both training and test datasets.
-
-
Clustering Analysis using Latitude and Longitude: