Engaged in a Kaggle competition on network intrusion detection, I crafted a predictive model leveraging the provided training dataset. The task involved submitting CSV solutions (ID, Class) for the test set, with ID aligning with test data and Class presenting predicted labels. The competition's evaluation metric was the F1-score. Notably, this undertaking was a component of my 2023 master's program at the University of Ottawa, specializing in AI for Cyber Security.
- Required libraries: scikit-learn, pandas, matplotlib.
- Execute cells in a Jupyter Notebook environment.
- The uploaded code has been executed successfully within the Google Colab environment.
Task is to classify the connection is intrusive (1) or not (0)
Include features such as duration, protocol type, service, flags, and numerical attributes related to net work activities. These variables provide a comprehensive representation of network behavior for intrusion detection.
- 'Class': classify the connection is intrusive (1) or not (0)
-
Data Loading and Exploration:
- Loaded and explored the train and test datasets.
- Checked data information, null values, duplicates, and unique values.
-
Data Cleaning:
- Handled missing values and duplicates.
- Dropped unnecessary columns ("ID", "duration").
-
Data Preprocessing:
- Separated features (X_train) and target variable (y_train).
- Applied one-hot encoding to categorical variables.
-
Model Training:
- Utilized CatBoostClassifier with hyperparameter tuning after applying differnet Classifiers.
- Applied class weights for imbalanced classes.
- Employed soft voting with a threshold for ensemble predictions.
-
Model Evaluation and Prediction:
- Evaluated the model using the F1-score metric.
- Generated predictions for the test data.
-
Submission File Creation:
- Formatted the predictions into a CSV file with columns (ID, Class).
- Saved the submission file as "Result of CatBoostClassifier model.csv".