Name | Program & Batch | Role |
---|---|---|
Shashwati | B.Tech ECE, Batch-26, IIIT NR | Team Head |
Darshan Kochar | B.Tech CSE, Batch-27, IIIT NR | Developer & Researcher |
Tejas Keshwani | B.Tech CSE, Batch-27, IIIT NR | Developer & Analyst |
This project presents a deep learning-based approach to classify cybercrime descriptions into multiple categories and subcategories, providing an efficient tool for law enforcement and cybersecurity analysts.
- Multi-Class Classification: Efficient categorization of cybercrime reports into primary categories and subcategories.
- Streamlit App Integration: A user-friendly interface for easy interaction and prediction.
- BERT-Based Fine-Tuning: Employs
bert-base-uncased
for classification tasks. - Addressing Imbalanced Data: Implements upsampling to improve performance on minority classes.
Cybercrime datasets often exhibit class imbalances, challenging language use, and unique categorizations. This project leverages BERT (Bidirectional Encoder Representations from Transformers) to build robust models for multi-class and multi-label classification.
The fine-tuned models are hosted on Hugging Face for public access:
Darshan Kochar's Hugging Face Models
Model Name | Task | Hugging Face Link |
---|---|---|
Category Classifier | Predict primary cybercrime category | Category Classifier) |
Financial Fraud Classifier | Specialized in financial fraud classification | Financial Fraud Classifier) |
Women and Child Classifier | Crimes affecting women and children | Women and Child Classifier) |
Other Cyber Crimes Classifier | Handles all other crime categories | Other Cyber Crime Classifier) |
Here attached the link of the demonstration of our project
Watch here
- Python 3.8+ (Anaconda recommended)
- Libraries:
transformers
,torch
,pandas
,scikit-learn
,numpy
conda create -n cyberguard python=3.8
conda activate cyberguard
After activating the environment, install the required packages:
pip install -r requirements.txt
The dataset includes categories and subcategories of cybercrimes (e.g., Phishing, Identity Theft, Malware Attack). Place your dataset in the official website and those csvs made for trainig are made after preprocessing but due to larger size can't be uploaded:
train.csv
: Training dataset- 'financial.csv' : subset of original
- 'women_child.csv':subset of original
- 'other.csv':subset of original
test.csv
: Testing dataset
Note: Due to confidentiality, the actual dataset is not provided here. Ensure your dataset follows the necessary format before training.
Run the following command to start model training:
EDA.ipynb
Category.ipynb
ffc.ipynb
wcc.ipynb
occ.ipynb
streamlit run App.py
- Arguments:
--epochs
: Number of training epochs.--batch_size
: Batch size for training.--lr
: Learning rate.
To execute the code using a Streamlit interface,clone the repo and run:
pip install-r requirements.txt
streamlit run x.py
This project includes techniques to handle data imbalance, particularly in the sub_category
labels. We implement upsampling to create a balanced dataset, improving model performance on minority classes.
The table below summarizes the performance of different models:
Model | Precision | Recall | F1 Score | Accuracy |
---|---|---|---|---|
Category Classifier | 0.9342 | 0.9337 | 0.9342 | 0.9337 |
Financial Fraud Classifier | 0.9296 | 0.9283 | 0.9296 | 0.9280 |
Other Cyber Crime Classifier + ('all-mp-net' from sentence_transformers) | 0.8880 | 0.8850 | 0.8880 | 0.8851 |
Women/ Child Classifier | 0.9704 | 0.9806 | 0.9804 | 0.9892 |
Although the model other cyber crime is not so efficient bt its effficiency has increased simultaneously after trying to run it with similarity search using chroma_db, we highly encourage everyone to try the user interface and test us
This project is licensed under the MIT License. See the LICENSE
file for details.
- Hugging Face for their open-source transformer models.
- PyTorch for the deep learning framework.