The goal of this project is to create a safer environment on social media. By detecting toxic comments on social media platforms, they can be more easily reported and removed. In the long term, this would allow people to better connect with each other in this increasingly digital world.
In the dataset given, there are 150000+ comments, and each comment is labelled in 6 categories: toxic, severe_toxic, obscene, threat, insult, and identity hate. A comment labelled as 1,0,0,1,0,0 means toxic and threat, while 0,0,1,0,1,0 means obscene and insult. If all 6 categories are 0, then the comments is a "good" comment.
An overwhelming majority (89%) of the data contains good comments, so in each dataset, the number of good and bad comments are balanced. Each dataset also contains 4 files: train, valid, test, and overfit.
There are multiple datasets created for prototyping and testing, each catered towards a different model prototype:
- Data: original and cleaned dataset from kaggle
- Datasets that start with binary: for 6 binary classifiers (6 sub folders, each for a different class)
- Datasets that start with processed_data: for multi-label classification
- Datasets that start with multi: for multi-class classification
The suffixes have different meanings too:
- Datasets that end with aug: with augmented data for the 3 minority classes: severe_toxic, threat, and indentity hate
- Datasets that end with v2: same as data_aug but with more data for threats
- loose vs strict: loose has repeated comments, while strict does not
In the end, two datasets were used: processed_data_aug_v2 and binary_data^6_aug_v2 for the two different approaches in the final model
After, the data can be fed into a globe embedding layer using processing.py, processing_binary.py, or processing_binary^6.py base on the data, and they are converted into batcheed vocab vectors.
Here's a link to the google drive folder that contains all the models and vocabs. Models in folder that contains the word "Aug" are the best performed models. https://drive.google.com/drive/folders/1nSGNEDN2mS2csaRHbWHTHipvGcxatwP7?usp=sharing
Each word is tokenized, and then converted to a GloVe embedding vector. The average vector for the sentence is passed through a fully connected layer which outputs a prediction.
The main model we used were called "CNN_LSTM_2_conv.py" It has 2 Conv2d layer, 2 maxpool layers, 1 LSTM layer and 1 fully connected layer
There are three approaches we experimented with to train the model:
General dataset with 6 classes --> 1 CNN_LSTM classifier --> 6 outputs
General dataset with 6 classes --> 6 CNN_LSTM classifier --> each gives 1 outputs
6 sub-datasets each with 1 class --> 6 CNN_LSTM classifier --> each gives 1 outputs
train loss: 0.617, train acc: 0.797, test loss: 0.619, test acc 0.794
train loss: 0.612, train acc: 0.812, test loss: 0.618, test acc 0.806
train loss: 0.612,train acc: 0.900, test loss: 0.537, test acc 0.879
train loss: 0.510,train acc: 0.986, test loss: 0.528, test acc 0.953