Skip to content

Latest commit

 

History

History
145 lines (115 loc) · 7.15 KB

README.md

File metadata and controls

145 lines (115 loc) · 7.15 KB

Subtheme Sentiments

Idea is to develop an approach that given a sample will identify the sub-themes along with their respective sentiments.

Screenshot-2020-10-03-042959

Approach

Data Exploration

During Data Exploration I came to know that there are around 10k data points and around 90 unique labels but most of them are noisy and are present in very low frequency. So, after doing some preprocessing and undersampling some more frequently occurring labels at the end we have 23 unique labels and around 6k data points. Look Data Exploration for more details.

My Approach

I considered this problem as a Multi-Label classification and used pre-trained BERT models with fine-tuning to train.

I chose Pretrained BERT models to leverage the information of Language models and as the data mostly consist of reviews, Language models would work fine, and also It is very easy to Implement. I have used Binary Cross Entropy with Logits as Loss Function.

I have tried both bert-base-uncased and bert-large-uncased pre-trained models to train the data. For more details check Model Analysis, bert-large-uncased is performing slightly better but due to its larger size, In this project, I stick with the bert-base-uncased. You can download the trained model from here.

Performance Metrics

Micro f1 score: Calculate metrics globally by counting the total true positives, false negatives, and false positives. This is a better metric when we have a class imbalance.

Macro f1 score: Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.

https://www.kaggle.com/wiki/MeanFScore

http://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html

Hamming loss: The Hamming loss is the fraction of labels that are incorrectly predicted.

https://www.kaggle.com/wiki/HammingLoss

Results

After 5 Epochs model started overfitting. More Details in Models Analysis

Metric Training Validation
BCE Loss 0.019 0.025
F1-Micro-Score 0.821 0.737
F1-Macro-Score 0.618 0.536
Hamming Loss 0.031 0.046

Shortcomings and Improvements

  1. As in Data Exploration, we combined labels to single label based on the sentiment which have a frequency less than 100 due to which we are ignoring some labels, we can improve this by oversampling those labels by using combinations of co-occurring labels.
  2. By experimenting with layers on top of pre-trained BERT could also improve results.
  3. By doing some Hyper-Parameter tuning of Batch Sizes, Learning Rate, we could improve results.
  4. I have used BCE Loss, some other loss functions could also improve results.

Usage

Clone the repository and run the following commands from the repository directory.

Install project dependencies from requirements.txt

pip install -r requirements.txt

Preprocessing Data and Saving Train and Validation Data Pickel File

python preprocess.py

Training, Evaluating and Saving Model

python train.py

Inference

python inference.py --text "Your Review Text"

Example -> python inference.py --text "Good prices. easy to arrange local fitting"
Output -> ['ease of booking positive', 'location positive', 'value for money positive']

Files

config.py
This file contains all the configuration for preprocessing, training, validation, and inference of the model.

preprocess.py
This file preprocesses the original data, converts the data to a multi-label classification problem, and also stores the train and validation pickle data. All the methods for preprocessing are commented pretty well in the file itself.

dataset.py
This file creates the custom pytorch dataset using bert tokenizer with all the features required by bert model.

dataloader.py
This file creates the dataset loader for both train and validation datasets in batches for training.

model.py
This file creates the custom bert model for multi-label classification, it uses hugging face transformers library to load pre-trained bert.

train.py
This file creates the training and validation functions to train and validate the model, Evaluation metrics are also defined in this file itself.

validate.py
This file contains the validation function that requires data loader and model to validate the dataset.

utils.py
This file has some utility functions to save models, print metrics, etc.

inference.py
This file contains the function for inference, we can give the reviews directly and it will predict labels using the trained bert model.

Contributing

Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.

Please make sure to update the tests as appropriate.