Bengali-Sentiment-Analysis-ML_Fine-Tune-Llama-3.1

This repository contains code and documentation for performing sentiment analysis on a Bengali text dataset using traditional machine learning models (Logistic Regression, Random Forest, etc.) and fine-tuning the Dolphin 2.9.4 Llama 3.1 (8B) model for Bengali text processing. The project also documents challenges faced with data preprocessing, NLP toolkit selection, model performance, and hardware limitations.

Project Overview

This project aims to:

Classify sentiments (positive, negative, neutral) of Bengali text data.
Evaluate traditional ML models and deep learning architectures like LSTM for sentiment classification.
Fine-tune large language models, such as Llama 3.1 (8B), despite hardware limitations, using alternatives like Dolphin 2.9.4 Llama 3.1 (8B).
Provide a comprehensive analysis of limitations in NLP toolkit selection, preprocessing issues, and hardware constraints, as well as a detailed comparison of model performance.

Key Features

Preprocessing of Bengali text using BNLP's CleanText module and NLTKTokenizer.
Traditional ML model evaluation with Logistic Regression, Random Forest, XGBoost, and LightGBM.
Deep Learning model implementation using LSTM for sentiment analysis.
Fine-tuning Dolphin 2.9.4 Llama 3.1 (8B) on limited hardware (4GB VRAM).
Comprehensive evaluation using Stratified K-Fold and K-Fold cross-validation techniques.
Extensive documentation of challenges and limitations, particularly with hardware and NLP toolkit selection.

Dataset

Source: The dataset consists of an Excel file containing Bengali text and sentiment labels (positive, negative, neutral). The data is processed using BNLP for text tokenization and cleaning.
Loading Method: Utilized pandas.read_excel() to load the dataset.

Preprocessing

Text Cleaning: Used BNLP’s CleanText module with custom parameters (fix_unicode=True, unicode_norm=True) to handle Unicode errors and clean the text output effectively.
Unwanted Strings Removal: Removed the string "See Translation" and reduced duplicate punctuation marks such as ‘।’ (Bengali full stop), ‘,’ (comma), ‘?’ (question mark), and ‘…’ (ellipsis) using re.sub().
Sentence Tokenization: Used BNLP's NLTKTokenizer for tokenizing text at the sentence level. This was necessary because word_tokenize removed punctuation, which was crucial for sentiment analysis.
Stemmer Issue: Initially employed stemmers from banglanltk, but they truncated words undesirably (e.g., ‘আমাকে’ to ‘আমা’), leading to loss of meaning. Consequently, stemming was excluded from the preprocessing pipeline.

Vectorization

Used TfidfVectorizer to transform the cleaned text into TF-IDF feature vectors. Converted the sparse matrix to a dense format with .toarray() to facilitate model training.

Used Models & Results

Model	Accuracy (K-Fold)	Accuracy (Stratified)	Best Parameters
Logistic Regression	75%	75%	{'C': 10, 'solver': 'liblinear'}
Multinomial Naive Bayes	65%	65%	Default
Random Forest Classifier	70%	75%	{'max_depth': 10, 'n_estimators': 50'}
XGBoost	60%	60%	{'learning_rate': 0.2, 'n_estimators': 100'}
LightGBM	50%	55%	{'learning_rate': 0.1, 'n_estimators': 100'}
LSTM	55%	55%	Default

Analysis

Best Performance: Logistic Regression and Random Forest were the top performers with 75% accuracy, indicating that simpler models combined with effective text vectorization can be highly effective for sentiment analysis.
LSTM Performance: The LSTM model exhibited poor performance (55% accuracy). The small dataset size and high variance likely contributed to its underperformance.

Fine-Tuning Llama 3.1

Initial Issues

Licensing Restrictions: Faced difficulties accessing Llama 3.1 models due to licensing issues. Applied for access, but was pending approval.
Model Choice: Used Dolphin 2.9.4 Llama 3.1 8B model from Hugging Face as an alternative. Due to licensing issues with Meta’s Llama 3.1 model, the Dolphin 2.9.4 Llama 3.1 (8B) model was used as an alternative for fine-tuning. This model is based on Meta's Llama 3.1 8B with 8.03 billion parameters.

Model Details:

Context Length: 128K
Training Sequence Length: 8192
Prompt Template: ChatML
Training Hyperparameters:
- Learning rate: 5e-06
- Batch size (training/evaluation): 2
- Epochs: 3
- Gradient accumulation steps: 4
- Optimizer: Adam

Challenges and Limitations

Hardware Limitation: The project was conducted on a GTX 1050Ti with 4GB VRAM, which proved insufficient for fine-tuning large models like Llama 3.1.
Workarounds: Forced the use of CPU (no_cuda=True), but kernel crashes occurred frequently despite adjustments in batch size, gradient accumulation, and mixed precision (fp16=True).
Ensuring that important parts of Bengali text, especially punctuation, were preserved during preprocessing.
Finding a reliable NLP library for Bengali text processing, which led to the exploration of both BNLP and banglanltk.
Data Format: Conversion of TF-IDF vectorized data to a dense format suitable for LSTM was required.
Dataset Size: The limited dataset size (99 rows) led to overfitting and hampered the model's ability to generalize.

NLP Toolkit Challenges:

Stemming Issues: The Bangla stemmer truncated important parts of the text, so stemming was excluded.
Punctuation Loss: Initial tokenization attempts using word_tokenize removed punctuation, which affected sentiment classification. Resolved by using sentence_tokenize.

Model Performance:

Sparse Matrix Conversion: The output from TF-IDF vectorization was a sparse matrix, which had to be converted to a dense format for model compatibility.

Hardware Constraints:

GPU VRAM Limitation: 4GB VRAM was insufficient to fine-tune the Llama 3.1 model, even with optimizations like smaller batch sizes and gradient accumulation.
CPU Use: Forced CPU use due to GPU constraints, leading to frequent kernel crashes despite adjustments.

Performance Comparison:

Best Performing Models: Logistic Regression and Random Forest, both achieving 75% accuracy with the Stratified K-Fold cross-validation method.
Worst Performing Models: LSTM and LightGBM performed poorly with an accuracy of 55%.

Fine-Tuning Llama:

Due to hardware limitations, fine-tuning the Dolphin 2.9.4 Llama 3.1 model was unsuccessful, with frequent kernel crashes.

Prerequisites:

Python 3.8+
Required libraries (install via requirements.txt):
- bnlp
- nltk
- banglanltk
- bnlp_toolkit
- huggingface_hub
- transformers
- numpy
- pandas
- matplotlib
- seaborn
- tensorflow
- torch
- scikit-learn
- xgboost
- lightgbm

Running the Project:

Clone the repository:

git clone https://github.com/yourusername/Sentiment-Analysis-ML_Fine-Tune-Llama-3.1.git

Install dependencies: pip install -r requirements.txt

Future Improvements

Data Augmentation: Augment the dataset with more samples to improve LSTM and other deep learning model performance.
Hardware Upgrades: Utilize cloud-based GPU services for large model fine-tuning (e.g., AWS-SageMaker, Google Colab, Kaggle).
Explore Lightweight Models: Investigate parameter-efficient models like QLoRA or 4-bit quantized models for fine-tuning on resource-constrained systems.
NLP Toolkit Enhancements: Continue refining the preprocessing pipeline, especially for tasks like stemming and tokenization in Bengali.

License

This project is licensed under the MIT License. See the LICENSE file for more details.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.ipynb_checkpoints		.ipynb_checkpoints
results/runs		results/runs
.gitattributes		.gitattributes
Dataset-Sentiment Analysis (Task-ML Engineer) - Copy.xlsx		Dataset-Sentiment Analysis (Task-ML Engineer) - Copy.xlsx
Dataset_Sentiment_Analysis_Task.xlsx		Dataset_Sentiment_Analysis_Task.xlsx
LICENSE		LICENSE
Llama3.1_fine-tuning.ipynb		Llama3.1_fine-tuning.ipynb
README.md		README.md
Sentiment-Analysis_DOC.md		Sentiment-Analysis_DOC.md
Sentiment-Analysis_DOC.pdf		Sentiment-Analysis_DOC.pdf
Task-ML Engineer.docx		Task-ML Engineer.docx
challenges.pdf		challenges.pdf
ipynb-code_sentiment-analysis.txt		ipynb-code_sentiment-analysis.txt
lstm_training_history.json		lstm_training_history.json
requirements.txt		requirements.txt
results_comparison.json		results_comparison.json
results_kfold.json		results_kfold.json
results_stratified.json		results_stratified.json
sentiment_analysis-task_MD. ALI AHNAF.rar		sentiment_analysis-task_MD. ALI AHNAF.rar
sentiment_analysis_traditional-ML.ipynb		sentiment_analysis_traditional-ML.ipynb
~$Dataset-Sentiment Analysis (Task-ML Engineer) - Copy.xlsx		~$Dataset-Sentiment Analysis (Task-ML Engineer) - Copy.xlsx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bengali-Sentiment-Analysis-ML_Fine-Tune-Llama-3.1

Table of Contents

Project Overview

Key Features

Dataset

Preprocessing

Vectorization

Used Models & Results

Analysis

Fine-Tuning Llama 3.1

Initial Issues

Model Details:

Challenges and Limitations

NLP Toolkit Challenges:

Model Performance:

Hardware Constraints:

Performance Comparison:

Fine-Tuning Llama:

Prerequisites:

Running the Project:

Future Improvements

License

About

Releases

Packages

Languages

License

MdAliAhnaf/Bengali-Sentiment-Analysis-ML_Fine-Tune-Llama-3.1

Folders and files

Latest commit

History

Repository files navigation

Bengali-Sentiment-Analysis-ML_Fine-Tune-Llama-3.1

Table of Contents

Project Overview

Key Features

Dataset

Preprocessing

Vectorization

Used Models & Results

Analysis

Fine-Tuning Llama 3.1

Initial Issues

Model Details:

Challenges and Limitations

NLP Toolkit Challenges:

Model Performance:

Hardware Constraints:

Performance Comparison:

Fine-Tuning Llama:

Prerequisites:

Running the Project:

Future Improvements

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages