Sentiment Analysis is an Natural Language Processing (NLP) application that classifies a text document or corpus’s emotional or sentimental tone, language, expression or point of view. Most of the time, emotions or attitudes can be positive, negative, somewhat positive and negative, mixed and so on. Therefore, sentiment analysis can help us to pick up and interpret the discursive patterns found in the language in order to understand and predict what are the evaluations and representations people are giving about a customer support, item bought, medication that has been taking, feedback analysis, market research, etc. In addition, classification tasks as this one can also give us clue about the audience by analyzing the demographics of the users.
The major goal of this project is to explore a dataset of medication reviews by analyzing the relationship between medication reviews, ratings given by their users, medications popularity throughout time, and hypothesis-testing about the dataset distribution, among others. Similarly, it has the goal to create a machine learning model to predict the emotion or sentiment addressed in the users' reviews or comments. For that, it was used NLP techniques and different machine learning algorithms, such as Random Forest Classifier, Naive Bayes Classifier and Long-Short Term Memory (LSTM) to create different models.
- Data gathering/loading
- Data exploration (EDA)
- Text preprocessing
- Feature engineering
- Model building, evaluation and hyperparameter tunning
- Model deployment
This project is organized in modules and notebooks. Similarly, they are suplemented with theory, comments and coding cells. In regards of the repo organization, this repository is divided into the modules below:
- Notebook 1 about data exploration (EDA) called
notebook_1_data_exploration.ipynb
; - Notebook 2 about data preprocessing called
notebook_2_data_preprocessing.ipynb
; - Notebooks 3, 4 and 5 about feature engineering called
notebook_3_feature_engineering.ipynb
,notebook_4_feature_engineering.ipynb
andnotebook_5_feature_engineering.ipynb
; - Notebook 6 and 7 about Random Forest Classifier modeling and testing called
notebook_6_data_modeling_with_random_forest_classifier.ipynb
andnotebook_7_data_testing_with_random_forest_classifier.ipynb
; - Notebook 8 and 9 about Naive Bayes Classifier modeling and testing called
notebook_8_data_modeling_with_multinomial_naive_bayes.ipynb
andnotebook_9_data_testing_with_multinomial_naive_bayes.ipynb
; - Notebook 10 about an ensemble model composed of Random Forest Classifier and Naive Bayes Classifier called
notebook_10_data_modeling_with_an_ensemble_model.ipynb
; - Notebook 11 about data modeling with Word2Vec called
notebook_11_data_modeling_with_word2vec.ipynb
(in progress); - Notebook 12 about data modeling with Long-Short Term Memory (LSTM) called
notebook_12_data_modeling_with_LSTM.ipynb
(in progress); - Under
models
folder, the modelsmnbc_model.joblib
andrfc_model.pkl
; - Uner
app
folder, the modelensemble_model.pkl
, the model deployed in and API flask app calledapp.py
, and the model testing file calledreviews_test_app_with_python.ipynb
.
During the execution of this project, many challenges were faced, starting with the dataset. As we know, an unbalanced training data can lead a machine learning algorithm to perform bias classifications. Thus, since it was not balanced, many strategies had to be employed in order to bounce the training data.