Sentiment analysis (or opinion mining) is a NLP technique used to determine whether data is positive, negative or neutral. Sentiment analysis is often performed on textual data to help businesses monitor brand and product sentiment in customer feedback, and understand customer needs.
In this project we have performed sentimental analysis on a dataset of covid tweets containing 179108 tweets with 13 different features including user name,user description ,user location etc. The models used include Naive Bayes, Random Forest and Neural Networks. Feature extraction has been done using 2 algorithms namely Bag of words and Tfidf Vectorizer. The train data has been created by using TextBlob.
Various graphs have been plotted such as the number of unique values in each column, number of tweets from different locations, number of users in various locations etc.
Train data has been generated by using TextBlob for classyfying text as positive, neutral or negative.
Text | Classification Using TextBlob |
---|---|
If I smelled the scent of hand sanitizers today on someone in the past, I would think they were so intoxicated that… https://t.co/QZvYbrOgb0 | Negative |
Hey @Yankees @YankeesPR and @MLB - wouldn't it have made more sense to have the players pay their respects to the A… https://t.co/1QvW0zgyPu | Positive |
@diane3443 @wdunlap @realDonaldTrump Trump never once claimed #COVID19 was a hoax. We all claim that this effort to… https://t.co/Jkk8vHWHb3 | Neutral |
#coronavirus #covid19 deaths continue to rise. It's almost as bad as it ever was. Politicians and businesses want… https://t.co/hXMHooXX2C | Negative |
The dataset contains text with various kinds of features which are not useful for the analysis. Steps here include converting text to lowercase, removing text if in square brackets,removing links,removing punctuation ,removing words containing numbers, removing emojis, removing stopwords and also lemmatization. This helps in making feature extraction much more easier.
Before Cleaning | After Cleaning |
---|---|
If I smelled the scent of hand sanitizers today on someone in the past, I would think they were so intoxicated that… https://t.co/QZvYbrOgb0 | smelled scent hand sanitizers today someone past would think intoxicated that… |
Hey @Yankees @YankeesPR and @MLB - wouldn't it have made more sense to have the players pay their respects to the A… https://t.co/1QvW0zgyPu | hey yankee yankeespr mlb wouldnt made sense player pay respect a… |
@diane3443 @wdunlap @realDonaldTrump Trump never once claimed #COVID19 was a hoax. We all claim that this effort to… https://t.co/Jkk8vHWHb3 | wdunlap realdonaldtrump trump never claimed hoax claim effort to… |
#coronavirus #covid19 deaths continue to rise. It's almost as bad as it ever was. Politicians and businesses want… https://t.co/hXMHooXX2C | coronavirus death continue rise almost bad ever politician business want… |
Different models were used from convention ML models to Neural Networks and it was consistently observed that the Tfidf Vectorizer was not a good feature extractor for this dataset. After each model we found the accuracy score,balnced accuracy score and also did hyperparamater tuning in some cases.
Model | Hyperparameter Tuning | Feature Extractor | Accuracy Score | Balanced Accuracy Score |
---|---|---|---|---|
Guassian Naive bayes | None | Bag of Words | 0.692759 | 0.636655 |
Guassian Naive bayes | None | Tfidf Vectorizer | 0.613757 | 0.565149 |
Random Forest | None | Bag of Words | 0.736251 | 0.676058 |
Random Forest | Randomized Search Cv | Tfidf Vectorizer | 0.635475 | 0.676058 |
We have created a Neural Network consisting of LSTM, embedding, batchNormlization, Desnsely connected layers. The batchNormlization and Desnsely connected layers have been used twice.
Tuning by keras Tuner | Best Accuracy Score | Best Val_Accuracy Score |
---|---|---|
Before | 0.9312 | 0.9034 |
After | 0.9213 | 0.9002 |