Skip to content

Latest commit

 

History

History
149 lines (115 loc) · 7.91 KB

File metadata and controls

149 lines (115 loc) · 7.91 KB

Sentiment Analysis NLP

Using NLP, this project gauges customer sentiments online, offering customization and real-time feedback. Employing TF-BOW-LDA and ML models on train.csv dataset, it empowers e-commerce decisions, culminating in an NLP course at uOttawa in 2023.

  • Required libraries: scikit-learn, pandas, matplotlib.
  • Execute cells in a Jupyter Notebook environment.
  • The uploaded code has been executed and tested successfully within the Google Colab environment.

Supervised Sentiment Analysis for Text classification problem

Perform supervised sentiment analysis to categorize user sentiments into three classes: Positive, Negative, and Neutral.

Independent Variables:

  • 'name': Name of the product.
  • 'brand': Brand of the product.
  • 'categories': Categories associated with the product.
  • 'primaryCategories': Primary category of the product.
  • 'reviews.date': Date of the review.
  • 'reviews.text': Text content of the review.
  • 'reviews.title': Title of the review.

Target variable:

  • 'sentiment': Dependent variable indicating the sentiment (Positive, Negative, Neutral) of the review.

Key Tasks Undertaken

  1. Data Explore:

    • The most common keywords and their counts. image

    • The most common Positive words using WorldCloud. image

    • The most common Negative words using WorldCloud. image

    • The most common Neutral words using WorldCloud. image

  2. Data Preparation:

    • Data Cleaning

      • Handling Missing Data: The dataset has a very low percentage of missing cells (less than 0.1%) (10 values in reviews.title ), so we can safely drop or impute those missing values based on the specific context.
      • Handling Duplicate Rows: The dataset has 1.5% duplicate rows, which can be removed to ensure data integrity
    • Renaming and Dropping Columns:Renamed the columns 'reviews.text,' 'reviews.title,' and 'reviews.date' to 'reviews_text,' 'reviews_title,' and 'reviews_date,' respectively. Additionally, Dropped the columns 'name,' 'brand,' 'categories,' 'primaryCategories,' and 'reviews.date' from the dataset.

    • Sentiment Label Encoding: Created a mapping dictionary for sentiment labels and encoded the 'sentiment' column into numerical form (1 for 'Positive,' -1 for 'Negative,' and 0 for 'Neutral').

    • Create new Column ‘Polarity Scores’: Apply the SentimentIntensityAnalyzer to the 'reviews_text' column to calculate polarity scores for each review. Polarity scores represent the sentiment of the text as a continuous value between -1 (negative) and 1 (positive).

    • Balancing Data : The classes are imbalanced, you may consider applying techniques like SMOTE to balance the data. image

  3. Text Feature Engineering:

    • Normalizing Case Folding: Convert all text to lowercase to ensure consistent comparisons between words.
    • Removing Punctuation: Eliminate special characters and punctuation marks from the text to avoid any interference in analysis.
    • Removing Numbers: Exclude numerical digits from the text as they may not be relevant for certain tasks like sentiment analysis.
    • Removing Stopwords: Remove common words that do not carry much meaning (e.g., "the," "and," "is") using stopwords from the
    • English language.Remove Rare Words: Eliminate words that appear infrequently in the dataset, as they may not contribute significantly to the analysis.
    • Lemmatization: Convert words to their base or root form (lemmas) to reduce inflected words to a common base form. For example, "running," "runs," and "ran" will all be transformed to "run."
  4. Text Transformations:

    • Bag-of-Words (BOW): Similar to TF, but it also ignores the frequency and considers only whether a word appears or not (binary representation).

    • Term Frequency-Inverse Document Frequency (TF-IDF): Convert the text data into a bag-of-words representation, where each document is represented as a vector of word frequencies in the corpus.

    • Latent Dirichlet Allocation (LDA): Perform topic modeling to extract latent topics from the text data. Each document is represented as a mixture of topics.

  5. Modeling

    • Classfication (Random Forest , SVM , Logistic Regression , Gaussian Navie Bayes)

      • BOW Technique

      • TF-IDF Technique

      • LDA Technique

    • Clustering ( K-Means , Hierarchical)

      • BOW Technique
            Silhouette Score (K-Means): 81.55401438608376
            Silhouette Score (Hierarchical) : 17.925024032592773

      • TF-IDF Technique
            Silhouette Score (K-Means): 0.7683612431807604
            Silhouette Score (Hierarchical) : 17.966507375240326

      • LDA Technique
            Silhouette Score (K-Means): 81.55401438608376
             Silhouette Score (Hierarchical) 16.194509

  6. Evaluations

  7. Champion Model