Machine Learning using Spark Streaming - Sentiment Analysis

Team Number: BD_090_102_563_577

Team Members:

Arvind Krishna - PES1UG19CS090
Sachin Shankar - PES1UG19CS102
Suhas Thalanki - PES1UG19CS563
Vishruth Veerendranath - PES1UG19CS577

Project Repository: https://github.com/Big-Data-Gang/Machine-Learning-with-Spark-Streaming

Project title chosen: Machine Learning Using Spark Streaming Sentiment Analysis

Flow of the System

Training Flow

Test Flow

Surface Level Implementation

Receiving Data

Data Streams were received in app.py over a TCP Connection.
DStream was operated on using flatmap to split it into individual tweets.
This was then converted to an RDD and used for pre-processing.

Preprocessing

A pre-processing Pipeline was built with the following steps:

documentAssembler - Prepares data into a format that is processable by Spark NLP.
Tokenizer - Prepares data into a format that is processable by Spark NLP.
Normalizer - Prepares data into a format that is processable by Spark NLP.Prepares data into a format that is processable by Spark NLP.
Stopwords Cleaner - This takes a sequence of strings and drops all the stop words from the input sequences.
Stemmer - Returns hard-stems out of words with the objective of retrieving the meaningful part of the word.
Lemmatizer - This will generate the lemmas out of words with the objective of returning a base dictionary word.
Finisher - Converts annotation results into a format that easier to use
HashingVectorizer - Convert a collection of text documents to a matrix of token occurrences

Training

1. Classification:

SGDClassifier - Linear classifiers (SVM, logistic regression, etc.) with SGD training.
BernoulliNB - Naive Bayes classifier for multivariate Bernoulli models.
PassiveAggresiveClassifier - Passive Aggressive Classifier works by responding as passive for correct classifications and responding as aggressive for any miscalculation

2. Clustering:

MiniBatchKMeans - Mini Batch K-means algorithm works by using small random batches of data of a fixed size

Plotting

The tested model with its insights like accuracy, recall, score, precision & F1 was later plotted corresponding to batch numbers and size to evaluate the effectiveness of incremental learning. The plots of different models were also plotted to draw conclusions on effectiveness of the same

As seen in the above graph, the Passive Aggresive model performed the best in training. This trend was observed across other Training batch sizes as well.

As seen in the above graph, the BernoulliNB model performed the best in testing. This trend was observed across other Testing batch sizes as well.

From the above two graphs, it can be inferred that the Passive Aggresive Classifier while having a higher Training Accuracy, had a worse Test Accuracy which is an indicator of overfitting.
Due to this, we conclude that BernoulliNB is the most suitable model.

Effect of Training Batch Size on Accuracy of models

Models trained over various Batch Sizes were tested on the Test Set with a Batch Size of 1000 in order to get a standard accuracy unaffected by testing batch size. The only variable after this is done is the training batch size that the models were trained on.

From the above graphs we can see that the batch size doesn't really affect the accuracy of the model, it just changes the time of training and memory consumption. Also the PA classifier is slightly more affected by the batch size compared to the other models.
While considering KMeans plotting, we can see the trend of movement of centroids in the plane with the addition of each batch of data. At the end of the streaming, i.e. in the final batch, we can see that the points always cluster around the centroid for a given label.

Hyper-Parameter Optimisation:

We have searched through a few hyperparameters for each of the classifiers, keeping everything else the same. The plots for PA and SGD classifiers are below.

The c parameter (max step size) in PA being tuned clearly gives a better model overall i.e, c = 1 gives a better model than c = 5.

By comparing the penalty functions in SGD - L1 & L2 and we can conclude that L2 is better hyperparameter/penalty than L1.

Drawing Conclusions:

Considering the above plots and further analysis, we can conclude that the Naive Bayes is the best classifier for the given dataset and our device capabilities. It doesn't suffer from either overfitting or underfitting too much.
From the graphs, it's obvious that clustering is not a good idea for the given dataset due the huge size of the vectorized text and its dimensionality. Even in the graphs we are losing so much information due to the dimensionality reduction techniques used(PCA).
Also given the Time-Space tradeoff, a batch size of 5000 was found to be best for our device configuration and available memory space. It best utilized the memory while not being too slow either.

Reason behind design decisions:

The Hashing vectorizer was used since it is not necessary to store the vocabulary dictionary in memory, for large data sets it is scalable in very little memory thus causing it to be highly efficient
SGDClassifier was selected because It is easier to fit in the memory due to a single training example being processed by the network.
Bernoulli Classifier was chosen since its the one of the few NaiveBayes classifiers which will take up Sparse Matrix as the parameter and thereby reducing the need to convert it into a full matrix
PassiveAggressiveClassifiers were chosen because when working with memory-constrained devices, you may not able to keep all the training data in memory: passive-aggressive classifiers may help solve your memory problems.
MiniBatchKMeans has the advantage of reducing the computational cost by not using all the dataset each iteration but a subsample of a fixed size.

Takeaway from the project:

Learnt about and worked with Distributed File Systems such as Hadoop and their management.
Worked with Data Streaming in order to work with large datasets which would not fit into memory otherwise.
Learnt Incremental Learning and how it can be used to continuously improve models and also work with Data Streaming and Huge Datasets.
Improved understanding of Machine Learning.

Name		Name	Last commit message	Last commit date
Latest commit History 135 Commits
Plotting		Plotting
docs		docs
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
README_old.md		README_old.md
report.pdf		report.pdf
requirements.txt		requirements.txt
scripts.sh		scripts.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Machine Learning using Spark Streaming - Sentiment Analysis

Team Members:

Flow of the System

Training Flow

Test Flow

Surface Level Implementation

Receiving Data

Preprocessing

Training

1. Classification:

2. Clustering:

Plotting

Effect of Training Batch Size on Accuracy of models

Hyper-Parameter Optimisation:

Drawing Conclusions:

Reason behind design decisions:

Takeaway from the project:

About

Releases

Packages

Contributors 4

Languages

License

Big-Data-Gang/Machine-Learning-with-Spark-Streaming

Folders and files

Latest commit

History

Repository files navigation

Machine Learning using Spark Streaming - Sentiment Analysis

Team Members:

Flow of the System

Training Flow

Test Flow

Surface Level Implementation

Receiving Data

Preprocessing

Training

1. Classification:

2. Clustering:

Plotting

Effect of Training Batch Size on Accuracy of models

Hyper-Parameter Optimisation:

Drawing Conclusions:

Reason behind design decisions:

Takeaway from the project:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages