This project implements a streaming pipeline for processing and analyzing the Amazon Metadata dataset using various techniques such as sampling, preprocessing, frequent itemset mining, and database integration.
This project is designed to:
- Process and analyze the Amazon Metadata dataset in real-time using a streaming pipeline
- Discover frequent itemsets in the dataset using algorithms like Apriori and PCY
- Optimize data processing using the Bloom Filter data structure
- Store and visualize the results in a MongoDB Compass database
- Handle large datasets and scale the processing using Kafka
- Perform real-time analytics and gain insights from the dataset
- Preprocess the dataset to prepare it for analysis
- Use multiple consumer applications to perform different tasks and analyses on the data stream
sampling.py
: Python script for sampling the Amazon Metadata dataset.pre-processing.py
: Python script for preprocessing the sampled dataset.producer.py
: Python script for the producer application in the streaming pipeline setup.consumer1.py
,consumer2.py
,consumer3.py
: Python scripts for the consumer applications subscribing to the producer's data stream.Apriori.py
: Python script implementing the Apriori algorithm for frequent itemset mining.PCY.py
: Python script implementing the PCY algorithm for frequent itemset mining.Bloomfilter.py
: Python script for implementing the Bloom Filter data structure.Database_Apriori.py
: Python script for integrating Apriori results with the MongoDB Compass database.Database_PCY.py
: Python script for integrating PCY results with the MongoDB Compass database.Database_Bloomfilter.py
: Python script for integrating Bloom Filter results with the MongoDB Compass database.
The sampling.py
script samples the Amazon Metadata dataset, followed by preprocessing using pre-processing.py
.
The producer.py
script generates a data stream, while consumer1.py
, consumer2.py
, and consumer3.py
subscribe to this stream to perform various tasks.
Apriori.py
and PCY.py
implement different algorithms for frequent itemset mining, while Bloomfilter.py
provides support for efficient data processing.
The Database_Apriori.py
, Database_PCY.py
, and Database_Bloomfilter.py
scripts connect to a MongoDB Compass database and store the results of the analysis.
The output generated by Database_Apriori.py
, Database_PCY.py
, and Database_Bloomfilter.py
can be viewed in MongoDB Compass after integration with the MongoDB database.
This project performs simple analysis on the dataset, including:
- Frequent itemset mining using Apriori and PCY algorithms
- Run
sampling.py
followed bypre-processing.py
to sample and preprocess the dataset. - Run
producer.py
to start the data stream. - Run
consumer1.py
,consumer2.py
, andconsumer3.py
to subscribe to the data stream and perform analysis. - Optionally, run
Apriori.py
,PCY.py
, andBloomfilter.py
for additional analysis. - Run
Database_Apriori.py
,Database_PCY.py
, andDatabase_Bloomfilter.py
to integrate with the MongoDB database.
- Python 3.x
- Kafka
- MongoDB Compass & MongoDB Connector
- Other dependencies as specified in the code
- Kindly, remove numbers at the start of the files (e.g., change
(3) producer.py
toproducer.py
). They are just given for clarifications. - Run the
BONUS.sh
file on the terminal with the command./BONUS.sh
.
- M.Tashfeen Abbasi
- Laiba Mazhar
- Rafia Khan
Thank you!