Author: Muhammad Sheryar Adil
Data Science
GRIP October'23
This repository contains three tasks involving data analysis and machine learning using Python. The tasks are:
- Linear Regression Analysis on Student Scores Data
- K-Means Clustering on Iris Dataset
- Exploratory Data Analysis on Retail Dataset
The objective of this task is to predict the scores of students based on the number of hours they studied using linear regression.
- Import Libraries:
- Import necessary libraries such as pandas, numpy, matplotlib, and sklearn.
- Load Data:
- Load the dataset from a URL and display the first 15 rows.
- Data Visualization:
- Plot the data to visualize the relationship between hours studied and scores obtained.
- Data Preparation:
- Split the data into features (hours) and target (scores).
- Train-Test Split:
- Split the data into training and testing sets.
- Model Training:
- Train a linear regression model on the training data.
- Model Visualization:
- Plot the regression line with the data points.
- Prediction:
- Make predictions on the test data.
- Evaluation:
- Compare the actual and predicted scores.
- Calculate and display the Mean Absolute Error (MAE).
- A scatter plot of hours vs. scores.
- pandas
- numpy
- matplotlib
- sklearn
The objective of this task is to perform clustering on the Iris dataset to identify different species of Iris flowers.
- Import Libraries:
- Import necessary libraries such as pandas, numpy, matplotlib, and sklearn.
- Load Data:
- Load the Iris dataset and display the first 5 rows.
- Data Preparation:
- Extract the feature variables from the dataset.
- Elbow Method:
- Use the Elbow Method to determine the optimal number of clusters.
- Model Training:
- Train a K-Means clustering model with the optimal number of clusters.
- Cluster Visualization:
- Visualize the clusters and their centroids.
- A plot showing the Elbow Method to determine the optimal number of clusters.
- A scatter plot visualizing the clusters and centroids.
- pandas
- numpy
- matplotlib
- sklearn
This project involves performing an exploratory data analysis (EDA) on a retail dataset named 'SampleSuperstore.' The analysis aims to understand the business's overall performance, identify profitable and loss-making segments, and offer insights for strategic decision-making.
- Import Libraries:
- Import necessary libraries such as pandas, numpy, matplotlib, and seaborn.
- Load Data:
- Load the dataset and display the first 5 rows.
- Data Overview:
- Display summary statistics and check for missing values.
- Data Cleaning:
- Remove duplicate entries from the dataset.
- Overall Profit/Loss Analysis:
- Calculate and display the overall profit/loss of the business.
- Category-wise Analysis:
- Analyze profit/loss and sales by different categories.
- Region-wise Analysis:
- Analyze profit/loss and sales by different regions.
- Sub-Category Analysis:
- Analyze profit/loss by different sub-categories.
- Discount Analysis:
- Analyze the impact of discount on profit/loss.
- Shipping Mode Analysis:
- Analyze sales and profit by different shipping modes.
- Segment Analysis:
- Analyze sales and profit by different customer segments.
- State-wise Analysis:
- Analyze sales and profit by different states.
- Recommendations:
- Provide recommendations based on the analysis.
- Overall Profit/Loss: Calculates the total profit or loss for the business.
- Profit/Loss by Category: Analyzes profit and loss across different product categories.
- Profit/Loss by Region: Breaks down profit and loss by geographical region.
- Profit/Loss by Sub-Category: Examines profit and loss at the sub-category level.
- Sales by Category: Visualizes sales distribution across categories.
- Sales by Region: Provides a regional sales breakdown.
- Profit/Loss by Discount: Investigates the impact of discounts on profit and loss.
- Profit & Sales by Ship Mode: Analyzes profit and sales based on different shipping methods.
- Sales & Profit by Segment: Studies the profit and sales across customer segments.
- Sales & Profit by State: Looks at profit and sales performance at the state level.
The analysis provides several key insights:
-
Overall Profit/Loss: The business has a total profit of $286,241.42.
-
Category and Region Performance: Some categories and regions perform better than others. For instance, office supplies generally show positive profits, while furniture and technology vary.
- Impact of Discounts: Higher discounts generally lead to lower profits, indicating the need for optimized discount strategies.
- Shipping and Segmentation Insights: The shipping method and customer segment analysis suggest opportunities to optimize shipping costs and target profitable customer segments.
These insights can be used to make data-driven decisions to improve business profitability and efficiency.
This repository provides a comprehensive analysis of different datasets using linear regression, clustering, and exploratory data analysis techniques. The insights and visualizations derived from these tasks can help in making informed business decisions.