Predicting GPA Using Lifestyle Factors
This repository contains a machine learning project focused on predicting students' GPA based on lifestyle factors like study hours, sleep, extracurricular activities, and stress levels.
Table of Contents
- Project Overview
- Dataset Description
- Project Structure
- Prerequisites
- Getting Started
- Project Workflow
- Results
- Future Enhancements
- Contributing
- License
- Contact
The aim of this project is to examine the relationship between lifestyle choices and academic performance, using students' GPA as a measurable outcome. By building and evaluating predictive models, we hope to uncover which lifestyle factors most significantly influence GPA. Such insights can be valuable for educational institutions, advisors, and students themselves in making informed decisions to optimize academic performance.
The dataset comprises several lifestyle factors and the GPA of each student. Key features include:
- Study Hours Per Day: Average hours spent studying daily.
- Extracurricular Hours Per Day: Time spent on extracurricular activities (sports, clubs, etc.).
- Sleep Hours Per Day: Average hours of sleep per night.
- Social Hours Per Day: Time spent socializing with friends or family.
- Physical Activity Hours Per Day: Hours spent on physical activities (exercise, sports).
- Stress Level: Self-reported stress level (Low, Moderate, High), converted to numerical format for modeling.
- GPA: The target variable, representing the students' GPA.
-
Data Loading and Exploration:
- Load and inspect the dataset for missing values and data types.
- Summarize descriptive statistics to understand the distribution and central tendencies of each variable.
-
Data Preprocessing:
- Missing Values Handling: Impute missing values with the mean.
- Encoding: Convert categorical variables like stress levels to numerical values.
- Feature Scaling: Standardize features to ensure they are on a similar scale.
- Polynomial Features: Generate polynomial terms to capture potential non-linear relationships.
-
Exploratory Data Analysis (EDA):
- Visualize the distribution of GPA.
- Examine correlations among variables to identify potential predictors.
- Create histograms, scatter plots, and a heatmap to reveal relationships between lifestyle factors and GPA.
-
Model Training and Evaluation:
- Train multiple regression models to predict GPA based on lifestyle features.
- Models include Linear Regression, Ridge, Lasso, Random Forest, Gradient Boosting, and XGBoost.
- Evaluate models using cross-validation and metrics such as R2, Mean Absolute Error (MAE), and Mean Squared Error (MSE).
-
Hyperparameter Tuning:
- Fine-tune the hyperparameters of the best-performing model to enhance accuracy.
-
Model Interpretation with SHAP:
- Use SHAP values to explain the importance of each feature, helping interpret the model's predictions.
Ensure you have Python installed, along with the following libraries:
pip install pandas numpy matplotlib seaborn scikit-learn xgboost shap
git clone https://github.com/Shelton-beep/predicting-gpa-using-lifestyle-factors.git
cd predicting-gpa-using-lifestyle-factors
To run the project, open and execute each cell in the predicting-gpa-using-lifestyle-factors.ipynb
notebook. It contains detailed code and explanations of each step.
In the first step, we load the dataset and perform basic data preprocessing:
Load dataset
data = pd.read_csv('path/to/dataset.csv')
Check for missing values
print(data.isnull().sum())
Convert categorical 'Stress_Level' to numerical values
data['Stress_Level'] = data['Stress_Level'].map({'Low': 0, 'Moderate': 1, 'High': 2})
Standardize features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data.drop(columns=['GPA']))
EDA helps us understand the dataset and discover relationships between variables. We visualize the distribution of GPA and examine correlations:
Plot GPA distribution
sns.histplot(data['GPA'], kde=True)
plt.title("Distribution of GPA")
Correlation heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(data.corr(), annot=True, cmap='coolwarm')
plt.title("Correlation Matrix")
plt.show()
We train various regression models and evaluate them using cross-validation. Here’s an example with Linear Regression:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
model = LinearRegression()
scores = cross_val_score(model, X_train, y_train, cv=5, scoring='r2')
print("Cross-validated R2 Score for Linear Regression:", scores.mean())
Model Evaluation Metrics:
- R2 Score: Proportion of variance in GPA explained by the model.
- Mean Absolute Error (MAE): Average absolute difference between predicted and actual GPA.
- Mean Squared Error (MSE): Average squared difference, penalizing larger errors.
For the best-performing model (e.g., Random Forest), we use hyperparameter tuning to enhance accuracy.
from sklearn.model_selection import RandomizedSearchCV
param_grid = {
'n_estimators': [100, 200, 300],
'max_depth': [10, 20, 30],
'min_samples_split': [2, 5, 10]
}
grid_search = RandomizedSearchCV(RandomForestRegressor(), param_grid, cv=5, scoring='r2')
grid_search.fit(X_train, y_train)
print("Best Hyperparameters:", grid_search.best_params_)
Using SHAP values, we can interpret the impact of each lifestyle factor on GPA predictions:
import shap
explainer = shap.TreeExplainer(grid_search.best_estimator_)
shap_values = explainer.shap_values(X_test)
shap.summary_plot(shap_values, X_test, feature_names=X.columns)
The table below summarizes the performance of each model based on cross-validation scores:
Model | R2 Score | Mean Absolute Error | Mean Squared Error |
---|---|---|---|
Linear Regression | 0.XX | X.XX | X.XX |
Ridge Regression | 0.XX | X.XX | X.XX |
Lasso Regression | 0.XX | X.XX | X.XX |
Random Forest | 0.XX | X.XX | X.XX |
Gradient Boosting | 0.XX | X.XX | X.XX |
XGBoost | 0.XX | X.XX | X.XX |
- Key Predictors: Factors such as study hours, stress level, and sleep hours show significant influence on GPA.
- Model Interpretability: SHAP values reveal which lifestyle choices are most impactful, helping students focus on areas for improvement.
Future versions of this project could include:
- Adding more lifestyle factors for improved prediction accuracy.
- Experimenting with more complex models or neural networks.
- Using more sophisticated hyperparameter tuning techniques like Bayesian Optimization.
Contributions are welcome! To contribute:
- Fork the repository.
- Create a branch for your feature (
git checkout -b feature/YourFeature
). - Commit your changes (
git commit -m 'Add YourFeature'
). - Push to the branch (
git push origin feature/YourFeature
). - Open a pull request.
This project is licensed under the MIT License - see the LICENSE file for details.
For questions or suggestions, please feel free to reach out or open an issue.