Predicting Student Dropout and Academic Success
This project aims to predict student dropout and academic success in higher education, focusing on STEM fields. A machine learning model was developed using early academic performance and socioeconomic factors. The Random Forest model achieved over 90% test set accuracy.
- Python (>= 3.x)
- Libraries: pandas, numpy, sklearn, matplotlib, scipy
-
Clone the repository.
-
Upload the dataset.
-
Run the Python script.
- Data Loading and Preprocessing: The dataset is loaded and cleaned
- Model Training (Random Forest): A Random Forest classifier is trained on the data.
- Model Evaluation
- Accuracy: The model's accuracy is calculated
- Confusion Matrix: A confusion matrix is generated to visualize model performance.
- ROC Curve: Receiver Operating Characteristic (ROC) curve is plotted to assess the model's ability to distinguish between classes
- Additional Analyses (Gender and Scholarship).
The model successfully predicts student success and dropout, offering early intervention opportunities to improve academic outcomes.
The project relies on several Python libraries for its implementation:
- pandas: Used for data manipulation and cleaning.
- numpy: Utilized for numerical operations and array manipulation.
- scikit-learn: Provides machine learning tools and algorithms.
- matplotlib: Used for data visualization and generating plots.
- scipy: Employed for statistical analyses and tests.