This project explores different machine learning models capable of classifying candidate exoplanets from the raw dataset.
Over a period of nine years in deep space, the NASA Kepler space telescope has been out on a planet-hunting mission to discover hidden planets outside of our solar system.
Which machine learning model best fits the data?
I explored four different machine learning models, and each of them are contained in individual notebooks.
- Logistic Model
- Random Forest Classifer Model
- Support vector machine classifier Model
- K-nearest neighbors classifier Model
The comparative summary of the hyper-tuned models is in its own NOTEBOOK.
The libraries I used are as follows:
- sklearn
- joblib
- numpy
- pandas
- matplotlib
All notebooks contain a pip install cells for joblib and sklearn that can be uncommented to make sure the version on your machine is up to date.
In each of the notebooks, I clean the source data to drop null values and remove the error columns.
The "y" variable for machine learning is "koi_disposition," which classifies each candidate as "confirmed", "candidate", or "false positive."
The "x" variables are the remaining columns in the dataset. The definitions of the columns are provided at the end of each model notebook, or it can be obtained at Kaggle or the data dictionary.
From the cleaned dataframe, I create a stratified train test split from the data with random_state=42.
I scale the data using a quantile transformer and normalizer.
I train and test the model, then hypertune them.
As a final step, I compile the scores and classification reports for the hypertuned models and observe the best-fit model in this notebook.
hypertuned rfc (random forest classifer) with standard scaler has the best fit to the data from observing the model score (0.89). Classification report of this model also has the best precision of outcomes (.80 for "CONFIRMED").
Hypertuning seems to have little impact on model scores.
See the details in the sections below.
The following sections show the overall scores for hypertuned and non-hypertuned models and precision for outcomes for ONLY hypertuned models.
For details on the hypertuned models' classification reports (including recall, f-1 score, etc.), see Model_Comparison notebook. For classification reports of the non-hypertuned models, please see each individual notebook.
Test Scores
Logistic - Type | Accuracy Scores |
---|---|
Non-hypertuned | 0.87 |
Hypertuned | 0.89 |
Hypertuned Model Outcome Precision
Outcome | Precision Scores |
---|---|
CANDIDATE | 0.82 |
CONFIRMED | 0.76 |
FALSE POSITIVE | 0.99 |
Test Scores
RFC - Type | Accuracy Scores |
---|---|
Non-hypertuned | 0.90 |
Hypertuned | 0.90 |
Outcome | Precision Scores |
---|---|
CANDIDATE | 0.86 |
CONFIRMED | 0.80 |
FALSE POSITIVE | 0.97 |
Test Scores
SVC - Type | Accuracy Scores |
---|---|
Non-hypertuned | 0.89 |
Hypertuned | 0.88 |
Hypertuned Model Outcome Precision
Outcome | Precision Scores |
---|---|
CANDIDATE | 0.80 |
CONFIRMED | 0.76 |
FALSE POSITIVE | 0.99 |
Test Scores
KNN - Type | Accuracy Scores |
---|---|
Non-hypertuned | 0.89 |
Hypertuned | 0.89 |
Hypertuned Model Outcome Precision
Outcome | Precision Scores |
---|---|
CANDIDATE | 0.86 |
CONFIRMED | 0.75 |
FALSE POSITIVE | 0.99 |