adult-income-dataset

ML Approach to identify characteristics associated with income level of people (those earning >$50,000 vs <$50,000).

Data

We use the publicly available Adult Income Dataset from the 1994 US Census Board Database for this task. The census_income_metadata file contains information on data/datatypes. We have ~200,000 records in training set and XXX in testing.

Solution

The notebook adult_income.ipynb contains end-to-end code for

Programmatically load data and check sanity against metadata file
Exploratory Data Analysis
Feature Engineering and Feature Encoding
Model Comparison and Selection (ROC AUC, PR AUC etc.)
1. Logistic Regression
2. RandomForest Classifier
3. XGBoost Classifier
4. CatBoost Classifier
Hyperparameter Tuning and Model Validation (Precision, Recall, Accuracy etc.)
Model Interpretation (Feature Importances, SHAP scores, Partial Dependency Plots etc.)
Imputation Techniques to handle imbalanced dataset (SMOTE, Oversampling, Undersampling etc.)

All the analysis, observation and comments are documented within the Python notebook itself.

Results

We observe that the top 8 most important characteristics associated with income are :

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
README.md		README.md
adult_income.ipynb		adult_income.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

adult-income-dataset

Data

Solution

Results

Weeks Worked in a Year

Age

Education

Sex

Occupation

About

Releases

Packages

Languages

sandeepchittilla/adult-income-dataset

Folders and files

Latest commit

History

Repository files navigation

adult-income-dataset

Data

Solution

Results

Weeks Worked in a Year

Age

Education

Sex

Occupation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages