Given the Pima Indians Diabetes Database as a csv file. This data-set is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective is to predict based on diagnostic measurements whether a patient has diabetes. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females with at least 21 years old of Pima Indian heritage.
The Data_Visualization code:
- Loads the csv file into the Spyder Enviornment.
- Calculates and print the various statistical features of each attribute like Mean, Median, Mode etc.
- Calculates and print the correlation cofficient between the target attribute and various columns
- Plots the scatter plot between 2 different attributes.
The Data_Preprocessing code:
- Imports PCA from sklearn.
- Loads the csv file into the Spyder Enviornment.
- Normalization and standarization of each and every attribute except the target class.
- Then generates a syntheic data inoder to perform Eigenvalue and EigenVector decomposition.
- Applys PCA on the given Dataset in oder to reduce the dimensions of the data.
- Caluculates and print the covarience matrix after dimensionality reduction.
https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database
Statistical Analysis:
Normalization and Standarization of each column :
EigenValue analysis Of the Synetheic Data:
PCA on the main Dataset :