Welcome! This repo provides an interactive and complete practical feature engineering tutorial in Jupyter Notebook. It contains three parts: Data Prepocessing, Feature Selection and Dimension Reduction. Each part is demonstrated separately in one notebook. Since some feature selection algorithms such as Simulated Annealing and Genetic Algorithm lack complete implementation in python, we also provide corresponding python scripts (Simulated Annealing, Genetic Algorithm) and cover them in our tutorial for your reference.
-
Notebook One covers data preprocessing on static continuous features based on scikit-learn, on static categorical features based on Category Encoders, and on time series features based on Featuretools.
-
Notebook Two covers feature selection including univariate filter methods based on scikit-learn, multivariate filter methods based on scikit-feature, deterministic wrapper methods based on scikit-learn, randomized wrapper methods based on our implementations in python scrips, and embedded methods based on scikit-learn.
-
Notebook Three covers supervised and unsupervised dimension reduction based on scikit-learn.
- 1 Data Prepocessing
- 1.1 Static Continuous Variables
- 1.1.1 Discretization
- 1.1.1.1 Binarization
- 1.1.1.2 Binning
- 1.1.2 Scaling
- 1.1.2.1 Stardard Scaling (Z-score standardization)
- 1.1.2.2 MinMaxScaler (Scale to range)
- 1.1.2.3 RobustScaler (Anti-outliers scaling)
- 1.1.2.4 Power Transform (Non-linear transformation)
- 1.1.3 Normalization
- 1.1.4 Imputation of missing values
- 1.1.4.1 Univariate feature imputation
- 1.1.4.2 Multivariate feature imputation
- 1.1.4.3 Marking imputed values
- 1.1.5 Feature Transformation
- 1.1.5.1 Polynomial Transformation
- 1.1.5.2 Custom Transformation
- 1.1.1 Discretization
- 1.2 Static Categorical Variables
- 1.2.1 Ordinal Encoding
- 1.2.2 One-hot Encoding
- 1.2.3 Hashing Encoding
- 1.2.4 Helmert Coding
- 1.2.5 Sum (Deviation) Coding
- 1.2.6 Target Encoding
- 1.2.7 M-estimate Encoding
- 1.2.8 James-Stein Encoder
- 1.2.9 Weight of Evidence Encoder
- 1.2.10 Leave One Out Encoder
- 1.2.11 Catboost Encoder
- 1.3 Time Series Variables
- 1.3.1 Time Series Categorical Features
- 1.3.2 Time Series Continuous Features
- 1.3.3 Implementation
- 1.3.3.1 Create EntitySet
- 1.3.3.2 Set up cut-time
- 1.3.3.3 Auto Feature Engineering
- 1.1 Static Continuous Variables
- 2 Feature Selection
- 2.1 Filter Methods
- 2.1.1 Univariate Filter Methods
- 2.1.1.1 Variance Threshold
- 2.1.1.2 Pearson Correlation (regression problem)
- 2.1.1.3 Distance Correlation (regression problem)
- 2.1.1.4 F-Score (regression problem)
- 2.1.1.5 Mutual Information (regression problem)
- 2.1.1.6 Chi-squared Statistics (classification problem)
- 2.1.1.7 F-Score (classification problem)
- 2.1.1.8 Mutual Information (classification problem)
- 2.1.2 Multivariate Filter Methods
- 2.1.2.1 Max-Relevance Min-Redundancy (mRMR)
- 2.1.2.2 Correlation-based Feature Selection (CFS)
- 2.1.2.3 Fast Correlation-based Filter (FCBF)
- 2.1.2.4 ReliefF
- 2.1.2.5 Spectral Feature Selection (SPEC)
- 2.1.1 Univariate Filter Methods
- 2.2 Wrapper Methods
- 2.2.1 Deterministic Algorithms
- 2.2.1.1 Recursive Feature Elimination (SBS)
- 2.2.2 Randomized Algorithms
- 2.2.2.1 Simulated Annealing (SA)
- 2.2.2.2 Genetic Algorithm (GA)
- 2.2.1 Deterministic Algorithms
- 2.3 Embedded Methods
- 2.3.1 Regulization Based Methods
- 2.3.1.1 Lasso Regression (Linear Regression with L1 Norm)
- 2.3.1.2 Logistic Regression (with L1 Norm)
- 2.3.1.3 LinearSVR/ LinearSVC
- 2.3.2 Tree Based Methods
- 2.3.1 Regulization Based Methods
- 2.1 Filter Methods
- 3 Dimension Reduction
- 3.1 Unsupervised Methods
- 3.1.1 PCA (Principal Components Analysis)
- 3.2 Supervised Methods
- 3.2.1 LDA (Linear Discriminant Analysis)
- 3.1 Unsupervised Methods
References have been included in each Jupyter Notebook.
If there are any mistakes, please feel free to reach out and correct us!
Yingxiang Chen E-mail: [email protected]
Zihan Yang E-mai: [email protected]