Perfect multicollinearity #352

AniekMarkus · 2023-01-09T15:06:04Z

Is your feature request related to a problem? Please describe.
LASSO may have undesired behavior in case of perfect multicollinearity between variables. If two variables x_A and x_B have correlation equal to 1, LASSO can split the coefficient value in an arbitrary way amongst the two variables, leading to a less sparse model than possible (i.e. with more variables selected). It may also make it difficult to explain models in a later stage (i.e. why both variables are included in a model if they have the same information).

This is a special case that might be common in PLP due to the data-driven way of creating variables with FeatureExtraction. Hence, parents-children in the hierarchy might be perfectly correlated (quite likely in case of few descendants - for me it occured with groups based on more/less detailed ATC codes) or short-medium-long term groups (less likely).

Describe the solution you'd like
It would be best if this problem is avoided by checking for this issue by removing these variables before model development. However, this requires finding an efficient way of detecting perfect correlation in a large set of variables.

Describe alternatives you've considered
Alternatively, we could;

Adjust the modelling strategy that avoids choosing both variables (e.g. by penalizing this).
Inspect the fitted model afterwards and make adjustement by removing one variable and adding the coefficients (= efficient way of dealing with problem, not prettiest solution).

Additional context
RemoveRedundancy should deal with some of these issues (e.g. in short-medium-long), although it apparently doesn't cover everything.

egillax · 2023-01-17T10:36:57Z

A little bit of data to map out this issue.

When extracting a target cohort on the IPCI database with default covariate settings from FeatureExtraction. I get 12185 features, out of which 48% are perfectly correlated to at least one other feature. When looking at high instead of perfect correlation (above 0.8) the percentage rises to 82%.

I checked this for various settings below:

covariateSettings	Total Features	Ratio of features with perfect correlation to at least one other	Ratio of features with high correlation (>0.8) to at least one other
default set from FE	12185	48%	82%
age, gender, conditions, procedures, and drug exposures in three time windows	16828	17%	50%
age, gender, conditions, procedures, and drug exposures in one time window	5990	6.4%	8.3%
age, gender and condition occurrence	1191	20%	23%
age, gender and condition era	1209	21%	23%
age, gender and condition group era	2519	65%	81%
age, gender and drug exposures	4765	2.7%	4.6%
age, gender and drug era	1368	6.8%	12.2%
age, gender and drug era group	2306	31%	57%

A few notes about this.

The 20% perfect collinearity with only condition occurrence is a database specific issue from the ETL process. Some of our source terms are ambiguous so when they are mapped they map to multiple standard concepts.
Adding time windows adds collinearity.
The group features are by far the most significant source of collinearity

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perfect multicollinearity #352

Perfect multicollinearity #352

AniekMarkus commented Jan 9, 2023

egillax commented Jan 17, 2023

Perfect multicollinearity #352

Perfect multicollinearity #352

Comments

AniekMarkus commented Jan 9, 2023

egillax commented Jan 17, 2023