Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perfect multicollinearity #352

Open
AniekMarkus opened this issue Jan 9, 2023 · 1 comment
Open

Perfect multicollinearity #352

AniekMarkus opened this issue Jan 9, 2023 · 1 comment

Comments

@AniekMarkus
Copy link

Is your feature request related to a problem? Please describe.
LASSO may have undesired behavior in case of perfect multicollinearity between variables. If two variables x_A and x_B have correlation equal to 1, LASSO can split the coefficient value in an arbitrary way amongst the two variables, leading to a less sparse model than possible (i.e. with more variables selected). It may also make it difficult to explain models in a later stage (i.e. why both variables are included in a model if they have the same information).
 
This is a special case that might be common in PLP due to the data-driven way of creating variables with FeatureExtraction. Hence, parents-children in the hierarchy might be perfectly correlated (quite likely in case of few descendants - for me it occured with groups based on more/less detailed ATC codes) or short-medium-long term groups (less likely).
 
Describe the solution you'd like
It would be best if this problem is avoided by checking for this issue by removing these variables before model development. However, this requires finding an efficient way of detecting perfect correlation in a large set of variables.
 
Describe alternatives you've considered
Alternatively, we could;

  • Adjust the modelling strategy that avoids choosing both variables (e.g. by penalizing this).
  • Inspect the fitted model afterwards and make adjustement by removing one variable and adding the coefficients (= efficient way of dealing with problem, not prettiest solution).
     
    Additional context
    RemoveRedundancy should deal with some of these issues (e.g. in short-medium-long), although it apparently doesn't cover everything.
@egillax
Copy link
Collaborator

egillax commented Jan 17, 2023

A little bit of data to map out this issue.

When extracting a target cohort on the IPCI database with default covariate settings from FeatureExtraction. I get 12185 features, out of which 48% are perfectly correlated to at least one other feature. When looking at high instead of perfect correlation (above 0.8) the percentage rises to 82%.

I checked this for various settings below:

covariateSettings Total Features Ratio of features with perfect correlation to at least one other Ratio of features with high correlation (>0.8) to at least one other
default set from FE 12185 48% 82%
age, gender, conditions, procedures, and drug exposures in three time windows 16828 17% 50%
age, gender, conditions, procedures, and drug exposures in one time window 5990 6.4% 8.3%
age, gender and condition occurrence 1191 20% 23%
age, gender and condition era 1209 21% 23%
age, gender and condition group era 2519 65% 81%
age, gender and drug exposures 4765 2.7% 4.6%
age, gender and drug era 1368 6.8% 12.2%
age, gender and drug era group 2306 31% 57%

A few notes about this.

  • The 20% perfect collinearity with only condition occurrence is a database specific issue from the ETL process. Some of our source terms are ambiguous so when they are mapped they map to multiple standard concepts.
  • Adding time windows adds collinearity.
  • The group features are by far the most significant source of collinearity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants