You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
LASSO may have undesired behavior in case of perfect multicollinearity between variables. If two variables x_A and x_B have correlation equal to 1, LASSO can split the coefficient value in an arbitrary way amongst the two variables, leading to a less sparse model than possible (i.e. with more variables selected). It may also make it difficult to explain models in a later stage (i.e. why both variables are included in a model if they have the same information).
This is a special case that might be common in PLP due to the data-driven way of creating variables with FeatureExtraction. Hence, parents-children in the hierarchy might be perfectly correlated (quite likely in case of few descendants - for me it occured with groups based on more/less detailed ATC codes) or short-medium-long term groups (less likely).
Describe the solution you'd like
It would be best if this problem is avoided by checking for this issue by removing these variables before model development. However, this requires finding an efficient way of detecting perfect correlation in a large set of variables.
Describe alternatives you've considered
Alternatively, we could;
Adjust the modelling strategy that avoids choosing both variables (e.g. by penalizing this).
Inspect the fitted model afterwards and make adjustement by removing one variable and adding the coefficients (= efficient way of dealing with problem, not prettiest solution).
Additional context
RemoveRedundancy should deal with some of these issues (e.g. in short-medium-long), although it apparently doesn't cover everything.
The text was updated successfully, but these errors were encountered:
When extracting a target cohort on the IPCI database with default covariate settings from FeatureExtraction. I get 12185 features, out of which 48% are perfectly correlated to at least one other feature. When looking at high instead of perfect correlation (above 0.8) the percentage rises to 82%.
I checked this for various settings below:
covariateSettings
Total Features
Ratio of features with perfect correlation to at least one other
Ratio of features with high correlation (>0.8) to at least one other
default set from FE
12185
48%
82%
age, gender, conditions, procedures, and drug exposures in three time windows
16828
17%
50%
age, gender, conditions, procedures, and drug exposures in one time window
5990
6.4%
8.3%
age, gender and condition occurrence
1191
20%
23%
age, gender and condition era
1209
21%
23%
age, gender and condition group era
2519
65%
81%
age, gender and drug exposures
4765
2.7%
4.6%
age, gender and drug era
1368
6.8%
12.2%
age, gender and drug era group
2306
31%
57%
A few notes about this.
The 20% perfect collinearity with only condition occurrence is a database specific issue from the ETL process. Some of our source terms are ambiguous so when they are mapped they map to multiple standard concepts.
Adding time windows adds collinearity.
The group features are by far the most significant source of collinearity
Is your feature request related to a problem? Please describe.
LASSO may have undesired behavior in case of perfect multicollinearity between variables. If two variables x_A and x_B have correlation equal to 1, LASSO can split the coefficient value in an arbitrary way amongst the two variables, leading to a less sparse model than possible (i.e. with more variables selected). It may also make it difficult to explain models in a later stage (i.e. why both variables are included in a model if they have the same information).
This is a special case that might be common in PLP due to the data-driven way of creating variables with FeatureExtraction. Hence, parents-children in the hierarchy might be perfectly correlated (quite likely in case of few descendants - for me it occured with groups based on more/less detailed ATC codes) or short-medium-long term groups (less likely).
Describe the solution you'd like
It would be best if this problem is avoided by checking for this issue by removing these variables before model development. However, this requires finding an efficient way of detecting perfect correlation in a large set of variables.
Describe alternatives you've considered
Alternatively, we could;
Additional context
RemoveRedundancy should deal with some of these issues (e.g. in short-medium-long), although it apparently doesn't cover everything.
The text was updated successfully, but these errors were encountered: