Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support of Nan #142

Open
lionelkusch opened this issue Jan 23, 2025 · 11 comments
Open

Support of Nan #142

lionelkusch opened this issue Jan 23, 2025 · 11 comments
Labels
method implementation Question regarding methods implementations test Question link to tests

Comments

@lionelkusch
Copy link
Collaborator

There will be a need to test how to handle Nan in the methods and to add tests associated with it.

@lionelkusch lionelkusch added method implementation Question regarding methods implementations test Question link to tests labels Jan 23, 2025
@jpaillard
Copy link
Collaborator

For LOCO and PermutationImportance, I think that the handling of NaNs values can be left to the sklearn pipeline (using a model that supports them natively or including an imputation step). However, for CPI, I think we need further work inside the library as the main predictive model should support NaNs but also the estimator of the conditional distribution, $p(X^j | X^{-j})$.

Not sure, but this might also apply to knockoffs?

@lionelkusch
Copy link
Collaborator Author

Do you know if Lasso and RandomForest can handle natively NaNs values?
I ask the question because dcrt are based on this estimator.

@jpaillard
Copy link
Collaborator

RandomForest yes, Lasso no

@bthirion
Copy link
Contributor

For LOCO and PermutationImportance, I think that the handling of NaNs values can be left to the sklearn pipeline (using a model that supports them natively or including an imputation step). However, for CPI, I think we need further work inside the library as the main predictive model should support NaNs but also the estimator of the conditional distribution, p ( X j | X − j ) .

Not sure, but this might also apply to knockoffs?

But in most cases, the end estimator will be an sklearn one. If we can defer the handling of NaN to it, it would be ideal.

@lionelkusch
Copy link
Collaborator Author

The problem is not for the estimator but for the method itself.
In the case of CPI, the imputation model needs to deal with NaN which is not trivial and for knockoff, the estimation of the distribution requires supporting NaN.
Moreover, in dcrt0, we use Lasso to distillate the variable of interest and which doesn't support NaN.
In consequence, we need to decide a general politic of dealing with NaN (not supported or partially supported or fully supported).

@bthirion
Copy link
Contributor

My best suggestion is to take a look at https://skrub-data.org
which is a sklearn extension for this type of data.
In a first instance, I think its is acceptable not to support NaNs. Why not defer to the user to perform cleaning on the data prior to importance analysis ?
i.e. the user would implement one of the following strategies:

  • ignore features with NaN values
  • ignore samples with NaN values
  • handle NaNs as a special value (makes sense for categorical data only)
  • impute NaN values

@lionelkusch
Copy link
Collaborator Author

It was my suggestion to @jpaillard but he told me that NaN contains information by itself for the estimation and they are important to keep it.

@bthirion
Copy link
Contributor

Only when you work with categorical variables AFAIK

@jpaillard
Copy link
Collaborator

I would argue for the two last options: for categorical data, consider NaNs as a special value, and for continuous impute NaN values.

Why not defer to the user to perform cleaning on the data prior to importance analysis?

The user might want to use a main predictive model/pipeline that handles NaN values (i.e., RandomForestRegressor) but a conditional estimator, $\nu_j(X^{-j}) = p(X^j | X^{-j})$ that does not (i.e. RidgeCV).
While I agree that for the main predictive model, the strategy for handling NaNs should be the user's responsibility, I think it makes sense to provide a default & customizable strategy for handling NaNs in the conditional sampling step.

@bthirion
Copy link
Contributor

But then, the problem should rather be addressed from the beginning, before using hidimstat, no ?

@jpaillard
Copy link
Collaborator

I was specifically thinking about the case where the predictive model is HistGradientBoosting for which NaN values are handled even for continuous variables.

This estimator has native support for missing values (NaNs). During training, the tree grower learns at each split point whether samples with missing values should go to the left or right child, based on the potential gain.

This may be too specific, and we should leave the management of NaNs to the user.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
method implementation Question regarding methods implementations test Question link to tests
Projects
None yet
Development

No branches or pull requests

3 participants