Support of Nan #142

lionelkusch · 2025-01-23T17:19:50Z

There will be a need to test how to handle Nan in the methods and to add tests associated with it.

jpaillard · 2025-02-13T10:42:32Z

For LOCO and PermutationImportance, I think that the handling of NaNs values can be left to the sklearn pipeline (using a model that supports them natively or including an imputation step). However, for CPI, I think we need further work inside the library as the main predictive model should support NaNs but also the estimator of the conditional distribution, $p(X^j | X^{-j})$.

Not sure, but this might also apply to knockoffs?

lionelkusch · 2025-02-13T11:09:41Z

Do you know if Lasso and RandomForest can handle natively NaNs values?
I ask the question because dcrt are based on this estimator.

jpaillard · 2025-02-13T13:10:10Z

RandomForest yes, Lasso no

bthirion · 2025-02-13T22:15:24Z

For LOCO and PermutationImportance, I think that the handling of NaNs values can be left to the sklearn pipeline (using a model that supports them natively or including an imputation step). However, for CPI, I think we need further work inside the library as the main predictive model should support NaNs but also the estimator of the conditional distribution, p ( X j | X − j ) .

Not sure, but this might also apply to knockoffs?

But in most cases, the end estimator will be an sklearn one. If we can defer the handling of NaN to it, it would be ideal.

lionelkusch · 2025-02-14T09:37:09Z

The problem is not for the estimator but for the method itself.
In the case of CPI, the imputation model needs to deal with NaN which is not trivial and for knockoff, the estimation of the distribution requires supporting NaN.
Moreover, in dcrt0, we use Lasso to distillate the variable of interest and which doesn't support NaN.
In consequence, we need to decide a general politic of dealing with NaN (not supported or partially supported or fully supported).

bthirion · 2025-02-15T10:11:33Z

My best suggestion is to take a look at https://skrub-data.org
which is a sklearn extension for this type of data.
In a first instance, I think its is acceptable not to support NaNs. Why not defer to the user to perform cleaning on the data prior to importance analysis ?
i.e. the user would implement one of the following strategies:

ignore features with NaN values
ignore samples with NaN values
handle NaNs as a special value (makes sense for categorical data only)
impute NaN values

lionelkusch · 2025-02-17T08:59:06Z

It was my suggestion to @jpaillard but he told me that NaN contains information by itself for the estimation and they are important to keep it.

bthirion · 2025-02-17T09:10:06Z

Only when you work with categorical variables AFAIK

jpaillard · 2025-02-17T11:11:02Z

I would argue for the two last options: for categorical data, consider NaNs as a special value, and for continuous impute NaN values.

Why not defer to the user to perform cleaning on the data prior to importance analysis?

The user might want to use a main predictive model/pipeline that handles NaN values (i.e., RandomForestRegressor) but a conditional estimator, $\nu_j(X^{-j}) = p(X^j | X^{-j})$ that does not (i.e. RidgeCV).
While I agree that for the main predictive model, the strategy for handling NaNs should be the user's responsibility, I think it makes sense to provide a default & customizable strategy for handling NaNs in the conditional sampling step.

bthirion · 2025-02-17T12:18:17Z

But then, the problem should rather be addressed from the beginning, before using hidimstat, no ?

jpaillard · 2025-02-17T16:11:41Z

I was specifically thinking about the case where the predictive model is HistGradientBoosting for which NaN values are handled even for continuous variables.

This estimator has native support for missing values (NaNs). During training, the tree grower learns at each split point whether samples with missing values should go to the left or right child, based on the potential gain.

This may be too specific, and we should leave the management of NaNs to the user.

lionelkusch added method implementation Question regarding methods implementations test Question link to tests labels Jan 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support of Nan #142

Support of Nan #142

lionelkusch commented Jan 23, 2025

jpaillard commented Feb 13, 2025

lionelkusch commented Feb 13, 2025

jpaillard commented Feb 13, 2025

bthirion commented Feb 13, 2025

lionelkusch commented Feb 14, 2025

bthirion commented Feb 15, 2025

lionelkusch commented Feb 17, 2025

bthirion commented Feb 17, 2025

jpaillard commented Feb 17, 2025

bthirion commented Feb 17, 2025

jpaillard commented Feb 17, 2025

Support of Nan #142

Support of Nan #142

Comments

lionelkusch commented Jan 23, 2025

jpaillard commented Feb 13, 2025

lionelkusch commented Feb 13, 2025

jpaillard commented Feb 13, 2025

bthirion commented Feb 13, 2025

lionelkusch commented Feb 14, 2025

bthirion commented Feb 15, 2025

lionelkusch commented Feb 17, 2025

bthirion commented Feb 17, 2025

jpaillard commented Feb 17, 2025

bthirion commented Feb 17, 2025

jpaillard commented Feb 17, 2025