Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Knockoffs(1/4): add comments and docstring of the functions #128

Open
wants to merge 49 commits into
base: main
Choose a base branch
from

Conversation

lionelkusch
Copy link
Collaborator

I aggregated most of the file for knockoff in one file.
I separate the main method from the computation of p_value and e_value and associate tests.

@lionelkusch lionelkusch requested review from jpaillard and removed request for jpaillard January 15, 2025 10:40
model_x_knockoff_filter
model_x_knockoff_pvalue
model_x_knockoff_bootstrap_quantile
model_x_knockoff_bootstrap_e_value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are all these functions meant to be public ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they should be public.

@lionelkusch lionelkusch removed the request for review from jpaillard January 17, 2025 13:25
@lionelkusch
Copy link
Collaborator Author

I merge the function model_x_knockoff and model_x_knockoff_aggregation together for having a unique function.

@bthirion
Copy link
Contributor

LMK when you want another review

return test_scores[0]
else:
return test_scores

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really sure if this function should be this "compacted". In this function we are both creating the knockoffs variables and applying the knockoffs statistic. Maybe it could be interesting to have them separate to be able to have more freedom (maybe we want to have knockoffs covariates not computed with the covariance matrix estimation or we want knockoffs statistic without a model like lasso). In the same line, an output like this one is not directly interpretable and the model_x_knockoff_filter is always applied after this. Why compact in the same function the knockoff variable, statistic by not the filter?

Copy link
Collaborator Author

@lionelkusch lionelkusch Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, there is only one way to compute the knockoff statistic. I didn't see the reason for cutting the function more.
Additionally, my other refactoring is more or based on the same "separation" of the function:

  • the first part of getting some data or test
  • second part evaluating the p-value from the previous result

Moreover, I was thinking in this case, the test_score was enough for getting the importance variables. Mainly because, for computing the pvalue, there were multiple ways to use this quantity.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To fulfill the knockoffs statistic condition for Candès et al. 2018 it is not mandatory to be of that type. It is true that they are usually lasso or logistic regression, but there may be others. Similarly for the knockoff variable construction, there are other ways of generating them, not forcely with the Covariance matrix. Therefore, I think that they should be better separated, even though they are the standard approaches.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's quite easy to transform the function _estimate_distribution, _stat_coefficient_diff, gaussian_knockoff_generation into argument.
Due to we don't have yet an example where is required this separation, I prefer to stay with this implementation.

What do you think @bthirion ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather keep the full outputs. I don't have a good reason to cut the APIs. It is easy for the user to not consider the outputs he/shi is not interested in.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by keeping the full output?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

selected, test_score, thresh, X_tilde

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I push the modification for a better separation of the API

return selected
else:
return selected, pvals

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is also a 'filter', not only computes the p-values. Shouldn't it be more explicit in the function name?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by 'filters'?

The name is coming to be homogeneous with the other functions and because there is not another option actually to "filter".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the same way that you have called the previous function 'model_x_knockoff_filter' because it selects the important covariates, the other aggregated functions and this one, are also filters because they return also the selected covariates

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The difference is that the 'model_x_knockoff_filter' is based on a 'filter' proposal in the original paper and the other methods are based on 'empirical_p_value'. For me, this makes the difference.
This is the cause of the name.
I should perhaps change the name of 'model_x_knockoff_filter' to avoid confusion.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the function model_x_knockoff_filter.
Do @AngelReyero think that I should still rename the other functions?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you remove the function?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't really remove it.
I merged this function with model_x_knockoff, following the comments.
Did I miss something?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case I think it is consistent.

Copy link
Contributor

@bthirion bthirion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost ready to merge now.

Sigma : 2D ndarray (n_features, n_features)
Covariance matrix
mu_tilde : 2D ndarray (n_samples, n_features)
The matrix of means used to generate the knockoff variables, returned by gaussian_knockoff_generation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the feeling that you make long lines (> 80 characters), which makes reading more difficult.
Can be addressed in a later PR though.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Black doesn't consider the size of the comment. There will be better formatting once Ruff will be used.

Copy link
Contributor

@bthirion bthirion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thx.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants