-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Knockoffs(1/4): add comments and docstring of the functions #128
base: main
Are you sure you want to change the base?
Conversation
model_x_knockoff_filter | ||
model_x_knockoff_pvalue | ||
model_x_knockoff_bootstrap_quantile | ||
model_x_knockoff_bootstrap_e_value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are all these functions meant to be public ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, they should be public.
Small modification Co-authored-by: bthirion <[email protected]>
I merge the function model_x_knockoff and model_x_knockoff_aggregation together for having a unique function. |
LMK when you want another review |
src/hidimstat/knockoffs.py
Outdated
return test_scores[0] | ||
else: | ||
return test_scores | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really sure if this function should be this "compacted". In this function we are both creating the knockoffs variables and applying the knockoffs statistic. Maybe it could be interesting to have them separate to be able to have more freedom (maybe we want to have knockoffs covariates not computed with the covariance matrix estimation or we want knockoffs statistic without a model like lasso). In the same line, an output like this one is not directly interpretable and the model_x_knockoff_filter is always applied after this. Why compact in the same function the knockoff variable, statistic by not the filter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, there is only one way to compute the knockoff statistic. I didn't see the reason for cutting the function more.
Additionally, my other refactoring is more or based on the same "separation" of the function:
- the first part of getting some data or test
- second part evaluating the p-value from the previous result
Moreover, I was thinking in this case, the test_score was enough for getting the importance variables. Mainly because, for computing the pvalue, there were multiple ways to use this quantity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To fulfill the knockoffs statistic condition for Candès et al. 2018 it is not mandatory to be of that type. It is true that they are usually lasso or logistic regression, but there may be others. Similarly for the knockoff variable construction, there are other ways of generating them, not forcely with the Covariance matrix. Therefore, I think that they should be better separated, even though they are the standard approaches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's quite easy to transform the function _estimate_distribution, _stat_coefficient_diff, gaussian_knockoff_generation into argument.
Due to we don't have yet an example where is required this separation, I prefer to stay with this implementation.
What do you think @bthirion ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather keep the full outputs. I don't have a good reason to cut the APIs. It is easy for the user to not consider the outputs he/shi is not interested in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by keeping the full output?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
selected, test_score, thresh, X_tilde
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I push the modification for a better separation of the API
return selected | ||
else: | ||
return selected, pvals | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is also a 'filter', not only computes the p-values. Shouldn't it be more explicit in the function name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by 'filters'?
The name is coming to be homogeneous with the other functions and because there is not another option actually to "filter".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the same way that you have called the previous function 'model_x_knockoff_filter' because it selects the important covariates, the other aggregated functions and this one, are also filters because they return also the selected covariates
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The difference is that the 'model_x_knockoff_filter' is based on a 'filter' proposal in the original paper and the other methods are based on 'empirical_p_value'. For me, this makes the difference.
This is the cause of the name.
I should perhaps change the name of 'model_x_knockoff_filter' to avoid confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the function model_x_knockoff_filter
.
Do @AngelReyero think that I should still rename the other functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you remove the function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't really remove it.
I merged this function with model_x_knockoff, following the comments.
Did I miss something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case I think it is consistent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost ready to merge now.
src/hidimstat/gaussian_knockoff.py
Outdated
Sigma : 2D ndarray (n_features, n_features) | ||
Covariance matrix | ||
mu_tilde : 2D ndarray (n_samples, n_features) | ||
The matrix of means used to generate the knockoff variables, returned by gaussian_knockoff_generation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the feeling that you make long lines (> 80 characters), which makes reading more difficult.
Can be addressed in a later PR though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Black doesn't consider the size of the comment. There will be better formatting once Ruff will be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thx.
I aggregated most of the file for knockoff in one file.
I separate the main method from the computation of p_value and e_value and associate tests.