-
-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sparse Implementation of build_coleman_forest
plus a fix for bottleneck import
#317
Sparse Implementation of build_coleman_forest
plus a fix for bottleneck import
#317
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #317 +/- ##
==========================================
+ Coverage 80.00% 80.27% +0.26%
==========================================
Files 24 24
Lines 2221 2296 +75
Branches 411 422 +11
==========================================
+ Hits 1777 1843 +66
- Misses 312 317 +5
- Partials 132 136 +4 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool! Thanks for the profiling data and documentation. It's very convincing.
Left a few comments to tighten up the PR
Co-authored-by: Adam Li <[email protected]>
@adam2392 I have updated the PR, let me know if I missed anything or if you want to see some more changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly lgtm. Just some small nits
@adam2392 I added some documentation to address your comments. Let me know what you think. |
I will let @sampan501 @SUKI-O @YuxinB @PSSF23 take a look as well. |
@@ -316,3 +329,209 @@ def get_per_tree_oob_samples(est: BaseForest): | |||
) | |||
oob_samples.append(unsampled_indices) | |||
return oob_samples | |||
|
|||
|
|||
def _get_forest_preds_sparse( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe I missed it somewhere, but are there unit tests for these methods anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sampan501, @adam2392 No there isn't a test for this function in test_utils.py. This function, _get_forest_preds_sparse
, is a good candidate for a unit test though. I can write one up.
I worry the other functions: _parallel_build_null_forests_sparse
and _compute_null_distribution_coleman_sparse
, like their dense counterparts, do too much to be easily unit tested and are tested via the integration tests. However, if you would like, we can refactor the dense/sparse functions into smaller units so that they can all have tests written. Would probably be cleaner, but would take time.
What would you like to see?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, I think it's fine the way it is. If we find means of re-designing the internal API to make things simpler and easier to test, then it's warranted, but I don't see a clear path. Do you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would take some thought. There is some repetition we might be able to leverage, but I am not sure if it would come at the cost of clarity. So just write up a test for _get_forest_preds_sparse
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think that's sufficient then for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good. Maybe we can make an issue for having such unit tests/re-designing the API so we don't forget. Otherwise, I think this PR lgtm
@sampan501 @adam2392 Hey I added a test, but it looks like I some other tests are failing. I submitted an issue, #328. I'll start looking into it, but it might go faster if you have an intuition on what the issue is. |
This is gonna be an issue due to parameter validation in scikit-learn. I will push a fix when I have time. It seems the rest of the code works tho. I would prioritize other issues until I have the time to push up the fix |
@adam2392 sounds good. Should we let this sit and update and merge after you have a fix for the other issue or merge as is? @sampan501 |
I'm happy to merge if @adam2392 is cool with it |
I will merge once the fix is added and the CI is confirmed to be happy. No rush on this one. |
@ryanhausen if you update wrt |
Thanks @ryanhausen ! |
Reference Issues/PRs
Fixes #310
What does this implement/fix? Explain your changes.
This adds a sparse implementation of
build_coleman_forest
totreeple.stats
. The sparse implementation relies onscipy.stats.sparse
and so only works in the case of binary classification and regression. In my tests usingcProfile
/memray
, the sparse implementation is a little over 50% faster and uses about %7 less memory, see pdf for more information. I am also attaching the profiling dataif you'd like to take a look.
This PR also includres a change to the bottleneck implementation to for the dense implementation of
build_coleman_forest
. Specifically there are two important changes:importlib.util.find_spec("bottleneck")
. The old check usingsys
seemed to work fine and passed tests, but actually didn't seem work unless bottleneck had been imported before, which when the tests are run is true, but may not be the case in other situationsnanmean_f
andanynan_f
are now defined usingdef
so that the warning message can moved into their respective calls rather than at import.Any other comments?
The scipy.stats sparse implementation should be swapped out for pydata.sparse when that implementation is performant enough, as it is more general.