-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use large workspace resource for balanced kmeans #659
base: branch-25.02
Are you sure you want to change the base?
Use large workspace resource for balanced kmeans #659
Conversation
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @csadorf for the PR!
I do not think this is the right fix. Just a few lines below your change, we calculate a minibatch size. We process the data in batches, therefore this should not go OOM.
I think the workspace_allocator
is conceptually right allocator to use. But we should make sure that calc_minibatch_size
considers the available workspace memory and calculates memory usage correctly.
As a workaround this PR is fine. I do not think there would be a perf impact. But if we decide to go ahead with this, then please create an issue to revise this in next release.
Do you have a reproducer for the problem?
@tfeher Thank you for your comments. The issue is that we do not only use this workspace for allocations related to the For very large datasets, the user must either increase the number of (mesoscale) clusters or use managed memory. However, in the current implementation, even using managed memory will not suffice since the current workspace would run into the resource_allocator limitation at total vram / 4. Increasing the number of clusters commensurate with the dataset size is generally advisable, but I believe that even if we do not generally remove the resource limiter on the workspace, we should at least remove it specifically for the
The reproducer is in rapidsai/cuml#6204 . |
Switch to using
get_large_workspace_resource
instead ofget_workspace_resource
in balanced KMeans code to ensure that we can run it on very large datasets. Previously, large datasets would run intolimited_adaptor_resource_limitation
errors.This change may lead to a performance regression!
Closes #682 .