Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use large workspace resource for balanced kmeans #659

Open
wants to merge 2 commits into
base: branch-25.02
Choose a base branch
from

Conversation

csadorf
Copy link

@csadorf csadorf commented Feb 5, 2025

Switch to using get_large_workspace_resource instead of get_workspace_resource in balanced KMeans code to ensure that we can run it on very large datasets. Previously, large datasets would run into limited_adaptor_resource_limitation errors.

This change may lead to a performance regression!

Closes #682 .

Copy link

copy-pr-bot bot commented Feb 5, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@cjnolet cjnolet added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Feb 5, 2025
@cjnolet cjnolet marked this pull request as ready for review February 5, 2025 21:11
@cjnolet cjnolet requested a review from a team as a code owner February 5, 2025 21:11
@csadorf csadorf changed the title [WIP] Use large workspace resource for balanced kmeans Use large workspace resource for balanced kmeans Feb 5, 2025
Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @csadorf for the PR!

I do not think this is the right fix. Just a few lines below your change, we calculate a minibatch size. We process the data in batches, therefore this should not go OOM.

I think the workspace_allocator is conceptually right allocator to use. But we should make sure that calc_minibatch_size considers the available workspace memory and calculates memory usage correctly.

As a workaround this PR is fine. I do not think there would be a perf impact. But if we decide to go ahead with this, then please create an issue to revise this in next release.

Do you have a reproducer for the problem?

@csadorf
Copy link
Author

csadorf commented Feb 10, 2025

Thanks @csadorf for the PR!

I do not think this is the right fix. Just a few lines below your change, we calculate a minibatch size. We process the data in batches, therefore this should not go OOM.

I think the workspace_allocator is conceptually right allocator to use. But we should make sure that calc_minibatch_size considers the available workspace memory and calculates memory usage correctly.

As a workaround this PR is fine. I do not think there would be a perf impact. But if we decide to go ahead with this, then please create an issue to revise this in next release.

@tfeher Thank you for your comments. The issue is that we do not only use this workspace for allocations related to the minibatch_size, but also within the build_fine_clusters() function (passed as device_memory) which allocates mc_trainset_buf [mesoscluster_size_max x dim] which is on the order of dataset size / n_clusters.

For very large datasets, the user must either increase the number of (mesoscale) clusters or use managed memory. However, in the current implementation, even using managed memory will not suffice since the current workspace would run into the resource_allocator limitation at total vram / 4.

Increasing the number of clusters commensurate with the dataset size is generally advisable, but I believe that even if we do not generally remove the resource limiter on the workspace, we should at least remove it specifically for the mc_trainset_buf allocation since there is no expectation that it should be on the order of minibatch_size.

Do you have a reproducer for the problem?

The reproducer is in rapidsai/cuml#6204 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cpp improvement Improves an existing functionality non-breaking Introduces a non-breaking change
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

3 participants