Use large workspace resource for balanced kmeans #659

csadorf · 2025-02-05T17:14:53Z

Switch to using get_large_workspace_resource instead of get_workspace_resource in balanced KMeans code to ensure that we can run it on very large datasets. Previously, large datasets would run into limited_adaptor_resource_limitation errors.

This change may lead to a performance regression!

Closes #682 .

…ed kmeans.

copy-pr-bot · 2025-02-05T17:14:57Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

cpp/src/cluster/detail/kmeans_balanced.cuh

tfeher

Thanks @csadorf for the PR!

I do not think this is the right fix. Just a few lines below your change, we calculate a minibatch size. We process the data in batches, therefore this should not go OOM.

I think the workspace_allocator is conceptually right allocator to use. But we should make sure that calc_minibatch_size considers the available workspace memory and calculates memory usage correctly.

As a workaround this PR is fine. I do not think there would be a perf impact. But if we decide to go ahead with this, then please create an issue to revise this in next release.

Do you have a reproducer for the problem?

csadorf · 2025-02-10T20:32:49Z

Thanks @csadorf for the PR!

I do not think this is the right fix. Just a few lines below your change, we calculate a minibatch size. We process the data in batches, therefore this should not go OOM.

I think the workspace_allocator is conceptually right allocator to use. But we should make sure that calc_minibatch_size considers the available workspace memory and calculates memory usage correctly.

As a workaround this PR is fine. I do not think there would be a perf impact. But if we decide to go ahead with this, then please create an issue to revise this in next release.

@tfeher Thank you for your comments. The issue is that we do not only use this workspace for allocations related to the minibatch_size, but also within the build_fine_clusters() function (passed as device_memory) which allocates mc_trainset_buf [mesoscluster_size_max x dim] which is on the order of dataset size / n_clusters.

For very large datasets, the user must either increase the number of (mesoscale) clusters or use managed memory. However, in the current implementation, even using managed memory will not suffice since the current workspace would run into the resource_allocator limitation at total vram / 4.

Increasing the number of clusters commensurate with the dataset size is generally advisable, but I believe that even if we do not generally remove the resource limiter on the workspace, we should at least remove it specifically for the mc_trainset_buf allocation since there is no expectation that it should be on the order of minibatch_size.

Do you have a reproducer for the problem?

The reproducer is in rapidsai/cuml#6204 .

Use large workspace resource instead of workspace resource for balanc…

668c354

…ed kmeans.

github-actions bot added the cpp label Feb 5, 2025

csadorf mentioned this pull request Feb 5, 2025

Avoid limited memory adaptor issue in balanced KMeans rapidsai/raft#2570

Merged

cjnolet reviewed Feb 5, 2025

View reviewed changes

cpp/src/cluster/detail/kmeans_balanced.cuh Show resolved Hide resolved

cjnolet assigned csadorf Feb 5, 2025

cjnolet added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Feb 5, 2025

restore todo comment

0e881e4

cjnolet marked this pull request as ready for review February 5, 2025 21:11

cjnolet requested a review from a team as a code owner February 5, 2025 21:11

csadorf changed the title ~~[WIP] Use large workspace resource for balanced kmeans~~ Use large workspace resource for balanced kmeans Feb 5, 2025

tfeher reviewed Feb 6, 2025

View reviewed changes

csadorf mentioned this pull request Feb 10, 2025

[BUG] Batched nn-descent UMAP unexpectedly throws OOM error on dataset that should succeed with UVM rapidsai/cuml#6204

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use large workspace resource for balanced kmeans #659

Use large workspace resource for balanced kmeans #659

csadorf commented Feb 5, 2025 •

edited

Loading

copy-pr-bot bot commented Feb 5, 2025

tfeher left a comment •

edited

Loading

csadorf commented Feb 10, 2025

Use large workspace resource for balanced kmeans #659

Are you sure you want to change the base?

Use large workspace resource for balanced kmeans #659

Conversation

csadorf commented Feb 5, 2025 • edited Loading

copy-pr-bot bot commented Feb 5, 2025

tfeher left a comment • edited Loading

Choose a reason for hiding this comment

csadorf commented Feb 10, 2025

csadorf commented Feb 5, 2025 •

edited

Loading

tfeher left a comment •

edited

Loading