Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Support for Heterogeneous GPU Configurations in the Cuda Component #312

Merged

Conversation

Treece-Burgess
Copy link
Contributor

@Treece-Burgess Treece-Burgess commented Feb 5, 2025

Pull Request Description

This PR adds support for heterogeneous gpu configurations. As a consequence the following were also updated:

  • How we internally handle the default device id
  • How we handle creating a context for users if one is not provided

Tested on Leconte (8 * V100) and Hexane (1 * V100 & 1 * H100):

Test Pass
HelloWorld.cu
HelloWorld_noCuCtx.cu
concurrent_profiling.cu
concurrent_profiling_noCuCtx.cu
cudaOpenMP.cu
cudaOpenMP_noCuCtx.cu
pthreads.cu
pthreads_noCuCtx.cu
simpleMultiGPU.cu
simpleMultiGPU_noCuCtx.cu
test_2thr_1gpu_not_allowed.cu
test_multi_read_and_reset.cu
test_multipass_event_fail.cu

Author Checklist

  • Description
    Why this PR exists. Reference all relevant information, including background, issues, test failures, etc
  • Commits
    Commits are self contained and only do one thing
    Commits have a header of the form: module: short description
    Commits have a body (whenever relevant) containing a detailed description of the addressed problem and its solution
  • Tests
    The PR needs to pass all the tests

@Treece-Burgess Treece-Burgess added component-cuda PRs and Issues related to the cuda component type-feature Issues that request a new feature or PRs that add a new feature status-ready-for-review PR is ready to be reviewed labels Feb 5, 2025
@Treece-Burgess Treece-Burgess force-pushed the 12.20.24-cuda-multi-gpu branch 2 times, most recently from 3d06eba to fe2a676 Compare February 14, 2025 13:56
@dbarry9
Copy link
Contributor

dbarry9 commented Feb 19, 2025

I have tested this PR on two configurations:

  • V100+A100
  • V100+H100

I monitored accurate counts of FP32 and FP64 events on both systems.

Copy link
Contributor

@djwoun djwoun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested and worked as expected.

  • A100
  • H100
    I noticed one minor error in cuptic_ctxarr_update_current and have one suggestion regarding initializing the hash table.

@Treece-Burgess Treece-Burgess force-pushed the 12.20.24-cuda-multi-gpu branch from a2264bb to 57e9c56 Compare February 24, 2025 20:22
@Treece-Burgess Treece-Burgess force-pushed the 12.20.24-cuda-multi-gpu branch from 57e9c56 to e411551 Compare February 24, 2025 20:22
@Treece-Burgess Treece-Burgess merged commit 17ded13 into icl-utk-edu:master Feb 24, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component-cuda PRs and Issues related to the cuda component status-ready-for-review PR is ready to be reviewed type-feature Issues that request a new feature or PRs that add a new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants