Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix and update sccache for CUDA 12.8 #2324

Merged
merged 3 commits into from
Feb 8, 2025
Merged

Conversation

trxcllnt
Copy link
Contributor

@trxcllnt trxcllnt commented Jan 28, 2025

This PR updates sccache, tests, and CI for CUDA Toolkit 12.8.

This PR fixes a bug related to the .module_id file generated by cudafe++. This file is now unique (in CTK 12.8) across cudafe++ invocations, is important for device symbol visibility, and must be consistent between cudafe++ and all cicc calls.

The bug arises when building a .cu.o that includes cached PTX. When the cudafe++ command is re-run it generates a new unique .module_id file. This version has a different id than the cached .ptx files, leading to a mismatch in the device symbol names in the PTX/cubins vs. the symbols used by the final host-compilation step.

Luckily the fix is straightforward -- cache cudafe++ invocations. This is safe to do since the cudafe++ input is the output from the host preprocessor, so any changes that affect device code will yield different hashes.

@trxcllnt trxcllnt force-pushed the fix/cuda-12.8 branch 2 times, most recently from 7766656 to a3e9faa Compare January 29, 2025 04:36
@sylvestre
Copy link
Collaborator

maybe i missed the info but why hip needs 22.04?
feels like a kind of a regression to revert to an older ubuntu for the CI

@trxcllnt
Copy link
Contributor Author

@sylvestre ubuntu-latest is 24.04 now, so the build jobs link sccache to glibc 2.38. The hip integration job uses 22.04 with an older glibc and fails.

I copied this fix over from another branch, and at the time rocm/dev-ubuntu-24.04 wasn't published. I just checked again and it looks like it exists now, so I'll change it back.

* cache (and dist-compile) cudafe++ invocations to ensure the original module_id file is restored and used for all PTX compilations
* hash the module_id file (if it exists), `--gen_module_id_file` arg, and `--module_id_file_name` arg so PTX generated by `nvcc -c` and `nvcc -ptx` yield different hashes
* remove `--gen_module_id_file` from internal cicc calls when using CTK<12.8 `nvcc -c`
@@ -1,4 +1,4 @@
FROM ubuntu:22.04
FROM ubuntu:latest
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to reference a specific version here

Copy link
Contributor Author

@trxcllnt trxcllnt Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is latest to ensure the container uses a glibc version >= the version sccache links to.

For example, I build and test on an Ubuntu 24.04 host. When the tests run this container with sccache built from my host, sccache fails to run since Ubuntu 22.04 has an older glibc.

I could set it to 24.04, but that just defers the problem for 2 more years.

The only potential problem I see is that apt install libcap2 bubblewrap could break in some future release if those packages are renamed (or Ubuntu removes apt, heh), but those would have to be dealt with eventually anyway.

@@ -8,7 +8,7 @@ env:

jobs:
build:
runs-on: ubuntu-latest
runs-on: ubuntu-24.04
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it isn't clear to me why this change?

Copy link
Contributor Author

@trxcllnt trxcllnt Feb 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When GitHub updates ubuntu-latest to 26.04 next year, the integration tests will start failing again in the same way as here until the HIP job is updated to rocm/dev-ubuntu-26.04.

It will stay broken until a new version is published (or we switch images). This lets us update on our schedule while keeping keep CI green. Since we have to update this file in both cases, I chose the option where jobs don't start mysteriously failing due to GitHub changes.

@sylvestre
Copy link
Collaborator

Could you please update one of the readme to document a bit this ?
For example, cudafe could be defined and explain :)
besides these two comments, looks good to me
thanks

@trxcllnt
Copy link
Contributor Author

trxcllnt commented Feb 8, 2025

@sylvestre sure, but can we make it a follow-up PR? The current sccache release is poisoning caches when used with CUDA 12.8.

@sylvestre sylvestre merged commit e42fa94 into mozilla:main Feb 8, 2025
59 checks passed
@sylvestre
Copy link
Collaborator

sure

@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 0.00%. Comparing base (0cc0c62) to head (8402902).
Report is 139 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #2324       +/-   ##
==========================================
- Coverage   30.91%       0   -30.92%     
==========================================
  Files          53       0       -53     
  Lines       20112       0    -20112     
  Branches     9755       0     -9755     
==========================================
- Hits         6217       0     -6217     
+ Misses       7922       0     -7922     
+ Partials     5973       0     -5973     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants