-
Notifications
You must be signed in to change notification settings - Fork 558
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix and update sccache for CUDA 12.8 #2324
Conversation
7766656
to
a3e9faa
Compare
maybe i missed the info but why hip needs 22.04? |
@sylvestre I copied this fix over from another branch, and at the time |
* cache (and dist-compile) cudafe++ invocations to ensure the original module_id file is restored and used for all PTX compilations * hash the module_id file (if it exists), `--gen_module_id_file` arg, and `--module_id_file_name` arg so PTX generated by `nvcc -c` and `nvcc -ptx` yield different hashes * remove `--gen_module_id_file` from internal cicc calls when using CTK<12.8 `nvcc -c`
a3e9faa
to
8402902
Compare
@@ -1,4 +1,4 @@ | |||
FROM ubuntu:22.04 | |||
FROM ubuntu:latest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to reference a specific version here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is latest
to ensure the container uses a glibc version >= the version sccache links to.
For example, I build and test on an Ubuntu 24.04 host. When the tests run this container with sccache built from my host, sccache fails to run since Ubuntu 22.04 has an older glibc.
I could set it to 24.04, but that just defers the problem for 2 more years.
The only potential problem I see is that apt install libcap2 bubblewrap
could break in some future release if those packages are renamed (or Ubuntu removes apt, heh), but those would have to be dealt with eventually anyway.
@@ -8,7 +8,7 @@ env: | |||
|
|||
jobs: | |||
build: | |||
runs-on: ubuntu-latest | |||
runs-on: ubuntu-24.04 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it isn't clear to me why this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When GitHub updates ubuntu-latest
to 26.04 next year, the integration tests will start failing again in the same way as here until the HIP job is updated to rocm/dev-ubuntu-26.04
.
It will stay broken until a new version is published (or we switch images). This lets us update on our schedule while keeping keep CI green. Since we have to update this file in both cases, I chose the option where jobs don't start mysteriously failing due to GitHub changes.
Could you please update one of the readme to document a bit this ? |
@sylvestre sure, but can we make it a follow-up PR? The current sccache release is poisoning caches when used with CUDA 12.8. |
sure |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #2324 +/- ##
==========================================
- Coverage 30.91% 0 -30.92%
==========================================
Files 53 0 -53
Lines 20112 0 -20112
Branches 9755 0 -9755
==========================================
- Hits 6217 0 -6217
+ Misses 7922 0 -7922
+ Partials 5973 0 -5973 ☔ View full report in Codecov by Sentry. |
This PR updates sccache, tests, and CI for CUDA Toolkit 12.8.
This PR fixes a bug related to the
.module_id
file generated bycudafe++
. This file is now unique (in CTK 12.8) acrosscudafe++
invocations, is important for device symbol visibility, and must be consistent betweencudafe++
and allcicc
calls.The bug arises when building a
.cu.o
that includes cached PTX. When thecudafe++
command is re-run it generates a new unique.module_id
file. This version has a different id than the cached.ptx
files, leading to a mismatch in the device symbol names in the PTX/cubins vs. the symbols used by the final host-compilation step.Luckily the fix is straightforward -- cache
cudafe++
invocations. This is safe to do since thecudafe++
input is the output from the host preprocessor, so any changes that affect device code will yield different hashes.