Fix and update sccache for CUDA 12.8 #2324

trxcllnt · 2025-01-28T18:08:58Z

This PR updates sccache, tests, and CI for CUDA Toolkit 12.8.

This PR fixes a bug related to the .module_id file generated by cudafe++. This file is now unique (in CTK 12.8) across cudafe++ invocations, is important for device symbol visibility, and must be consistent between cudafe++ and all cicc calls.

The bug arises when building a .cu.o that includes cached PTX. When the cudafe++ command is re-run it generates a new unique .module_id file. This version has a different id than the cached .ptx files, leading to a mismatch in the device symbol names in the PTX/cubins vs. the symbols used by the final host-compilation step.

Luckily the fix is straightforward -- cache cudafe++ invocations. This is safe to do since the cudafe++ input is the output from the host preprocessor, so any changes that affect device code will yield different hashes.

sylvestre · 2025-01-30T08:46:04Z

maybe i missed the info but why hip needs 22.04?
feels like a kind of a regression to revert to an older ubuntu for the CI

trxcllnt · 2025-01-30T20:10:32Z

@sylvestre ubuntu-latest is 24.04 now, so the build jobs link sccache to glibc 2.38. The hip integration job uses 22.04 with an older glibc and fails.

I copied this fix over from another branch, and at the time rocm/dev-ubuntu-24.04 wasn't published. I just checked again and it looks like it exists now, so I'll change it back.

* cache (and dist-compile) cudafe++ invocations to ensure the original module_id file is restored and used for all PTX compilations * hash the module_id file (if it exists), `--gen_module_id_file` arg, and `--module_id_file_name` arg so PTX generated by `nvcc -c` and `nvcc -ptx` yield different hashes * remove `--gen_module_id_file` from internal cicc calls when using CTK<12.8 `nvcc -c`

drahnr · 2025-02-06T09:56:22Z

tests/harness/Dockerfile.sccache-dist

@@ -1,4 +1,4 @@
-FROM ubuntu:22.04
+FROM ubuntu:latest


I'd prefer to reference a specific version here

This is latest to ensure the container uses a glibc version >= the version sccache links to.

For example, I build and test on an Ubuntu 24.04 host. When the tests run this container with sccache built from my host, sccache fails to run since Ubuntu 22.04 has an older glibc.

I could set it to 24.04, but that just defers the problem for 2 more years.

The only potential problem I see is that apt install libcap2 bubblewrap could break in some future release if those packages are renamed (or Ubuntu removes apt, heh), but those would have to be dealt with eventually anyway.

sylvestre · 2025-02-08T08:47:01Z

.github/workflows/integration-tests.yml

@@ -8,7 +8,7 @@ env:

 jobs:
  build:
-    runs-on: ubuntu-latest
+    runs-on: ubuntu-24.04


it isn't clear to me why this change?

When GitHub updates ubuntu-latest to 26.04 next year, the integration tests will start failing again in the same way as here until the HIP job is updated to rocm/dev-ubuntu-26.04.

It will stay broken until a new version is published (or we switch images). This lets us update on our schedule while keeping keep CI green. Since we have to update this file in both cases, I chose the option where jobs don't start mysteriously failing due to GitHub changes.

sylvestre · 2025-02-08T08:48:34Z

Could you please update one of the readme to document a bit this ?
For example, cudafe could be defined and explain :)
besides these two comments, looks good to me
thanks

trxcllnt · 2025-02-08T17:44:55Z

@sylvestre sure, but can we make it a follow-up PR? The current sccache release is poisoning caches when used with CUDA 12.8.

sylvestre · 2025-02-08T18:05:30Z

sure

codecov-commenter · 2025-02-08T18:05:36Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 0.00%. Comparing base (0cc0c62) to head (8402902).
Report is 139 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #2324       +/-   ##
==========================================
- Coverage   30.91%       0   -30.92%     
==========================================
  Files          53       0       -53     
  Lines       20112       0    -20112     
  Branches     9755       0     -9755     
==========================================
- Hits         6217       0     -6217     
+ Misses       7922       0     -7922     
+ Partials     5973       0     -5973

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

update GHA workflows to CUDA 12.8

4aea9be

robertmaynard approved these changes Jan 28, 2025

View reviewed changes

trxcllnt force-pushed the fix/cuda-12.8 branch 2 times, most recently from 7766656 to a3e9faa Compare January 29, 2025 04:36

trxcllnt added 2 commits January 30, 2025 22:23

build and run integration tests on ubuntu-24.04

8402902

trxcllnt force-pushed the fix/cuda-12.8 branch from a3e9faa to 8402902 Compare January 30, 2025 22:25

drahnr reviewed Feb 6, 2025

View reviewed changes

sylvestre reviewed Feb 8, 2025

View reviewed changes

sylvestre merged commit e42fa94 into mozilla:main Feb 8, 2025
59 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix and update sccache for CUDA 12.8 #2324

Fix and update sccache for CUDA 12.8 #2324

trxcllnt commented Jan 28, 2025 •

edited

Loading

sylvestre commented Jan 30, 2025

trxcllnt commented Jan 30, 2025

drahnr Feb 6, 2025

trxcllnt Feb 6, 2025 •

edited

Loading

sylvestre Feb 8, 2025

trxcllnt Feb 8, 2025 •

edited

Loading

sylvestre commented Feb 8, 2025

trxcllnt commented Feb 8, 2025 •

edited

Loading

sylvestre commented Feb 8, 2025

codecov-commenter commented Feb 8, 2025

Fix and update sccache for CUDA 12.8 #2324

Fix and update sccache for CUDA 12.8 #2324

Conversation

trxcllnt commented Jan 28, 2025 • edited Loading

sylvestre commented Jan 30, 2025

trxcllnt commented Jan 30, 2025

drahnr Feb 6, 2025

Choose a reason for hiding this comment

trxcllnt Feb 6, 2025 • edited Loading

Choose a reason for hiding this comment

sylvestre Feb 8, 2025

Choose a reason for hiding this comment

trxcllnt Feb 8, 2025 • edited Loading

Choose a reason for hiding this comment

sylvestre commented Feb 8, 2025

trxcllnt commented Feb 8, 2025 • edited Loading

sylvestre commented Feb 8, 2025

codecov-commenter commented Feb 8, 2025

Codecov Report

trxcllnt commented Jan 28, 2025 •

edited

Loading

trxcllnt Feb 6, 2025 •

edited

Loading

trxcllnt Feb 8, 2025 •

edited

Loading

trxcllnt commented Feb 8, 2025 •

edited

Loading