-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft: Factor build/link/use of VecGeom/CUDA using libraries for simplicity #279
Conversation
Can one of the admins verify this patch? |
For simplicity of future update to |
6b1369e
to
24098bb
Compare
@@ -8,8 +8,8 @@ macro(build_tests TESTS) | |||
foreach(TEST ${TESTS}) | |||
get_filename_component(TARGET_NAME ${TEST} NAME_WE) | |||
add_executable(${TARGET_NAME} ${TEST}) | |||
set_target_properties(${TARGET_NAME} PROPERTIES CUDA_SEPARABLE_COMPILATION ON) | |||
target_link_libraries(${TARGET_NAME} AdePT_cuda AdePT_G4_integration AdePT) | |||
# NB: We do not use PRIVATE for executables due to an apparent limitation in CudaRDCUtils |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What was the symptoms/issues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The output from the cmake
step if adding PRIVATE
here:
CMake Error at cmake/CudaRdcUtils.cmake:850 (target_link_libraries):
The keyword signature for target_link_libraries has already been used with
the target "test_atomic". All uses of target_link_libraries with a target
must be either all-keyword or all-plain.
The uses of the keyword signature are here:
* cmake/CudaRdcUtils.cmake:738 (target_link_libraries)
Call Stack (most recent call first):
test/CMakeLists.txt:12 (cuda_rdc_target_link_libraries)
test/CMakeLists.txt:76 (build_tests)
The target_link_libraries
at 850 is "bare" and could probably use a direct PRIVATE
though I do recall there are use cases for PUBLIC
linking of executables (plugins?) if only for CPU.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an actual difference on the link line of the executable between the 2 settings? (I naively thought it would only be a setting for dependency propagation in CMake
and thus 'useless' for executable).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've never observed a difference, though I wondered if there's some connection to symbol visibility/export. We can probably safely ignore any difference for now given the scope in which the RDC linking is required.
<TARGET INCLUDE DIRECTORIES> | ||
${AdePT_INCLUDE_DIRS}) | ||
```cmake | ||
include(CudaRdcUtils) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be done transparently in AdePTConfig.cmake.in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer to enforce an explicit include for now for clarity in downstream use It also matches the Celeritas pattern of use. We can of course review (in both) in future.
@drbenmorgan , both me an @JuanGonzalezCaminero are getting locally linking errors with this branch. I have the VecGeom master and cmake 3.28.3
However, I checked and the symbols are defined in
The error is the same if I change manually the link order of |
also getting when linking example1:
|
@agheata, I've been chatting to @JuanGonzalezCaminero on Mattermost about this - I'm in the process of rebuilding my local stack with the newer CUDA and CMake that he has to see if I can reproduce. What I am now spotting is there is a undeclared (to CMake) circular dependence between |
@pcanal may have insight on the
error. I believe was seen as well in Celeritas and patched? |
One other thing to check: what is the full list of Vecgeom libraries in your installs (e.g. under /home/agheata/geant_src/VecGeom_install/lib)? I've tried locally with both static and shared builds and am simply not able to reproduce. Could you also let me know the versions of VecGeom and Geant4 (down to commit) you are building against, and the exact set of build options used to configure and install them please? There's some weird difference or specific configuration setup that's causing these problems for you but not me or the CI system. |
Yes we have seem similar but slightly different. The fix was celeritas-project/celeritas@ The bottom-line is that you might need to add explicitly more library to the |
Indeed, we are looking now into how to break the dependency of lib AdePT on libAdePT_G4_integration |
and for AdePT:
|
Thanks for the info @agheata, I think the crucial thing is that we have the same VecGeom library structure here, i.e. only a static
Could you run:
and post the compile/link commands for the source file and end executable (should be the last two) please? For comparison, with my local, working build, I get (paths reduced for clarity):
|
This is what I get for test_atomic @drbenmorgan
|
Thanks @JuanGonzalezCaminero, is that failing with the same error as @agheata reported earlier? The only difference I can spot is the use of response files for the device link step. Could you check/post the contents of the files:
please? edit: clarified the path to these files from the build directory. |
Yes, we were having the same errors, the files have: deviceObjects1.rsp:
deviceLinkLibs.rsp
|
Indeed I'm able to build after the last commit @drbenmorgan |
O.k., I have absolutely no idea why things are failing on these specific machines, so the latest commit is a shot in the dark. All it does is add a dummy |
Thanks for the update @JuanGonzalezCaminero, that's something of a relief! If you could also check that your separate "findpackage_test" example building against an install also works, that'd cover that base as well. All I can really imagine is that this is something to do with GCC - could you and @agheata double check and confirm which Linux, gcc and associated toolchain is in use please (i.e. gcc, gcc-ar gcc-ranlib etc)? |
Ah, I wasn't building the examples, I confirm the commit solved the linking issue with AdePT_G4_Integration, however the vecgeom |
Right, same exercise as before, so could you post the full link line for
|
This is what I get:
|
Which looks identical to mine... One last thing to try - could you modify the link command in cuda_rdc_target_link_libraries(example1
PRIVATE
AdePT_G4_integration
AdePT
VecGeom::vecgeom
${HEPMC3_LIBRARIES}
${HEPMC3_FIO_LIBRARIES}
) Note the extra VecGeom link. This should fully declare that as needed in the link rather than relying on transitive inclusion. If that still fails, could you also:
|
As noted in #280, we should wait on the fix to the circular dependency presented there to go through with the existing link structure. I can then rebase here so we can at least decouple these issues (if they are not already orthogonal). As I commented in #280:
and also the CI machine is passing. @agheata, @JuanGonzalezCaminero, which machines are you using, and are they accessible for CERN account holders at all? Are you able to reproduce the errors on One additional cross-check to try (if @pcanal and @sethrj don't mind):
This should expose any fundamental system problems. |
@drbenmorgan Building the current Celeritas
I get the following when it tries to link
But in this case it seems unrelated. It seems I can't compile with gcc 12.3, the compiler I used instead is gcc 10.5 |
@JuanGonzalezCaminero that last error message looks like you're mixing incompatible compilers... |
Thanks @sethrj, much appreciated 🙇! That was with GCC8, CUDA 11.5 (which OS?)? |
linux-rhel8-cascadelake, gcc 8.5.0, CUDA 11.5. |
O.k., on my setup (linux-rocky9-cascadelake, gcc 11.4.1, CUDA 12) the same spack env builds both this PR and celeritas develop perfectly (so even Celeritas with Vecgeom 2.0.0-rc2, which is this tag/tarball |
CMakeLists.txt
Outdated
src/AdePTPhysics.cc | ||
src/HepEMPhysics.cc | ||
src/AdePTGeant4Integration.cpp | ||
src/adept_g4_dummy.cu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cuda_rdc_
function are meant to add this when needed. The fact that it isn't means that the detection mechanism in the cuda_rdc_
function that the library and/or executable is depending on more than one independent CUDA rdc
library failed (this could indeed be related to the circular depenendy).
@drbenmorgan We just merged the removal of circular dependencies. Could this PR be rebased onto master? |
- Taken as of commit fb03216c0cd701c6aff0eb901370dad19e178503 - celeritas-project/celeritas@fb03216
Due to VecGeom's legacy CUDA design and the exposure of device functions in public APIs of libraries, ensuring correct singular device links in final executables is awkward. The Celeritas project has evolved the build/link pattern used in VecGeom into the "CudaRdcUtils" CMake module that enables projects using VecGeom (or building libraries that expose device APIs) to be correctly integrated and used by downstream clients. This is acheived through the use of custom CMake properties and wrappers around the fundamental target_XXX interfaces CMake provides, making use as transparent as it can be given the problem faced. Use CudaRdcUtils to build, link, install, and use AdePT - Provide FindVecGeom wrapper module to add RDC properties to VecGeom targets - Replace target_XXX calls with cuda_rdc_XXX equivalents - Remove obsolete "AdePT_cuda" and "AdePT_standalone" libraries that CudaRdcUtils build and handles automatically - Install and use FindVecGeom and CudaRdcUtils for use by downstream clients Tests and executables build and run successfully on a local Rocky 9, GCC11, CUDA 12 environment.
Link errors observed on CERN machines report missing symbols that are in AdePT_G4_integration, and with this library on the link line. The cause of this is not totally clear given success on other platforms, though it is known that there is a circular dependency between the AdePT and AdePT_G4_integration layer that might be an underlying issue. To attempt resolution/diagnosis, add a dummy .cu file to AdePT_G4_integration. This forces it to be built as an RDC library and thus onto the device link line alongside AdePT, potnetially resolving the device link errors.
726e762
to
f08d53a
Compare
Latest commits are a very quick and dirty rebase on the latest |
O.k., CI looks good, so @JuanGonzalezCaminero, @agheata could you try building again and see if things are now resolved on your side please? |
Still getting one error here:
Simplified link line:
And the same line on Master:
|
Argh... Could you try adding this line:
at line 64 of examples/example1/CMakeLists.txt please? Are you running on an Ubuntu machine by any chance? |
That line solves it indeed. I'm running on Debian |
Yep, I found out quite some time ago that VecGeom requires this on Ubuntu systems: see https://wiki.gentoo.org/wiki/Project:Quality_Assurance/-Wl,-z,defs_and_-Wl,--no-allow-shlib-undefined . |
Thanks both! @JuanGonzalezCaminero, could you try modifying the line to:
and see what it prints please? Not a fix, but useful to see what it produces. I'll have a think about how to fix this for installs, but in the meantime I think we can use a project wide |
As far as I can tell I get the same error and link line when using those options. Is there something specific I should look for? |
If I understand the option correctly, it should report specifics on what's missing, though perhaps it needs to be applied to the libraries itself? Anyway, I'm now able to reproduce the error locally by adding the |
@drbenmorgan it also works for me when adding |
IIUC, As @WitekPokorski reported that the current AdePT libraries had been successfully linked in Gaussino, I'm first going to go back to the celer-adept joint integration layer and see what happens there when fully linking in and using AdePT before deciding how to progress this PR (I also want to go back and update |
Closing as superseded by #289 |
Due to VecGeom's legacy CUDA design and the exposure of device functions in public APIs of libraries, ensuring correct singular device code links in final executables is, and continues to be, awkward. This largely means reproducing VecGeom's pattern of shared/static/device linked libraries downstream so that usage can be appropriately tracked and the correct linking done at each stage. A suitable CMake module that packages all of the needed functionality is available from the Celeritas project where it's also been tested across a range of use cases and is therefore now pretty robust (as it can be).
This PR imports this CMake module into AdePT (it's a completely standalone file) and ports the build, link and use of AdePT to use it. The primary changes are:
FindVecGeom
wrapper module to add RDC properties to VecGeom targets (to possibly be upstreamed in VecGeom)target_XXX
calls withcuda_rdc_XXX
equivalentsAdePT_cuda
andAdePT_standalone
libraries thatCudaRdcUtils
build and handles automatically locally and in downstream use.FindVecGeom
andCudaRdcUtils
for use by downstream clientsI've tested this locally on a Rocky 9/GCC 11/Cuda 12 system with an underlying VecGeom install with static libs and CUDA enabled. The build and tests all run fine and without problem. I've also been able to take one of the examples, recast it into a standalone project and build and run it against an install of AdePT. However, CI and wider use may show up additional bugs...
Pinging @pcanal and @sethrj for info and any additional comments they want to make.