Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU Support #67

Closed
wants to merge 117 commits into from
Closed

GPU Support #67

wants to merge 117 commits into from

Conversation

TeachRaccooon
Copy link
Contributor

Introduces barebones GPU support into RandLAPACK.
This is a WIP PR.

@rileyjmurray
Copy link
Contributor

Ping @TeachRaccooon. OpenMP isn't correctly linked into RandLAPACK when building with CUDA. The problem is that CUDA requires specific compiler flags in order to interface with OpenMP code. So you'll need to change RandLAPACK's top-level CMakeLists.txt to add the line

include(compiler_flags)

before the include(find_cuda). You also need to create a file called compiler_flags.cmake, with the following contents

# set default compiler flags
if (NOT CMAKE_CXX_FLAGS)
    set(tmp "-fPIC -std=c++20 -Wall -Wextra -Wno-unknown-pragmas")
    if ((APPLE) AND ("${CMAKE_CXX_COMPILER_ID}" MATCHES "Clang"))
        set(tmp "${tmp} -stdlib=libc++")
    endif()
    if ("${CMAKE_BUILD_TYPE}" MATCHES "Release")
        set(tmp "${tmp} -O3 -march=native -mtune=native -fno-trapping-math -fno-math-errno")
        if (NOT "${CMAKE_CXX_COMPILER_ID}" MATCHES "Clang")
            set(tmp "${tmp} -fno-signaling-nans")
        endif()
    endif()
    set(CMAKE_CXX_FLAGS "${tmp}"
            CACHE STRING "RandLAPACK build defaults"
        FORCE)
endif()
if (NOT CMAKE_CUDA_FLAGS)
    set(tmp "--default-stream per-thread --expt-relaxed-constexpr")
    if ("${CMAKE_BUILD_TYPE}" MATCHES "Release")
        set(tmp "${tmp} -Xcompiler -fopenmp,-Wall,-Wextra,-O3,-march=native,-mtune=native,-fno-trapping-math,-fno-math-errno")
        if (NOT "${CMAKE_CXX_COMPILER_ID}" MATCHES "Clang")
            set(tmp "${tmp},-fno-signaling-nans")
        endif()
    elseif ("${CMAKE_BUILD_TYPE}" MATCHES "Debug")
        set(tmp "${tmp} -g -G -Xcompiler -fopenmp,-Wall,-Wextra,-O0,-g")
    endif()
    set(CMAKE_CUDA_FLAGS "${tmp}"
        CACHE STRING "CUDA compiler build defaults"
        FORCE)
    string(REGEX REPLACE "-O[0-9]" "-O3" tmp "${CMAKE_CXX_FLAGS_RELEASE}")
    set(CMAKE_CXX_FLAGS_RELEASE "${tmp}"
        CACHE STRING "CUDA compiler build defaults"
        FORCE)
    string(REGEX REPLACE "-O[0-9]" "-O3" tmp "${CMAKE_CUDA_FLAGS_RELEASE}")
    set(CMAKE_CUDA_FLAGS_RELEASE "${tmp}"
        CACHE STRING "CUDA compiler build defaults"
        FORCE)
endif()

I have these changes made locally in Jonathan's cloned version of RandLAPACK's GPU branch. You can just make the changes when convenience here on the main PR, then Jonathan can pull them down properly later on.

@TeachRaccooon
Copy link
Contributor Author

Ping @TeachRaccooon. OpenMP isn't correctly linked into RandLAPACK when building with CUDA. The problem is that CUDA requires specific compiler flags in order to interface with OpenMP code. So you'll need to change RandLAPACK's top-level CMakeLists.txt to add the line

include(compiler_flags)

before the include(find_cuda). You also need to create a file called compiler_flags.cmake, with the following contents

# set default compiler flags
if (NOT CMAKE_CXX_FLAGS)
    set(tmp "-fPIC -std=c++20 -Wall -Wextra -Wno-unknown-pragmas")
    if ((APPLE) AND ("${CMAKE_CXX_COMPILER_ID}" MATCHES "Clang"))
        set(tmp "${tmp} -stdlib=libc++")
    endif()
    if ("${CMAKE_BUILD_TYPE}" MATCHES "Release")
        set(tmp "${tmp} -O3 -march=native -mtune=native -fno-trapping-math -fno-math-errno")
        if (NOT "${CMAKE_CXX_COMPILER_ID}" MATCHES "Clang")
            set(tmp "${tmp} -fno-signaling-nans")
        endif()
    endif()
    set(CMAKE_CXX_FLAGS "${tmp}"
            CACHE STRING "RandLAPACK build defaults"
        FORCE)
endif()
if (NOT CMAKE_CUDA_FLAGS)
    set(tmp "--default-stream per-thread --expt-relaxed-constexpr")
    if ("${CMAKE_BUILD_TYPE}" MATCHES "Release")
        set(tmp "${tmp} -Xcompiler -fopenmp,-Wall,-Wextra,-O3,-march=native,-mtune=native,-fno-trapping-math,-fno-math-errno")
        if (NOT "${CMAKE_CXX_COMPILER_ID}" MATCHES "Clang")
            set(tmp "${tmp},-fno-signaling-nans")
        endif()
    elseif ("${CMAKE_BUILD_TYPE}" MATCHES "Debug")
        set(tmp "${tmp} -g -G -Xcompiler -fopenmp,-Wall,-Wextra,-O0,-g")
    endif()
    set(CMAKE_CUDA_FLAGS "${tmp}"
        CACHE STRING "CUDA compiler build defaults"
        FORCE)
    string(REGEX REPLACE "-O[0-9]" "-O3" tmp "${CMAKE_CXX_FLAGS_RELEASE}")
    set(CMAKE_CXX_FLAGS_RELEASE "${tmp}"
        CACHE STRING "CUDA compiler build defaults"
        FORCE)
    string(REGEX REPLACE "-O[0-9]" "-O3" tmp "${CMAKE_CUDA_FLAGS_RELEASE}")
    set(CMAKE_CUDA_FLAGS_RELEASE "${tmp}"
        CACHE STRING "CUDA compiler build defaults"
        FORCE)
endif()

I have these changes made locally in Jonathan's cloned version of RandLAPACK's GPU branch. You can just make the changes when convenience here on the main PR, then Jonathan can pull them down properly later on.

Just pushed that change.

INSTALL.md Outdated
Comment on lines 15 to 20
## 0. Software requirements
RandLAPACK_GPU temporary requirements:
GNU 13.1.0
NVIDIA 12.4.131 (make sure to use driver v 550)
CMAKE 3.29.2
All that is used to ensure we can compile with C++20 features with no issues.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change CMake spec to 3.27.

Also, make sure RandLAPACK can compile even if CUDA isn't present. Obviously you can't do anything GPU-based, but the CPU-only functionality should still work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not working without the GPU support rn, I'll need to figure out a way for that,

Copy link
Contributor

@rileyjmurray rileyjmurray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of trivial or almost-trivial comments. Some minor comments.

@TeachRaccooon
Copy link
Contributor Author

@rileyjmurray I just pushed a solution to the issue of the project not compiling without CUDA support.
With the way things are set up in this commit, we are required to provide path to CUDA upon configuration (or set environment vars properly).
That means that if you were, say, using spack and just loaded cuda with it, CMake would most likely not find cuda.
Idk if this is a problem or not

@rileyjmurray
Copy link
Contributor

@rileyjmurray I just pushed a solution to the issue of the project not compiling without CUDA support. With the way things are set up in this commit, we are required to provide path to CUDA upon configuration (or set environment vars properly). That means that if you were, say, using spack and just loaded cuda with it, CMake would most likely not find cuda. Idk if this is a problem or not

@TeachRaccooon can we just have CMake look for a custom flag like "RequireCUDA" that defaults to false/undefined? Then we'd try to setup CUDA if (and only if) that flag was present and true.

TeachRaccooon added a commit that referenced this pull request Sep 10, 2024
This PR inherits commits originally introduced in PR #67.
The discussion of some of the details can also be found there.
The list of changes is as follows:
1. Introduces a CMake build option for GPU support (specifically, CUDA
support) in RandLAPACK. This is enabled with ``-DRequireCUDA=ON``.
2. Introduces rl_cuda_kernels.cuh - file contains various utility GPU
functions, including some BLAS and LAPACK-level routines.
3. Introduces rl_cqrrpt_gpu.cuh, - a GPU version of CQRRPT. Note that
since many parts of CQRRPT (including sketching) do not (currently) have
GPU versions, the data offload happens inside of the algorithm. The
input data is expected to be located on a CPU.
4. Introduces rl_cqrrp_gpu.cuh - a GPU version of CQRRP algorithm, which
accepts data allocated on a GPU.
5. Includes tests for the functions from the above files and benchmarks
(living in test space) for CQRRP algorithm. In the future, these should
be moved into benchmarking space (built separately). For now, we can
avoid running these with the rest of the tests by using `ctest
--gtest_filter=-*bench*`.

Issues #77 - #80 are related to this PR.

---------

Co-authored-by: Riley John Murray <[email protected]>
Co-authored-by: Riley John Murray <[email protected]>
Co-authored-by: Max Melnichenko <[email protected]>
Co-authored-by: rileyjmurray <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants