GPU Support #67

TeachRaccooon · 2024-04-03T21:51:50Z

Introduces barebones GPU support into RandLAPACK.
This is a WIP PR.

rileyjmurray · 2024-04-24T14:44:58Z

Ping @TeachRaccooon. OpenMP isn't correctly linked into RandLAPACK when building with CUDA. The problem is that CUDA requires specific compiler flags in order to interface with OpenMP code. So you'll need to change RandLAPACK's top-level CMakeLists.txt to add the line

include(compiler_flags)

before the include(find_cuda). You also need to create a file called compiler_flags.cmake, with the following contents

# set default compiler flags
if (NOT CMAKE_CXX_FLAGS)
    set(tmp "-fPIC -std=c++20 -Wall -Wextra -Wno-unknown-pragmas")
    if ((APPLE) AND ("${CMAKE_CXX_COMPILER_ID}" MATCHES "Clang"))
        set(tmp "${tmp} -stdlib=libc++")
    endif()
    if ("${CMAKE_BUILD_TYPE}" MATCHES "Release")
        set(tmp "${tmp} -O3 -march=native -mtune=native -fno-trapping-math -fno-math-errno")
        if (NOT "${CMAKE_CXX_COMPILER_ID}" MATCHES "Clang")
            set(tmp "${tmp} -fno-signaling-nans")
        endif()
    endif()
    set(CMAKE_CXX_FLAGS "${tmp}"
            CACHE STRING "RandLAPACK build defaults"
        FORCE)
endif()
if (NOT CMAKE_CUDA_FLAGS)
    set(tmp "--default-stream per-thread --expt-relaxed-constexpr")
    if ("${CMAKE_BUILD_TYPE}" MATCHES "Release")
        set(tmp "${tmp} -Xcompiler -fopenmp,-Wall,-Wextra,-O3,-march=native,-mtune=native,-fno-trapping-math,-fno-math-errno")
        if (NOT "${CMAKE_CXX_COMPILER_ID}" MATCHES "Clang")
            set(tmp "${tmp},-fno-signaling-nans")
        endif()
    elseif ("${CMAKE_BUILD_TYPE}" MATCHES "Debug")
        set(tmp "${tmp} -g -G -Xcompiler -fopenmp,-Wall,-Wextra,-O0,-g")
    endif()
    set(CMAKE_CUDA_FLAGS "${tmp}"
        CACHE STRING "CUDA compiler build defaults"
        FORCE)
    string(REGEX REPLACE "-O[0-9]" "-O3" tmp "${CMAKE_CXX_FLAGS_RELEASE}")
    set(CMAKE_CXX_FLAGS_RELEASE "${tmp}"
        CACHE STRING "CUDA compiler build defaults"
        FORCE)
    string(REGEX REPLACE "-O[0-9]" "-O3" tmp "${CMAKE_CUDA_FLAGS_RELEASE}")
    set(CMAKE_CUDA_FLAGS_RELEASE "${tmp}"
        CACHE STRING "CUDA compiler build defaults"
        FORCE)
endif()

I have these changes made locally in Jonathan's cloned version of RandLAPACK's GPU branch. You can just make the changes when convenience here on the main PR, then Jonathan can pull them down properly later on.

TeachRaccooon · 2024-04-24T14:55:17Z

Ping @TeachRaccooon. OpenMP isn't correctly linked into RandLAPACK when building with CUDA. The problem is that CUDA requires specific compiler flags in order to interface with OpenMP code. So you'll need to change RandLAPACK's top-level CMakeLists.txt to add the line

include(compiler_flags)

before the include(find_cuda). You also need to create a file called compiler_flags.cmake, with the following contents

# set default compiler flags
if (NOT CMAKE_CXX_FLAGS)
    set(tmp "-fPIC -std=c++20 -Wall -Wextra -Wno-unknown-pragmas")
    if ((APPLE) AND ("${CMAKE_CXX_COMPILER_ID}" MATCHES "Clang"))
        set(tmp "${tmp} -stdlib=libc++")
    endif()
    if ("${CMAKE_BUILD_TYPE}" MATCHES "Release")
        set(tmp "${tmp} -O3 -march=native -mtune=native -fno-trapping-math -fno-math-errno")
        if (NOT "${CMAKE_CXX_COMPILER_ID}" MATCHES "Clang")
            set(tmp "${tmp} -fno-signaling-nans")
        endif()
    endif()
    set(CMAKE_CXX_FLAGS "${tmp}"
            CACHE STRING "RandLAPACK build defaults"
        FORCE)
endif()
if (NOT CMAKE_CUDA_FLAGS)
    set(tmp "--default-stream per-thread --expt-relaxed-constexpr")
    if ("${CMAKE_BUILD_TYPE}" MATCHES "Release")
        set(tmp "${tmp} -Xcompiler -fopenmp,-Wall,-Wextra,-O3,-march=native,-mtune=native,-fno-trapping-math,-fno-math-errno")
        if (NOT "${CMAKE_CXX_COMPILER_ID}" MATCHES "Clang")
            set(tmp "${tmp},-fno-signaling-nans")
        endif()
    elseif ("${CMAKE_BUILD_TYPE}" MATCHES "Debug")
        set(tmp "${tmp} -g -G -Xcompiler -fopenmp,-Wall,-Wextra,-O0,-g")
    endif()
    set(CMAKE_CUDA_FLAGS "${tmp}"
        CACHE STRING "CUDA compiler build defaults"
        FORCE)
    string(REGEX REPLACE "-O[0-9]" "-O3" tmp "${CMAKE_CXX_FLAGS_RELEASE}")
    set(CMAKE_CXX_FLAGS_RELEASE "${tmp}"
        CACHE STRING "CUDA compiler build defaults"
        FORCE)
    string(REGEX REPLACE "-O[0-9]" "-O3" tmp "${CMAKE_CUDA_FLAGS_RELEASE}")
    set(CMAKE_CUDA_FLAGS_RELEASE "${tmp}"
        CACHE STRING "CUDA compiler build defaults"
        FORCE)
endif()

I have these changes made locally in Jonathan's cloned version of RandLAPACK's GPU branch. You can just make the changes when convenience here on the main PR, then Jonathan can pull them down properly later on.

Just pushed that change.

CMake/find_cuda.cmake

rileyjmurray · 2024-09-07T21:04:39Z

INSTALL.md

+## 0. Software requirements
+RandLAPACK_GPU temporary requirements:
+GNU 13.1.0
+NVIDIA 12.4.131 (make sure to use driver v 550)
+CMAKE 3.29.2
+All that is used to ensure we can compile with C++20 features with no issues.


Change CMake spec to 3.27.

Also, make sure RandLAPACK can compile even if CUDA isn't present. Obviously you can't do anything GPU-based, but the CPU-only functionality should still work.

Not working without the GPU support rn, I'll need to figure out a way for that,

INSTALL.md

RandLAPACK.hh

rileyjmurray

Lots of trivial or almost-trivial comments. Some minor comments.

RandLAPACK/drivers/rl_cqrrp.hh

test/drivers/test_cqrrp_gpu.cu

test/drivers/test_cqrrpt_gpu.cu

TeachRaccooon · 2024-09-08T17:25:08Z

@rileyjmurray I just pushed a solution to the issue of the project not compiling without CUDA support.
With the way things are set up in this commit, we are required to provide path to CUDA upon configuration (or set environment vars properly).
That means that if you were, say, using spack and just loaded cuda with it, CMake would most likely not find cuda.
Idk if this is a problem or not

rileyjmurray · 2024-09-08T20:13:19Z

@rileyjmurray I just pushed a solution to the issue of the project not compiling without CUDA support. With the way things are set up in this commit, we are required to provide path to CUDA upon configuration (or set environment vars properly). That means that if you were, say, using spack and just loaded cuda with it, CMake would most likely not find cuda. Idk if this is a problem or not

@TeachRaccooon can we just have CMake look for a custom flag like "RequireCUDA" that defaults to false/undefined? Then we'd try to setup CUDA if (and only if) that flag was present and true.

Adding CUDA toolkit into our workflow per GPU_SUPPORT PR.

…mpiled with gcc.

…POTRF+SYRK on a GPU not staying strictly in fp64 land. Thee code here is not polished and will be reverted

…n executable, so ./bin/RandLAPACK_tests_gpu doesnt dispatch benchmarks by default

This PR inherits commits originally introduced in PR #67. The discussion of some of the details can also be found there. The list of changes is as follows: 1. Introduces a CMake build option for GPU support (specifically, CUDA support) in RandLAPACK. This is enabled with ``-DRequireCUDA=ON``. 2. Introduces rl_cuda_kernels.cuh - file contains various utility GPU functions, including some BLAS and LAPACK-level routines. 3. Introduces rl_cqrrpt_gpu.cuh, - a GPU version of CQRRPT. Note that since many parts of CQRRPT (including sketching) do not (currently) have GPU versions, the data offload happens inside of the algorithm. The input data is expected to be located on a CPU. 4. Introduces rl_cqrrp_gpu.cuh - a GPU version of CQRRP algorithm, which accepts data allocated on a GPU. 5. Includes tests for the functions from the above files and benchmarks (living in test space) for CQRRP algorithm. In the future, these should be moved into benchmarking space (built separately). For now, we can avoid running these with the rest of the tests by using `ctest --gtest_filter=-*bench*`. Issues #77 - #80 are related to this PR. --------- Co-authored-by: Riley John Murray <[email protected]> Co-authored-by: Riley John Murray <[email protected]> Co-authored-by: Max Melnichenko <[email protected]> Co-authored-by: rileyjmurray <[email protected]>

rileyjmurray reviewed Sep 7, 2024

View reviewed changes

CMake/find_cuda.cmake Outdated Show resolved Hide resolved

rileyjmurray reviewed Sep 7, 2024

View reviewed changes

INSTALL.md Show resolved Hide resolved

rileyjmurray reviewed Sep 7, 2024

View reviewed changes

RandLAPACK.hh Show resolved Hide resolved

rileyjmurray requested changes Sep 7, 2024

View reviewed changes

TeachRaccooon added 21 commits September 8, 2024 18:58

Begin adding GPU support into RandLAPACK

ed82b76

Porting files from Parth's project

bc1476e

Making sure GPU functions work

f86a091

CQRRPT GPU test

92502f7

Update

b658c8e

Update

1c07f29

Quick compilartion error comment-out

5b606fd

GPU-based CQRRPT finished

7d87d79

CQRRPT GPU benchmark

94f908e

CQRRPT GPU benchmark

046d71a

Fixing the RNG template issue that is visible on ISAAC

7b01495

Fixing the RNG template issue that is visible on ISAAC

f836b06

Save before update

7a14cfb

Attempt at making GPU kernels work

8beb5e9

Update

c47b0ec

Update core-linux.yaml

e92afdf

Adding CUDA toolkit into our workflow per GPU_SUPPORT PR.

Update core-linux.yaml

9e4cebc

Resolved the issue of cuda kernels being undefined/attempted to be co…

ef2ae3b

…mpiled with gcc.

Issue not fully fixed.

88d0684

Issue not fully fixed.

51e0d6b

Added a probably temporary fix

88e3f6a

TeachRaccooon and others added 26 commits September 8, 2024 18:59

Update

f46f94d

try

60a22a1

Onto something

dad4f02

Got rid of pivot idx copy

22b3808

Col_swap_bupdate

1bbb4a7

Cusolver stream update

f05957c

Switched copy_mat_gpu to a standard one

d6a5c4b

Reworked copy_mat_gpu

f6f2696

Placing A copy on a separate stream

d3bf1a8

Fixed J copy bug & reverted the col_swap change

f3ae8a0

Synchronization bug fixed

70c0aec

Small fix in regards to the col_swap_gpu

7927ef3

fixed the issue with copying extra data

fb3fe8a

This commit simply demonstrates that the numerical issues are due to …

c93ca35

…POTRF+SYRK on a GPU not staying strictly in fp64 land. Thee code here is not polished and will be reverted

Reverting the demonstrative change

222596f

Made sure the low-rank cases are handled properly

ca9290a

Update

f14ed67

Some fixes per Riley's comments

4bd5d10

File add

182e498

reduce CMake version dependency. Move GPU benchmark tests into its ow…

ea52ff1

…n executable, so ./bin/RandLAPACK_tests_gpu doesnt dispatch benchmarks by default

add checks for unnoticed cuda errors before returning from a test

a2f96e2

move error check, and error message

25e8898

Fixed the bug with freeing A_sk_cpy space

ab00b1e

Nearly all fixes that Riley requested were applied.

185380d

Temporary solution tensure things work without CUDA support

f286f94

Rebased from main. Reverted the CPU fix.

16cdbb8

TeachRaccooon force-pushed the GPU_SUPPORT_NEW branch from e903f41 to 16cdbb8 Compare September 9, 2024 02:00

TeachRaccooon mentioned this pull request Sep 9, 2024

GPU support - September 2024 #76

Merged

TeachRaccooon closed this Sep 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU Support #67

GPU Support #67

TeachRaccooon commented Apr 3, 2024

rileyjmurray commented Apr 24, 2024

TeachRaccooon commented Apr 24, 2024

rileyjmurray Sep 7, 2024

TeachRaccooon Sep 8, 2024

rileyjmurray left a comment

TeachRaccooon commented Sep 8, 2024

rileyjmurray commented Sep 8, 2024

GPU Support #67

GPU Support #67

Conversation

TeachRaccooon commented Apr 3, 2024

rileyjmurray commented Apr 24, 2024

TeachRaccooon commented Apr 24, 2024

rileyjmurray Sep 7, 2024

Choose a reason for hiding this comment

TeachRaccooon Sep 8, 2024

Choose a reason for hiding this comment

rileyjmurray left a comment

Choose a reason for hiding this comment

TeachRaccooon commented Sep 8, 2024

rileyjmurray commented Sep 8, 2024