GPU linear operators #163

Abdelrahman912 · 2024-12-18T23:25:17Z

Just initial rough ideas for the design of GPU linear operators 🧑‍🎄

termi-official · 2025-01-06T18:34:04Z

src/discretization/operator.jl

+        element_cache = setup_element_cache(protocol, element_qr, ip, sdh)
+        push!(eles_caches, element_cache)
+    end
+   return dh.subdofhandlers |> cu, eles_caches |> cu


Note that this will be super slow, as we need to do this possibly at each time step.

I already thought about whether it might make sense (and how) to pass the device type around to give more precise control over this funny stuff here.

I was thinking about the same issue because yes, you are absolutely right this is gonna be super slow. The problem though is that I am trying to change right now the gpudofhandler and gpusubdofhandler so I needed to commit everything in case things blow up, didn't actually intend to push tho 😂 but working after Holidays is like forgetting everthing 😢 .

No worries. You can also just start with copy pasting the existing linear operator and changing the internals to your liking to figure out a better API. :)

termi-official

Just a quick review.

Can you also add a benchmark script for CPU vs GPU in benchmarks/operators/linear-operators/? You can use https://github.com/termi-official/Thunderbolt.jl/blob/main/benchmarks/benchmarks-linear-form.jl as baseline.

Project.toml

termi-official · 2025-01-13T14:42:27Z

Project.toml

@@ -5,6 +5,8 @@ version = "0.0.1-DEV"
 [deps]
 Adapt = "79e6a3ab-5dfb-504d-930d-738a2a938a0e"
 BlockArrays = "8e7c35d0-a365-5155-bbbb-fb81a777f24e"
+CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"
+CommonSolve = "38540f10-b2f7-11e9-35d8-d573e4eb0ff2"


I think this should be removeable.

termi-official · 2025-01-13T14:43:55Z

Project.toml

@@ -5,6 +5,8 @@ version = "0.0.1-DEV"
 [deps]
 Adapt = "79e6a3ab-5dfb-504d-930d-738a2a938a0e"
 BlockArrays = "8e7c35d0-a365-5155-bbbb-fb81a777f24e"
+CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"


Should be a weak dependency.

termi-official · 2025-01-13T14:45:23Z

src/discretization/operator.jl

+struct BackendCUDA <: AbstractBackend end
+struct BackendCPU <: AbstractBackend end


Is there any reason why we do not use KernelAbstractions backends for the dispatch? If there is not, then please wait before adapting it. We might need to discuss this one in more detail (or even a separtate PR).

The things is that KA doesn't allow dynamic memory allocations for shared arrays all the interfaces they provide are static shared and I try to use dynamic mem for local vectors and matrices if it can fit the memory. I thought that we can use their interfaces but it would be installing a whole library for just using their structs wouldn't be the optimal thing to do.

termi-official · 2025-01-13T14:46:49Z

src/ferrite-addons/PR883.jl

@@ -10,7 +10,7 @@ struct QuadratureValuesIterator{VT,XT}
        return new{V, Nothing}(v, nothing)
    end
    function QuadratureValuesIterator(v::V, cell_coords::VT) where {V, VT <: AbstractArray}
-        reinit!(v, cell_coords)
+        #reinit!(v, cell_coords)


because cell value is a shared instance across kernels so it can't store the coords, right?! so I rather store the coords in the cell cache which is unique to every thread

I see. I need some more time to think about this tho and how to make this compatible with the CPU assembly.

termi-official · 2025-01-13T14:48:17Z

src/ferrite-addons/PR883.jl

        detJdV = detJ * fe_v.weights[q_point]
-
+    


Suggested change

ws

termi-official

Next batch of comments.

termi-official · 2025-01-15T13:10:43Z

ext/cuda/cuda_adapt.jl

+    )
+end
+
+# TODO: here or in ferrite-addons?


You can answer these questions relatively easily. Just ask yourself these design questions here:

It does not need CUDA specifically?

It does it need any extra dependency?

It is potentially shared between extensions (e.g. AMDGPU, CUDA, ...)?

So it should be in Thunderbolt and not any extension. The first parameter should also be some supertype of the GPU device types to control the dispatch. For the CPU side it should just return the dof handler without touching it.

termi-official · 2025-01-15T13:11:42Z

ext/cuda/cuda_adapt.jl

+    cell_to_sdh      = adapt(to, dh.cell_to_subdofhandler .|> (i -> convert(Int32,i)) |> cu)
+    #subdofhandlers   = Tuple(i->_convert_subdofhandler_to_gpu(cell_dofs, cell_dofs_offset, sdh) for sdh in dh.subdofhandlers)
+    subdofhandlers   = adapt_structure(to,dh.subdofhandlers .|> (sdh -> Adapt.adapt_structure(to, sdh)) |> cu)
+    gpudata = GPUDofHandlerData(


Maybe we can be more generic and call this DeviceDofHandlerData? Just an idea which came to my mind.

maybe, both naming I believe are synonyms, but if we do so for Dofhandler, we should rename the other structs as well, shouldn't we ?

Note that FPGAs/ICs/... are also devices. Agreed on renaming the other structs accordingly.

termi-official · 2025-01-15T13:11:51Z

ext/cuda/cuda_adapt.jl

+    nodes = Adapt.adapt_structure(to, grid.nodes |> cu)
+    #TODO subdomain info
+    return GPUGrid{sdim, cell_type, T, typeof(cells), typeof(nodes)}(cells, nodes)
+end


Suggested change

end

end

termi-official · 2025-01-15T14:11:13Z

ext/cuda/cuda_adapt.jl

+function Adapt.adapt_structure(to, cysc::CartesianCoordinateSystemCache)
+    cs = Adapt.adapt_structure(to, cysc.cs)
+    cv = Adapt.adapt_structure(to, cysc.cv)
+    return CartesianCoordinateSystemCache(cs, cv)
+end


I think for such simple ones we can just use adapt structure. See https://github.com/JuliaGPU/Adapt.jl/blob/master/src/macro.jl .

termi-official · 2025-01-15T14:12:24Z

ext/cuda/cuda_adapt.jl

+    fv = Adapt.adapt(to, StaticInterpolationValues(cv.fun_values))
+    gm = Adapt.adapt(to, StaticInterpolationValues(cv.geo_mapping))


Do you think it might be simpler to overload adapt_structure here for the different FEValues?

termi-official · 2025-01-15T14:42:42Z

src/ferrite-addons/PR913.jl

+
+
+####################
+# cuda_iterator.jl #


If you think that it helps navigatibility, then feel free to make a subfolder PR913 and put the files there, where PR913.jl just includes the files in this folder in the right order. If not, then just leave it as it is.

termi-official · 2025-01-15T14:44:06Z

src/ferrite-addons/PR913.jl

+function allocate_global_mem(::Type{FullObject{Tv}}, n_cells::Ti, n_basefuncs::Ti) where {Ti <: Integer, Tv <: Real}
+    Kes = CUDA.zeros(Tv, n_cells, n_basefuncs, n_basefuncs)
+    fes = CUDA.zeros(Tv, n_cells, n_basefuncs)
+    return GlobalMemAlloc(Kes, fes)
+end


This does unfortunately not work. We need to move this dispatch into an extension.

termi-official · 2025-01-15T14:44:48Z

src/ferrite-addons/PR913.jl

+abstract type AbstractMemAllocObjects{Tv <: Real} end
+struct RHSObject{Tv<:Real} <:AbstractMemAllocObjects{Tv}  end
+struct JacobianObject{Tv<:Real} <:AbstractMemAllocObjects{Tv}  end
+struct FullObject{Tv<:Real} <:AbstractMemAllocObjects{Tv}  end


Can we come up with more descriptive names for these?

I am not satisfied either with the naming convention, if you have a better naming, it would be great.

Maybe I should name them sth like BMem AMem AbMem

termi-official · 2025-01-15T14:47:02Z

src/ferrite-addons/PR913.jl

+############
+# adapt.jl #
+############
+Adapt.@adapt_structure GPUGrid


Just import @adapt_structure. :)

termi-official · 2025-01-15T14:52:56Z

src/ferrite-addons/gpu/gpudofhandler.jl

+    ndofs = ndofs_per_cell(dh, i)
+    view = @view dh.cell_dofs[offset:(offset + ndofs - convert(Ti, 1))]
+    return view
+end


Suggested change

end

end

termi-official · 2025-01-15T16:12:06Z

src/gpu/gpu_operator.jl

Also related, how can decouple the assembly strategy itself from the specific backend?

init design (no working implementation)

5c2e71c

termi-official changed the title ~~init design (no working implementation)~~ GPU operators Dec 19, 2024

Abdelrahman912 added 2 commits December 21, 2024 00:53

init setup

8636a49

make cuda work for the test kernel

34d8b79

termi-official linked an issue Dec 23, 2024 that may be closed by this pull request

GPU assembly of linear forms #136

Open

3 tasks

add gpuy subdof and init gpu update operator

f7bd306

termi-official reviewed Jan 6, 2025

View reviewed changes

Abdelrahman912 added 6 commits January 6, 2025 21:19

add subdof iterator

7eedd51

minor fix in sdh iterator

16846e4

first working example (not refined)

5787e82

working example (not refined)

c58a14f

Merge branch 'main' into gpu-operators

324dbd0

minor adjustment

139fc83

termi-official changed the title ~~GPU operators~~ GPU linear operators Jan 13, 2025

termi-official reviewed Jan 13, 2025

View reviewed changes

Abdelrahman912 added 6 commits January 13, 2025 22:04

minor refinement

babc06c

init add coeffs

fe064cf

add to extension (doesn't dispatch to ext tho)

6bdae08

minor edit

c5c8b0b

move to ext (working implementation)

4d3f7de

minor fix

ac9c807

termi-official reviewed Jan 15, 2025

View reviewed changes

src/gpu/gpu_operator.jl Outdated

Copy link

Owner

termi-official Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also related, how can decouple the assembly strategy itself from the specific backend?

termi-official mentioned this pull request Jan 16, 2025

Automatic field type conversions in types #174

Open

Abdelrahman912 added 6 commits January 17, 2025 01:45

add coesfficients tests for gpu

0d075a8

minor refinements in coefficients

863ab05

minor fix

0daf36c

restructure PR913

7ba3139

change gpu -> device

077572f

some restructuring for memory allocation

34c4536

Abdelrahman912 added 2 commits January 25, 2025 06:29

fix extension

0e41a01

optimize global mem

f512b88

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU linear operators #163

GPU linear operators #163

Abdelrahman912 commented Dec 18, 2024

termi-official Jan 6, 2025

Abdelrahman912 Jan 6, 2025

termi-official Jan 6, 2025

termi-official left a comment •

edited

Loading

termi-official Jan 13, 2025

termi-official Jan 13, 2025

termi-official Jan 13, 2025

Abdelrahman912 Jan 13, 2025

termi-official Jan 13, 2025

Abdelrahman912 Jan 13, 2025

termi-official Jan 15, 2025

termi-official Jan 13, 2025

termi-official left a comment

termi-official Jan 15, 2025

termi-official Jan 15, 2025

Abdelrahman912 Jan 20, 2025

termi-official Jan 20, 2025

termi-official Jan 15, 2025

termi-official Jan 15, 2025

termi-official Jan 15, 2025

termi-official Jan 15, 2025

termi-official Jan 15, 2025

termi-official Jan 15, 2025

Abdelrahman912 Jan 21, 2025

termi-official Jan 15, 2025

termi-official Jan 15, 2025

termi-official Jan 15, 2025

		struct BackendCUDA <: AbstractBackend end
		struct BackendCPU <: AbstractBackend end

		fv = Adapt.adapt(to, StaticInterpolationValues(cv.fun_values))
		gm = Adapt.adapt(to, StaticInterpolationValues(cv.geo_mapping))



		####################
		# cuda_iterator.jl #

GPU linear operators #163

Are you sure you want to change the base?

GPU linear operators #163

Conversation

Abdelrahman912 commented Dec 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

termi-official left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

termi-official left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

termi-official left a comment •

edited

Loading