Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU linear operators #163

Draft
wants to merge 24 commits into
base: main
Choose a base branch
from

Conversation

Abdelrahman912
Copy link
Collaborator

Just initial rough ideas for the design of GPU linear operators πŸ§‘β€πŸŽ„

@termi-official termi-official changed the title init design (no working implementation) GPU operators Dec 19, 2024
@termi-official termi-official linked an issue Dec 23, 2024 that may be closed by this pull request
3 tasks
element_cache = setup_element_cache(protocol, element_qr, ip, sdh)
push!(eles_caches, element_cache)
end
return dh.subdofhandlers |> cu, eles_caches |> cu
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this will be super slow, as we need to do this possibly at each time step.

I already thought about whether it might make sense (and how) to pass the device type around to give more precise control over this funny stuff here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about the same issue because yes, you are absolutely right this is gonna be super slow. The problem though is that I am trying to change right now the gpudofhandler and gpusubdofhandler so I needed to commit everything in case things blow up, didn't actually intend to push tho πŸ˜‚ but working after Holidays is like forgetting everthing 😒 .

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No worries. You can also just start with copy pasting the existing linear operator and changing the internals to your liking to figure out a better API. :)

@termi-official termi-official changed the title GPU operators GPU linear operators Jan 13, 2025
Copy link
Owner

@termi-official termi-official left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a quick review.

Can you also add a benchmark script for CPU vs GPU in benchmarks/operators/linear-operators/? You can use https://github.com/termi-official/Thunderbolt.jl/blob/main/benchmarks/benchmarks-linear-form.jl as baseline.

Project.toml Outdated Show resolved Hide resolved
Project.toml Outdated
@@ -5,6 +5,8 @@ version = "0.0.1-DEV"
[deps]
Adapt = "79e6a3ab-5dfb-504d-930d-738a2a938a0e"
BlockArrays = "8e7c35d0-a365-5155-bbbb-fb81a777f24e"
CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"
CommonSolve = "38540f10-b2f7-11e9-35d8-d573e4eb0ff2"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be removeable.

Project.toml Outdated
@@ -5,6 +5,8 @@ version = "0.0.1-DEV"
[deps]
Adapt = "79e6a3ab-5dfb-504d-930d-738a2a938a0e"
BlockArrays = "8e7c35d0-a365-5155-bbbb-fb81a777f24e"
CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be a weak dependency.

Comment on lines 566 to 567
struct BackendCUDA <: AbstractBackend end
struct BackendCPU <: AbstractBackend end
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason why we do not use KernelAbstractions backends for the dispatch? If there is not, then please wait before adapting it. We might need to discuss this one in more detail (or even a separtate PR).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The things is that KA doesn't allow dynamic memory allocations for shared arrays all the interfaces they provide are static shared and I try to use dynamic mem for local vectors and matrices if it can fit the memory. I thought that we can use their interfaces but it would be installing a whole library for just using their structs wouldn't be the optimal thing to do.

@@ -10,7 +10,7 @@ struct QuadratureValuesIterator{VT,XT}
return new{V, Nothing}(v, nothing)
end
function QuadratureValuesIterator(v::V, cell_coords::VT) where {V, VT <: AbstractArray}
reinit!(v, cell_coords)
#reinit!(v, cell_coords)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because cell value is a shared instance across kernels so it can't store the coords, right?! so I rather store the coords in the cell cache which is unique to every thread

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I need some more time to think about this tho and how to make this compatible with the CPU assembly.

detJdV = detJ * fe_v.weights[q_point]

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

ws

Copy link
Owner

@termi-official termi-official left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Next batch of comments.

)
end

# TODO: here or in ferrite-addons?
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can answer these questions relatively easily. Just ask yourself these design questions here:

  1. It does not need CUDA specifically?
  2. It does it need any extra dependency?
  3. It is potentially shared between extensions (e.g. AMDGPU, CUDA, ...)?

So it should be in Thunderbolt and not any extension. The first parameter should also be some supertype of the GPU device types to control the dispatch. For the CPU side it should just return the dof handler without touching it.

cell_to_sdh = adapt(to, dh.cell_to_subdofhandler .|> (i -> convert(Int32,i)) |> cu)
#subdofhandlers = Tuple(i->_convert_subdofhandler_to_gpu(cell_dofs, cell_dofs_offset, sdh) for sdh in dh.subdofhandlers)
subdofhandlers = adapt_structure(to,dh.subdofhandlers .|> (sdh -> Adapt.adapt_structure(to, sdh)) |> cu)
gpudata = GPUDofHandlerData(
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can be more generic and call this DeviceDofHandlerData? Just an idea which came to my mind.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe, both naming I believe are synonyms, but if we do so for Dofhandler, we should rename the other structs as well, shouldn't we ?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that FPGAs/ICs/... are also devices. Agreed on renaming the other structs accordingly.

nodes = Adapt.adapt_structure(to, grid.nodes |> cu)
#TODO subdomain info
return GPUGrid{sdim, cell_type, T, typeof(cells), typeof(nodes)}(cells, nodes)
end
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
end
end

Comment on lines 19 to 23
function Adapt.adapt_structure(to, cysc::CartesianCoordinateSystemCache)
cs = Adapt.adapt_structure(to, cysc.cs)
cv = Adapt.adapt_structure(to, cysc.cv)
return CartesianCoordinateSystemCache(cs, cv)
end
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for such simple ones we can just use adapt structure. See https://github.com/JuliaGPU/Adapt.jl/blob/master/src/macro.jl .

Comment on lines 5 to 6
fv = Adapt.adapt(to, StaticInterpolationValues(cv.fun_values))
gm = Adapt.adapt(to, StaticInterpolationValues(cv.geo_mapping))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it might be simpler to overload adapt_structure here for the different FEValues?



####################
# cuda_iterator.jl #
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you think that it helps navigatibility, then feel free to make a subfolder PR913 and put the files there, where PR913.jl just includes the files in this folder in the right order. If not, then just leave it as it is.

Comment on lines 66 to 70
function allocate_global_mem(::Type{FullObject{Tv}}, n_cells::Ti, n_basefuncs::Ti) where {Ti <: Integer, Tv <: Real}
Kes = CUDA.zeros(Tv, n_cells, n_basefuncs, n_basefuncs)
fes = CUDA.zeros(Tv, n_cells, n_basefuncs)
return GlobalMemAlloc(Kes, fes)
end
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does unfortunately not work. We need to move this dispatch into an extension.

Comment on lines 8 to 11
abstract type AbstractMemAllocObjects{Tv <: Real} end
struct RHSObject{Tv<:Real} <:AbstractMemAllocObjects{Tv} end
struct JacobianObject{Tv<:Real} <:AbstractMemAllocObjects{Tv} end
struct FullObject{Tv<:Real} <:AbstractMemAllocObjects{Tv} end
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we come up with more descriptive names for these?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not satisfied either with the naming convention, if you have a better naming, it would be great.

Maybe I should name them sth like BMem AMem AbMem

############
# adapt.jl #
############
Adapt.@adapt_structure GPUGrid
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just import @adapt_structure. :)

ndofs = ndofs_per_cell(dh, i)
view = @view dh.cell_dofs[offset:(offset + ndofs - convert(Ti, 1))]
return view
end
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
end
end

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also related, how can decouple the assembly strategy itself from the specific backend?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GPU assembly of linear forms
2 participants