-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU linear operators #163
base: main
Are you sure you want to change the base?
GPU linear operators #163
Conversation
src/discretization/operator.jl
Outdated
element_cache = setup_element_cache(protocol, element_qr, ip, sdh) | ||
push!(eles_caches, element_cache) | ||
end | ||
return dh.subdofhandlers |> cu, eles_caches |> cu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that this will be super slow, as we need to do this possibly at each time step.
I already thought about whether it might make sense (and how) to pass the device type around to give more precise control over this funny stuff here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about the same issue because yes, you are absolutely right this is gonna be super slow. The problem though is that I am trying to change right now the gpudofhandler and gpusubdofhandler so I needed to commit everything in case things blow up, didn't actually intend to push tho π but working after Holidays is like forgetting everthing π’ .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No worries. You can also just start with copy pasting the existing linear operator and changing the internals to your liking to figure out a better API. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a quick review.
Can you also add a benchmark script for CPU vs GPU in benchmarks/operators/linear-operators/
? You can use https://github.com/termi-official/Thunderbolt.jl/blob/main/benchmarks/benchmarks-linear-form.jl as baseline.
Project.toml
Outdated
@@ -5,6 +5,8 @@ version = "0.0.1-DEV" | |||
[deps] | |||
Adapt = "79e6a3ab-5dfb-504d-930d-738a2a938a0e" | |||
BlockArrays = "8e7c35d0-a365-5155-bbbb-fb81a777f24e" | |||
CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba" | |||
CommonSolve = "38540f10-b2f7-11e9-35d8-d573e4eb0ff2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be removeable.
Project.toml
Outdated
@@ -5,6 +5,8 @@ version = "0.0.1-DEV" | |||
[deps] | |||
Adapt = "79e6a3ab-5dfb-504d-930d-738a2a938a0e" | |||
BlockArrays = "8e7c35d0-a365-5155-bbbb-fb81a777f24e" | |||
CUDA = "052768ef-5323-5732-b1bb-66c8b64840ba" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be a weak dependency.
src/discretization/operator.jl
Outdated
struct BackendCUDA <: AbstractBackend end | ||
struct BackendCPU <: AbstractBackend end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any reason why we do not use KernelAbstractions backends for the dispatch? If there is not, then please wait before adapting it. We might need to discuss this one in more detail (or even a separtate PR).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The things is that KA doesn't allow dynamic memory allocations for shared arrays all the interfaces they provide are static shared and I try to use dynamic mem for local vectors and matrices if it can fit the memory. I thought that we can use their interfaces but it would be installing a whole library for just using their structs wouldn't be the optimal thing to do.
@@ -10,7 +10,7 @@ struct QuadratureValuesIterator{VT,XT} | |||
return new{V, Nothing}(v, nothing) | |||
end | |||
function QuadratureValuesIterator(v::V, cell_coords::VT) where {V, VT <: AbstractArray} | |||
reinit!(v, cell_coords) | |||
#reinit!(v, cell_coords) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because cell value is a shared instance across kernels so it can't store the coords, right?! so I rather store the coords in the cell cache which is unique to every thread
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I need some more time to think about this tho and how to make this compatible with the CPU assembly.
src/ferrite-addons/PR883.jl
Outdated
detJdV = detJ * fe_v.weights[q_point] | ||
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ws
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Next batch of comments.
ext/cuda/cuda_adapt.jl
Outdated
) | ||
end | ||
|
||
# TODO: here or in ferrite-addons? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can answer these questions relatively easily. Just ask yourself these design questions here:
- It does not need CUDA specifically?
- It does it need any extra dependency?
- It is potentially shared between extensions (e.g. AMDGPU, CUDA, ...)?
So it should be in Thunderbolt and not any extension. The first parameter should also be some supertype of the GPU device types to control the dispatch. For the CPU side it should just return the dof handler without touching it.
ext/cuda/cuda_adapt.jl
Outdated
cell_to_sdh = adapt(to, dh.cell_to_subdofhandler .|> (i -> convert(Int32,i)) |> cu) | ||
#subdofhandlers = Tuple(i->_convert_subdofhandler_to_gpu(cell_dofs, cell_dofs_offset, sdh) for sdh in dh.subdofhandlers) | ||
subdofhandlers = adapt_structure(to,dh.subdofhandlers .|> (sdh -> Adapt.adapt_structure(to, sdh)) |> cu) | ||
gpudata = GPUDofHandlerData( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can be more generic and call this DeviceDofHandlerData
? Just an idea which came to my mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe, both naming I believe are synonyms, but if we do so for Dofhandler, we should rename the other structs as well, shouldn't we ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that FPGAs/ICs/... are also devices. Agreed on renaming the other structs accordingly.
nodes = Adapt.adapt_structure(to, grid.nodes |> cu) | ||
#TODO subdomain info | ||
return GPUGrid{sdim, cell_type, T, typeof(cells), typeof(nodes)}(cells, nodes) | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
end | |
end | |
ext/cuda/cuda_adapt.jl
Outdated
function Adapt.adapt_structure(to, cysc::CartesianCoordinateSystemCache) | ||
cs = Adapt.adapt_structure(to, cysc.cs) | ||
cv = Adapt.adapt_structure(to, cysc.cv) | ||
return CartesianCoordinateSystemCache(cs, cv) | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for such simple ones we can just use adapt structure. See https://github.com/JuliaGPU/Adapt.jl/blob/master/src/macro.jl .
ext/cuda/cuda_adapt.jl
Outdated
fv = Adapt.adapt(to, StaticInterpolationValues(cv.fun_values)) | ||
gm = Adapt.adapt(to, StaticInterpolationValues(cv.geo_mapping)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it might be simpler to overload adapt_structure here for the different FEValues?
src/ferrite-addons/PR913.jl
Outdated
|
||
|
||
#################### | ||
# cuda_iterator.jl # |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you think that it helps navigatibility, then feel free to make a subfolder PR913 and put the files there, where PR913.jl just includes the files in this folder in the right order. If not, then just leave it as it is.
src/ferrite-addons/PR913.jl
Outdated
function allocate_global_mem(::Type{FullObject{Tv}}, n_cells::Ti, n_basefuncs::Ti) where {Ti <: Integer, Tv <: Real} | ||
Kes = CUDA.zeros(Tv, n_cells, n_basefuncs, n_basefuncs) | ||
fes = CUDA.zeros(Tv, n_cells, n_basefuncs) | ||
return GlobalMemAlloc(Kes, fes) | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does unfortunately not work. We need to move this dispatch into an extension.
src/ferrite-addons/PR913.jl
Outdated
abstract type AbstractMemAllocObjects{Tv <: Real} end | ||
struct RHSObject{Tv<:Real} <:AbstractMemAllocObjects{Tv} end | ||
struct JacobianObject{Tv<:Real} <:AbstractMemAllocObjects{Tv} end | ||
struct FullObject{Tv<:Real} <:AbstractMemAllocObjects{Tv} end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we come up with more descriptive names for these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not satisfied either with the naming convention, if you have a better naming, it would be great.
Maybe I should name them sth like BMem
AMem
AbMem
src/ferrite-addons/PR913.jl
Outdated
############ | ||
# adapt.jl # | ||
############ | ||
Adapt.@adapt_structure GPUGrid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just import @adapt_structure
. :)
ndofs = ndofs_per_cell(dh, i) | ||
view = @view dh.cell_dofs[offset:(offset + ndofs - convert(Ti, 1))] | ||
return view | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
end | |
end | |
src/gpu/gpu_operator.jl
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also related, how can decouple the assembly strategy itself from the specific backend?
Just initial rough ideas for the design of GPU linear operators π§βπ