You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Kranc currently supports DG operators, but not efficiently. I am planning to add a feature that makes this efficient. The basic plan is to loop over a grid function in two layers: an outermost layer that loops over elements (or "tiles" in general), and an innermost layer that loops over all collocation points within an element.
OpenMP parallelization is applied only to the outermost layer. Vectorization is only relevant for the innermost layer. Derivatives etc. are applied "en bloc" to a whole element.
This should also generalize to other numerical methods such as finite differencing. Replace the term "element" by "tile", and "collocation point" by "grid point" for this. The tile size can be chosen freely and is not restricted to the element size as for DG. This optimization has proven beneficial in Chemora for FD on GPUs, so I assume this would also lead to a performance benefit on CPUs.
I attach a sample for how the generated code (sans Kranc-typical boilerplate) could look like, following the structure described above.
The text was updated successfully, but these errors were encountered:
Additionally, I recently tested that using tiling improved performance significantly on Intel MICs, possibly due to giving a larger number of small work units to distribute using OpenMP. Tiling is currently implemented in LoopControl. Why is it better to do tiling in Kranc than in LoopControl or cctk_Loop.h?
I want to add other features such as pre-calculating derivatives per tile instead of calculating them per grid point or per grid function. This is not easily possible otherwise. See the attached gist, which shows the structure of the generated code I want to have.
Kranc currently supports DG operators, but not efficiently. I am planning to add a feature that makes this efficient. The basic plan is to loop over a grid function in two layers: an outermost layer that loops over elements (or "tiles" in general), and an innermost layer that loops over all collocation points within an element.
OpenMP parallelization is applied only to the outermost layer. Vectorization is only relevant for the innermost layer. Derivatives etc. are applied "en bloc" to a whole element.
This should also generalize to other numerical methods such as finite differencing. Replace the term "element" by "tile", and "collocation point" by "grid point" for this. The tile size can be chosen freely and is not restricted to the element size as for DG. This optimization has proven beneficial in Chemora for FD on GPUs, so I assume this would also lead to a performance benefit on CPUs.
I attach a sample for how the generated code (sans Kranc-typical boilerplate) could look like, following the structure described above.
The text was updated successfully, but these errors were encountered: