-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compact legendre polynomials #164
Compact legendre polynomials #164
Conversation
(convert to draft for the moment ==> it is fnished but we should first merge the other PR, then rebase and review) |
04194bc
to
d0d2349
Compare
d0d2349
to
2a87188
Compare
Beside replacing ext_acc with the proper "copy module", we are only left with this PR from the old GPU branch which I now rebased on top of develop. This PR is compacting the legendre polynomials and with this removing zero padding. This PR is not expected to interfere with any other PR, as it only touches the GEMMs. Since I am touching the CUDA interfaces anyway, I also added const to pointers in the interface.
@samhatfield @wdeconinck Feel free to review when you find time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks all fine and what a nice saving!
Perhaps @samhatfield can run it in few places to double check all is in order ?
If I can find the time this week, I will take a look. It might have to wait until next year. |
Sorry, I am gradually getting through this finally :) Will add some more comments soon. |
no worry - I merged it with develop. |
If I understand right, you have basically flattened the Legendre polynomial work arrays ( I am happy to merge this pending the conversation above and once you give me the go ahead @lukasm91. |
I just modified the consts to be "consistent" now. Your description makes sense! I think from my side we are ready to merge. |
Legendre polynomials don't need to be stored zero-padded, we can just concatenate them with proper zero padding.
E.g. on tco2559, this saves up to almost 40GB per rank (used to be 64 GB, now it is 26 GB). Initially, we did this for Leonardo, because legendre coefficients used almost the whole device memory.