-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel computing #9
Comments
Hello. I might be able to help with this issue from a programming perspective. |
Wow. Someone addressing an issue! I have never, ever, used any kind of multithreading. My undeducated guess would be to use pthreads, but it would break compatibility with other platforms (or am I mistaken?) To be totally honnest I forgot I even had this issue open. My focus switched to OpenCL since while multithreading sounds cool, I doubt any consumer device can offer 1000+ threads... The issue lies in
And more spefically this line (that comes up several times):
SENPAI is iterating through the universe to perform the same operation on each atom. This is stupidly inefficient and this process should be made parallel. I just kept postponing this issue since I have no experience with either multithreading or parallel computing. |
However, I would also try to see how an OpenMP approach. This will involve an extra step in the compiling procedure, but we will figure it out. But in order to have concludent results, I need to have a solution to running a deterministic algorithm (i.e. For the given input, always have the same output). How can I run your software multiple times and get similar running times and same output? When I tried running:
Is there somewhere in code a random generator ? |
Ouch, 8 hours is mean. The mechanics are fully deterministic, but the simulations cannot be reproduced identically without some (light) changes. The way SENPAI works is it loads a system from a A quick fix that would allow for a simple simulation to be fully deterministic would be to have With such a fix, a simulation such as Note the Another issue worth pointing out is that
To sum up, if full determinism for development purposes is required:
Let me know if you have more questions, having zero concrete programming experience due to my curriculum I do understand that my methods, usage of |
If Oscars were given for a job well done, I’d nominate you! I did the steps that you have suggested and now we have a deterministic system, where I can further investigate which multithreading solution fits best. Moreover, I can check that my changes does not change in any way the output of the current application, assuring the persisting All this setup looks extremely good! I will keep you updated. |
FYI, there are three main strategies to parallelize this kind of software: shared memory parallelism (i.e. threads, OpenMP); distributed memory parallelism (i.e. MPI, running the code on multiple computers connected by high speed network); and accelerators (GPU with CUDA or OpenCL, Xeon Phi, etc.) All of these strategies can be used together, and the biggest codes in the community use all of them. In order of complexity, I would rank them as Another thing to consider is what to parallelize: in most MD code, the most expensive step is the computation of the forces acting on all atoms, and the use of periodic boundary condition. The actual integration step only takes a fraction of this time. I don't know how much this apply to this project =) |
Indeed, you can choose to parallelize at different levels based on the architecture you aim to execute the code. While using accelerators (like GPUs) can result in having an outstanding speedup, there will be an explicit need of new hardware to support execution. Thus, the application is likely to become reliant on the architecture of the calculus system. When working with distributed memory parallelism (using Message Passing Interface for instance, to achieve communication), this approach is likely to be more efficient when running on multiple systems. If the application is meant to run on ordinary personal computers, where the number of cores is small, shared memory might be a better approach. We may want to try both shared memory parallelism and distributed memory parallelism approaches. What to parallelize is again an important task to be solved. I think the best plan is to start with few profiling activities. I expect that next week I will be able to come with some preliminatory results. |
MPI/acceleration is far beyond my skills, I'm afraid. So I won't be commenting on that
SENPAI's bottleneck indeed lies in the horrible way it handles periodic boundary conditions. It's inefficient beyond any reasoning. I implemented those at the end of an internship this summer, when the heatwave and failed UV-vis stacked up to a massive burnout. I never got around to trying to convince myself to waste more time on them.
I actually have an old Proliant DL380 waiting to run the thing. If we do get around to try out MPI, I might invest in a few Raspberry Pi's. I'm really busy right now, I'm prototyping a 3D-printable FTIR system, and learning good Git practice to avoid poluting SENPAI's repository now that I have contributions. As such I'll wait for development regarding parallelism before pushing anything, if anything happens it's going to be cleanup.
My inexperienced bets are on By the way, I lack experience with parallelism, so don't expect me to shine a light on any path, but would it be more efficient to paralellise within parallels threads? If that makes any sense. What I mean is something along the line of updating each atom's force in parallel, and each of the threads would again parallelise all the computations. I have no choice but to put blind trust in contributions when it comes to this issue/feature. |
Some preliminary results:
|
2nd run
|
I am bewildered. More than 75% of the simulation time is just dealing with the periodic boundary conditions. I'll open an issue just for this one. Parallel computing or not, this needs to be dealt with. |
Random question: From what I see, the code currently does smth like
What do you think if we change the order of loops, such that the code will become:
This way, we might be able to parallelize easier the problem, as Are there any data dependencies between different iterations steps and different atoms?
FYI: There is a well-explained tutorial how to use gprof |
Changing the order of the loops is impossible. The force applied to a single atom is a function of its distance to all other atoms, which changes constantly. An iteration is directly dependant on the last one. This is one of the many reasons why the N body problem is such a PITA. The coordinates of each atom cannot be expressed as a function of time, and numerical integration is mandatory. |
Got it! 👍 Btw, I re-run profiling with a new input. The previous profiling results were generated by running Now I tried to increase the number of atoms
Seems that periodic boundary conditions are not that terrible for performance. |
DAMN. Waking up to this on a Christmas morning! This is fantastic work - I'll be reviewing it soon :) |
[DRAFT] SENPAI Parallel computing #9 -- OpenMP implementation
…-- OpenMP implementation"
Revert "[DRAFT] SENPAI Parallel computing #9 -- OpenMP implementation"
Hey, i just looked at your repo, supercool project indeed! |
Oof, I haven't touched parallel computing since 2019. I'm more than rusty, I wouldn't consider myself able to intervene in a beneficial manner. I'm also thinking about using a GPU, just like all the big simulators do. There is no way SENPAI will be used in actual production environments without GPU computing. However, SENPAI is FOSS and I just can't allow it to rely on CUDA and nVidia's desperate pushes to consume the entire HPC market. If it comes down to the OpenCL/CUDA choice, OpenCL it is. |
Just a quick thought: How about getting distributed computing (MPI support, so that it can run on entire clusters) going first, and then work on GPU computing ? The end game would be to have both. Distributed computing on machines full of GPUs. |
well, i doubt you'll have a cluster at your disposal to mess around with anytime soon, while its much more likely you'll have a GPU (or more than one, see mining rigs) readily available to start computing right away.. in my opinion you would get the most benefit by implementing GPU support first, but thats just my opinion. Anyway a good stating point could be to start refactoring the code to support a "solver" module, which can then be implemented differently based on the parallel computing architecture you end up choosing. What do you think about that? |
I actually have a cluster in my living room, a 12U rack cabinet full of servers waiting to be used aha. I have no GPU, though. I bought the servers for this exact purpose, wanted to get GPUs too, but life happened. And the GPU shortage happened. |
I don't mind sacrificing portability to have SENPAI run on Linux x86_64 exclusively, if it means getting MPI+OpenCL support. However, I'd still like SENPAI to run on personal computers. SENPAI was a useful learning tool for me, I'd like others to enjoy it as well. Maybe it should have a way to fall back to a more classic "non-distributed, CPU only" computing. Maybe with multithreading. How would you describe the solver module, in a hypothetical MPI+OpenCL scenario? |
No joke! Well, let's go for that then! Although i have to warn you i have no MPI experience.. and no way of testing things out myself, since i have only one computer. |
I'm tired of running simulations overnight gimme some multithreading
The text was updated successfully, but these errors were encountered: