-
Notifications
You must be signed in to change notification settings - Fork 250
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MPI][AMGCL] MPI version of AMGCL solver not performing as expected #3868
Comments
Minor comment, just regarding the last point. Can you confirm that you are disabling the OpenMP parallelism in mpi runs? (for example with |
Yes, this setting was chosen and the run was repeated to double-check. The wording "slow execution" may be a bad formulation: Due to only slowly approaching the termination criterion, the solution is often repeated and this takes a long time. Looking at a single solution step, the MPI version is faster. |
Ok understood. Thanks for the clarification. |
I'll need @RiccardoRossi to help me reproduce this. May be chat later today? |
@RiccardoRossi, I think @jcotela was correct, because you know more than me re actually using mpi amgcl solver in Kratos (I still have very little idea of how to actually run Kratos). I managed to compile enough of Kratos to run serial part of @swenczowski example. When I export the matrix, I can solve it with standalone amgcl solver, both serial and mpi one (and I don't see too much of a discrepancy between the two). I can not compile METIS_APPLICATION on my machine, it seems that metis headers included into Kratos source are in conflict with metis version installed in my system (I have metis v5.1 and parmetis v4.0 installed on Arch linux). The error I get is
|
@RiccardoRossi you are completely right on that. I did not intend to play down @ddemidov's role in this, I was just trying to solve what it looked (to me) as an integration issue in kratos internally before calling him in... (thanks @ddemidov by the way) @ddemidov kratos "assumes" an older version of metis by default. Can you check that you have set the option |
Thanks, that solved the metis issue! Now, when I run
Should I run it as |
Yes, please. Also "-n 2" showed the described behavior for me. |
I can see that the solver returns nan as error. But, if I solve the same system (saved to MatrixMarket file by setting Serial solve:
MPI solve:
|
When I set EDIT: In fact, I should be able to temporary add the matrix saving code myself. |
Partitioning does not seem to be a problem. When I save matrix parts to separate files, I can still solve those with (modified) standalone solver:
|
just a workaround while we look for the problem: please set
|
Some progress: I was able to reproduce the NaN error with standalone amgcl solver by switching from MatrixMarket to binary output format. So, the slight difference in precision between text and binary formats is important here, which makes me suspicious about the problem correctness (could it be singular?). Still don't know what exactly is the problem, but at least we can be sure that Kratos code wrapping amgcl is working correctly. Another possible workaround is to switch from damped_jacobi to spai0 as relaxation:
|
Ok, I think I know the reason for NaNs: 168-th diagonal block of the system matrix belonging to the second MPI process is singular (its inverse is 3x3 matrix of nans). The block in question is:
In fact, the whole 168-th row looks bad (column number followed by 3x3 value):
The rows below and above it look normal. EDIT: this could be a problem with block matrix adapter, which expects matrix rows to be sorted by column numbers. It looks like this is not the case with the matrices I get here. Will look at this later. |
Main reason is this commit: ddemidov/amgcl@3468ad8 It should fix #3868.
Should be fixed by #3896. |
@swenczowski, can you please confirm that #3896 helps? |
Thank you very much for the update. The branch was checked out and compiled in Release mode. The implementation in #3896 was tested on my local machine in different cases meaning
The repeating observation was, that the solver operates fast and (judging from the results) correctly when launched as "mpirun -n 2". However, when I use 3 or 4 processes on my machine, the previously described unexpected behaviour is still present. I am very sorry. @ddemidov Can you, in return, reproduce this observation? For me, it could also be seen in the case attached to the issue. If you prefer I different case with certain specifications, please, just ask and I will try and generate one. ( Edit: Node of the above mentioned workarounds was applied at the same time ) |
What exactly is the unexpected behavior you observe with mpi=3 and mpi=4? I don't see any problems with your mpi example with #3896. As @jcotela said, remember to disable openmp parallelism, or it may seem that the solution is too slow. EDIT: Also, you should not need the workarounds (disabling block solves or switching to spai0). |
My apologies for the false alarm. I was convinced that I had Yes, now the solver seems to perform fast and correctly on all locally tested cases. Thank you very much for your effort. Edit: Now also successfully tested on CoolMUC2 and a larger case. Great difference in performance compared to the MultiLevel solver! |
Hi, I have again the same issue as @swenczowski had once. I am simulating a simple 2D fluid simulation using OMP 8 threads, 1 thread 1 process MPI, 1 thread 8 processes MPI with AMGCL solver. I observed the following:
The problem with 2nd observation is, it does not converge. I used "use_block_matrices_if_possible" : false as well. It does not change the observations I made. I have attached the case which I used to produce above mentioned observations. Thanks a lot |
can it be that the matrix is simply reordered? |
I checked the entries like (1,1) in both files... they are totally different. I'm not sure how equation ids and dof ids are distributed in omp and mpi. I thought they are the same, dont they? |
explainging a little more what i mean: if you import the two A.mm in say python, and solve it with a direct solver, try to see if the norm(dx) is the same in the two cases. |
there is no guarantee the ordering of the matix is the same (generally it will not be) |
I checked it with direct solvers in python. They all give same norm(dx) value. But it still not solve the issue of non convergence. That is, If i run the same case with MPI with 8 processes and 1 thread, it does not converge (but the OMP run converges without a problem). Am i doing something wrong with the settings? Followings are the settings i am using in mpi case apart from the defaults in amgcl. |
The following observations were made for the AMGCL linear solver in cases using different solver types. In particular, I tested the "two_fluids" and "monolithic" solver. For comparison purposes, otherwise absolutely identical cases were computed with the serial and the mpi-parallel version of the solver.
AMGCL (for serial applications with 1 and more threads)
amgcl (for MPI based applications)
If you are interested in reproducing the behaviour, please, find a small and well-known case (original case provided by @rubenzorrilla) in the archive folder I am attaching.
@philbucher Since I assume that several people may be affected and aslo for the sake of documentation, I finally created an issue for this observation.
AMGCLissue.tar.gz
The text was updated successfully, but these errors were encountered: