Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MPI][AMGCL] MPI version of AMGCL solver not performing as expected #3868

Open
swenczowski opened this issue Jan 17, 2019 · 27 comments · Fixed by #3896
Open

[MPI][AMGCL] MPI version of AMGCL solver not performing as expected #3868

swenczowski opened this issue Jan 17, 2019 · 27 comments · Fixed by #3896
Assignees
Labels
Help Wanted Parallel-MPI Distributed memory parallelism for HPC / clusters

Comments

@swenczowski
Copy link
Collaborator

The following observations were made for the AMGCL linear solver in cases using different solver types. In particular, I tested the "two_fluids" and "monolithic" solver. For comparison purposes, otherwise absolutely identical cases were computed with the serial and the mpi-parallel version of the solver.

AMGCL (for serial applications with 1 and more threads)

  • fast solution
  • quick convergence

amgcl (for MPI based applications)

  • frequent output of "AMGCL: Unknown parameter class" as already mentioned in [AMGCL] many warnings after #2807 #2989
  • unsatisfying convergence behaviour, sometimes even diverging into "+/- nan" for a case that quickly converges in serial version
  • extremely slow execution

If you are interested in reproducing the behaviour, please, find a small and well-known case (original case provided by @rubenzorrilla) in the archive folder I am attaching.

@philbucher Since I assume that several people may be affected and aslo for the sake of documentation, I finally created an issue for this observation.

AMGCLissue.tar.gz

@swenczowski swenczowski added Help Wanted Parallel-MPI Distributed memory parallelism for HPC / clusters labels Jan 17, 2019
@jcotela
Copy link
Member

jcotela commented Jan 17, 2019

Minor comment, just regarding the last point. Can you confirm that you are disabling the OpenMP parallelism in mpi runs? (for example with export OMP_NUM_THREADS=1 in your shell session before calling mpirun). This makes a great difference in performance.

@swenczowski
Copy link
Collaborator Author

Yes, this setting was chosen and the run was repeated to double-check. The wording "slow execution" may be a bad formulation: Due to only slowly approaching the termination criterion, the solution is often repeated and this takes a long time. Looking at a single solution step, the MPI version is faster.

@jcotela
Copy link
Member

jcotela commented Jan 17, 2019

Ok understood. Thanks for the clarification.
I believe that @RiccardoRossi is the expert regarding the use of the MPI AMGCL solver in Kratos.

@ddemidov
Copy link
Member

I'll need @RiccardoRossi to help me reproduce this. May be chat later today?

@RiccardoRossi
Copy link
Member

yes i can defnitely help you.

@jcotela, this is just to acknowledge that @ddemidov is the author of amgcl...and definitely more expert than me 😉

@ddemidov
Copy link
Member

ddemidov commented Jan 18, 2019

@RiccardoRossi, I think @jcotela was correct, because you know more than me re actually using mpi amgcl solver in Kratos (I still have very little idea of how to actually run Kratos).

I managed to compile enough of Kratos to run serial part of @swenczowski example. When I export the matrix, I can solve it with standalone amgcl solver, both serial and mpi one (and I don't see too much of a discrepancy between the two).

I can not compile METIS_APPLICATION on my machine, it seems that metis headers included into Kratos source are in conflict with metis version installed in my system (I have metis v5.1 and parmetis v4.0 installed on Arch linux). The error I get is

FAILED: applications/metis_application/CMakeFiles/KratosMetisApplication.dir/custom_python/add_processes_to_python.cpp.o 
/usr/bin/g++  -DADD_ -DAMGCL_GPGPU -DKRATOS_PYTHON -DKRATOS_USING_MPI -DKratosMetisApplication_EXPORTS -DVEXCL_BACKEND_CUDA -Iapplications/metis_application -I../applications/metis_application -I../external_libraries -I../kratos -I/usr/include/python3.7m -isystem /opt/cuda/include -isystem /home/demidov/work/vexcl -msse3 -std=c++11 -Wno-class-memaccess -funroll-loops -Wall -std=c++11 -Wsuggest-override -fopenmp -O3 -DNDEBUG -fPIC -fvisibility=hidden   -fPIC -std=c++14 -flto -fno-fat-lto-objects -Wall -Wno-missing-braces -Wno-deprecated-declarations -Wno-ignored-attributes -Wno-unused-local-typedefs -Wno-catch-value -std=gnu++11 -MD -MT applications/metis_application/CMakeFiles/KratosMetisApplication.dir/custom_python/add_processes_to_python.cpp.o -MF applications/metis_application/CMakeFiles/KratosMetisApplication.dir/custom_python/add_processes_to_python.cpp.o.d -o applications/metis_application/CMakeFiles/KratosMetisApplication.dir/custom_python/add_processes_to_python.cpp.o -c ../applications/metis_application/custom_python/add_processes_to_python.cpp
In file included from ../applications/metis_application/custom_processes/metis_divide_input_to_partitions_process.h:63,
                 from ../applications/metis_application/custom_processes/metis_divide_heterogeneous_input_process.h:11,
                 from ../applications/metis_application/custom_python/add_processes_to_python.cpp:23:
../applications/metis_application/custom_processes/metis_graph_partitioning_process.h:31:18: error: conflicting declaration of C function ‘int METIS_PartMeshDual(int*, int*, int*, int*, int*, int*, int*, int*, int*)’
       extern int METIS_PartMeshDual(int*, int*, int*, int*, int*, int*, int*, int*, int*);
                  ^~~~~~~~~~~~~~~~~~
In file included from /usr/include/parmetis.h:18,
                 from ../applications/metis_application/custom_processes/metis_divide_heterogeneous_input_process.h:7,
                 from ../applications/metis_application/custom_python/add_processes_to_python.cpp:23:
/usr/include/metis.h:219:16: note: previous declaration ‘int METIS_PartMeshDual(idx_t*, idx_t*, idx_t*, idx_t*, idx_t*, idx_t*, idx_t*, idx_t*, real_t*, idx_t*, idx_t*, idx_t*, idx_t*)’
 METIS_API(int) METIS_PartMeshDual(idx_t *ne, idx_t *nn, idx_t *eptr, idx_t *eind,
                ^~~~~~~~~~~~~~~~~~

@jcotela
Copy link
Member

jcotela commented Jan 18, 2019

@RiccardoRossi you are completely right on that. I did not intend to play down @ddemidov's role in this, I was just trying to solve what it looked (to me) as an integration issue in kratos internally before calling him in... (thanks @ddemidov by the way)

@ddemidov kratos "assumes" an older version of metis by default. Can you check that you have set the option -DUSE_METIS_5=ON in your cmake?

@ddemidov
Copy link
Member

Thanks, that solved the metis issue! Now, when I run python MainKratos.py from the mpi example, I get

Traceback (most recent call last):
  File "MainKratos.py", line 32, in <module>
    simulation.Run()
  File "/home/demidov/work/Kratos/kratos/python_scripts/analysis_stage.py", line 47, in Run
    self.Initialize()
  File "/home/demidov/work/Kratos/kratos/python_scripts/analysis_stage.py", line 68, in Initialize
    self._GetSolver().ImportModelPart()
  File "/home/demidov/work/Kratos/applications/FluidDynamicsApplication/python_scripts/trilinos_navier_stokes_solver_vmsmonolithic.py", line 110, in ImportModelPart
    self.trilinos_model_part_importer = trilinos_import_model_part_utility.TrilinosImportModelPartUtility(self.main_model_part, self.settings)
  File "/home/demidov/work/Kratos/applications/trilinos_application/python_scripts/trilinos_import_model_part_utility.py", line 18, in __init__
    raise NameError("MPI number of processors is 1.")
NameError: MPI number of processors is 1.

Should I run it as mpirun -np 4 python MainKratos.py instead?

@swenczowski
Copy link
Collaborator Author

Yes, please. Also "-n 2" showed the described behavior for me.

@jcotela
Copy link
Member

jcotela commented Jan 18, 2019

@ddemidov see #3880 if you want to test with a single process. (edited-- wrong '@')

@ddemidov
Copy link
Member

ddemidov commented Jan 18, 2019

I can see that the solver returns nan as error. But, if I solve the same system (saved to MatrixMarket file by setting verbosity=4) with the same parameters, with the standalone amgcl solver, I can not reproduce this.

Serial solve:

~/work/amgcl/build/examples/solver -A A.mm -f b.mm.rhs \
-p precond.coarse_enough=500 \
   precond.relax.type=damped_jacobi \
   precond.coarsening.type=aggregation \
   precond.coarsening.aggr.eps_strong=0 \
   solver.type=lgmres -b3

Solver
======
Type:             LGMRES(30,3)
Unknowns:         3072
Memory footprint: 2.69 M

Preconditioner
==============
Number of levels:    2
Operator complexity: 1.12
Grid complexity:     1.13
Memory footprint:    3.38 M

level     unknowns       nonzeros      memory
---------------------------------------------
    0         3072          20772      2.53 M (89.00%)
    1          403           2567    876.55 K (11.00%)

Iterations: 24
Error:      5.01826e-09

[Profile:       0.317 s] (100.00%)
[ self:         0.001 s] (  0.16%)
[  reading:     0.244 s] ( 77.14%)
[  setup:       0.012 s] (  3.85%)
[  solve:       0.060 s] ( 18.85%)

MPI solve:

mpirun -np 2 ~/work/amgcl/build/examples/mpi/mpi_amg -A A.mm -f b.mm.rhs \
-p precond.coarse_enough=500 \
   precond.relax.type=damped_jacobi \
   precond.coarsening.type=aggregation \
   precond.coarsening.aggr.eps_strong=0 \
   solver.type=lgmres -b3

World size: 2
Partitioning[PT-Scotch] 2 -> 2
Type:             LGMRES(30,3)
Unknowns:         1550
Memory footprint: 1.37 M

Number of levels:    2
Operator complexity: 1.12
Grid complexity:     1.13

level     unknowns       nonzeros
---------------------------------
    0         3072          20772 (89.16%) [2]
    1          397           2525 (10.84%) [2]

Iterations: 25
Error:      8.58271e-09

[Profile:         0.310 s] (100.00%)
[ self:           0.031 s] ( 10.14%)
[  partition:     0.007 s] (  2.26%)
[  read:          0.229 s] ( 73.91%)
[  setup:         0.007 s] (  2.22%)
[  solve:         0.036 s] ( 11.47%)

@ddemidov
Copy link
Member

ddemidov commented Jan 18, 2019

When I set verbosity=4 in mpi case, I still get single matrix file. Is there a way to get partitioned matrix parts? How is the system matrix partitioned here?

EDIT: In fact, I should be able to temporary add the matrix saving code myself.

@ddemidov
Copy link
Member

ddemidov commented Jan 18, 2019

Partitioning does not seem to be a problem. When I save matrix parts to separate files, I can still solve those with (modified) standalone solver:

mpirun -np 2 ~/work/amgcl/build/examples/mpi/mpi_amg \
-p precond.coarse_enough=500 \
   precond.relax.type=damped_jacobi \
   precond.coarsening.type=aggregation \
   precond.coarsening.aggr.eps_strong=0 \
   solver.type=lgmres -b3 

World size: 2
Type:             LGMRES(30,3)
Unknowns:         1541
Memory footprint: 1.36 M

Number of levels:    2
Operator complexity: 1.12
Grid complexity:     1.12

level     unknowns       nonzeros
---------------------------------
    0         3072          20772 (89.40%) [2]
    1          384           2464 (10.60%) [2]

Iterations: 26
Error:      5.3051e-09

[Profile:     0.204 s] (100.00%)
[ self:       0.033 s] ( 16.42%)
[  read:      0.123 s] ( 60.29%)
[  setup:     0.009 s] (  4.57%)
[  solve:     0.038 s] ( 18.73%)

@RiccardoRossi
Copy link
Member

just a workaround while we look for the problem:

please set

   "use_block_matrices_if_possible" : false

@ddemidov
Copy link
Member

ddemidov commented Jan 19, 2019

Some progress: I was able to reproduce the NaN error with standalone amgcl solver by switching from MatrixMarket to binary output format. So, the slight difference in precision between text and binary formats is important here, which makes me suspicious about the problem correctness (could it be singular?).

Still don't know what exactly is the problem, but at least we can be sure that Kratos code wrapping amgcl is working correctly.

Another possible workaround is to switch from damped_jacobi to spai0 as relaxation:

mpirun -n 2 ~/work/amgcl/build/examples/mpi/mpi_amg -P P.json -B -b3 \
-p precond.relax.type=spai0

World size: 2
Type:             LGMRES(500,3)
Unknowns:         1541
Memory footprint: 21.80 M

Number of levels:    3
Operator complexity: 1.13
Grid complexity:     1.14

level     unknowns       nonzeros
---------------------------------
    0         3072          20772 (88.34%) [2]
    1          387           2471 (10.51%) [2]
    2           49            272 ( 1.16%) [2]

Iterations: 129
Error:      7.84784e-09

[Profile:     0.339 s] (100.00%)
[ self:       0.040 s] ( 11.84%)
[  read:      0.001 s] (  0.44%)
[  setup:     0.018 s] (  5.29%)
[  solve:     0.280 s] ( 82.43%)

@ddemidov
Copy link
Member

ddemidov commented Jan 19, 2019

Ok, I think I know the reason for NaNs: 168-th diagonal block of the system matrix belonging to the second MPI process is singular (its inverse is 3x3 matrix of nans). The block in question is:

-0.00999503 0 0
 0 0 0
 0 0 0

In fact, the whole 168-th row looks bad (column number followed by 3x3 value):

159:  -0.00979576 -0.00400725 0.00310985
 0 0 0
 0 0 0

160:  -0.00648279 0.000788925 0.00395082
 0 0 0
 0 0 0

161:  -0.0217919 0 0
 0 0 0
 0 0 0

168:  -0.00999503 0 0
 0 0 0
 0 0 0

The rows below and above it look normal.

EDIT: this could be a problem with block matrix adapter, which expects matrix rows to be sorted by column numbers. It looks like this is not the case with the matrices I get here. Will look at this later.

ddemidov added a commit that referenced this issue Jan 19, 2019
Main reason is this commit: ddemidov/amgcl@3468ad8
It should fix #3868.
@ddemidov
Copy link
Member

Should be fixed by #3896.

@ddemidov
Copy link
Member

@swenczowski, can you please confirm that #3896 helps?

@swenczowski
Copy link
Collaborator Author

swenczowski commented Jan 19, 2019

Thank you very much for the update. The branch was checked out and compiled in Release mode. The implementation in #3896 was tested on my local machine in different cases meaning

  • different solvers
  • different (but still relatively small) mesh sizes
  • different geometries.

The repeating observation was, that the solver operates fast and (judging from the results) correctly when launched as "mpirun -n 2". However, when I use 3 or 4 processes on my machine, the previously described unexpected behaviour is still present. I am very sorry.

@ddemidov Can you, in return, reproduce this observation? For me, it could also be seen in the case attached to the issue. If you prefer I different case with certain specifications, please, just ask and I will try and generate one.

( Edit: Node of the above mentioned workarounds was applied at the same time )

@ddemidov
Copy link
Member

ddemidov commented Jan 20, 2019

What exactly is the unexpected behavior you observe with mpi=3 and mpi=4? I don't see any problems with your mpi example with #3896. As @jcotela said, remember to disable openmp parallelism, or it may seem that the solution is too slow.

EDIT: Also, you should not need the workarounds (disabling block solves or switching to spai0).

@swenczowski
Copy link
Collaborator Author

swenczowski commented Jan 20, 2019

My apologies for the false alarm. I was convinced that I had export OMP_NUM_THREADS = 1 specified in my ~/.bashrc. However, I made a typo there and several threads were started. I am sorry and thank you very much for the reminder.

Yes, now the solver seems to perform fast and correctly on all locally tested cases. Thank you very much for your effort.

Edit: Now also successfully tested on CoolMUC2 and a larger case. Great difference in performance compared to the MultiLevel solver!

@sunethwarna
Copy link
Member

sunethwarna commented Mar 6, 2020

Hi,

I have again the same issue as @swenczowski had once. I am simulating a simple 2D fluid simulation using OMP 8 threads, 1 thread 1 process MPI, 1 thread 8 processes MPI with AMGCL solver. I observed the following:

  1. OMP 8 threads and 1 thread 1 process MPI produceses same "A.mm" file.
  2. OMP 8 threads and 1 thread 8 process MPI producesses different "A.mm" files.

The problem with 2nd observation is, it does not converge.

I used "use_block_matrices_if_possible" : false as well. It does not change the observations I made.
I checked example given by @swenczowski as well, it also produces different "A.mm" files in the case of "1" and "2" methods.

I have attached the case which I used to produce above mentioned observations.
Would you be able to look at it @ddemidov, @RiccardoRossi ?

Thanks a lot

caarc_90_2d_slip_omp_mpi_test.zip

@sunethwarna sunethwarna reopened this Mar 6, 2020
@RiccardoRossi
Copy link
Member

can it be that the matrix is simply reordered?

@sunethwarna
Copy link
Member

I checked the entries like (1,1) in both files... they are totally different. I'm not sure how equation ids and dof ids are distributed in omp and mpi. I thought they are the same, dont they?

@RiccardoRossi
Copy link
Member

explainging a little more what i mean:

if you import the two A.mm in say python, and solve it with a direct solver, try to see if the norm(dx) is the same in the two cases.

@RiccardoRossi
Copy link
Member

there is no guarantee the ordering of the matix is the same (generally it will not be)

@sunethwarna
Copy link
Member

I checked it with direct solvers in python. They all give same norm(dx) value. But it still not solve the issue of non convergence. That is,

If i run the same case with MPI with 8 processes and 1 thread, it does not converge (but the OMP run converges without a problem). Am i doing something wrong with the settings?

Followings are the settings i am using in mpi case
'
"solver_type": "amgcl",
"gmres_krylov_space_dimension":300,
"coarse_enough": 5000,
"use_block_matrices_if_possible" : false
'

apart from the defaults in amgcl.
Is there a way to get around this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Help Wanted Parallel-MPI Distributed memory parallelism for HPC / clusters
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants