Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash on free(): invalid pointer #146

Closed
asmunder opened this issue Mar 23, 2020 · 14 comments
Closed

Crash on free(): invalid pointer #146

asmunder opened this issue Mar 23, 2020 · 14 comments

Comments

@asmunder
Copy link
Contributor

One of my 3D PeleC runs crashed with a free(): invalid pointer after almost 4 hours on a few hundred cores. Below is the crash log that I was able to extract. Seems like this is happening in the cleanup stage after a Level 1 solve.

I don't know if this is a useful bug report, nor if it can be reproduced. But let me know if you are interested in trying to chase this one down, and I can provide more details and the case files.

AMReX commit 93fb085d28349 (Nov 1 2019 - this is the "current submodule" for PeleC)
PeleC commit 1821d36 (Feb 13 2020)

[Level 1 step 8611] Advanced 20480 cells
[Level 1 step 8612] ADVANCE with dt = 1.861440021e-08
... Computing MOL source term at t^{n} 
... Computing MOL source term at t^{n+1} 
... Computing reactions for dt = 1.861440021e-08
[Level 1 step 8612] Advanced 20480 cells
*** glibc detected *** PeleC3d.gnu.MPI.ex: free(): invalid pointer: 0x0000000003955350 ***
======= Backtrace: =========
/lib64/libc.so.6[0x398fe75e5e]
/lib64/libc.so.6[0x398fe78cad]
PeleC3d.gnu.MPI.ex[0x501660]
PeleC3d.gnu.MPI.ex[0x5015ef]
PeleC3d.gnu.MPI.ex[0x4fc32e]
PeleC3d.gnu.MPI.ex[0x4fc41e]
PeleC3d.gnu.MPI.ex[0x54bbb6]
PeleC3d.gnu.MPI.ex[0x54bc3a]
PeleC3d.gnu.MPI.ex[0x5489d3]
PeleC3d.gnu.MPI.ex[0x4fd601]
PeleC3d.gnu.MPI.ex[0x5d9644]
PeleC3d.gnu.MPI.ex[0x7a1139]
PeleC3d.gnu.MPI.ex[0x5d6ce7]
PeleC3d.gnu.MPI.ex[0x5d7b48]
PeleC3d.gnu.MPI.ex[0x5cbcbf]
PeleC3d.gnu.MPI.ex[0x41600a]
/lib64/libc.so.6(__libc_start_main+0x100)[0x398fe1ed20]
PeleC3d.gnu.MPI.ex[0x41690d]


Backtrace.139:
(parsed with parse_bt.py)

0: amrex::BLBackTrace::print_backtrace_info(_IO_FILE*) at /home/asmunde/codes/amrex/Src/Base/AMReX_BLBackTrace.cpp:167

1: amrex::BLBackTrace::handler(int) at /home/asmunde/codes/amrex/Src/Base/AMReX_BLBackTrace.cpp:71

8: std::_Rb_tree<std::pair<amrex::IntVect, amrex::IntVect>, std::pair<std::pair<amrex::IntVect, amrex::IntVect> const, amrex::FabArrayBase::TileArray>, std::_Select1st<std::pair<std::pair<amrex::IntVect, amrex::IntVect> const, amrex::FabArrayBase::TileArray> >, std::less<std::pair<amrex::IntVect, amrex::IntVect> >, std::allocator<std::pair<std::pair<amrex::IntVect, amrex::IntVect> const, amrex::FabArrayBase::TileArray> > >::_M_erase(std::_Rb_tree_node<std::pair<std::pair<amrex::IntVect, amrex::IntVect> const, amrex::FabArrayBase::TileArray> >*) at /share/apps/modulessoftware/gcc/gcc-7.3.0/include/c++/7.3.0/bits/stl_tree.h:1854

9: std::_Rb_tree<std::pair<amrex::IntVect, amrex::IntVect>, std::pair<std::pair<amrex::IntVect, amrex::IntVect> const, amrex::FabArrayBase::TileArray>, std::_Select1st<std::pair<std::pair<amrex::IntVect, amrex::IntVect> const, amrex::FabArrayBase::TileArray> >, std::less<std::pair<amrex::IntVect, amrex::IntVect> >, std::allocator<std::pair<std::pair<amrex::IntVect, amrex::IntVect> const, amrex::FabArrayBase::TileArray> > >::_M_erase(std::_Rb_tree_node<std::pair<std::pair<amrex::IntVect, amrex::IntVect> const, amrex::FabArrayBase::TileArray> >*) at /share/apps/modulessoftware/gcc/gcc-7.3.0/include/c++/7.3.0/bits/stl_vector.h:434
 (inlined by) ?? at /home/asmunde/codes/amrex/Src/Base/AMReX_Vector.H:29
 (inlined by) ?? at /home/asmunde/codes/amrex/Src/Base/AMReX_FabArrayBase.H:225
 (inlined by) ?? at /share/apps/modulessoftware/gcc/gcc-7.3.0/include/c++/7.3.0/bits/stl_pair.h:198
 (inlined by) ?? at /share/apps/modulessoftware/gcc/gcc-7.3.0/include/c++/7.3.0/ext/new_allocator.h:140
 (inlined by) ?? at /share/apps/modulessoftware/gcc/gcc-7.3.0/include/c++/7.3.0/bits/alloc_traits.h:487
 (inlined by) ?? at /share/apps/modulessoftware/gcc/gcc-7.3.0/include/c++/7.3.0/bits/stl_tree.h:650
 (inlined by) ?? at /share/apps/modulessoftware/gcc/gcc-7.3.0/include/c++/7.3.0/bits/stl_tree.h:658
 (inlined by) std::_Rb_tree<std::pair<amrex::IntVect, amrex::IntVect>, std::pair<std::pair<amrex::IntVect, amrex::IntVect> const, amrex::FabArrayBase::TileArray>, std::_Select1st<std::pair<std::pair<amrex::IntVect, amrex::IntVect> const, amrex::FabArrayBase::TileArray> >, std::less<std::pair<amrex::IntVect, amrex::IntVect> >, std::allocator<std::pair<std::pair<amrex::IntVect, amrex::IntVect> const, amrex::FabArrayBase::TileArray> > >::_M_erase(std::_Rb_tree_node<std::pair<std::pair<amrex::IntVect, amrex::IntVect> const, amrex::FabArrayBase::TileArray> >*) at /share/apps/modulessoftware/gcc/gcc-7.3.0/include/c++/7.3.0/bits/stl_tree.h:1858

10: amrex::FabArrayBase::flushTileArray(amrex::IntVect const&, bool) const at /share/apps/modulessoftware/gcc/gcc-7.3.0/include/c++/7.3.0/ext/new_allocator.h:125
 (inlined by) ?? at /share/apps/modulessoftware/gcc/gcc-7.3.0/include/c++/7.3.0/bits/alloc_traits.h:462
 (inlined by) ?? at /share/apps/modulessoftware/gcc/gcc-7.3.0/include/c++/7.3.0/bits/stl_tree.h:592
 (inlined by) ?? at /share/apps/modulessoftware/gcc/gcc-7.3.0/include/c++/7.3.0/bits/stl_tree.h:659
 (inlined by) ?? at /share/apps/modulessoftware/gcc/gcc-7.3.0/include/c++/7.3.0/bits/stl_tree.h:2477
 (inlined by) ?? at /share/apps/modulessoftware/gcc/gcc-7.3.0/include/c++/7.3.0/bits/stl_tree.h:1125
 (inlined by) ?? at /share/apps/modulessoftware/gcc/gcc-7.3.0/include/c++/7.3.0/bits/stl_map.h:1032
 (inlined by) amrex::FabArrayBase::flushTileArray(amrex::IntVect const&, bool) const at /home/asmunde/codes/amrex/Src/Base/AMReX_FabArrayBase.cpp:1562

11: amrex::FabArrayBase::clearThisBD(bool) at /home/asmunde/codes/amrex/Src/Base/AMReX_FabArrayBase.cpp:1614

12: amrex::FabArray<amrex::CutFab>::clear() at /home/asmunde/codes/amrex/Src/Base/AMReX_FabArray.H:973

13: amrex::FabArray<amrex::CutFab>::~FabArray() at /share/apps/modulessoftware/gcc/gcc-7.3.0/include/c++/7.3.0/bits/stl_vector.h:434
 (inlined by) ?? at /home/asmunde/codes/amrex/Src/Base/AMReX_Vector.H:29
 (inlined by) amrex::FabArray<amrex::CutFab>::~FabArray() at /home/asmunde/codes/amrex/Src/Base/AMReX_FabArray.H:1129

14: amrex::EBDataCollection::~EBDataCollection() at /home/asmunde/codes/amrex/Src/EB/AMReX_EBDataCollection.cpp:72 (discriminator 1)

15: amrex::EBFArrayBoxFactory::~EBFArrayBoxFactory() at /share/apps/modulessoftware/gcc/gcc-7.3.0/include/c++/7.3.0/ext/atomicity.h:49
 (inlined by) ?? at /share/apps/modulessoftware/gcc/gcc-7.3.0/include/c++/7.3.0/ext/atomicity.h:82
 (inlined by) ?? at /share/apps/modulessoftware/gcc/gcc-7.3.0/include/c++/7.3.0/bits/shared_ptr_base.h:166
 (inlined by) ?? at /share/apps/modulessoftware/gcc/gcc-7.3.0/include/c++/7.3.0/bits/shared_ptr_base.h:684
 (inlined by) ?? at /share/apps/modulessoftware/gcc/gcc-7.3.0/include/c++/7.3.0/bits/shared_ptr_base.h:1123
 (inlined by) ?? at /share/apps/modulessoftware/gcc/gcc-7.3.0/include/c++/7.3.0/bits/shared_ptr.h:93
 (inlined by) ?? at /home/asmunde/codes/amrex/Src/EB/AMReX_EBFabFactory.H:27
 (inlined by) amrex::EBFArrayBoxFactory::~EBFArrayBoxFactory() at /home/asmunde/codes/amrex/Src/EB/AMReX_EBFabFactory.H:27

16: amrex::AmrLevel::~AmrLevel() at /share/apps/modulessoftware/gcc/gcc-7.3.0/include/c++/7.3.0/bits/stl_construct.h:107
 (inlined by) ?? at /share/apps/modulessoftware/gcc/gcc-7.3.0/include/c++/7.3.0/bits/stl_construct.h:137
 (inlined by) ?? at /share/apps/modulessoftware/gcc/gcc-7.3.0/include/c++/7.3.0/bits/stl_construct.h:206
 (inlined by) ?? at /share/apps/modulessoftware/gcc/gcc-7.3.0/include/c++/7.3.0/bits/stl_vector.h:434
 (inlined by) ?? at /home/asmunde/codes/amrex/Src/Base/AMReX_Vector.H:29
 (inlined by) amrex::AmrLevel::~AmrLevel() at /home/asmunde/codes/amrex/Src/Amr/AMReX_AmrLevel.cpp:533

17: PeleC::~PeleC() at /home/asmunde/codes/PeleC/Source/PeleC.cpp:599

18: amrex::Amr::regrid(int, double, bool) at /home/asmunde/codes/amrex/Src/Amr/AMReX_Amr.cpp:2917

19: amrex::Amr::timeStep(int, double, int, int, double) at /home/asmunde/codes/amrex/Src/Amr/AMReX_Amr.cpp:2030

20: amrex::Amr::coarseTimeStep(double) at /home/asmunde/codes/amrex/Src/Amr/AMReX_Amr.cpp:2439

21: main at /home/asmunde/codes/PeleC/Source/main.cpp:173

23: _start at ??:?


@jrood-nrel
Copy link
Contributor

It looks like this is happening in AmrLevel's destructor which is AMReX code. Specifically flushTileArray. I guess I'm not versed enough with the code to understand why regrid would call the destructors. Unfortunately I would have to do the classic redirection and finger point to AMReX. Otherwise, is there any other thing that could have triggered your job to die like time limit, or any single rank died? Can you restart and reproduce this at the same timestep?

@drummerdoc
Copy link
Contributor

Side comment.... regarding calls the destructor because the box array associated with the AmrLevel is likely to have changed, and would make all the cached into associated with the box array invalid. On a regrid the old AmrLevel is destructed and a new one created...any caches associated with the AmrLevel need to be rebuilt.

@asmunder
Copy link
Contributor Author

So I have been trying to dig a bit more here, and I can reproduce this behaviour, also on a different cluster, and also with different grid resolutions - when changing these, it's not the same time step, and not always the same message - I have seen the variations free: invalid pointer, Error: corrupted size vs. prev_size, and Error: double free or corruption (!prev). But it's always when trying to regrid.

All my runs without AMR are fine, but with AMR + EB + hydrogen combustion I've not been able to run my case successfully yet.

Just for ruling out that I'm doing something stupid somewhere on my side, I'd like to run a Tutorial or other reference case with hydrogen combustion that has AMR and EB, to verify that this does not crash.

Is there such a case I could try?

@jrood-nrel
Copy link
Contributor

I am not aware of any cases besides the ones that exist in our Exec directory. Have you tried a newer version of AMReX? I've also been doing a lot of work in the cpp branch which is updating the Fortran code with C++ code. Most functionality is there, AMR and EB, for example. That is also another avenue to try, in that case looking in the ExecCpp directory for examples. That branch will also have a very recent AMReX version in the submodule as well.

@whitmanscu
Copy link

@asmunder, I got very similar errors trying to use propane/air chemistry with AMR and EBs last year. I saw the same variety of error messages, always when doing a regrid. I talked at some length with @nickwimer and @hsitaram (who was able to reproduce the error) but we never came to a conclusive solution, so I'm interested in any progress we can make here.

I did have some success running the same propane case shrunk down significantly, with correspondingly higher base resolution (~0.2mm/cell vs. ~2mm). This made me think that there may be an issue with base resolution or timestep size, although I wasn't able to afford fully shrinking my base resolution on the full-size case to see if it fixed the issue. I did shrink the base resolution to ~0.5 mm/cell and still got the regrid crash.

That said, more recently I ran some H2/air bluff body cases successfully with AMR and EBs up until the flame tried to exit the domain, at which point I got a different crash using NSCBCs based on lack of species/reaction terms, or pressure issues with other outflow conditions - see #149. But at least I didn't get the regrid error! This was using the LiDryer mech with relatively high base resolution (~0.04mm/cell) and a very small physical domain, on the order of 1cm. What is your physical domain size and base resolution? Even if you see the error at multiple resolutions, could you maybe try a significantly smaller physical domain and resolution and see if you still get the error?

Some other notes on what I found did not error:

  • Any case without AMR.
  • Any case with reactions off. If I restart right before a crash with reactions off, even after having turned them on initially, I did not get a crash.
  • Any case with reactions turned on but no energy source to initiate reactions.
  • Very small domains/base resolutions in general; for reference the successful propane case had ~0.2mm base resolution and my H2 cases have ~0.04mm base resolution.

Notes on what did error:

  • Propane bluff body case with ~0.5mm base resolution or higher.
  • Same case with no bluff body, using an external source to add energy.
  • Same case with no bluff body and an external source, using SDC advance instead of MOL.

All of these cases ran fine without AMR.

Based on the above, if we are in fact dealing with the same issue I don't think EBs are actually necessary to get this sort of error; it seems to rely on base resolution and/or timestep in conjunction with reactions, and for some reason triggers specifically on AMR regridding. If I get time in the next few weeks I'll also try and revisit the issue with some of the more recent AMReX/cpp changes, as suggested, to see if they help at all.

@jrood-nrel
Copy link
Contributor

This makes me think there is a problem with the destructor during regrid interfacing with the EOS fortran code which interfaces to the C code in the mechanisms. I'm hoping the C++ code, as well as updates that have occurred in PelePhysics, would have better luck with this issue since it's all C++ in that case.

@asmunder
Copy link
Contributor Author

Thanks both for the input.

@jrood-nrel I've noticed comments about this transition to C++, but is there a roadmap for it somewhere? Will all Fortran code be replaced, eventually? I have some custom code for the inlet (cf. issue #141 ) that is Fortran, in the files pmf_generic.f90, bc_fill_nd.F90, Prob_nd.F90 . Do I need to migrate this code to C++ ?

Tangentially, if I want to run my existing code with a newer AMReX, which AMReX commit should I use?

@whitmanscu My current case is pretty fine resolution, a pseudo-2D expanding channel setup that is 2 cm x 8 cm - 8x1024x4096 grid with resolution 0.02 mm/cell for the base grid, three levels of AMR. I'm doing this because I'm targeting a hydrogen/air reheat flame at 15 bar pressure, so expecting the flame front to be very thin. The corresponding case at 1 bar is fully resolved on the same grid without using AMR, and is running fine.

@drummerdoc
Copy link
Contributor

@asmunder Eventually, all the AMReX-based codes will be migrated to use AMReX's kernel launching strategy, which has generally implied the use of C++ kernel functions to maximize the amount of inlining that the compilers can do. However, there is no formal restriction on programming language used within the kernels. For PeleC, this migration has already happened in one of the branches, but it hasn't been pulled into development yet. The Pele codes are the last of the AMReX codes to make this transition (PeleLM is even behind PeleLM though). Even if the BC's and IC's get pulled into C++, you can still call your fortran helper functions within them if you want...or you can convert your codes to C++ as well so that it inlines better for more efficient execution.

As for the crash on free error, there is a known issue for main that is similar. I can't imagine how it would help, but it's a quick edit to see....if not, remove it and keep looking. But the trick is to add an extra scoping of the code between Initialize and Finalize, so in main.cpp,
....
Initialize(...);
{ // <-------------------Added
// Code here
} // <-------------------Added
Finalize();
}

Somehow this guarantees that all objects get their destructors properly called before the program exits.

@jrood-nrel
Copy link
Contributor

@asmunder The C++ code is now in the development branch. Hopefully it will help with this error. There should be plenty of example problems to illustrate how to move your problem to C++. There is also an up-to-date AMReX included as a submodule. We don't have a concrete roadmap but most functionality is there, except most notably NSCBC and non-ideal EOS.

@asmunder
Copy link
Contributor Author

asmunder commented Aug 4, 2020

@jrood-nrel I'm starting to look into this now. (In parallell I'm focusing on a non-EB case where the AMR works fine; there I'm working on other stuff such as sampling a plane, which I'll open another issue for).

Unfortunately my existing case uses NSCBC, I guess I'll have to test if I can get by without too many reflections if I just use plain inlet/outlet.
Is there a plan for implementing the NSCBC stuff in the C++ code?

@drummerdoc
Copy link
Contributor

drummerdoc commented Aug 4, 2020

Is there a plan for implementing NSCBC stuff in the C++ code?

Sorry to say that it's lower on our list at LBNL since we are currently focused on meeting our milestone to have all of PeleLM ported to GPU by the end of the fiscal year. PeleC, managed predominantly out of NREL, is focused on performance metrics for the same milestone. My guess is that early in the new FY (October, November) many of these usability aspects will be cleaned up. It may not actually be very hard at all to port the NSCBC stuff if you want to have a look yourself. Sorry that you are left hanging in this process.

@emotheau
Copy link
Contributor

emotheau commented Aug 4, 2020

We can still call the Fortran routine to fill the ghost-cells, right ?

@drummerdoc
Copy link
Contributor

Yes, as long as the routines are marked as DEVICE or HOST_DEVICE, and as long as a suitable fortran compiler is used.

@asmunder
Copy link
Contributor Author

asmunder commented Aug 6, 2020

Right, you were saying earlier that it's possible to mix Fortran and C++ routines.
If I follow correctly, you are saying that in PeleC's SourceCpp/BCfill.cpp I would replace AMREX_GPU_DEVICE with AMREX_HOST_DEVICE, since the bcnormal routine will now be provided by a Fortran file. And then I would need to add the calls to impose_NSCBC in the correct places in SourceCpp/MOL.cpp and SourceCpp/Hydro.Cpp, also in some blocks marked with AMREX_HOST_DEVICE? Do I also need to copy over the impose_NSCBC_3d.f90 etc. files over to my Exec folder or something?

What would be a suitable compiler? - I'm a bit confused from the AMReX GPU page about whether GNU will work or not. And does this require GPUs on the machine, or will it fall back to CPU even though stuff is marked AMREX_GPU_DEVICE?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants