Crash on free(): invalid pointer #146

asmunder · 2020-03-23T22:52:25Z

One of my 3D PeleC runs crashed with a free(): invalid pointer after almost 4 hours on a few hundred cores. Below is the crash log that I was able to extract. Seems like this is happening in the cleanup stage after a Level 1 solve.

I don't know if this is a useful bug report, nor if it can be reproduced. But let me know if you are interested in trying to chase this one down, and I can provide more details and the case files.

AMReX commit 93fb085d28349 (Nov 1 2019 - this is the "current submodule" for PeleC)
PeleC commit 1821d36 (Feb 13 2020)

[Level 1 step 8611] Advanced 20480 cells
[Level 1 step 8612] ADVANCE with dt = 1.861440021e-08
... Computing MOL source term at t^{n} 
... Computing MOL source term at t^{n+1} 
... Computing reactions for dt = 1.861440021e-08
[Level 1 step 8612] Advanced 20480 cells
*** glibc detected *** PeleC3d.gnu.MPI.ex: free(): invalid pointer: 0x0000000003955350 ***
======= Backtrace: =========
/lib64/libc.so.6[0x398fe75e5e]
/lib64/libc.so.6[0x398fe78cad]
PeleC3d.gnu.MPI.ex[0x501660]
PeleC3d.gnu.MPI.ex[0x5015ef]
PeleC3d.gnu.MPI.ex[0x4fc32e]
PeleC3d.gnu.MPI.ex[0x4fc41e]
PeleC3d.gnu.MPI.ex[0x54bbb6]
PeleC3d.gnu.MPI.ex[0x54bc3a]
PeleC3d.gnu.MPI.ex[0x5489d3]
PeleC3d.gnu.MPI.ex[0x4fd601]
PeleC3d.gnu.MPI.ex[0x5d9644]
PeleC3d.gnu.MPI.ex[0x7a1139]
PeleC3d.gnu.MPI.ex[0x5d6ce7]
PeleC3d.gnu.MPI.ex[0x5d7b48]
PeleC3d.gnu.MPI.ex[0x5cbcbf]
PeleC3d.gnu.MPI.ex[0x41600a]
/lib64/libc.so.6(__libc_start_main+0x100)[0x398fe1ed20]
PeleC3d.gnu.MPI.ex[0x41690d]


Backtrace.139:
(parsed with parse_bt.py)

0: amrex::BLBackTrace::print_backtrace_info(_IO_FILE*) at /home/asmunde/codes/amrex/Src/Base/AMReX_BLBackTrace.cpp:167

1: amrex::BLBackTrace::handler(int) at /home/asmunde/codes/amrex/Src/Base/AMReX_BLBackTrace.cpp:71

8: std::_Rb_tree<std::pair<amrex::IntVect, amrex::IntVect>, std::pair<std::pair<amrex::IntVect, amrex::IntVect> const, amrex::FabArrayBase::TileArray>, std::_Select1st<std::pair<std::pair<amrex::IntVect, amrex::IntVect> const, amrex::FabArrayBase::TileArray> >, std::less<std::pair<amrex::IntVect, amrex::IntVect> >, std::allocator<std::pair<std::pair<amrex::IntVect, amrex::IntVect> const, amrex::FabArrayBase::TileArray> > >::_M_erase(std::_Rb_tree_node<std::pair<std::pair<amrex::IntVect, amrex::IntVect> const, amrex::FabArrayBase::TileArray> >*) at /share/apps/modulessoftware/gcc/gcc-7.3.0/include/c++/7.3.0/bits/stl_tree.h:1854

9: std::_Rb_tree<std::pair<amrex::IntVect, amrex::IntVect>, std::pair<std::pair<amrex::IntVect, amrex::IntVect> const, amrex::FabArrayBase::TileArray>, std::_Select1st<std::pair<std::pair<amrex::IntVect, amrex::IntVect> const, amrex::FabArrayBase::TileArray> >, std::less<std::pair<amrex::IntVect, amrex::IntVect> >, std::allocator<std::pair<std::pair<amrex::IntVect, amrex::IntVect> const, amrex::FabArrayBase::TileArray> > >::_M_erase(std::_Rb_tree_node<std::pair<std::pair<amrex::IntVect, amrex::IntVect> const, amrex::FabArrayBase::TileArray> >*) at /share/apps/modulessoftware/gcc/gcc-7.3.0/include/c++/7.3.0/bits/stl_vector.h:434
 (inlined by) ?? at /home/asmunde/codes/amrex/Src/Base/AMReX_Vector.H:29
 (inlined by) ?? at /home/asmunde/codes/amrex/Src/Base/AMReX_FabArrayBase.H:225
 (inlined by) ?? at /share/apps/modulessoftware/gcc/gcc-7.3.0/include/c++/7.3.0/bits/stl_pair.h:198
 (inlined by) ?? at /share/apps/modulessoftware/gcc/gcc-7.3.0/include/c++/7.3.0/ext/new_allocator.h:140
 (inlined by) ?? at /share/apps/modulessoftware/gcc/gcc-7.3.0/include/c++/7.3.0/bits/alloc_traits.h:487
 (inlined by) ?? at /share/apps/modulessoftware/gcc/gcc-7.3.0/include/c++/7.3.0/bits/stl_tree.h:650
 (inlined by) ?? at /share/apps/modulessoftware/gcc/gcc-7.3.0/include/c++/7.3.0/bits/stl_tree.h:658
 (inlined by) std::_Rb_tree<std::pair<amrex::IntVect, amrex::IntVect>, std::pair<std::pair<amrex::IntVect, amrex::IntVect> const, amrex::FabArrayBase::TileArray>, std::_Select1st<std::pair<std::pair<amrex::IntVect, amrex::IntVect> const, amrex::FabArrayBase::TileArray> >, std::less<std::pair<amrex::IntVect, amrex::IntVect> >, std::allocator<std::pair<std::pair<amrex::IntVect, amrex::IntVect> const, amrex::FabArrayBase::TileArray> > >::_M_erase(std::_Rb_tree_node<std::pair<std::pair<amrex::IntVect, amrex::IntVect> const, amrex::FabArrayBase::TileArray> >*) at /share/apps/modulessoftware/gcc/gcc-7.3.0/include/c++/7.3.0/bits/stl_tree.h:1858

10: amrex::FabArrayBase::flushTileArray(amrex::IntVect const&, bool) const at /share/apps/modulessoftware/gcc/gcc-7.3.0/include/c++/7.3.0/ext/new_allocator.h:125
 (inlined by) ?? at /share/apps/modulessoftware/gcc/gcc-7.3.0/include/c++/7.3.0/bits/alloc_traits.h:462
 (inlined by) ?? at /share/apps/modulessoftware/gcc/gcc-7.3.0/include/c++/7.3.0/bits/stl_tree.h:592
 (inlined by) ?? at /share/apps/modulessoftware/gcc/gcc-7.3.0/include/c++/7.3.0/bits/stl_tree.h:659
 (inlined by) ?? at /share/apps/modulessoftware/gcc/gcc-7.3.0/include/c++/7.3.0/bits/stl_tree.h:2477
 (inlined by) ?? at /share/apps/modulessoftware/gcc/gcc-7.3.0/include/c++/7.3.0/bits/stl_tree.h:1125
 (inlined by) ?? at /share/apps/modulessoftware/gcc/gcc-7.3.0/include/c++/7.3.0/bits/stl_map.h:1032
 (inlined by) amrex::FabArrayBase::flushTileArray(amrex::IntVect const&, bool) const at /home/asmunde/codes/amrex/Src/Base/AMReX_FabArrayBase.cpp:1562

11: amrex::FabArrayBase::clearThisBD(bool) at /home/asmunde/codes/amrex/Src/Base/AMReX_FabArrayBase.cpp:1614

12: amrex::FabArray<amrex::CutFab>::clear() at /home/asmunde/codes/amrex/Src/Base/AMReX_FabArray.H:973

13: amrex::FabArray<amrex::CutFab>::~FabArray() at /share/apps/modulessoftware/gcc/gcc-7.3.0/include/c++/7.3.0/bits/stl_vector.h:434
 (inlined by) ?? at /home/asmunde/codes/amrex/Src/Base/AMReX_Vector.H:29
 (inlined by) amrex::FabArray<amrex::CutFab>::~FabArray() at /home/asmunde/codes/amrex/Src/Base/AMReX_FabArray.H:1129

14: amrex::EBDataCollection::~EBDataCollection() at /home/asmunde/codes/amrex/Src/EB/AMReX_EBDataCollection.cpp:72 (discriminator 1)

15: amrex::EBFArrayBoxFactory::~EBFArrayBoxFactory() at /share/apps/modulessoftware/gcc/gcc-7.3.0/include/c++/7.3.0/ext/atomicity.h:49
 (inlined by) ?? at /share/apps/modulessoftware/gcc/gcc-7.3.0/include/c++/7.3.0/ext/atomicity.h:82
 (inlined by) ?? at /share/apps/modulessoftware/gcc/gcc-7.3.0/include/c++/7.3.0/bits/shared_ptr_base.h:166
 (inlined by) ?? at /share/apps/modulessoftware/gcc/gcc-7.3.0/include/c++/7.3.0/bits/shared_ptr_base.h:684
 (inlined by) ?? at /share/apps/modulessoftware/gcc/gcc-7.3.0/include/c++/7.3.0/bits/shared_ptr_base.h:1123
 (inlined by) ?? at /share/apps/modulessoftware/gcc/gcc-7.3.0/include/c++/7.3.0/bits/shared_ptr.h:93
 (inlined by) ?? at /home/asmunde/codes/amrex/Src/EB/AMReX_EBFabFactory.H:27
 (inlined by) amrex::EBFArrayBoxFactory::~EBFArrayBoxFactory() at /home/asmunde/codes/amrex/Src/EB/AMReX_EBFabFactory.H:27

16: amrex::AmrLevel::~AmrLevel() at /share/apps/modulessoftware/gcc/gcc-7.3.0/include/c++/7.3.0/bits/stl_construct.h:107
 (inlined by) ?? at /share/apps/modulessoftware/gcc/gcc-7.3.0/include/c++/7.3.0/bits/stl_construct.h:137
 (inlined by) ?? at /share/apps/modulessoftware/gcc/gcc-7.3.0/include/c++/7.3.0/bits/stl_construct.h:206
 (inlined by) ?? at /share/apps/modulessoftware/gcc/gcc-7.3.0/include/c++/7.3.0/bits/stl_vector.h:434
 (inlined by) ?? at /home/asmunde/codes/amrex/Src/Base/AMReX_Vector.H:29
 (inlined by) amrex::AmrLevel::~AmrLevel() at /home/asmunde/codes/amrex/Src/Amr/AMReX_AmrLevel.cpp:533

17: PeleC::~PeleC() at /home/asmunde/codes/PeleC/Source/PeleC.cpp:599

18: amrex::Amr::regrid(int, double, bool) at /home/asmunde/codes/amrex/Src/Amr/AMReX_Amr.cpp:2917

19: amrex::Amr::timeStep(int, double, int, int, double) at /home/asmunde/codes/amrex/Src/Amr/AMReX_Amr.cpp:2030

20: amrex::Amr::coarseTimeStep(double) at /home/asmunde/codes/amrex/Src/Amr/AMReX_Amr.cpp:2439

21: main at /home/asmunde/codes/PeleC/Source/main.cpp:173

23: _start at ??:?

The text was updated successfully, but these errors were encountered:

jrood-nrel · 2020-03-23T23:36:10Z

It looks like this is happening in AmrLevel's destructor which is AMReX code. Specifically flushTileArray. I guess I'm not versed enough with the code to understand why regrid would call the destructors. Unfortunately I would have to do the classic redirection and finger point to AMReX. Otherwise, is there any other thing that could have triggered your job to die like time limit, or any single rank died? Can you restart and reproduce this at the same timestep?

drummerdoc · 2020-03-23T23:39:35Z

Side comment.... regarding calls the destructor because the box array associated with the AmrLevel is likely to have changed, and would make all the cached into associated with the box array invalid. On a regrid the old AmrLevel is destructed and a new one created...any caches associated with the AmrLevel need to be rebuilt.

asmunder · 2020-06-10T09:31:00Z

So I have been trying to dig a bit more here, and I can reproduce this behaviour, also on a different cluster, and also with different grid resolutions - when changing these, it's not the same time step, and not always the same message - I have seen the variations free: invalid pointer, Error: corrupted size vs. prev_size, and Error: double free or corruption (!prev). But it's always when trying to regrid.

All my runs without AMR are fine, but with AMR + EB + hydrogen combustion I've not been able to run my case successfully yet.

Just for ruling out that I'm doing something stupid somewhere on my side, I'd like to run a Tutorial or other reference case with hydrogen combustion that has AMR and EB, to verify that this does not crash.

Is there such a case I could try?

jrood-nrel · 2020-06-10T16:25:38Z

I am not aware of any cases besides the ones that exist in our Exec directory. Have you tried a newer version of AMReX? I've also been doing a lot of work in the cpp branch which is updating the Fortran code with C++ code. Most functionality is there, AMR and EB, for example. That is also another avenue to try, in that case looking in the ExecCpp directory for examples. That branch will also have a very recent AMReX version in the submodule as well.

whitmanscu · 2020-06-10T18:25:44Z

@asmunder, I got very similar errors trying to use propane/air chemistry with AMR and EBs last year. I saw the same variety of error messages, always when doing a regrid. I talked at some length with @nickwimer and @hsitaram (who was able to reproduce the error) but we never came to a conclusive solution, so I'm interested in any progress we can make here.

I did have some success running the same propane case shrunk down significantly, with correspondingly higher base resolution (~0.2mm/cell vs. ~2mm). This made me think that there may be an issue with base resolution or timestep size, although I wasn't able to afford fully shrinking my base resolution on the full-size case to see if it fixed the issue. I did shrink the base resolution to ~0.5 mm/cell and still got the regrid crash.

That said, more recently I ran some H2/air bluff body cases successfully with AMR and EBs up until the flame tried to exit the domain, at which point I got a different crash using NSCBCs based on lack of species/reaction terms, or pressure issues with other outflow conditions - see #149. But at least I didn't get the regrid error! This was using the LiDryer mech with relatively high base resolution (~0.04mm/cell) and a very small physical domain, on the order of 1cm. What is your physical domain size and base resolution? Even if you see the error at multiple resolutions, could you maybe try a significantly smaller physical domain and resolution and see if you still get the error?

Some other notes on what I found did not error:

Any case without AMR.
Any case with reactions off. If I restart right before a crash with reactions off, even after having turned them on initially, I did not get a crash.
Any case with reactions turned on but no energy source to initiate reactions.
Very small domains/base resolutions in general; for reference the successful propane case had ~0.2mm base resolution and my H2 cases have ~0.04mm base resolution.

Notes on what did error:

Propane bluff body case with ~0.5mm base resolution or higher.
Same case with no bluff body, using an external source to add energy.
Same case with no bluff body and an external source, using SDC advance instead of MOL.

All of these cases ran fine without AMR.

Based on the above, if we are in fact dealing with the same issue I don't think EBs are actually necessary to get this sort of error; it seems to rely on base resolution and/or timestep in conjunction with reactions, and for some reason triggers specifically on AMR regridding. If I get time in the next few weeks I'll also try and revisit the issue with some of the more recent AMReX/cpp changes, as suggested, to see if they help at all.

jrood-nrel · 2020-06-10T18:39:28Z

This makes me think there is a problem with the destructor during regrid interfacing with the EOS fortran code which interfaces to the C code in the mechanisms. I'm hoping the C++ code, as well as updates that have occurred in PelePhysics, would have better luck with this issue since it's all C++ in that case.

asmunder · 2020-06-11T12:23:38Z

Thanks both for the input.

@jrood-nrel I've noticed comments about this transition to C++, but is there a roadmap for it somewhere? Will all Fortran code be replaced, eventually? I have some custom code for the inlet (cf. issue #141 ) that is Fortran, in the files pmf_generic.f90, bc_fill_nd.F90, Prob_nd.F90 . Do I need to migrate this code to C++ ?

Tangentially, if I want to run my existing code with a newer AMReX, which AMReX commit should I use?

@whitmanscu My current case is pretty fine resolution, a pseudo-2D expanding channel setup that is 2 cm x 8 cm - 8x1024x4096 grid with resolution 0.02 mm/cell for the base grid, three levels of AMR. I'm doing this because I'm targeting a hydrogen/air reheat flame at 15 bar pressure, so expecting the flame front to be very thin. The corresponding case at 1 bar is fully resolved on the same grid without using AMR, and is running fine.

drummerdoc · 2020-06-11T18:26:25Z

@asmunder Eventually, all the AMReX-based codes will be migrated to use AMReX's kernel launching strategy, which has generally implied the use of C++ kernel functions to maximize the amount of inlining that the compilers can do. However, there is no formal restriction on programming language used within the kernels. For PeleC, this migration has already happened in one of the branches, but it hasn't been pulled into development yet. The Pele codes are the last of the AMReX codes to make this transition (PeleLM is even behind PeleLM though). Even if the BC's and IC's get pulled into C++, you can still call your fortran helper functions within them if you want...or you can convert your codes to C++ as well so that it inlines better for more efficient execution.

As for the crash on free error, there is a known issue for main that is similar. I can't imagine how it would help, but it's a quick edit to see....if not, remove it and keep looking. But the trick is to add an extra scoping of the code between Initialize and Finalize, so in main.cpp,
....
Initialize(...);
{ // <-------------------Added
// Code here
} // <-------------------Added
Finalize();
}

Somehow this guarantees that all objects get their destructors properly called before the program exits.

jrood-nrel · 2020-07-10T19:36:45Z

@asmunder The C++ code is now in the development branch. Hopefully it will help with this error. There should be plenty of example problems to illustrate how to move your problem to C++. There is also an up-to-date AMReX included as a submodule. We don't have a concrete roadmap but most functionality is there, except most notably NSCBC and non-ideal EOS.

asmunder · 2020-08-04T08:55:34Z

@jrood-nrel I'm starting to look into this now. (In parallell I'm focusing on a non-EB case where the AMR works fine; there I'm working on other stuff such as sampling a plane, which I'll open another issue for).

Unfortunately my existing case uses NSCBC, I guess I'll have to test if I can get by without too many reflections if I just use plain inlet/outlet.
Is there a plan for implementing the NSCBC stuff in the C++ code?

drummerdoc · 2020-08-04T18:42:00Z

Is there a plan for implementing NSCBC stuff in the C++ code?

Sorry to say that it's lower on our list at LBNL since we are currently focused on meeting our milestone to have all of PeleLM ported to GPU by the end of the fiscal year. PeleC, managed predominantly out of NREL, is focused on performance metrics for the same milestone. My guess is that early in the new FY (October, November) many of these usability aspects will be cleaned up. It may not actually be very hard at all to port the NSCBC stuff if you want to have a look yourself. Sorry that you are left hanging in this process.

emotheau · 2020-08-04T20:35:05Z

We can still call the Fortran routine to fill the ghost-cells, right ?

drummerdoc · 2020-08-04T21:09:24Z

Yes, as long as the routines are marked as DEVICE or HOST_DEVICE, and as long as a suitable fortran compiler is used.

asmunder · 2020-08-06T08:55:07Z

Right, you were saying earlier that it's possible to mix Fortran and C++ routines.
If I follow correctly, you are saying that in PeleC's SourceCpp/BCfill.cpp I would replace AMREX_GPU_DEVICE with AMREX_HOST_DEVICE, since the bcnormal routine will now be provided by a Fortran file. And then I would need to add the calls to impose_NSCBC in the correct places in SourceCpp/MOL.cpp and SourceCpp/Hydro.Cpp, also in some blocks marked with AMREX_HOST_DEVICE? Do I also need to copy over the impose_NSCBC_3d.f90 etc. files over to my Exec folder or something?

What would be a suitable compiler? - I'm a bit confused from the AMReX GPU page about whether GNU will work or not. And does this require GPUs on the machine, or will it fall back to CPU even though stuff is marked AMREX_GPU_DEVICE?

asmunder closed this as completed Apr 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash on free(): invalid pointer #146

Crash on free(): invalid pointer #146

asmunder commented Mar 23, 2020

jrood-nrel commented Mar 23, 2020

drummerdoc commented Mar 23, 2020

asmunder commented Jun 10, 2020

jrood-nrel commented Jun 10, 2020

whitmanscu commented Jun 10, 2020

jrood-nrel commented Jun 10, 2020

asmunder commented Jun 11, 2020

drummerdoc commented Jun 11, 2020

jrood-nrel commented Jul 10, 2020

asmunder commented Aug 4, 2020

drummerdoc commented Aug 4, 2020 •

edited

Loading

emotheau commented Aug 4, 2020

drummerdoc commented Aug 4, 2020

asmunder commented Aug 6, 2020 •

edited

Loading

Crash on free(): invalid pointer #146

Crash on free(): invalid pointer #146

Comments

asmunder commented Mar 23, 2020

jrood-nrel commented Mar 23, 2020

drummerdoc commented Mar 23, 2020

asmunder commented Jun 10, 2020

jrood-nrel commented Jun 10, 2020

whitmanscu commented Jun 10, 2020

jrood-nrel commented Jun 10, 2020

asmunder commented Jun 11, 2020

drummerdoc commented Jun 11, 2020

jrood-nrel commented Jul 10, 2020

asmunder commented Aug 4, 2020

drummerdoc commented Aug 4, 2020 • edited Loading

emotheau commented Aug 4, 2020

drummerdoc commented Aug 4, 2020

asmunder commented Aug 6, 2020 • edited Loading

drummerdoc commented Aug 4, 2020 •

edited

Loading

asmunder commented Aug 6, 2020 •

edited

Loading