-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crash on free(): invalid pointer #146
Comments
It looks like this is happening in AmrLevel's destructor which is AMReX code. Specifically |
Side comment.... regarding calls the destructor because the box array associated with the AmrLevel is likely to have changed, and would make all the cached into associated with the box array invalid. On a regrid the old AmrLevel is destructed and a new one created...any caches associated with the AmrLevel need to be rebuilt. |
So I have been trying to dig a bit more here, and I can reproduce this behaviour, also on a different cluster, and also with different grid resolutions - when changing these, it's not the same time step, and not always the same message - I have seen the variations All my runs without AMR are fine, but with AMR + EB + hydrogen combustion I've not been able to run my case successfully yet. Just for ruling out that I'm doing something stupid somewhere on my side, I'd like to run a Tutorial or other reference case with hydrogen combustion that has AMR and EB, to verify that this does not crash. Is there such a case I could try? |
I am not aware of any cases besides the ones that exist in our |
@asmunder, I got very similar errors trying to use propane/air chemistry with AMR and EBs last year. I saw the same variety of error messages, always when doing a regrid. I talked at some length with @nickwimer and @hsitaram (who was able to reproduce the error) but we never came to a conclusive solution, so I'm interested in any progress we can make here. I did have some success running the same propane case shrunk down significantly, with correspondingly higher base resolution (~0.2mm/cell vs. ~2mm). This made me think that there may be an issue with base resolution or timestep size, although I wasn't able to afford fully shrinking my base resolution on the full-size case to see if it fixed the issue. I did shrink the base resolution to ~0.5 mm/cell and still got the regrid crash. That said, more recently I ran some H2/air bluff body cases successfully with AMR and EBs up until the flame tried to exit the domain, at which point I got a different crash using NSCBCs based on lack of species/reaction terms, or pressure issues with other outflow conditions - see #149. But at least I didn't get the regrid error! This was using the LiDryer mech with relatively high base resolution (~0.04mm/cell) and a very small physical domain, on the order of 1cm. What is your physical domain size and base resolution? Even if you see the error at multiple resolutions, could you maybe try a significantly smaller physical domain and resolution and see if you still get the error? Some other notes on what I found did not error:
Notes on what did error:
All of these cases ran fine without AMR. Based on the above, if we are in fact dealing with the same issue I don't think EBs are actually necessary to get this sort of error; it seems to rely on base resolution and/or timestep in conjunction with reactions, and for some reason triggers specifically on AMR regridding. If I get time in the next few weeks I'll also try and revisit the issue with some of the more recent AMReX/cpp changes, as suggested, to see if they help at all. |
This makes me think there is a problem with the destructor during regrid interfacing with the EOS fortran code which interfaces to the C code in the mechanisms. I'm hoping the C++ code, as well as updates that have occurred in PelePhysics, would have better luck with this issue since it's all C++ in that case. |
Thanks both for the input. @jrood-nrel I've noticed comments about this transition to C++, but is there a roadmap for it somewhere? Will all Fortran code be replaced, eventually? I have some custom code for the inlet (cf. issue #141 ) that is Fortran, in the files Tangentially, if I want to run my existing code with a newer AMReX, which AMReX commit should I use? @whitmanscu My current case is pretty fine resolution, a pseudo-2D expanding channel setup that is 2 cm x 8 cm - 8x1024x4096 grid with resolution 0.02 mm/cell for the base grid, three levels of AMR. I'm doing this because I'm targeting a hydrogen/air reheat flame at 15 bar pressure, so expecting the flame front to be very thin. The corresponding case at 1 bar is fully resolved on the same grid without using AMR, and is running fine. |
@asmunder Eventually, all the AMReX-based codes will be migrated to use AMReX's kernel launching strategy, which has generally implied the use of C++ kernel functions to maximize the amount of inlining that the compilers can do. However, there is no formal restriction on programming language used within the kernels. For PeleC, this migration has already happened in one of the branches, but it hasn't been pulled into As for the crash on free error, there is a known issue for main that is similar. I can't imagine how it would help, but it's a quick edit to see....if not, remove it and keep looking. But the trick is to add an extra scoping of the code between Somehow this guarantees that all objects get their destructors properly called before the program exits. |
@asmunder The C++ code is now in the |
@jrood-nrel I'm starting to look into this now. (In parallell I'm focusing on a non-EB case where the AMR works fine; there I'm working on other stuff such as sampling a plane, which I'll open another issue for). Unfortunately my existing case uses NSCBC, I guess I'll have to test if I can get by without too many reflections if I just use plain inlet/outlet. |
Sorry to say that it's lower on our list at LBNL since we are currently focused on meeting our milestone to have all of PeleLM ported to GPU by the end of the fiscal year. PeleC, managed predominantly out of NREL, is focused on performance metrics for the same milestone. My guess is that early in the new FY (October, November) many of these usability aspects will be cleaned up. It may not actually be very hard at all to port the NSCBC stuff if you want to have a look yourself. Sorry that you are left hanging in this process. |
We can still call the Fortran routine to fill the ghost-cells, right ? |
Yes, as long as the routines are marked as DEVICE or HOST_DEVICE, and as long as a suitable fortran compiler is used. |
Right, you were saying earlier that it's possible to mix Fortran and C++ routines. What would be a suitable compiler? - I'm a bit confused from the AMReX GPU page about whether GNU will work or not. And does this require GPUs on the machine, or will it fall back to CPU even though stuff is marked AMREX_GPU_DEVICE? |
One of my 3D PeleC runs crashed with a free(): invalid pointer after almost 4 hours on a few hundred cores. Below is the crash log that I was able to extract. Seems like this is happening in the cleanup stage after a Level 1 solve.
I don't know if this is a useful bug report, nor if it can be reproduced. But let me know if you are interested in trying to chase this one down, and I can provide more details and the case files.
AMReX commit 93fb085d28349 (Nov 1 2019 - this is the "current submodule" for PeleC)
PeleC commit 1821d36 (Feb 13 2020)
The text was updated successfully, but these errors were encountered: