-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Possible missing MPI_Type_free in ESMCI_VMKernel? #209
Comments
Thanks for letting us know. This is deep in Gerhard's (@theurich) territory, so I'm going to assign it to him and hopefully he'll have a chance soon to take a look and make sure things are as they should be. What machine is this? I noticed the 12.3 gcc and wondered if this relates to Tom's issue (#397). Is he using a static ESMF? |
This is on Discover. We've observed the same with [email protected] |
@oehmke My guess is @tclune is not using static ESMF. ESMA-Baselibs currently builds ESMF as static and shared and from the experiments @climbfuji and myself have done with spack and other observations, it looks like And I now realize instead of rebuilding ESMF as static only, I could have just set |
Well, my current tests are not looking good for this being the issue. I mean, it's probably a memory leak (maybe?), but it'd be teeny. I've tried a few different ways of doing the As @atrayano said when I talked with him, since it's a double free it's more like MAPL or ESMF is freeing something twice. But, all the |
As a test, per a suggestion by @oehmke, I built ESMF with But, a thought occurred to me chatting with @atrayano. What if we build MAPL as static along with ESMF. Do that and one of my at-finalize double-free errors ( So, I'm wondering if static ESMF means everything GEOS makes has to be static as well? |
That’s too bad. Sorry I haven’t done a lot with spack, but why does ESMF need to be built static only for this? (We should figure out this issue anyway, but I was just wondering why that’s a constraint.)
… On Jan 16, 2024, at 2:41 PM, Matthew Thompson ***@***.***> wrote:
As a test, per a suggestion by @oehmke <https://github.com/oehmke>, I built ESMF with ESMF_PIO=OFF and ESMF_MOAB=OFF but no change. Dang.
But, a thought occurred to me chatting with @atrayano <https://github.com/atrayano>. What if we build MAPL as static along with ESMF. Do that and one of my at-finalize double-free errors (rs_numtiles.x) goes away.
So, I'm wondering if static ESMF means everything GEOS makes has to be static as well?
—
Reply to this email directly, view it on GitHub <#209 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AE6A7U4TX5DDQ5W45275RNDYO3XX7AVCNFSM6AAAAABBYVGWGSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJUGU2TSNJWGE>.
You are receiving this because you were mentioned.
|
NOAA currently wants everything to be static. (Just guessing that this is the reason.)
They may be forced to accept DSO in the future though for multiple reasons. (Vendor/OS may force it and … MAPL3 is going to bake it in fairly deep.
_ Tom
From: oehmke ***@***.***>
Reply-To: esmf-org/esmf ***@***.***>
Date: Tuesday, January 16, 2024 at 4:52 PM
To: esmf-org/esmf ***@***.***>
Cc: "Clune, Thomas L. (GSFC-6101)" ***@***.***>, Mention ***@***.***>
Subject: [EXTERNAL] [BULK] Re: [esmf-org/esmf] Possible missing MPI_Type_free in ESMCI_VMKernel? (Issue #209)
CAUTION: This email originated from outside of NASA. Please take care when clicking links or opening attachments. Use the "Report Message" button to report suspicious messages to the NASA SOC.
That’s too bad. Sorry I haven’t done a lot with spack, but why does ESMF need to be built static only for this? (We should figure out this issue anyway, but I was just wondering why that’s a constraint.)
On Jan 16, 2024, at 2:41 PM, Matthew Thompson ***@***.***> wrote:
As a test, per a suggestion by @oehmke <https://github.com/oehmke>, I built ESMF with ESMF_PIO=OFF and ESMF_MOAB=OFF but no change. Dang.
But, a thought occurred to me chatting with @atrayano <https://github.com/atrayano>. What if we build MAPL as static along with ESMF. Do that and one of my at-finalize double-free errors (rs_numtiles.x) goes away.
So, I'm wondering if static ESMF means everything GEOS makes has to be static as well?
—
Reply to this email directly, view it on GitHub <#209 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AE6A7U4TX5DDQ5W45275RNDYO3XX7AVCNFSM6AAAAABBYVGWGSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJUGU2TSNJWGE>.
You are receiving this because you were mentioned.
—
Reply to this email directly, view it on GitHub<#209 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABPP7YHFMARGYYOOLOTACHTYO3ZCRAVCNFSM6AAAAABBYVGWGSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJUGU3TGOBYGU>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
I think the right way forward is to re-enable the shared esmf build. I just confirmed that if I do that (flip one character in our spack config file), geos builds and runs correctly. Then we give the UFS folks a heads up that with the next spack-stack release ESMF will be both shared and static, and that they have to fix their build system to correctly pick up the static version (or move away from static libraries - it's a thing of the past anyway). |
See ufs-community/ufs-weather-model#2094 for the heads-up to the UFS that future versions of spack-stack will have both shared and static esmf and mapl. See JCSDA/spack-stack#953 and JCSDA/spack#372 for the spack-stack and spack changes to support GEOS (and build esmf and mapl both shared and static). I agree nonetheless that this issue should be fixed between esmf and mapl so that one can combine shared and static libraries. |
If they will accept a shared version, then I agree, that’s we should offer them for now. That should give us time to figure out this other problem, so we can offer a combined static version as well.
… On Jan 16, 2024, at 4:00 PM, Dom Heinzeller ***@***.***> wrote:
See ufs-community/ufs-weather-model#2094 <ufs-community/ufs-weather-model#2094> for the heads-up to the UFS that future versions of spack-stack will have both shared and static esmf and mapl. See JCSDA/spack-stack#953 <JCSDA/spack-stack#953> and JCSDA/spack#372 <JCSDA/spack#372> for the spack-stack and spack changes to support GEOS (and build esmf and mapl both shared and static).
I agree nonetheless that this issue should be fixed between esmf and mapl so that one can combine shared and static libraries.
—
Reply to this email directly, view it on GitHub <#209 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AE6A7U6AYF4IWF7BBK27ILLYO4BCJAVCNFSM6AAAAABBYVGWGSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJUGY2TOMZZGQ>.
You are receiving this because you were mentioned.
|
Hey Matt, Was this built with debug (-g)? I’m just wondering if we could coax out more info about where this is failing. Also, do you know if this was happening with a version before 8.6 (e.g. 8.5)? I’m trying to narrow down the possibilities. Thanks.
… On Jan 16, 2024, at 4:13 PM, Robert Oehmke ***@***.***> wrote:
If they will accept a shared version, then I agree, that’s we should offer them for now. That should give us time to figure out this other problem, so we can offer a combined static version as well.
> On Jan 16, 2024, at 4:00 PM, Dom Heinzeller ***@***.***> wrote:
>
>
> See ufs-community/ufs-weather-model#2094 <ufs-community/ufs-weather-model#2094> for the heads-up to the UFS that future versions of spack-stack will have both shared and static esmf and mapl. See JCSDA/spack-stack#953 <JCSDA/spack-stack#953> and JCSDA/spack#372 <JCSDA/spack#372> for the spack-stack and spack changes to support GEOS (and build esmf and mapl both shared and static).
>
> I agree nonetheless that this issue should be fixed between esmf and mapl so that one can combine shared and static libraries.
>
> —
> Reply to this email directly, view it on GitHub <#209 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AE6A7U6AYF4IWF7BBK27ILLYO4BCJAVCNFSM6AAAAABBYVGWGSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJUGY2TOMZZGQ>.
> You are receiving this because you were mentioned.
>
|
@oehmke Yup. Both GEOS and ESMF with debugging flags. And even that just gave the four usable lines of traceback. |
I think this goes all the way back to 8.3.0, maybe beta snapshot 09. Could also be earlier, but we didn't run the UFS with earlier versions of spack-stack, therefore can't tell. |
Fun stuff. Building ESMF in spack shared fails on macOS in the linker stage, see JCSDA/spack-stack#956 ... |
This looks like it may be an issue with a fix for tracing we put in for Darwin. Would you try setting ESMF_TRACE_LIB_BUILD=OFF when building ESMF and see if that fixes it? Thanks.
… On Jan 16, 2024, at 9:12 PM, Dom Heinzeller ***@***.***> wrote:
Fun stuff. Building ESMF in spack shared fails on macOS in the linker stage, see JCSDA/spack-stack#956 <JCSDA/spack-stack#956> ...
—
Reply to this email directly, view it on GitHub <#209 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AE6A7U3MHZC7IJPNAZ44CWTYO5FUHAVCNFSM6AAAAABBYVGWGSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTQOJUHEYDQMRWGI>.
You are receiving this because you were mentioned.
|
Thanks so much @oehmke, that worked! I'll submit a PR to spack with the change for macOS when building shared ESMF. Sorry for the late reply, all-day meeting today ... |
This is a big longshot in the dark. @climbfuji and I are trying to get GEOS to work with Spack, namely the JCSDA spack-stack. In the tests by @climbfuji with spack-stack, he kept getting crashes at the end of execution of GEOSgcm (and even smaller more boring programs, but ones that did link to MAPL and thus ESMF).
So I started with mothership spack, and my first test showed all was well. But he reminded me that spack-stack builds ESMF as static-only, no shared. So I build GEOS against a static-only ESMF and, yup, crashes on program exit. Turning on all the debugging flags in GEOS and MAPL didn't help too much but I did get out:
Now, not much traceback, but it does seem to point to MPI type-ish stuff? Maybe? Honestly, I'm reaching here.
So I grepped both ESMF and MAPL and many types around but one thing I saw was in
ESMCI_VMKernel.C
you have:esmf/src/Infrastructure/VM/src/ESMCI_VMKernel.C
Lines 730 to 731 in 609c811
and I don't see a corresponding
MPI_Type_free
forcustomType
.Of course, ESMF is complex and this is also C++ code which I am not very good at. It's possible the frees are done elsewhere? (aka Fun with OO programming!)
It's also possible this has absolutely nothing to do with the crash. I mean, I currently load 51 (!) modules when I run with spack so...that's a lot of things to look at. But the fact that just changing from shared to static ESMF causes a crash does point us toward ESMF...
The text was updated successfully, but these errors were encountered: