g-w jobs fail on Hercules with NetCDF: HDF error #2489

RussTreadon-NOAA · 2024-04-15T16:57:04Z

What is wrong?

ufs_model.x aborts on Hercules in gdasfcst, gfsfcst, and enkfgdasfcst_mem* with

24:  file: module_write_netcdf.F90 line:          761 NetCDF: HDF error
24: Abort(1) on node 24 (rank 24 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 24
 0: PASS: fcstRUN phase 1, n_atmsteps =                3 time is         3.137640
srun: error: hercules-01-26: tasks 0-23,25-79: Exited with exit code 1
srun: Terminating StepId=1016751.0
 0: slurmstepd: error: *** STEP 1016751.0 ON hercules-01-26 CANCELLED AT 2024-04-15T11:17:22 ***
24: forrtl: error (78): process killed (SIGTERM)

getsigensmeanp_smooth.x aborts on Hercules in enkfgdasecen000 with

 0: dosmooth = F
 0:         -101 NetCDF: HDF error
 0: 99
srun: error: hercules-05-29: task 0: Exited with exit code 99
srun: Terminating StepId=1016780.0
 0: slurmstepd: error: *** STEP 1016780.0 ON hercules-05-29 CANCELLED AT 2024-04-15T11:20:21 ***

What should have happened?

ufs_model.x and getsigensmeanp_smooth.x should run to completion on Hercules

What machines are impacted?

Hercules

Steps to reproduce

clone and install DavidNew-NOAA:feature/jediinc2fv3
set up C96C48_hybatmDA
rocotoboot
the first half cycle gdasfcst, enkfgdasfcst_mem001, and enkfgdasfcst_mem002 abort with NetCDF: HDF error

Additional information

Also set up C96C48_ufs_hybatmDA on Hercules. gdasfcst and enkfgdasfcst successfully ran to completion for the first half-cycle. The gdasfcst, gfsfcst, and enkfgdasecen000 failed on the first full cycle. No changes to the executables between the first half cycle and first full cycle.

Do you have a proposed solution?

No response

The text was updated successfully, but these errors were encountered:

BrianCurtis-NOAA · 2024-04-15T17:30:33Z

@RussTreadon-NOAA I think we saw something similar in the UFSWM here: ufs-community/ufs-weather-model#2015 and the TL;DR was to add export I_MPI_EXTRA_FILESYSTEM=ON to the job card.

Can you give that a try to see if you still see that issue?

WalterKolczynski-NOAA · 2024-04-15T17:36:58Z

While this seems like a different issue, workflow is not yet supported on Hercules due to a Lustre issue with ln on Rocky 9. We've had a ticket open for quite a while now.

RussTreadon-NOAA · 2024-04-15T17:51:34Z

@BrianCurtis-NOAA , thank you for sharing your insight.

env/HERCULES.env currently contains

export I_MPI_EXTRA_FILESYSTEM=1
export I_MPI_EXTRA_FILESYSTEM_LIST=lustre

I replaced the first line above with

export I_MPI_EXTRA_FILESYSTEM=ON

The 202112200 18Z gdasfcst and enkfgdasfcst_mem002 still aborted with

24:  file: module_write_netcdf.F90 line:          761 NetCDF: HDF error
24: Abort(1) on node 24 (rank 24 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 24

In contrast, enkfgdasfcst_mem001 successfully ran to completion.

As a follow on test I removed

export I_MPI_EXTRA_FILESYSTEM_LIST=lustre

while retaining

export I_MPI_EXTRA_FILESYSTEM=ON

A rerun of gdasfcst still aborted with the NetCDF: HDF error error. However, this time enkfgdasfcst_mem002 successfully ran to completion.

The seemingly random nature of this behavior is disturbing.

RussTreadon-NOAA · 2024-04-15T17:55:28Z

Thank you @WalterKolczynski-NOAA for letting me know that g-w does not support Hercules. It is unfortunate that we can not reliably run cycled global parallels on Hercules. Have we elevated the ln and NetCDF: HDF error issues to management?

WalterKolczynski-NOAA · 2024-04-15T18:03:42Z

The ln issue has been elevated. The HDF error is new to me.

RussTreadon-NOAA · 2024-04-15T18:07:12Z

OK, we can keep this issue open for awareness. Hercules is not, at present, a viable option for running global parallels.

RussTreadon-NOAA · 2024-05-02T17:33:34Z

Ran C96C48_ufs_hybatmDA (JEDI ATM) on Hercules. All jobs in 20240223/18 half cycle and 20240224/00 full cycle ran. Full cycle runs gdas, enkfgdas, and gfs. JEDI ATM currently runs with DOIAU=NO

Ran C96C48_hybatmDA (GSI ATM) on Hercules. All jobs in 20211220/18 half cycle ran. All gdas and enkfgdas jobs in 20211221/00 and 06 full cycles ran. The 20211221/00 gfs fcst aborted upon ufs_model.x start. GSI ATM runs with DOIAU=YES.

+ exglobal_forecast.sh[152]: /bin/cp -p /work/noaa/da/rtreadon/git/global-workflow/test/exec/ufs_model.x /work/noaa/stmp/rtreadon/RUNDIRS/prtest_gsi_hercules/gfsfcst.2021122100/fcst.300179/
+ exglobal_forecast.sh[153]: srun -l --export=ALL -n 80 /work/noaa/stmp/rtreadon/RUNDIRS/prtest_gsi_hercules/gfsfcst.2021122100/fcst.300179/ufs_model.x
 0: MPI startup(): I_MPI_EXTRA_FILESYSTEM_LIST environment variable is not supported.
 0: MPI startup(): Similar variables:
 0:      I_MPI_EXTRA_FILESYSTEM
 0:      I_MPI_EXTRA_FILESYSTEM_FORCE
 0:      I_MPI_EXTRA_FILESYSTEM_NFS_DIRECT
 0: MPI startup(): To check the list of supported variables, use the impi_info utility or refer to https://software.intel.com/en-us/mpi-library/documentation/get-started.
 0:
 0:
 0: * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * .
 0:      PROGRAM ufs-weather-model HAS BEGUN. COMPILED       0.00     ORG: np23
 0:      STARTING DATE-TIME  MAY 02,2024  15:31:11.701  123  THU   2460433
 0:
 0:
 0: MPI Library = Intel(R) MPI Library 2021.9 for Linux* OS
 0:
 0: MPI Version = 3.1
 0: Abort(1) on node 0 (rank 0 in comm 496): application called MPI_Abort(comm=0x84000003, 1) - process 0
 5: Abort(1) on node 5 (rank 5 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 5
 8: Abort(1) on node 8 (rank 8 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 8

Given this, rewind and reboot 20240224/00 gfsfcst from JEDI ATM. The reboot failed just like GSI ATM gfsfcst.

The gfsfcst failure isn't the NetCDF: HDF error reported in this issue. The gfsfcst failure above is the same failure reported in issue #2551.

aerorahul · 2024-05-20T18:36:12Z

@RussTreadon-NOAA Is this still an issue. Forecast tests on hercules indicate that the model is running cleanly.

RussTreadon-NOAA · 2024-05-20T18:56:58Z

We may close this issue. It seems failures were related to not removing the run directory before re-running tests.

RussTreadon-NOAA added bug Something isn't working triage Issues that are triage labels Apr 15, 2024

WalterKolczynski-NOAA removed the triage Issues that are triage label Apr 15, 2024

aerorahul closed this as completed May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

g-w jobs fail on Hercules with NetCDF: HDF error #2489

g-w jobs fail on Hercules with NetCDF: HDF error #2489

RussTreadon-NOAA commented Apr 15, 2024

BrianCurtis-NOAA commented Apr 15, 2024

WalterKolczynski-NOAA commented Apr 15, 2024

RussTreadon-NOAA commented Apr 15, 2024

RussTreadon-NOAA commented Apr 15, 2024

WalterKolczynski-NOAA commented Apr 15, 2024

RussTreadon-NOAA commented Apr 15, 2024

RussTreadon-NOAA commented May 2, 2024

aerorahul commented May 20, 2024

RussTreadon-NOAA commented May 20, 2024

g-w jobs fail on Hercules with NetCDF: HDF error #2489

g-w jobs fail on Hercules with NetCDF: HDF error #2489

Comments

RussTreadon-NOAA commented Apr 15, 2024

What is wrong?

What should have happened?

What machines are impacted?

Steps to reproduce

Additional information

Do you have a proposed solution?

BrianCurtis-NOAA commented Apr 15, 2024

WalterKolczynski-NOAA commented Apr 15, 2024

RussTreadon-NOAA commented Apr 15, 2024

RussTreadon-NOAA commented Apr 15, 2024

WalterKolczynski-NOAA commented Apr 15, 2024

RussTreadon-NOAA commented Apr 15, 2024

RussTreadon-NOAA commented May 2, 2024

aerorahul commented May 20, 2024

RussTreadon-NOAA commented May 20, 2024