Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

g-w jobs fail on Hercules with NetCDF: HDF error #2489

Closed
RussTreadon-NOAA opened this issue Apr 15, 2024 · 9 comments
Closed

g-w jobs fail on Hercules with NetCDF: HDF error #2489

RussTreadon-NOAA opened this issue Apr 15, 2024 · 9 comments
Labels
bug Something isn't working

Comments

@RussTreadon-NOAA
Copy link
Contributor

What is wrong?

ufs_model.x aborts on Hercules in gdasfcst, gfsfcst, and enkfgdasfcst_mem* with

24:  file: module_write_netcdf.F90 line:          761 NetCDF: HDF error
24: Abort(1) on node 24 (rank 24 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 24
 0: PASS: fcstRUN phase 1, n_atmsteps =                3 time is         3.137640
srun: error: hercules-01-26: tasks 0-23,25-79: Exited with exit code 1
srun: Terminating StepId=1016751.0
 0: slurmstepd: error: *** STEP 1016751.0 ON hercules-01-26 CANCELLED AT 2024-04-15T11:17:22 ***
24: forrtl: error (78): process killed (SIGTERM)

getsigensmeanp_smooth.x aborts on Hercules in enkfgdasecen000 with

 0: dosmooth = F
 0:         -101 NetCDF: HDF error
 0: 99
srun: error: hercules-05-29: task 0: Exited with exit code 99
srun: Terminating StepId=1016780.0
 0: slurmstepd: error: *** STEP 1016780.0 ON hercules-05-29 CANCELLED AT 2024-04-15T11:20:21 ***

What should have happened?

ufs_model.x and getsigensmeanp_smooth.x should run to completion on Hercules

What machines are impacted?

Hercules

Steps to reproduce

  1. clone and install DavidNew-NOAA:feature/jediinc2fv3
  2. set up C96C48_hybatmDA
  3. rocotoboot
  4. the first half cycle gdasfcst, enkfgdasfcst_mem001, and enkfgdasfcst_mem002 abort with NetCDF: HDF error

Additional information

Also set up C96C48_ufs_hybatmDA on Hercules. gdasfcst and enkfgdasfcst successfully ran to completion for the first half-cycle. The gdasfcst, gfsfcst, and enkfgdasecen000 failed on the first full cycle. No changes to the executables between the first half cycle and first full cycle.

Do you have a proposed solution?

No response

@RussTreadon-NOAA RussTreadon-NOAA added bug Something isn't working triage Issues that are triage labels Apr 15, 2024
@BrianCurtis-NOAA
Copy link
Contributor

@RussTreadon-NOAA I think we saw something similar in the UFSWM here: ufs-community/ufs-weather-model#2015 and the TL;DR was to add export I_MPI_EXTRA_FILESYSTEM=ON to the job card.

Can you give that a try to see if you still see that issue?

@WalterKolczynski-NOAA
Copy link
Contributor

While this seems like a different issue, workflow is not yet supported on Hercules due to a Lustre issue with ln on Rocky 9. We've had a ticket open for quite a while now.

@WalterKolczynski-NOAA WalterKolczynski-NOAA removed the triage Issues that are triage label Apr 15, 2024
@RussTreadon-NOAA
Copy link
Contributor Author

@BrianCurtis-NOAA , thank you for sharing your insight.

env/HERCULES.env currently contains

export I_MPI_EXTRA_FILESYSTEM=1
export I_MPI_EXTRA_FILESYSTEM_LIST=lustre

I replaced the first line above with

export I_MPI_EXTRA_FILESYSTEM=ON

The 202112200 18Z gdasfcst and enkfgdasfcst_mem002 still aborted with

24:  file: module_write_netcdf.F90 line:          761 NetCDF: HDF error
24: Abort(1) on node 24 (rank 24 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 24

In contrast, enkfgdasfcst_mem001 successfully ran to completion.

As a follow on test I removed

export I_MPI_EXTRA_FILESYSTEM_LIST=lustre

while retaining

export I_MPI_EXTRA_FILESYSTEM=ON

A rerun of gdasfcst still aborted with the NetCDF: HDF error error. However, this time enkfgdasfcst_mem002 successfully ran to completion.

The seemingly random nature of this behavior is disturbing.

@RussTreadon-NOAA
Copy link
Contributor Author

Thank you @WalterKolczynski-NOAA for letting me know that g-w does not support Hercules. It is unfortunate that we can not reliably run cycled global parallels on Hercules. Have we elevated the ln and NetCDF: HDF error issues to management?

@WalterKolczynski-NOAA
Copy link
Contributor

The ln issue has been elevated. The HDF error is new to me.

@RussTreadon-NOAA
Copy link
Contributor Author

OK, we can keep this issue open for awareness. Hercules is not, at present, a viable option for running global parallels.

@RussTreadon-NOAA
Copy link
Contributor Author

Ran C96C48_ufs_hybatmDA (JEDI ATM) on Hercules. All jobs in 20240223/18 half cycle and 20240224/00 full cycle ran. Full cycle runs gdas, enkfgdas, and gfs. JEDI ATM currently runs with DOIAU=NO

Ran C96C48_hybatmDA (GSI ATM) on Hercules. All jobs in 20211220/18 half cycle ran. All gdas and enkfgdas jobs in 20211221/00 and 06 full cycles ran. The 20211221/00 gfs fcst aborted upon ufs_model.x start. GSI ATM runs with DOIAU=YES.

+ exglobal_forecast.sh[152]: /bin/cp -p /work/noaa/da/rtreadon/git/global-workflow/test/exec/ufs_model.x /work/noaa/stmp/rtreadon/RUNDIRS/prtest_gsi_hercules/gfsfcst.2021122100/fcst.300179/
+ exglobal_forecast.sh[153]: srun -l --export=ALL -n 80 /work/noaa/stmp/rtreadon/RUNDIRS/prtest_gsi_hercules/gfsfcst.2021122100/fcst.300179/ufs_model.x
 0: MPI startup(): I_MPI_EXTRA_FILESYSTEM_LIST environment variable is not supported.
 0: MPI startup(): Similar variables:
 0:      I_MPI_EXTRA_FILESYSTEM
 0:      I_MPI_EXTRA_FILESYSTEM_FORCE
 0:      I_MPI_EXTRA_FILESYSTEM_NFS_DIRECT
 0: MPI startup(): To check the list of supported variables, use the impi_info utility or refer to https://software.intel.com/en-us/mpi-library/documentation/get-started.
 0:
 0:
 0: * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * .
 0:      PROGRAM ufs-weather-model HAS BEGUN. COMPILED       0.00     ORG: np23
 0:      STARTING DATE-TIME  MAY 02,2024  15:31:11.701  123  THU   2460433
 0:
 0:
 0: MPI Library = Intel(R) MPI Library 2021.9 for Linux* OS
 0:
 0: MPI Version = 3.1
 0: Abort(1) on node 0 (rank 0 in comm 496): application called MPI_Abort(comm=0x84000003, 1) - process 0
 5: Abort(1) on node 5 (rank 5 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 5
 8: Abort(1) on node 8 (rank 8 in comm 496): application called MPI_Abort(comm=0x84000002, 1) - process 8

Given this, rewind and reboot 20240224/00 gfsfcst from JEDI ATM. The reboot failed just like GSI ATM gfsfcst.

The gfsfcst failure isn't the NetCDF: HDF error reported in this issue. The gfsfcst failure above is the same failure reported in issue #2551.

@aerorahul
Copy link
Contributor

@RussTreadon-NOAA Is this still an issue. Forecast tests on hercules indicate that the model is running cleanly.

@RussTreadon-NOAA
Copy link
Contributor Author

We may close this issue. It seems failures were related to not removing the run directory before re-running tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants