Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hafs_3denvar_glbens ctest fails on Hercules #697

Closed
DavidHuber-NOAA opened this issue Feb 13, 2024 · 42 comments
Closed

hafs_3denvar_glbens ctest fails on Hercules #697

DavidHuber-NOAA opened this issue Feb 13, 2024 · 42 comments
Assignees

Comments

@DavidHuber-NOAA
Copy link
Collaborator

The hafs_3denvar_glbens ctest is failing to produce identical fv3_dynvars files between the hafs_3denvar_glbens_loproc_updat and hafs_3denvar_glbens_hiproc_updat test cases. This was discovered during testing of #684 and #695.

@TingLei-NOAA
Copy link
Contributor

@DavidHuber-NOAA I tried looking into this starting from GSI compiled with debug mode. But seems GSI in their ctests would abort in the debug mode for issues to be fixed by #679. While I am not sure when that PR would be merged to the EMC GSI, is that possible you directly merge that branch to your GSI branch and point to me?

@DavidHuber-NOAA
Copy link
Collaborator Author

@TingLei-NOAA I would suggest doing that on a local clone since the PR may still change:

cd <gsi>
git remote add hafs https://github.com/hafs-community/gsi
git fetch hafs
git merge hafs/feature/toff_fix

@TingLei-NOAA
Copy link
Contributor

@DavidHuber-NOAA But, which branch are you using now for your current spack setup?

@DavidHuber-NOAA
Copy link
Collaborator Author

Ah, right:

git clone https://github.com/DavidHuber-NOAA/GSI --recursive -b feature/ss_160 gsi_ss_160
cd gsi_ss_160
git remote add hafs https://github.com/hafs-community/gsi
git fetch hafs
git merge hafs/feature/toff_fix

@RussTreadon-NOAA
Copy link
Contributor

While running ctests on Hercules for PR #684, failures occurred in the rrfs_3denvar_glbens and hafs_3denvar_hybens tests.

The rrfs_3denvar_glbens was due to the loproc and hiproc updat runs (i.e., the spack-stack/1.6.0 build) not producing identical fv3_dynvars. Comparison of fv3_dynvars records between the loproc and hiproc runs shows differences in the u and v wind components

hercules-login-3:/work2/noaa/stmp/rtreadon/pr684/tmpreg_rrfs_3denvar_glbens$ /work/noaa/da/rtreadon/bin/compare_ncfile.py rrfs_3denvar_glbens_loproc_updat/fv3_dynvars rrfs_3denvar_glbens_hiproc_updat/fv3_dynvars
/work/noaa/da/rtreadon/bin/compare_ncfile.py:12: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  data1 = nc1[varname][:]
/work/noaa/da/rtreadon/bin/compare_ncfile.py:13: DeprecationWarning: `np.bool` is a deprecated alias for the builtin `bool`. To silence this warning, use `bool` by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use `np.bool_` here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  data2 = nc2[varname][:]
xaxis_1 min/max 1=1.0,396.0 min/max 2=1.0,396.0 max abs diff=0.0000000000
xaxis_2 min/max 1=1.0,397.0 min/max 2=1.0,397.0 max abs diff=0.0000000000
yaxis_1 min/max 1=1.0,233.0 min/max 2=1.0,233.0 max abs diff=0.0000000000
yaxis_2 min/max 1=1.0,232.0 min/max 2=1.0,232.0 max abs diff=0.0000000000
zaxis_1 min/max 1=1.0,65.0 min/max 2=1.0,65.0 max abs diff=0.0000000000
Time min/max 1=1.0,1.0 min/max 2=1.0,1.0 max abs diff=0.0000000000
u min/max 1=-38.006363,59.550613 min/max 2=-38.006363,59.550613 max abs diff=0.0057044029
v min/max 1=-26.81582,31.66718 min/max 2=-26.81582,31.66718 max abs diff=10.2818336487
W min/max 1=-2.563452,6.3780456 min/max 2=-2.563452,6.3780456 max abs diff=0.0000000000
DZ min/max 1=-5746.5317,-17.513391 min/max 2=-5746.5317,-17.513391 max abs diff=0.0000000000
T min/max 1=194.30504,313.18134 min/max 2=194.30504,313.18134 max abs diff=0.0000000000
delp min/max 1=140.36311,3325.6877 min/max 2=140.36311,3325.6877 max abs diff=0.0000000000
phis min/max 1=-676.6589,36239.055 min/max 2=-676.6589,36239.055 max abs diff=0.0000000000

The hafs_3denvar_hybens failure is also due to differences between the loproc and_hiproc_ updat runs. However, for this case differences occur in the T (temperature) record

xaxis_1 min/max 1=1.0,720.0 min/max 2=1.0,720.0 max abs diff=0.0000000000
xaxis_2 min/max 1=1.0,721.0 min/max 2=1.0,721.0 max abs diff=0.0000000000
yaxis_1 min/max 1=1.0,541.0 min/max 2=1.0,541.0 max abs diff=0.0000000000
yaxis_2 min/max 1=1.0,540.0 min/max 2=1.0,540.0 max abs diff=0.0000000000
zaxis_1 min/max 1=1.0,65.0 min/max 2=1.0,65.0 max abs diff=0.0000000000
Time min/max 1=1.0,1.0 min/max 2=1.0,1.0 max abs diff=0.0000000000
u min/max 1=-49.416553,49.226555 min/max 2=-49.416553,49.226555 max abs diff=0.0000000000
v min/max 1=-49.340206,39.08488 min/max 2=-49.340206,39.08488 max abs diff=0.0000000000
W min/max 1=-2.842067,9.540336 min/max 2=-2.842067,9.540336 max abs diff=0.0000000000
DZ min/max 1=-5518.389,-17.457632 min/max 2=-5518.389,-17.457632 max abs diff=0.0000000000
T min/max 1=181.89748,346.99887 min/max 2=91.40897,346.99887 max abs diff=162.1640167236
delp min/max 1=140.48375,3329.2869 min/max 2=140.48375,3329.2869 max abs diff=0.0000000000
phis min/max 1=-6.5023405e-06,36102.867 min/max 2=-6.5023405e-06,36102.867 max abs diff=0.0000000000
ua min/max 1=-49.661697,46.939003 min/max 2=-49.661697,46.939003 max abs diff=0.0000000000
va min/max 1=-37.868477,49.013893 min/max 2=-37.868477,49.013893 max abs diff=0.0000000000

As a test, rerun rrfs_3denvar_glbens on Hercules. This time the test passed. The fv3_dynvars files are identical between updat and contrl and for loproc and hiproc.

Test project /work2/noaa/da/rtreadon/git/gsi/pr684/build
    Start 3: rrfs_3denvar_glbens
1/1 Test #3: rrfs_3denvar_glbens ..............   Passed  968.61 sec

100% tests passed, 0 tests failed out of 1

Total Test time (real) = 968.61 sec

@TingLei-NOAA
Copy link
Contributor

An update : the branch incorporating changes in #679 still shows the issue. As @RussTreadon-NOAA found , the differences are in u and v, as shown below (for v, differences on 23 level).
image
Hence, based on the current findings (thanks to @RussTreadon-NOAA investigation on this issue), I agree, this issue should be caused by the issue #694 and #674, in which it was found when I_MPI_EXTRA_FILESYSTEM, the "hdf error" would occur in IO for u and v in the writing of the analysis. Now, alongside findings reported here, it is clear , no matter what I_MPI_EXTRA_FILESYSTEM's setup, there are problems with the writing out of U and V in fv3reg GSI , working on which is ongoing and would be updated here

@ShunLiu-NOAA
Copy link
Contributor

@TingLei-NOAA your findings are similar to the streak feature in RRFS forecast. Let's have a short meeting and collect all information together.

@RussTreadon-NOAA
Copy link
Contributor

Other instances of hafs_3denvar_glbens failures have found differences between loproc and updat in the T (see above) and delp fields (see here).

@TingLei-NOAA
Copy link
Contributor

An update: After changing , for example, "call check( nf90_open(filenamein,nf90_write,gfile_loc,comm=mpi_comm_world,info=MPI_INFO_NULL) )" to "call check( nf90_open(filenamein,ior(nf90_write,nf90_mpiio),gfile_loc,comm=mpi_comm_world,info=MPI_INFO_NULL) )" in the update branch, In the regression test,
the control still show differences between loproc and hiproc in fv3_dynvars, while the update runs show identical files . However, for the spontaneous occurrences of this issue as @RussTreadon-NOAA found, I can't say this is just the solution. I will prepare a clean branch with all recent changes (including those in #694) for more systematic tests/verifications.

@TingLei-NOAA
Copy link
Contributor

An update: with the PR #698, the hafs regression tests on hercules still failed for this issue.

@TingLei-NOAA
Copy link
Contributor

With the recently updated PR #698 ( adding nf90_collective mode for U and V in their writing subs following the suggestion from P. Johnson through Hercules helpdesk, I had seen this problem ( differences between loproc and hiproc in fv3_dynvars on hercules) had disappeared in my two runs of the hafs regression tests. @BinLiu-NOAA and @yonghuiweng would test this PR and see if it works stably?

@BinLiu-NOAA
Copy link
Contributor

BinLiu-NOAA commented Feb 16, 2024

@TingLei-NOAA, Thanks for sharing the good news!

We had experienced lots of Hercules system side issues running the HAFS application on it for the past 1-2 weeks or so. Not sure if any Hercules system side update/fix might have resolved the issue.

@BinLiu-NOAA
Copy link
Contributor

With the recently updated PR #698 ( adding nf90_collective mode for U and V in their writing subs following the suggestion from P. Johnson through Hercules helpdesk, I had seen this problem ( differences between loproc and hiproc in fv3_dynvars on hercules) had disappeared in my two runs of the hafs regression tests. @BinLiu-NOAA and @yonghuiweng would test this PR and see if it works stably?

@yonghuiweng, Could you please help to conduct some tests from your end on Hercules in order to confirm whether or not it is working properly on Hercules? Thanks!

@TingLei-NOAA
Copy link
Contributor

@yonghuiweng Thanks. However, for test of my branch (fv3reg_parallel_io_upgrade), we don't need to merge with toff_fix. And, would you please confirm if the issue (differences between loproc and hiproc in fv3_dynvars) was the reason causing all those tests failure?

@yonghuiweng
Copy link

@TingLei-NOAA , I couldn't build PR #698 on Hercules due to "nf90_netcdf4" type, the error shows,
[ 44%] Building Fortran object src/gsi/CMakeFiles/gsi_fortran_obj.dir/cplr_read_wrf_nmm_files.f90.o
/work2/noaa/hwrf/noscrub/yweng/regression/GSI/src/gsi/gsi_rfv3io_mod.f90(4055): error #6404: This name does not have a type, and must have an explicit type. [NF90_NETCDF4]
call check( nf90_open(filename_layout,ior(nf90_netcdf4,ior(nf90_write, nf90_mpiio)),gfile_loc_layout(nio),comm=mpi_comm_read,info=MPI_INFO_NULL) )
-------------------------------------------------------^
/work2/noaa/hwrf/noscrub/yweng/regression/GSI/src/gsi/gsi_rfv3io_mod.f90(4055): error #6362: The data types of the argument(s) are invalid. [IOR]
call check( nf90_open(filename_layout,ior(nf90_netcdf4,ior(nf90_write, nf90_mpiio)),gfile_loc_layout(nio),comm=mpi_comm_read,info=MPI_INFO_NULL) )
-------------------------------------------------------^
/work2/noaa/hwrf/noscrub/yweng/regression/GSI/src/gsi/gsi_rfv3io_mod.f90(4059): error #6362: The data types of the argument(s) are invalid. [IOR]
call check( nf90_open(filenamein,ior(nf90_netcdf4,ior(nf90_write, nf90_mpiio)),gfile_loc,comm=mpi_comm_read,info=MPI_INFO_NULL) )
-----------------------------------------------^
[ 44%] Building Fortran object src/gsi/CMakeFiles/gsi_fortran_obj.dir/lag_fields.f90.o
compilation aborted for /work2/noaa/hwrf/noscrub/yweng/regression/GSI/src/gsi/gsi_rfv3io_mod.f90 (code 1)

The previous test by combined #698 and toff_fix may have issues, I tried serval times and may mixed the branches.

@TingLei-NOAA
Copy link
Contributor

TingLei-NOAA commented Feb 18, 2024

@yonghuiweng My apology. I made that bug when I pushed changes from hera. It should work now .

@yonghuiweng
Copy link

@TingLei-NOAA , I added nf90_netcdf4 in the 3967th line of src/gsi/gsi_rfv3io_mod.f90 as,
3967 use netcdf, only: nf90_netcdf4, nf90_write,nf90_mpiio,nf90_inq_varid,nf90_var_par_access,nf90_collective
I will upload the test result here a few minutes later.

@yonghuiweng
Copy link

@TingLei-NOAA , the test shows rrfs_3denvar_glbens passed, but hafs_3denvar_hybens failed. Here is the result,
Test project /work2/noaa/hwrf/noscrub/yweng/regression/GSI/build
Start 1: global_4denvar
Start 2: rtma
Start 3: rrfs_3denvar_glbens
Start 4: netcdf_fv3_regional
Start 5: hafs_4denvar_glbens
Start 6: hafs_3denvar_hybens
Start 7: global_enkf
1/7 Test #4: netcdf_fv3_regional .............. Passed 305.73 sec
2/7 Test #7: global_enkf ......................***Failed 391.21 sec
3/7 Test #3: rrfs_3denvar_glbens .............. Passed 669.46 sec
4/7 Test #2: rtma ............................. Passed 1207.54 sec
5/7 Test #6: hafs_3denvar_hybens ..............***Failed 1213.82 sec
6/7 Test #5: hafs_4denvar_glbens ..............***Failed 1396.59 sec
7/7 Test #1: global_4denvar ................... Passed 1445.46 sec

57% tests passed, 3 tests failed out of 7

@TingLei-NOAA
Copy link
Contributor

@yonghuiweng what is the causes for failure of hafs regression tests? It should fail since we know the differences between loproc_cntrl and hiproc_cntrl in fv3_dynvars is the issue , but we expect the PR 698 would resolve the issue namely creating identical fv3_dynvars from loproc_updt and hiproc_updt. We could discuss more off-line later .

@yonghuiweng
Copy link

It seems the error is still there, and it shows in hafs_3denvar_hybens_regression_results.txt,
The results between the two runs (hafs_3denvar_hybens_loproc_updat and hafs_3denvar_hybens_hiproc_updat) are not reproducible
Thus, the case has Failed siganl of the regression tests.

@TingLei-NOAA
Copy link
Contributor

@yonghuiweng Thanks. So seems this issue is not, at least, exactly as the issue #694. More investigation is needed

@yonghuiweng
Copy link

@TingLei-NOAA I tested your branch feature/fv3reg_pio_upgrade_toff_fix on /work disk of Hercules twice. Both of them show hybens_loproc_updat and hybens_hiproc_updat are reproducible for both hafs_3d and hafs_4d (though those failed, but loproc and hiproc were reproducible).
Before the tests on /work, I made 9 times of test on /work2, only a few of tests of hafs_3d got the loproc and hiproc repeoducible. The reason of failure on /work2 is not clear. Before 5/20/2023, we had the disk issue on /work2 of Orion, the follwing figure shows the difference between 2 GSI runs,
image
Then Orion disabled "Hot Pools" and fixed this issue.

@TingLei-NOAA
Copy link
Contributor

@yonghuiweng Thanks for those investigation, findings and information. The pattern of the differences found in the previous issue with the work2 disk setup is very similar to the pattern of the differences found in this issue. Does that mean the culprit for this issue is also with the system setup (while, it is somehow better for work than for work2 disk), though our codes could help reduce the chances of those issues' occurrences. Let us have some more off-line discussions.

@TingLei-NOAA
Copy link
Contributor

TingLei-NOAA commented Feb 23, 2024

An update on the "same" issue with rrfs_3denvar_glbens on hera with the branch feature/fv3reg_parallel_io_upgrade.

  1. when "SBATCH --nodes=1 --ntasks-per-node=20 " , the rrfs_3denvar_glbens_loproc_updat would abort with error messages as : "Caught signal 11 (Segmentation fault: address not mapped to object at address (nil))" and it indicated the error occurring with the following line in Bold:
    call check( nf90_var_par_access(gfile_loc, ugrd_VarId, nf90_collective))
    ...
**call check( nf90_get_var(gfile_loc,ugrd_VarId,work_bu,start=u_startloc,count=u_countloc) )**

If the previous line (set nf90_collective) was commented out, the GSI would run smoothly. But since that set up is needed for runs on hercules, It can't be commented out.
2) when "SBATCH --nodes=2 --ntasks-per-node=10 ", GSI run would run smoothly, but the issue (differences in fv3_dynvars between loproc_updat and hiproc_updat) occurred.
3) when "SBATCH --nodes=4 --ntasks-per-node=5 " , the issue disappeared.
In summary, seems in the "same"/similar issue with rrfs test on hera, it is related to the allocated memory resources for the low level parallel netcdf IO processes (especially when nf90_collective is used, larger memory usage would be caused). If there is no "enough" memory is given, the GSI could still run successfully, but possible memory corruption in the parallel netcdf io step could cause those differences investigated in this issue.
So, when we have been focusing on GSI codes and compiling packages and possible env variable, we might also test the impacts of the sbach setup especially helping allocating "enough" memory to the low-level netcdf/hdf5 IO.

@TingLei-NOAA
Copy link
Contributor

TingLei-NOAA commented Mar 1, 2024

An update: After changing a few slurm parameter related to usage of memory, the PR #698 passes all regression tests including hafs test on Herclues. But I will see how to make sure that performance is stable.

@RussTreadon-NOAA
Copy link
Contributor

@TingLei-NOAA and @DavidHuber-NOAA : May we close this issue or is it still being actively worked on?

@TingLei-NOAA
Copy link
Contributor

@RussTreadon-NOAA It is not resolved yet while I believe it is an machine-specific instability issue ( while,not necessarily the machine's issue). But I am not working on it for being now. I'd prefer to leave it open to see if other GSI developers could take higher priority to work on it.

@RussTreadon-NOAA
Copy link
Contributor

Got it. The GSI Handling Review team will close this issue if we do not see any activity over the next few months.

@RussTreadon-NOAA
Copy link
Contributor

Ran develop at ebeaba1 and DavidBurrows-NCO:gaea_build at 5981b57 on Hercules. The rrfs_3denvar_glbens and hafs_4denvar_glbens ctests failed due to non-reproducible results. The other tests passed.

Regional developers should be mindful of this behavior when using Hercules to run gsi.x.

@RussTreadon-NOAA
Copy link
Contributor

@TingLei-NOAA , what is the status of this issue?

@TingLei-NOAA
Copy link
Contributor

An update: using the develop branch on my fork and EMC GSI

commit 9f44c8798c2087aca06df8f629699632e57df431 (HEAD -> develop, origin/develop, origin/HEAD)
Author: Innocent Souopgui <[email protected]>
Date:   Fri Sep 6 07:47:08 2024 -0500

The two hafs tests passed.

The following tests passed:
        hafs_4denvar_glbens
        hafs_3denvar_hybens

rrfs_3denvar_rdasens also passed for reproducibility tests but failed for "Failure memthresh of the regression test".
So, seems the originally reported non-reproducibility issue has disappeared with the current modules, though I am not sure which recently upgraded components in the system caused it.
@RussTreadon-NOAA I think this issue could be closed unless new instances of this issue be reported again.

@RussTreadon-NOAA
Copy link
Contributor

@TingLei-NOAA , good to hear that things now look better.

Please reach out to regional DA staff to get their thoughts. @DavidHuber-NOAA opened this issue. He, too, may have comments to share.. Close this issue if there is consensus that the reported problem is no longer a problem.

@TingLei-NOAA
Copy link
Contributor

@ShunLiu-NOAA @hu5970 (on travel) @yonghuiweng @BinLiu-NOAA @XuLu-NOAA that will be great to know your opinions/suggestions on this issue considering the recent test results.

@ShunLiu-NOAA
Copy link
Contributor

@TingLei-NOAA I am OK with closing this issue because it does not appear in your new test.

@DavidHuber-NOAA
Copy link
Collaborator Author

Glad to hear that the issue is no longer present. I'm OK with closing this issue as well.

@XuLu-NOAA
Copy link
Contributor

@TingLei-NOAA If the reproducibility has been tested on all platforms, then I'm OK with closing this issue, too.

@DavidHuber-NOAA
Copy link
Collaborator Author

@XuLu-NOAA @TingLei-NOAA This issue only appeared on Hercules (I believe it was a Rocky 8 issue initially). Tests have been run on other systems (Hera, Jet, and WCOSS2, at least). That said, I think that an additional test on Orion makes sense.

@TingLei-NOAA
Copy link
Contributor

@DavidHuber-NOAA As recently reported on the GSI PR , all GSI ctests passed with the current system on orion.

@DavidHuber-NOAA
Copy link
Collaborator Author

Excellent, then I believe the GSI is working on all platforms correctly.

@XuLu-NOAA
Copy link
Contributor

Thanks for your efforts! That sounds great! I don't have any additional concerns regarding the issue then.

@TingLei-NOAA
Copy link
Contributor

TingLei-NOAA commented Sep 13, 2024

An additional note: in the previous investigation of this issue, it was found that the issue could occur spontaneously, Though as reported in #697 (comment), more requested memory help attain a more stable behavior. So, since we have no plans for some systematic intensity tests, based on recent successful ctests results, I prefer to close this issue and it could be re-opened if any failed instances are reported again.

@RussTreadon-NOAA
Copy link
Contributor

Build GSI develop at 9f44c87 on Hercules and Orion. Run ctests with following results

Hercules

Test project /work/noaa/da/rtreadon/git/gsi/develop/build
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_rdasens
    Start 4: hafs_4denvar_glbens
    Start 5: hafs_3denvar_hybens
    Start 6: global_enkf
1/6 Test #3: rrfs_3denvar_rdasens .............***Failed  549.69 sec
2/6 Test #6: global_enkf ......................   Passed  855.86 sec
3/6 Test #2: rtma .............................   Passed  1388.29 sec
4/6 Test #5: hafs_3denvar_hybens ..............   Passed  1515.99 sec
5/6 Test #4: hafs_4denvar_glbens ..............   Passed  1584.20 sec
6/6 Test #1: global_4denvar ...................   Passed  1983.51 sec

83% tests passed, 1 tests failed out of 6

Total Test time (real) = 1983.53 sec

The rrfs_3denvar_rdasens failure is due to

The memory for rrfs_3denvar_rdasens_loproc_updat is 1110400 KBs.  This has exceeded maximum allowable memory of 1105051 KBs, resulting in Failure memthresh of the regression test.  

This is not a fatal fail.

Notably, hafs_3denvar_glbens Passed as @TingLei-NOAA reported.

Orion

Test project /work2/noaa/da/rtreadon/git/gsi/develop/build
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_rdasens
    Start 4: hafs_4denvar_glbens
    Start 5: hafs_3denvar_hybens
    Start 6: global_enkf
1/6 Test #6: global_enkf ......................   Passed  847.83 sec
2/6 Test #3: rrfs_3denvar_rdasens .............   Passed  1148.03 sec
3/6 Test #2: rtma .............................   Passed  2048.26 sec
4/6 Test #5: hafs_3denvar_hybens ..............   Passed  2851.25 sec
5/6 Test #4: hafs_4denvar_glbens ..............   Passed  3216.17 sec
6/6 Test #1: global_4denvar ...................   Passed  3842.49 sec

100% tests passed, 0 tests failed out of 6

Total Test time (real) = 3842.61 sec

Unfortunately, increased gsi.x wall times are still evident on Orion. On the positive side, all ctests Passed on Orion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants