-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hafs_3denvar_glbens ctest fails on Hercules #697
Comments
@DavidHuber-NOAA I tried looking into this starting from GSI compiled with debug mode. But seems GSI in their ctests would abort in the debug mode for issues to be fixed by #679. While I am not sure when that PR would be merged to the EMC GSI, is that possible you directly merge that branch to your GSI branch and point to me? |
@TingLei-NOAA I would suggest doing that on a local clone since the PR may still change: cd <gsi>
git remote add hafs https://github.com/hafs-community/gsi
git fetch hafs
git merge hafs/feature/toff_fix |
@DavidHuber-NOAA But, which branch are you using now for your current spack setup? |
Ah, right: git clone https://github.com/DavidHuber-NOAA/GSI --recursive -b feature/ss_160 gsi_ss_160
cd gsi_ss_160
git remote add hafs https://github.com/hafs-community/gsi
git fetch hafs
git merge hafs/feature/toff_fix |
While running ctests on Hercules for PR #684, failures occurred in the The
The
As a test, rerun
|
An update : the branch incorporating changes in #679 still shows the issue. As @RussTreadon-NOAA found , the differences are in u and v, as shown below (for v, differences on 23 level). |
@TingLei-NOAA your findings are similar to the streak feature in RRFS forecast. Let's have a short meeting and collect all information together. |
An update: After changing , for example, "call check( nf90_open(filenamein,nf90_write,gfile_loc,comm=mpi_comm_world,info=MPI_INFO_NULL) )" to "call check( nf90_open(filenamein,ior(nf90_write,nf90_mpiio),gfile_loc,comm=mpi_comm_world,info=MPI_INFO_NULL) )" in the update branch, In the regression test, |
An update: with the PR #698, the hafs regression tests on hercules still failed for this issue. |
With the recently updated PR #698 ( adding nf90_collective mode for U and V in their writing subs following the suggestion from P. Johnson through Hercules helpdesk, I had seen this problem ( differences between loproc and hiproc in fv3_dynvars on hercules) had disappeared in my two runs of the hafs regression tests. @BinLiu-NOAA and @yonghuiweng would test this PR and see if it works stably? |
@TingLei-NOAA, Thanks for sharing the good news! We had experienced lots of Hercules system side issues running the HAFS application on it for the past 1-2 weeks or so. Not sure if any Hercules system side update/fix might have resolved the issue. |
@yonghuiweng, Could you please help to conduct some tests from your end on Hercules in order to confirm whether or not it is working properly on Hercules? Thanks! |
@yonghuiweng Thanks. However, for test of my branch (fv3reg_parallel_io_upgrade), we don't need to merge with toff_fix. And, would you please confirm if the issue (differences between loproc and hiproc in fv3_dynvars) was the reason causing all those tests failure? |
@TingLei-NOAA , I couldn't build PR #698 on Hercules due to "nf90_netcdf4" type, the error shows, The previous test by combined #698 and toff_fix may have issues, I tried serval times and may mixed the branches. |
@yonghuiweng My apology. I made that bug when I pushed changes from hera. It should work now . |
@TingLei-NOAA , I added nf90_netcdf4 in the 3967th line of src/gsi/gsi_rfv3io_mod.f90 as, |
@TingLei-NOAA , the test shows rrfs_3denvar_glbens passed, but hafs_3denvar_hybens failed. Here is the result, 57% tests passed, 3 tests failed out of 7 |
@yonghuiweng what is the causes for failure of hafs regression tests? It should fail since we know the differences between loproc_cntrl and hiproc_cntrl in fv3_dynvars is the issue , but we expect the PR 698 would resolve the issue namely creating identical fv3_dynvars from loproc_updt and hiproc_updt. We could discuss more off-line later . |
It seems the error is still there, and it shows in hafs_3denvar_hybens_regression_results.txt, |
@yonghuiweng Thanks. So seems this issue is not, at least, exactly as the issue #694. More investigation is needed |
@TingLei-NOAA I tested your branch feature/fv3reg_pio_upgrade_toff_fix on /work disk of Hercules twice. Both of them show hybens_loproc_updat and hybens_hiproc_updat are reproducible for both hafs_3d and hafs_4d (though those failed, but loproc and hiproc were reproducible). |
@yonghuiweng Thanks for those investigation, findings and information. The pattern of the differences found in the previous issue with the work2 disk setup is very similar to the pattern of the differences found in this issue. Does that mean the culprit for this issue is also with the system setup (while, it is somehow better for work than for work2 disk), though our codes could help reduce the chances of those issues' occurrences. Let us have some more off-line discussions. |
An update on the "same" issue with rrfs_3denvar_glbens on hera with the branch feature/fv3reg_parallel_io_upgrade.
If the previous line (set nf90_collective) was commented out, the GSI would run smoothly. But since that set up is needed for runs on hercules, It can't be commented out. |
An update: After changing a few slurm parameter related to usage of memory, the PR #698 passes all regression tests including hafs test on Herclues. But I will see how to make sure that performance is stable. |
@TingLei-NOAA and @DavidHuber-NOAA : May we close this issue or is it still being actively worked on? |
@RussTreadon-NOAA It is not resolved yet while I believe it is an machine-specific instability issue ( while,not necessarily the machine's issue). But I am not working on it for being now. I'd prefer to leave it open to see if other GSI developers could take higher priority to work on it. |
Got it. The GSI Handling Review team will close this issue if we do not see any activity over the next few months. |
@TingLei-NOAA , what is the status of this issue? |
An update: using the develop branch on my fork and EMC GSI
The two hafs tests passed.
rrfs_3denvar_rdasens also passed for reproducibility tests but failed for "Failure memthresh of the regression test". |
@TingLei-NOAA , good to hear that things now look better. Please reach out to regional DA staff to get their thoughts. @DavidHuber-NOAA opened this issue. He, too, may have comments to share.. Close this issue if there is consensus that the reported problem is no longer a problem. |
@ShunLiu-NOAA @hu5970 (on travel) @yonghuiweng @BinLiu-NOAA @XuLu-NOAA that will be great to know your opinions/suggestions on this issue considering the recent test results. |
@TingLei-NOAA I am OK with closing this issue because it does not appear in your new test. |
Glad to hear that the issue is no longer present. I'm OK with closing this issue as well. |
@TingLei-NOAA If the reproducibility has been tested on all platforms, then I'm OK with closing this issue, too. |
@XuLu-NOAA @TingLei-NOAA This issue only appeared on Hercules (I believe it was a Rocky 8 issue initially). Tests have been run on other systems (Hera, Jet, and WCOSS2, at least). That said, I think that an additional test on Orion makes sense. |
@DavidHuber-NOAA As recently reported on the GSI PR , all GSI ctests passed with the current system on orion. |
Excellent, then I believe the GSI is working on all platforms correctly. |
Thanks for your efforts! That sounds great! I don't have any additional concerns regarding the issue then. |
An additional note: in the previous investigation of this issue, it was found that the issue could occur spontaneously, Though as reported in #697 (comment), more requested memory help attain a more stable behavior. So, since we have no plans for some systematic intensity tests, based on recent successful ctests results, I prefer to close this issue and it could be re-opened if any failed instances are reported again. |
Build GSI Hercules
The rrfs_3denvar_rdasens failure is due to
This is not a fatal fail. Notably, hafs_3denvar_glbens Passed as @TingLei-NOAA reported. Orion
Unfortunately, increased |
The hafs_3denvar_glbens ctest is failing to produce identical
fv3_dynvars
files between thehafs_3denvar_glbens_loproc_updat
andhafs_3denvar_glbens_hiproc_updat
test cases. This was discovered during testing of #684 and #695.The text was updated successfully, but these errors were encountered: