Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify hpc stack to point to hpc-stack-gfsv16 since hpc-stack is outd… #562

Merged

Conversation

emilyhcliu
Copy link
Contributor

@emilyhcliu emilyhcliu commented Apr 21, 2023

Description
Based on the discussion in the GW issue #1453 and GSI issue #560

Fixes #560

We need to do the following:

  • GSI develop uses EPIC-maintain hpc-stacks and, later, spack-stack
  • GSI release/gfsda.v16 uses hpc-stack-gfsv16

On HERA, updating the hpc stack to hpc-stack-gfsv16 is ready. So, This PR is focused on updating the HERA modulefiles for gfsda.v16. The updated changes can be found in gfsda.v16_crtm under emilyhcliu fork.

Here are the changes

How Has This Been Tested?

Checklist

DUE DATE for this PR is 6/2/2023. If this PR is not merged into develop by this date, the PR will be closed and returned to the developer.

@emilyhcliu emilyhcliu self-assigned this Apr 21, 2023
@RussTreadon-NOAA RussTreadon-NOAA linked an issue May 8, 2023 that may be closed by this pull request
@natalie-perlin
Copy link

@emilyhcliu - please note that EPIC-maintained stacks are now available, as in
an updated in PR-571, #571

@emilyhcliu
Copy link
Contributor Author

@emilyhcliu - please note that EPIC-maintained stacks are now available, as in an updated in PR-571, #571

@natalie-perlin Thanks for letting me about the EPIC hpc-stack update.
We should have HPC-maintenance stacks in the develop.
Thanks for all the good work!!

@emilyhcliu
Copy link
Contributor Author

@RussTreadon-NOAA I think I should close this PR since we can use the EPIC-maintenance stacks in develop (PR #571).
Should we change the release/gfsda.v16 from hpc-stack to hpc-stack-gfsv16? Is so, I will open a PR.

@RussTreadon-NOAA
Copy link
Contributor

@emilyhcliu : This PR (#562) merges into NOAA-EMC:release/gfsda.v16. PR #571 merges into NOAA-EMC:develop. Doesn't NOAA-EMC:release/gfsda.v16 need to stay in sync with the operational GFS v16.*?

@emilyhcliu
Copy link
Contributor Author

emilyhcliu commented May 12, 2023

@emilyhcliu : This PR (#562) merges into NOAA-EMC:release/gfsda.v16. PR #571 merges into NOAA-EMC:develop. Doesn't NOAA-EMC:release/gfsda.v16 need to stay in sync with the operational GFS v16.*?

The CRTM-2.4.0_emc installed in hpc_stack' used by release/gfsda.v16 is **not** the same version as the one installed on WCOSS-2 for operation. The CRTM difference in hpc-stack' and hpc-stack-gfsv16 is the bug-fix to handle the floating overflow error in cloudy radiance simulation from CRTM. The bug-fix was merged into CRTM-2.4.0_emc on May 27, 2022.

@HaixiaLiu-NOAA and I worked with EIB to make sure the version with the bug fix was installed on HERA last summer. We checked the source code and knew the bug-fixed version was installed.

Lately, we had some parallel experiments that were having trouble. So I went to check the CRTM source installed on HERA again. The source code did not contain the bug fix. Some changes must be made for hpc-stack since last summer.

I found there were two versions of hpc stacks (hpc-stack and hcp-stack-gfsv16). I reported my findings in Issue #1452.
Kate provided very helpful information about the status of various HPC stacks

So, I went to check CRTM-v2.4.0_emc installed in hpc-stack-gfs16 and confirmed with EIB the location of the source code.
Both stacks pointed to the same source code, which did not have the bug fix. An older version was installed. So, I asked EIB for clarification on April 7 (Friday). The next Monday (April 10), they installed a brand new CRTM-2.4.0_emc (with the bug fix) for hpc-stack-gfsv16. There were no explanations and communication before they re-installed.

Anyway, the whole process has been very frustrating for me.
The release/gfsda.v16 needs to update the modules from hpc-stack to hpc-stack-gfsv16.
hpc-stack-gfsv16 has the CRTM-2.4.0_emc with bug fix and also includes the CRTM coefficients for N21.

@RussTreadon-NOAA
Copy link
Contributor

Thank you @emilyhcliu for confirming that we need to keep this PR to update NOAA-EMC:release/gfsda.v16.

Two questions:

  1. Is it only the Hera modulefile which needs to be updated?
  2. Have we run the regression tests on Hera to ensure expected behavior?

@emilyhcliu
Copy link
Contributor Author

I checked ORION before. It is not the CRTM with bug-fix.
I opened an issue in hpc stack repository to request updating CRTM-2.4.0_emc on HPC machines.

I will run the regression test for release/gfsda.v16 with the change from hpc-stack to hpc-stack-gfsv16
The results will be changed since the simulated values from CRTM will be slight different. The changes in numbers are very tiny.

@emilyhcliu
Copy link
Contributor Author

Thank you @emilyhcliu for confirming that we need to keep this PR to update NOAA-EMC:release/gfsda.v16.

Two questions:

  1. Is it only the Hera modulefile which needs to be updated?
  2. Have we run the regression tests on Hera to ensure expected behavior?

@RussTreadon-NOAA Just to clarify, the regression tests you mentioned above are the standard regression tests (total 19 tests) in the release/gfsda.v16, correct?

@RussTreadon-NOAA
Copy link
Contributor

Ooops, this a problem. release/gfsda.v16 uses the old suite of ctests. It's not configured to run the new suite of tests. You can try the old tests but I suspect many will fail because the required input files can not be found.

This will be a problem moving forward. I don't think GFS v16.x implementations can move to develop. We move to develop with GFS v17. How do we regression test GFS v16.x changes if release/gfsda.v16 can only run old, likely non-functional, tests?

Please try old ctests bundled with release/gfsda.v16. Record which tests run and which fail. If an insufficient number of global ctests run, we'll need to update release/gfsda.v16 to the current suite of 9 ctests.

@emilyhcliu
Copy link
Contributor Author

@RussTreadon-NOAA Thanks for your quick response.

My control is release/gfsda.v16
My update is gfsda.v16_crtm
The difference between the control and update is the module files under modulefiles

All 19 tests ran to completion except netcdf_fv3_regional for the control. The update runs (low and high) are good.

The error message from the control runs is the following:

 the current GSI parallelization IO for fv3_lam only works for netcdf4
 ncfmt should be            3  GSI will stop  while, dynvars file is
           2
 ****STOP2****  ABORTING EXECUTION w/code=         333
 ****STOP2****  ABORTING EXECUTION w/code=         333
application called MPI_Abort(MPI_COMM_WORLD, 333) - process 0
slurmstepd: error: *** STEP 44887687.0 ON h24c23 CANCELLED AT 2023-05-12T22:06:26 ***

As you expected, many tests failed.

 1/19 Test #17: [=[global_C96_fv3aero]=] .........***Failed  5174.09 sec
 2/19 Test #16: [=[netcdf_fv3_regional]=] ........***Failed  5285.91 sec
 3/19 Test  #8: [=[arw_netcdf]=] .................***Failed  5289.14 sec
 4/19 Test  #2: [=[global_T62_ozonly]=] ..........***Failed  5289.85 sec
 5/19 Test  #9: [=[arw_binary]=] .................***Failed  5290.54 sec
 6/19 Test #11: [=[nmm_netcdf]=] .................***Failed  5349.01 sec
 7/19 Test #19: [=[global_enkf_T62]=] ............   Passed  5413.37 sec
 8/19 Test  #3: [=[global_4dvar_T62]=] ...........***Failed  5835.19 sec
 9/19 Test #13: [=[hwrf_nmm_d2]=] ................***Failed  5957.64 sec
10/19 Test #14: [=[hwrf_nmm_d3]=] ................   Passed  5967.76 sec
11/19 Test #10: [=[nmm_binary]=] .................***Failed  5974.89 sec
12/19 Test #15: [=[rtma]=] .......................   Passed  6021.94 sec
13/19 Test  #7: [=[global_lanczos_T62]=] .........***Failed  6311.42 sec
14/19 Test  #4: [=[global_4denvar_T126]=] ........***Failed  8292.09 sec
15/19 Test #12: [=[nmmb_nems_4denvar]=] ..........***Failed  8343.47 sec
16/19 Test #18: [=[global_C96_fv3aerorad]=] ......***Failed  8413.97 sec
17/19 Test  #1: [=[global_T62]=] .................***Failed  8775.63 sec
18/19 Test  #5: [=[global_fv3_4denvar_T126]=] ....***Failed  8895.95 sec
19/19 Test  #6: [=[global_fv3_4denvar_C192]=] ....***Failed  8901.93 sec

I did not find the summary output for each run. (I remembered there should be a summary output for each test documenting the reproducibility, timing, scalability, ...etc).
I will set clean = .false. in regression_var.sh and re-run the regression tests to investigate failed cases.

@emilyhcliu
Copy link
Contributor Author

@RussTreadon-NOAA
I am running a gfsda.v16 experiment on hera using the hpc-stack-gfsv16. The experiment period started from 2023031518, and now is running 2023041600. The experiment has been running smoothly without any issues so far.

@RussTreadon-NOAA
Copy link
Contributor

PR #1 has been opened in your fork. This PR replaces the old regression tests in emilyhcliu:gfsda.v16_crtm with the new regression tests used by the authoritative develop.

Only the global ctests work since gfsda.v16_crtm only clones global fix files. i ran the ctests from my fork using executables from your fork as the control.

Hera(hfe10):/scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/test/build$ ctest -R global
Test project /scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/test/build
    Start 1: global_3dvar
1/4 Test #1: global_3dvar .....................   Passed  1911.96 sec
    Start 2: global_4dvar
2/4 Test #2: global_4dvar .....................***Failed  480.88 sec
    Start 3: global_4denvar
3/4 Test #3: global_4denvar ...................   Passed  1613.04 sec
    Start 9: global_enkf
4/4 Test #9: global_enkf ......................   Passed  826.31 sec

75% tests passed, 1 tests failed out of 4

Total Test time (real) = 4832.23 sec

The following tests FAILED:
          2 - global_4dvar (Failed)

The global_4dvar ctest fails due to issues in release/gfsda.v16. This is OK since we do not exercise 4dvar options in GFS v16.

I'd like to hear from you, Andrew, and others working on GFS v16 implementations if we want to update release/gfsda.v16 regression tests. If "yes", I'll change this PR from draft to active. If "no", I'll close the PR and delete my branch. We can discuss at the team's convenience.

@emilyhcliu
Copy link
Contributor Author

@RussTreadon-NOAA I agree with you. We should add the updated regression tests for release/gfsda.v16

@RussTreadon-NOAA
Copy link
Contributor

Thank you @emilyhcliu for your reply. I am currently running the updated ctests on Orion. The Hera and Cactus tests are done. Global ctest results will be posted in PR #1 once the Orion tests finish. After this I'll change PR #1 to active status.

@emilyhcliu
Copy link
Contributor Author

@RussTreadon-NOAA Thanks for working on updating the regression for release/gfsda.v16.

@RussTreadon-NOAA
Copy link
Contributor

@emilyhcliu , we have two hera modulefiles: gsi_hera.intel.lua and gsi_hera.gnu.lua. This PR updates gsi_hera.intel.lua. Do we also need to update gsi_hera.gnu.lua? This may be more a question for EIB and/or the library team than you. I don't know who uses the gnu build on Hera.

@emilyhcliu
Copy link
Contributor Author

@emilyhcliu , we have two hera modulefiles: gsi_hera.intel.lua and gsi_hera.gnu.lua. This PR updates gsi_hera.intel.lua. Do we also need to update gsi_hera.gnu.lua? This may be more a question for EIB and/or the library team than you. I don't know who uses the gnu build on Hera.
https://github.com/NOAA-EMC/hpc-stack

@RussTreadon-NOAA Let's just address the HPC stack change (hpc-stack-gfsv16) for intel on HERA in this PR.
We can open an issue at the [hpc-stack]https://github.com/NOAA-EMC/hpc-stack() repository and request to update the hpc-stack-gfsv16 modules for the rest of the compilers and HPC machines. If you agree, I will go ahead and open an issue for updating hpc-stack-gfsv16 at hpc-stack repository.

@RussTreadon-NOAA
Copy link
Contributor

It is my understanding that

  • GFS v17 uses develop
  • release/gfsda.v16 is only for GFS v16 implementations

GFS v16 implementations occur on WCOSS2. Pre-implementation testing may occur on Orion or Hera. I assume Hera testing will use intel builds since WCOSS2 uses intel builds. If true, release/gfsda.v16 will not use the Hera gnu build. This plus the fact that we do not have a GSI code manager suggest that we do not need to update the Hera gnu module in release/gfsda.v16.

What's your take?

@emilyhcliu
Copy link
Contributor Author

It is my understanding that

  • GFS v17 uses develop
  • release/gfsda.v16 is only for GFS v16 implementations

GFS v16 implementations occur on WCOSS2. Pre-implementation testing may occur on Orion or Hera. I assume Hera testing will use intel builds since WCOSS2 uses intel builds. If true, release/gfsda.v16 will not use the Hera gnu build. This plus the fact that we do not have a GSI code manager suggest that we do not need to update the Hera gnu module in release/gfsda.v16.

What's your take?

@RussTreadon-NOAA I am totally OK with it!

@RussTreadon-NOAA
Copy link
Contributor

Pending two peer reviews I believe we can move forward with the PR.

Copy link
Collaborator

@CatherineThomas-NOAA CatherineThomas-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for getting this all sorted out @emilyhcliu. And thank you Russ for updating the regression tests. Looks good to me.

@RussTreadon-NOAA
Copy link
Contributor

RussTreadon-NOAA commented May 15, 2023

Thank you @CatherineThomas-NOAA for your review. When you have time, would you please click approve so we record two peer reviews for this PR.

@emilyhcliu, I'll contact the Handling Review team to schedule merger of this PR into release/gfsda.v16 on Tuesday, 5/16.

@RussTreadon-NOAA
Copy link
Contributor

My bad @CatherineThomas-NOAA . I see that you already approved this PR.

@emilyhcliu , I hope to merge this PR before lunch tomorrow.

@RussTreadon-NOAA
Copy link
Contributor

Given feedback from GSI Handling Review team move forward with merging this PR into release/gfsda.v16.

@RussTreadon-NOAA RussTreadon-NOAA merged commit aef6f02 into NOAA-EMC:release/gfsda.v16 May 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update hpc stack from hpc-stack to hpc-stack-gfsv16
5 participants