Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

address Issue 712 : GSI built with debug mode failed in the global_4densvar #722

Merged

Conversation

TingLei-NOAA
Copy link
Contributor

@TingLei-NOAA TingLei-NOAA commented Mar 15, 2024

Resolves: #712
This PR aims to resolve issues reported in Issue 712, identified during testing of global_4densvar.

1)The correction of the omitted private variable in the OpenMP directive within intlimq has addressed the reproducibility issue across runs using varying numbers of MPI processes.

  1. The modification in read_nsstbufr.f90 is to prevent the incorrect use of variables when they are undefined, which could potentially impact the results. While this fix stops the use of uninitialized variables (e.g., tpf), the correctness of this approach in terms of the algorithm requires verification by experts on this. I hope @emilyhcliu could help confirm this as a reviewer of this PR.

3)The addition of -init=snan to the Fortran debugging compile options is particularly beneficial for debugging purpose. This option aids in identifying the use of undefined floating-point variables, a type of error that is difficult to trace and can lead to unexplained program behavior.

@RussTreadon-NOAA
Copy link
Contributor

@TingLei-NOAA , please identify two peer reviewers. @emilyhcliu is busy with many tasks. You can ask her to be a peer reviewer, but if you do so I recommend that you identify two additional peer reviewers for a total of three peer reviewers.

@RussTreadon-NOAA
Copy link
Contributor

@TingLei-NOAA , include in your description of this PR the issue this PR resolves. Do so by adding Resolves: #issue_number or Fixes: #issue_number in the description of this PR.

@RussTreadon-NOAA RussTreadon-NOAA self-assigned this Mar 16, 2024
@RussTreadon-NOAA
Copy link
Contributor

WCOSS2 test

Compile gsi.x in debug mode on Cactus with OMP bug fix from intjcmod.f90 included. Run global_4denvar ctest. loproc and hiproc analysis results are identical. This corrects the non-reproduciblity issue reported in issue #712.

@TingLei-NOAA
Copy link
Contributor Author

@TingLei-NOAA , please identify two peer reviewers. @emilyhcliu is busy with many tasks. You can ask her to be a peer reviewer, but if you do so I recommend that you identify two additional peer reviewers for a total of three peer reviewers.

@RussTreadon-NOAA Thanks. I had talked with Emily earlier and she kindly agree to take a look at the issue with read_nsstbufr.f90. I will contact her and other possible reviewers to make sure they are available and are willing to be reviewers and come back to you.

@RussTreadon-NOAA
Copy link
Contributor

OK. I assigned @emilyhcliu as a reviewer on this PR. Please find two other peer reviewers for a total of three. Have you contacted our lead NSST developer, Xu, to be a potential peer reviewer?

@TingLei-NOAA
Copy link
Contributor Author

Russ, Thanks. Yes I contacted Xu and realized he was out of office till next week and I am now writing the email to him.

@RussTreadon-NOAA
Copy link
Contributor

@TingLei-NOAA , include in your description of this PR the issue this PR resolves. Do so by adding Resolves: #issue_number or Fixes: #issue_number in the description of this PR.

Thanks for trying by adding "Resolves issue #712". This isn't the correct markup. The correct format is "Resolves: #712". I edited the description for you. Now issue #712 is automatically linked with this PR. When this PR is closed, issue #712 will be closed. Developers usually fail to close issues after their PR is merged. It's not the GSI Handling Review team's responsibility to close developer issues. Hence the requirement that developers properly link their issue(s) with the PR.

@TingLei-NOAA
Copy link
Contributor Author

@RussTreadon-NOAA Thanks. sorry for missing that right format requirement.

@TingLei-NOAA
Copy link
Contributor Author

@RussTreadon-NOAA , @XuLi-NOAA and @XuLu-NOAA kindly agreed to review this PR. Please assign them accordingly. Thanks.

@RussTreadon-NOAA
Copy link
Contributor

@XuLi-NOAA was already assigned as a reviewer. @XuLu-NOAA has been added.

Copy link
Contributor

@XuLu-NOAA XuLu-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, Ting! Thanks for identifying the private ii issue! It's not easy to find. And the potential data race from mm1 may indeed introduce non-reproducible results with different processors.

@emilyhcliu
Copy link
Contributor

emilyhcliu commented Mar 20, 2024

@TingLei-NOAA
I looked at read_nsstbufr.f90 and confirmed that the tpf2 should be used for NC031001 (bathy) and NC031002(tesac).
With the change from tpf to tpf2, the GSI assimilation result should be different if the bathy and tesac data types are in the NSST BUFR file. Do you see differences in the regression tests?

I will run a single cycle test using GSI develop with and without your bug fix tomorrow.

@TingLei-NOAA
Copy link
Contributor Author

@emilyhcliu So many thanks!
The PR with the change doesn't make any differences compared with the control in the regression test : global_4densvar.
I speculate the related obs are not used in the actual analysis (QC rejected? ). I will leave any further verification/validation , if needed, to the specialists on this.

@RussTreadon-NOAA
Copy link
Contributor

@TingLei-NOAA , your branch, TingLei-daprediction:feature/gsi_debug_omp, is three commits behind the current head of develop. Please update your branch and rerun regression tests on WCOSS2 and RDHPCS machines.

As is, your branch will not compile on Hera unless you specifically log into CentOS7 nodes. By default Hera logins now take users to Rocky8 nodes. 2/3 of Hera is running Rocky8. GSI develop has been update to build on Rocky8 Hera nodes.

@emilyhcliu is right. global_4denvar regression test results should differ from the control given your changes to src/gsi/read_nsstbufr.f90. I would expect the initial sst penalties to differ. The omp bug fix to src/gsi/intjcmod.f90 can also alter results but the impact of this bug fix should be less than the read_nsstbufr.f90 change. If you obtained identical results from previous runs of global-4denvar, I suspect something was wrong in your test.

It's important to get @XuLi-NOAA 's input on your change to read_nsstbufr.f90. This is not a trivial change. It could have quantifiable impact in a cycled global parallel.

@TingLei-NOAA
Copy link
Contributor Author

@emilyhcliu is right. global_4denvar regression test results should differ from the control given your changes to src/gsi/read_nsstbufr.f90. I would expect the initial sst penalties to differ. The omp bug fix to src/gsi/intjcmod.f90 can also alter results but the impact of this bug fix should be less than the read_nsstbufr.f90 change. If you obtained identical results from previous runs of global-4denvar, I suspect something was wrong in your test.

@RussTreadon-NOAA As described in this PR and the corresponding issue, with added "-init=snan", if there is no the change with read_nsstbufr.f90, the debug mode GSI would abort in the global_4densvar test. So, it is believed that the regression test: global_4densvar for this PR was run as it should be for this PR. Why that change didn't change the results, I would leave this to another issue to be resolved .

@RussTreadon-NOAA
Copy link
Contributor

I ran global_4denvar on Cactus this morning. The initial updat and contrl sst penalties differ. The final foundation temperature analysis files differ. The temperature difference is not trivial

dtf min/max 1=-13.668449389311288,10.743425714003955 min/max 2=-11.00172847855836,10.743425714003955 max abs diff=4.0568789806

@TingLei-NOAA
Copy link
Contributor Author

I ran global_4denvar on Cactus this morning. The initial updat and contrl sst penalties differ. The final foundation temperature analysis files differ. The temperature difference is not trivial

dtf min/max 1=-13.668449389311288,10.743425714003955 min/max 2=-11.00172847855836,10.743425714003955 max abs diff=4.0568789806

That is interesting. @RussTreadon-NOAA does your control GSI include the fix for openmp directive ?

@RussTreadon-NOAA
Copy link
Contributor

My first test did not include the omp bug fix in develop.

I added the intjcmod.f90 omp bug fix to develop, recompiled, and reran global_4denvar with the following results.

The updat and contrl initial sst penalties still differ

  • updat: sst 8.4144309801987820E+03
  • contrl: sst 8.4007931559073422E+03

Additional information on these differences is found by comparing fort.213

diff global_4denvar_loproc_updat/fort.213 global_4denvar_loproc_contrl/| head -23
14c14
<  o-g 01     sst asm 198 0000       800 -0.454E+00  0.126E+01  0.220E+00  0.220E+00
---
>  o-g 01     sst asm 198 0000       788 -0.432E+00  0.124E+01  0.206E+00  0.206E+00
16c16
<  o-g 01         asm all          20454 -0.265E-01  0.902E+00  0.411E+00  0.411E+00
---
>  o-g 01         asm all          20442 -0.254E-01  0.901E+00  0.411E+00  0.411E+00
23c23
<  o-g 01     sst rej 198 0000       412 -0.486E+01  0.508E+01  0.000E+00  0.000E+00
---
>  o-g 01     sst rej 198 0000       424 -0.484E+01  0.506E+01  0.000E+00  0.000E+00
25,28c25,28
<  o-g 01         rej all           3811 -0.511E+01  0.662E+01  0.000E+00  0.000E+00
<  number of     sst obs that failed gross test =   3811 nonlin qc test =      0
<  type     sst jiter   1 nread     53300 nkeep   24265 num   20454
<  type     sst pen=  0.841443098019878198E+04 qcpen=  0.841443098019878198E+04 r=  0.411383     qcr=  0.411383    
---
>  o-g 01         rej all           3823 -0.511E+01  0.661E+01  0.000E+00  0.000E+00
>  number of     sst obs that failed gross test =   3823 nonlin qc test =      0
>  type     sst jiter   1 nread     53300 nkeep   24265 num   20442
>  type     sst pen=  0.840079315590734222E+04 qcpen=  0.840079315590734222E+04 r=  0.410957     qcr=  0.410957    
33,46c33,46

The final foundation temperature analyses still differ.

tf min/max 1=-13.668449389311288,11.369278461932964 min/max 2=-13.668449389311288,11.369278461932964 max abs diff=0.0000000000

The omp bug fix in intjcmod.f90 is NOT responsible for the sst differences. The change to read_nsstbufr.f90 is responsible. This is @emilyhcliu 's point.

@TingLei-NOAA
Copy link
Contributor Author

@RussTreadon-NOAA Thanks for this update. I am updating the PR and will first run the global_4densvar with debug mode GSI ( to repeat my previous runs with GSI corresponding to a little older GSI develop) and then run them with the GSI built with optimization. I will update here when I have new results.

Copy link
Contributor

@XuLi-NOAA XuLi-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes by Ting fix an inconsistency of the obervation depth, which is defined and used at different part of the code. The changes are required and fine to me.
The changes could lead to different accepted observation count, and thereofore alters the NSST analysis. And the count difference is limited two obervation types, TESAC(198) and XBT(199), only. I think it is unnecessary to evaluate the impact of these changes on the analysis.

Copy link
Contributor

@RussTreadon-NOAA RussTreadon-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor question regarding removal of ! line in read_nsstbufr.f90

@@ -566,7 +566,6 @@ subroutine read_nsstbufr(nread,ndata,nodata,gstime,infile,obstype,lunout, &
endif
!
! Determine usage
!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deleting of this line was done "casually" though intentionally. I can add it back.
BTW: the regression tests with the updated PR is still ongoing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deleting of this line was done "casually" though intentionally. I can add it back. BTW: the regression tests with the updated PR is still ongoing.

I'd add the ! back but it doesn't really matter.

The branch for this PR, TingLei-daprediction:feature/gsi_debug_omp is now one commit behind develop.

@ShunLiu-NOAA merged your PR #698 into develop this morning. You can let the current regression tests run but you will need to rerun with the updated and rebuilt develop and TingLei-daprediction:feature/gsi_debug_omp.

@TingLei-NOAA
Copy link
Contributor Author

An update on comparison trough global_4dvar test:
Rerun the global_4densvar test with GSI updated with EMC GSI (while not including the latest commit ) and built in debug mode, the differences between control and the update were identified. Went back to the results of the past regression tests, it was found similar differences between loproc_contrl and loproc_updat does exist while was overlooked by me. So, I would correct my previous statement and acknowledge @RussTreadon-NOAA 's observations on this. My apologies to @emilyhcliu for my mistake in explaining the regression results and thanks to @RussTreadon-NOAA for your correcting on this.

@RussTreadon-NOAA
Copy link
Contributor

@TingLei-NOAA , thank you for updating TingLei-daprediction:feature/gsi_debug_omp with the current head of develop. On which platforms have you run ctests after updating TingLei-daprediction:feature/gsi_debug_omp? Please post results in this PR.

It would be good to finish up testing so we can merge this PR into develop as soon as possible. The longer it takes to complete testing, the more likely that testing will need to be repeated because develop has changed.

@TingLei-NOAA
Copy link
Contributor Author

@RussTreadon-NOAA Thanks. I will begin this PR's regression tests on wcoss2 as soon as possible.

@RussTreadon-NOAA
Copy link
Contributor

Thank you @TingLei-NOAA . Will you also run the ctests on other platforms?

@TingLei-NOAA
Copy link
Contributor Author

@RussTreadon-NOAA I am working on hera (Rocky8) for ctests. Will they (wcoss2 and hera) be enough?

@RussTreadon-NOAA
Copy link
Contributor

Since developers use Hercules and Orion for pre-implementation testing, my preference is to also run ctests on these machines. I have not run ctests on Jet, but some developers (e.g, Dave Huber) have done this. I think it is good to stress test our code on various platforms.

@TingLei-NOAA
Copy link
Contributor Author

I will add orion on the list , which I think then it is enough. Please let me know if you prefer more.
Thanks.
Ting

@RussTreadon-NOAA
Copy link
Contributor

Was the HAFS team using Hercules? If so, we should test on Hercules.

@TingLei-NOAA
Copy link
Contributor Author

@RussTreadon-NOAA I understand your consideration. Just a question, after I finish ctests on 4 machines for the current PR, then I need to sync again with the head of emc gsi, should I re-do those regression tests again? If that is the scenarior, there should some coordination on multiple PR to run their regression tests, which is really difficult to do that.

@RussTreadon-NOAA
Copy link
Contributor

Testing presents a challenge. This is why I posted my comment earlier this morning.

@RussTreadon-NOAA
Copy link
Contributor

ctests for TingLei-daprediction:feature/gsi_debug_omp -vs- develop are now running on Hercules, Orion, and Hera. I did not start WCOSS2 because you indicated that you were already running tests on Cactus.

@TingLei-NOAA
Copy link
Contributor Author

@RussTreadon-NOAA Thanks a lot. I am running ctests on wcoss2 , hera and orion and I will see if @ShunLiu-NOAA could coordinate on #700 which is very close to get my approval to finish its reviewing process to avoid some kind of "PR-race" situations:(.

@RussTreadon-NOAA
Copy link
Contributor

Coordination is critical for agile development. I don't view our develop as being very agile. That said, the coordination point remains. Running ctests is not complicated.

@RussTreadon-NOAA
Copy link
Contributor

Orion ctests
Install TingLei-daprediction:feature/gsi_debug_omp at e050818 and develop at 2167bc9 on Orion. Build each and run ctests with the following results.

Test project /work2/noaa/da/rtreadon/git/gsi/pr722/build
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_glbens
    Start 4: netcdf_fv3_regional
    Start 5: hafs_4denvar_glbens
    Start 6: hafs_3denvar_hybens
    Start 7: global_enkf
1/7 Test #4: netcdf_fv3_regional ..............   Passed  605.17 sec
2/7 Test #3: rrfs_3denvar_glbens ..............   Passed  669.52 sec
3/7 Test #7: global_enkf ......................   Passed  861.63 sec
4/7 Test #2: rtma .............................   Passed  1090.85 sec
5/7 Test #6: hafs_3denvar_hybens ..............   Passed  1282.59 sec
6/7 Test #5: hafs_4denvar_glbens ..............   Passed  1401.58 sec
7/7 Test #1: global_4denvar ...................***Failed  1870.99 sec

86% tests passed, 1 tests failed out of 7

Total Test time (real) = 1871.04 sec

The following tests FAILED:
          1 - global_4denvar (Failed)

The global_4denvar failure is due to

The results (penalty) between the two runs are nonreproducible,
thus the regression test has Failed on cost for global_4denvar_loproc_updat and global_4denvar_loproc_contrl analyses.

This is exepcted. The change in read_nsstbufr.f90 alters the number of sst observation passing quality control. The contrl run (develop gsi.x) reports 788 type 198 sst obs passing the initial quality control.

 o-g 01     sst asm 198 0000       788 -0.432E+00  0.124E+01  0.206E+00  0.206E+00

The updat run (TingLei-daprediction:feature/gsi_debug_omp gsi.x`) reports 800 type 198 sst obs passing the initial quality control

 o-g 01     sst asm 198 0000       800 -0.454E+00  0.126E+01  0.220E+00  0.220E+00

@RussTreadon-NOAA
Copy link
Contributor

RussTreadon-NOAA commented Mar 22, 2024

Hercules ctests
Install TingLei-daprediction:feature/gsi_debug_omp at e050818 and develop at 2167bc9 on Hercules. Build each and run ctests with the following results.

Test project /work/noaa/da/rtreadon/git/gsi/pr722/build
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_glbens
    Start 4: netcdf_fv3_regional
    Start 5: hafs_4denvar_glbens
    Start 6: hafs_3denvar_hybens
    Start 7: global_enkf
1/7 Test #4: netcdf_fv3_regional ..............   Passed  482.88 sec
2/7 Test #3: rrfs_3denvar_glbens ..............***Failed  485.33 sec
3/7 Test #7: global_enkf ......................   Passed  726.10 sec
4/7 Test #2: rtma .............................   Passed  964.95 sec
5/7 Test #6: hafs_3denvar_hybens ..............   Passed  1091.72 sec
6/7 Test #5: hafs_4denvar_glbens ..............***Failed  1158.59 sec
7/7 Test #1: global_4denvar ...................***Failed  1681.58 sec

57% tests passed, 3 tests failed out of 7

Total Test time (real) = 1681.59 sec

The following tests FAILED:
          1 - global_4denvar (Failed)
          3 - rrfs_3denvar_glbens (Failed)
          5 - hafs_4denvar_glbens (Failed)

The global_4denvar failure is the same as that reported on Orion.

The results (penalty) between the two runs are nonreproducible, thus the regression test has Failed on cost for global_4denvar_loproc_updat and global_4denvar_loproc_contrl analyses.

The total sst penalty for the updat run is greater than the contrl run.

global_4denvar_loproc_contrl/stdout:sst                          8.4007931559073422E+03
global_4denvar_loproc_updat/stdout:sst                          8.4144309801987820E+03

As with the Orion test, the Hercules updat run assimilates more type 198 sst observations.

It's my understanding that the rrfs_3denvar_glbens is expected on Hercules. The failure is due to

The fv3_dynvars are reproducible
The fv3_sfcdata are reproducible
The results between the two runs (rrfs_3denvar_glbens_loproc_updat and rrfs_3denvar_glbens_loproc_contrl) are not reproducible
Thus, the case has Failed siganl of the regression tests.

The updat and contrl fv3_tracer analysis files differ. @TingLei-NOAA , would you please confirm that the rrfs_3denvar_glbens failure is expected on Hercules.

The hafs_4denvar_glbens failure is due to

The runtime for hafs_4denvar_glbens_loproc_updat is 318.232665 seconds.  This has exceeded maximum allowable threshold time of 308.569998 seconds,
resulting in Failure time-thresh of the regression test.

The gsi.x wall times are

hafs_4denvar_glbens_hiproc_contrl/stdout:The total amount of wall time                        = 227.486349
hafs_4denvar_glbens_hiproc_updat/stdout:The total amount of wall time                        = 217.630020
hafs_4denvar_glbens_loproc_contrl/stdout:The total amount of wall time                        = 280.518180
hafs_4denvar_glbens_loproc_updat/stdout:The total amount of wall time                        = 318.232665

The loproc_updat ran 38 seconds slower than the contrl. The hiproc_updat ran 10 seconds faster than the contrl. These jobs ran in the /work fileset. This fileset is known to have i/o performance sensitivities. This is not a fatal fail.

@RussTreadon-NOAA
Copy link
Contributor

Hera ctests
Install TingLei-daprediction:feature/gsi_debug_omp at e050818 and develop at 2167bc9 on Hera. Build each and run ctests with the following results.

Test project /scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/pr722/build
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_glbens
    Start 4: netcdf_fv3_regional
    Start 5: hafs_4denvar_glbens
    Start 6: hafs_3denvar_hybens
    Start 7: global_enkf
1/7 Test #3: rrfs_3denvar_glbens ..............   Passed  428.97 sec
2/7 Test #4: netcdf_fv3_regional ..............   Passed  486.52 sec
3/7 Test #7: global_enkf ......................   Passed  928.28 sec
4/7 Test #2: rtma .............................   Passed  972.54 sec
5/7 Test #6: hafs_3denvar_hybens ..............   Passed  1103.79 sec
6/7 Test #5: hafs_4denvar_glbens ..............   Passed  1281.22 sec
7/7 Test #1: global_4denvar ...................***Failed  1805.78 sec

86% tests passed, 1 tests failed out of 7

Total Test time (real) = 1805.83 sec

The following tests FAILED:
          1 - global_4denvar (Failed)

global_4denvar failed due to

The results (penalty) between the two runs are nonreproducible, thus the regression test has Failed on cost for global_4denvar_loproc_updat and global_4denvar_loproc_contrl analyses.

Again, this is an expected result given the change to read_nsstbufr.f90. The updat gsi.x assimilates more sst observations than the contrl gsi.x.

@RussTreadon-NOAA
Copy link
Contributor

@TingLei-NOAA , once you post and explain Cactus results along with confirming the Hercules rrfs_3denvar_glbens behavior this PR can be passed to the Handling Review team for approval and merger into develop.

@RussTreadon-NOAA RussTreadon-NOAA self-requested a review March 22, 2024 17:16
Copy link
Contributor

@RussTreadon-NOAA RussTreadon-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following has been done

  1. @XuLi-NOAA has confirmed the correctness of the change to read_nsstbufr.f90 fix
  2. two peer reviews have been completed with approval
  3. ctests have been run on Hera and Orion with results posted in this PR. Expected behavior was observed.
  4. Hercules ctests have been run and results posted in this PR. Behavior is presumed to be expected.

Approve pending @TingLei-NOAA 's

  1. confirmation that the Hercules rrfs_3denvar_glbens failure is expected
  2. posting of acceptable WCOSS2 ctest results

@TingLei-NOAA
Copy link
Contributor Author

Update for regression tests on hera (Rocky8).
global_4dvarvar failed for differences in cost between update and control, which is as expected.
hafs_4denvar_glbens failed for :

The runtime for hafs_4denvar_glbens_hiproc_updat is 259.271409 seconds.  This has exceeded maximum allowable threshold time of 258.696309 seconds,
resulting in Failure of timethresh2 the regression test

which could be ignored .

global_enkf passed in the second run
Other tests passed except for :
rrfs_3denvar_glbens failed for

 the regression test has Failed on cost for rrfs_3denvar_glbens_loproc_updat and rrfs_3denvar_glbens_loproc_contrl analyses

I would see if the fix of omp direction in this PR is the reason for the differences by manually adding the fix to the control .
@RussTreadon-NOAA The failure for the differences in fv3_tracer for rrfs_3dvar_glbens is not expected, which had not been reported previously. I would attribute it to the same issue #697 ( the recent PR#698 hadn't resolved them completely). I will have a look into this issue on hercules later.

@RussTreadon-NOAA
Copy link
Contributor

Ooops, that's not good --> The Hercules rrfs_3dvar_glbens is not expected.

The difference in the fv3_tracer fields is in the o3mr field

sphum min/max 1=0.0,0.022526605 min/max 2=0.0,0.022526605 max abs diff=0.0000000000
liq_wat min/max 1=0.0,0.0025440012 min/max 2=0.0,0.0025440012 max abs diff=0.0000000000
ice_wat min/max 1=0.0,0.0005032422 min/max 2=0.0,0.0005032422 max abs diff=0.0000000000
rainwat min/max 1=0.0,0.007244442 min/max 2=0.0,0.007244442 max abs diff=0.0000000000
snowwat min/max 1=0.0,0.0068065105 min/max 2=0.0,0.0068065105 max abs diff=0.0000000000
graupel min/max 1=0.0,0.0047608074 min/max 2=0.0,0.0047608074 max abs diff=0.0000000000
water_nc min/max 1=0.0,1336478600.0 min/max 2=0.0,1336478600.0 max abs diff=0.0000000000
ice_nc min/max 1=0.0,4201064.0 min/max 2=0.0,4201064.0 max abs diff=0.0000000000
rain_nc min/max 1=0.0,613399.6 min/max 2=0.0,613399.6 max abs diff=0.0000000000
o3mr min/max 1=1.2216857e-08,1.5621523e-05 min/max 2=3.852443e-08,1.5621523e-05 max abs diff=0.0000000792
liq_aero min/max 1=10846143.0,71099390000.0 min/max 2=10846143.0,71099390000.0 max abs diff=0.0000000000
ice_aero min/max 1=0.0,6482549.0 min/max 2=0.0,6482549.0 max abs diff=0.0000000000
sgs_tke min/max 1=9.934678e-05,42.890892 min/max 2=9.934678e-05,42.890892 max abs diff=0.0000000000

@RussTreadon-NOAA
Copy link
Contributor

Hercules rrfs_3denvar_glbens rerun

Rerun ctest rrfs_3denvar_glbens on Hercules. This time the test Passed

Test project /work/noaa/da/rtreadon/git/gsi/pr722/build
    Start 3: rrfs_3denvar_glbens
1/1 Test #3: rrfs_3denvar_glbens ..............   Passed  485.01 sec

100% tests passed, 0 tests failed out of 1

Total Test time (real) = 485.13 sec

@TingLei-NOAA
Copy link
Contributor Author

Hercules rrfs_3denvar_glbens rerun

Rerun ctest rrfs_3denvar_glbens on Hercules. This time the test Passed

Test project /work/noaa/da/rtreadon/git/gsi/pr722/build
    Start 3: rrfs_3denvar_glbens
1/1 Test #3: rrfs_3denvar_glbens ..............   Passed  485.01 sec

100% tests passed, 0 tests failed out of 1

Total Test time (real) = 485.13 sec

@RussTreadon-NOAA Thanks for this further exploring. To me, this confirms the issue #697 which could spread over various tests .

@RussTreadon-NOAA
Copy link
Contributor

I reported rrfs_3denvar_glbens reproducibility issues in issue #697 more than a month ago. I thought this issue had been resolved. Non-reproducibility in regional gsi.x tests on Hercules makes this machine questionable for regional GSI development. This is very unfortunate.

@TingLei-NOAA
Copy link
Contributor Author

On wcoss2, except for global_4densvar failed as expected, all regression tests passed.

@RussTreadon-NOAA
Copy link
Contributor

@TingLei-NOAA , thank you for briefly reporting the WCOSS2 ctest results. Since the threading bug is most evident on WCOSS2 and the nsst bug alters analysis results, we should provide more information in this PR for the benefit of GSI users. Below is what I have in mind

WCOSS2 ctests
Install TingLei-daprediction:feature/gsi_debug_omp at e050818 and develop at 2167bc9 on Cactus. Run ctests with the following results.

Test project /lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/pr722/build
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_glbens
    Start 4: netcdf_fv3_regional
    Start 5: hafs_4denvar_glbens
    Start 6: hafs_3denvar_hybens
    Start 7: global_enkf
1/7 Test #4: netcdf_fv3_regional ..............***Failed  484.31 sec
2/7 Test #3: rrfs_3denvar_glbens ..............   Passed  486.98 sec
3/7 Test #7: global_enkf ......................   Passed  856.25 sec
4/7 Test #2: rtma .............................   Passed  969.11 sec
5/7 Test #6: hafs_3denvar_hybens ..............   Passed  1152.10 sec
6/7 Test #5: hafs_4denvar_glbens ..............   Passed  1212.90 sec
7/7 Test #1: global_4denvar ...................***Failed  1683.33 sec

71% tests passed, 2 tests failed out of 7

Total Test time (real) = 1683.34 sec

The following tests FAILED:
          1 - global_4denvar (Failed)
          4 - netcdf_fv3_regional (Failed)

The netcdf_fv3_regional failure is due to

The memory for netcdf_fv3_regional_loproc_updat is 232696 KBs.  This has exceeded maximum allowable memory of 174917 KBs, resulting in Failure memthresh of the regression test.

A check of the task 0 resident set size shows

netcdf_fv3_regional_hiproc_contrl/stdout:The maximum resident set size (KB)                   = 362336
netcdf_fv3_regional_hiproc_updat/stdout:The maximum resident set size (KB)                   = 363484
netcdf_fv3_regional_loproc_contrl/stdout:The maximum resident set size (KB)                   = 159016
netcdf_fv3_regional_loproc_updat/stdout:The maximum resident set size (KB)                   = 232696

The hiproc values are comparable between updat and contrl. The loproc_updat is higher than contrl, but both are less than the hiproc values. As Dave Huber has explained the memory threshold test is not stable. It can generate false positives. This is not a fatal fail.

The global_4denvar failure is due to

The results (penalty) between the two runs are nonreproducible, thus the regression test has Failed on cost for global_4denvar_loproc_updat and global_4denvar_loproc_contrl analyses.

This is expected given the bug fixes in read_nsstbufr.f90 and intjcmod.f90. As shown below, the largest impact in terms of altering the analysis is from the read_nsstbufr.f90 fix. This fix alters the number of assimilated sst observations.

global_4denvar_loproc_contrl/fort.213: o-g 01     sst asm 198 0000       788 -0.432E+00  0.124E+01  0.206E+00  0.206E+00
global_4denvar_loproc_updat/fort.213: o-g 01     sst asm 198 0000       800 -0.454E+00  0.126E+01  0.220E+00  0.220E+00

12 more type 198 observations pass the initial quality control in the updat code. As are result the initial total Jo is slightly higher in the updat run.

global_4denvar_loproc_contrl/stdout: J Global                    7.4755758951304480E+05
global_4denvar_loproc_updat/stdout: J Global                    7.4757122733733628E+05

due to differences in the initial sst Jo

global_4denvar_loproc_contrl/stdout:sst                          8.4007931559073422E+03
global_4denvar_loproc_updat/stdout:sst                          8.4144309801987820E+03

Repeat the above ctests with the bug fix to read_nsstbufr.f90 removed. This isolates the impact of the threading bug fix in intjcmod.f90

Test project /lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/pr722t/build
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_glbens
    Start 4: netcdf_fv3_regional
    Start 5: hafs_4denvar_glbens
    Start 6: hafs_3denvar_hybens
    Start 7: global_enkf
1/7 Test #4: netcdf_fv3_regional ..............   Passed  494.18 sec
2/7 Test #3: rrfs_3denvar_glbens ..............   Passed  495.82 sec
3/7 Test #7: global_enkf ......................   Passed  856.47 sec
4/7 Test #2: rtma .............................   Passed  974.82 sec
5/7 Test #6: hafs_3denvar_hybens ..............   Passed  1159.24 sec
6/7 Test #5: hafs_4denvar_glbens ..............   Passed  1211.65 sec
7/7 Test #1: global_4denvar ...................   Passed  1682.68 sec

100% tests passed, 0 tests failed out of 7

Total Test time (real) = 1682.70 sec

Note that netcdf_fv3_regional passed in this rerun. This again reflects the fact the memory threshold test is not a reliable indicator of memory usage. We should consider removing or at least revising this test in the future.

It's also interesting to see that the global_4denvar test Passed despite the threading bug being present in the contrl runs. This is not entirely surprising. The intjcmod.f90 threading bug has been present for a while. Other PRs have returned Passed results for global_4denvar on WCOSS2 even with the bug present. Since the bug introduces a potential racing condition it does not always alter analysis results. Repeated runs of global_4denvar with contrl (threading bug present) and updat (threading bug corrected) might eventually show differences in the final analyses. The differences, however, would likely be small. The greater impact on analyses is the bug fix in read_nsstbufr.f90.

@RussTreadon-NOAA
Copy link
Contributor

@ShunLiu-NOAA , @CoryMartin-NOAA , and @hu5970 :

This PR

  • has been peer reviewed and approved by @XuLi-NOAA and @XuLu-NOAA .
  • ctests have been run on WCOSS2 (Cactus), Hera, Hercules, and Orion. Results have been reported and explained.

This PR is ready for merger into develop.

@ShunLiu-NOAA
Copy link
Contributor

@RussTreadon-NOAA It is fine for me to merge this PR.

@RussTreadon-NOAA RussTreadon-NOAA merged commit f7e93ab into NOAA-EMC:develop Mar 24, 2024
4 checks passed
@TingLei-NOAA
Copy link
Contributor Author

@RussTreadon-NOAA Thanks a lot for all your efforts on this PR/Issue!

@RussTreadon-NOAA RussTreadon-NOAA mentioned this pull request Mar 25, 2024
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

GSI built with debug mode failed in the test global_4denvar on wcoss2
7 participants