-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
address Issue 712 : GSI built with debug mode failed in the global_4densvar #722
address Issue 712 : GSI built with debug mode failed in the global_4densvar #722
Conversation
…est global_4denvar on wcoss2
@TingLei-NOAA , please identify two peer reviewers. @emilyhcliu is busy with many tasks. You can ask her to be a peer reviewer, but if you do so I recommend that you identify two additional peer reviewers for a total of three peer reviewers. |
@TingLei-NOAA , include in your description of this PR the issue this PR resolves. Do so by adding Resolves: #issue_number or Fixes: #issue_number in the description of this PR. |
WCOSS2 test Compile |
@RussTreadon-NOAA Thanks. I had talked with Emily earlier and she kindly agree to take a look at the issue with read_nsstbufr.f90. I will contact her and other possible reviewers to make sure they are available and are willing to be reviewers and come back to you. |
OK. I assigned @emilyhcliu as a reviewer on this PR. Please find two other peer reviewers for a total of three. Have you contacted our lead NSST developer, Xu, to be a potential peer reviewer? |
Russ, Thanks. Yes I contacted Xu and realized he was out of office till next week and I am now writing the email to him. |
Thanks for trying by adding "Resolves issue #712". This isn't the correct markup. The correct format is "Resolves: #712". I edited the description for you. Now issue #712 is automatically linked with this PR. When this PR is closed, issue #712 will be closed. Developers usually fail to close issues after their PR is merged. It's not the GSI Handling Review team's responsibility to close developer issues. Hence the requirement that developers properly link their issue(s) with the PR. |
@RussTreadon-NOAA Thanks. sorry for missing that right format requirement. |
@RussTreadon-NOAA , @XuLi-NOAA and @XuLu-NOAA kindly agreed to review this PR. Please assign them accordingly. Thanks. |
@XuLi-NOAA was already assigned as a reviewer. @XuLu-NOAA has been added. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, Ting! Thanks for identifying the private ii issue! It's not easy to find. And the potential data race from mm1 may indeed introduce non-reproducible results with different processors.
@TingLei-NOAA I will run a single cycle test using GSI develop with and without your bug fix tomorrow. |
@emilyhcliu So many thanks! |
@TingLei-NOAA , your branch, As is, your branch will not compile on Hera unless you specifically log into CentOS7 nodes. By default Hera logins now take users to Rocky8 nodes. 2/3 of Hera is running Rocky8. GSI @emilyhcliu is right. global_4denvar regression test results should differ from the control given your changes to It's important to get @XuLi-NOAA 's input on your change to |
@RussTreadon-NOAA As described in this PR and the corresponding issue, with added "-init=snan", if there is no the change with read_nsstbufr.f90, the debug mode GSI would abort in the global_4densvar test. So, it is believed that the regression test: global_4densvar for this PR was run as it should be for this PR. Why that change didn't change the results, I would leave this to another issue to be resolved . |
I ran global_4denvar on Cactus this morning. The initial updat and contrl sst penalties differ. The final foundation temperature analysis files differ. The temperature difference is not trivial
|
That is interesting. @RussTreadon-NOAA does your control GSI include the fix for openmp directive ? |
My first test did not include the omp bug fix in I added the The updat and contrl initial sst penalties still differ
Additional information on these differences is found by comparing
The final foundation temperature analyses still differ.
The omp bug fix in |
@RussTreadon-NOAA Thanks for this update. I am updating the PR and will first run the global_4densvar with debug mode GSI ( to repeat my previous runs with GSI corresponding to a little older GSI develop) and then run them with the GSI built with optimization. I will update here when I have new results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes by Ting fix an inconsistency of the obervation depth, which is defined and used at different part of the code. The changes are required and fine to me.
The changes could lead to different accepted observation count, and thereofore alters the NSST analysis. And the count difference is limited two obervation types, TESAC(198) and XBT(199), only. I think it is unnecessary to evaluate the impact of these changes on the analysis.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor question regarding removal of !
line in read_nsstbufr.f90
src/gsi/read_nsstbufr.f90
Outdated
@@ -566,7 +566,6 @@ subroutine read_nsstbufr(nread,ndata,nodata,gstime,infile,obstype,lunout, & | |||
endif | |||
! | |||
! Determine usage | |||
! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change intentional?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The deleting of this line was done "casually" though intentionally. I can add it back.
BTW: the regression tests with the updated PR is still ongoing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The deleting of this line was done "casually" though intentionally. I can add it back. BTW: the regression tests with the updated PR is still ongoing.
I'd add the !
back but it doesn't really matter.
The branch for this PR, TingLei-daprediction:feature/gsi_debug_omp
is now one commit behind develop
.
@ShunLiu-NOAA merged your PR #698 into develop
this morning. You can let the current regression tests run but you will need to rerun with the updated and rebuilt develop
and TingLei-daprediction:feature/gsi_debug_omp
.
An update on comparison trough global_4dvar test: |
@TingLei-NOAA , thank you for updating It would be good to finish up testing so we can merge this PR into |
@RussTreadon-NOAA Thanks. I will begin this PR's regression tests on wcoss2 as soon as possible. |
Thank you @TingLei-NOAA . Will you also run the ctests on other platforms? |
@RussTreadon-NOAA I am working on hera (Rocky8) for ctests. Will they (wcoss2 and hera) be enough? |
Since developers use Hercules and Orion for pre-implementation testing, my preference is to also run ctests on these machines. I have not run ctests on Jet, but some developers (e.g, Dave Huber) have done this. I think it is good to stress test our code on various platforms. |
I will add orion on the list , which I think then it is enough. Please let me know if you prefer more. |
Was the HAFS team using Hercules? If so, we should test on Hercules. |
@RussTreadon-NOAA I understand your consideration. Just a question, after I finish ctests on 4 machines for the current PR, then I need to sync again with the head of emc gsi, should I re-do those regression tests again? If that is the scenarior, there should some coordination on multiple PR to run their regression tests, which is really difficult to do that. |
Testing presents a challenge. This is why I posted my comment earlier this morning. |
ctests for |
@RussTreadon-NOAA Thanks a lot. I am running ctests on wcoss2 , hera and orion and I will see if @ShunLiu-NOAA could coordinate on #700 which is very close to get my approval to finish its reviewing process to avoid some kind of "PR-race" situations:(. |
Coordination is critical for agile development. I don't view our develop as being very agile. That said, the coordination point remains. Running ctests is not complicated. |
Orion ctests
The
This is exepcted. The change in
The updat run (TingLei-daprediction:feature/gsi_debug_omp gsi.x`) reports 800 type 198 sst obs passing the initial quality control
|
Hercules ctests
The
The total sst penalty for the updat run is greater than the contrl run.
As with the Orion test, the Hercules updat run assimilates more type 198 sst observations. It's my understanding that the
The updat and contrl The
The
The loproc_updat ran 38 seconds slower than the contrl. The hiproc_updat ran 10 seconds faster than the contrl. These jobs ran in the |
Hera ctests
Again, this is an expected result given the change to |
@TingLei-NOAA , once you post and explain Cactus results along with confirming the Hercules |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following has been done
- @XuLi-NOAA has confirmed the correctness of the change to
read_nsstbufr.f90
fix - two peer reviews have been completed with approval
- ctests have been run on Hera and Orion with results posted in this PR. Expected behavior was observed.
- Hercules ctests have been run and results posted in this PR. Behavior is presumed to be expected.
Approve pending @TingLei-NOAA 's
- confirmation that the Hercules
rrfs_3denvar_glbens
failure is expected - posting of acceptable WCOSS2 ctest results
Update for regression tests on hera (Rocky8).
which could be ignored . global_enkf passed in the second run
I would see if the fix of omp direction in this PR is the reason for the differences by manually adding the fix to the control . |
Ooops, that's not good --> The Hercules The difference in the fv3_tracer fields is in the o3mr field
|
Hercules Rerun ctest
|
@RussTreadon-NOAA Thanks for this further exploring. To me, this confirms the issue #697 which could spread over various tests . |
I reported |
On wcoss2, except for global_4densvar failed as expected, all regression tests passed. |
@TingLei-NOAA , thank you for briefly reporting the WCOSS2 ctest results. Since the threading bug is most evident on WCOSS2 and the nsst bug alters analysis results, we should provide more information in this PR for the benefit of GSI users. Below is what I have in mind WCOSS2 ctests
The
A check of the task 0 resident set size shows
The hiproc values are comparable between updat and contrl. The loproc_updat is higher than contrl, but both are less than the hiproc values. As Dave Huber has explained the memory threshold test is not stable. It can generate false positives. This is not a fatal fail. The
This is expected given the bug fixes in
12 more type 198 observations pass the initial quality control in the updat code. As are result the initial total Jo is slightly higher in the updat run.
due to differences in the initial sst Jo
Repeat the above ctests with the bug fix to
Note that It's also interesting to see that the
|
@ShunLiu-NOAA , @CoryMartin-NOAA , and @hu5970 : This PR
This PR is ready for merger into |
@RussTreadon-NOAA It is fine for me to merge this PR. |
@RussTreadon-NOAA Thanks a lot for all your efforts on this PR/Issue! |
Resolves: #712
This PR aims to resolve issues reported in Issue 712, identified during testing of global_4densvar.
1)The correction of the omitted private variable in the OpenMP directive within intlimq has addressed the reproducibility issue across runs using varying numbers of MPI processes.
3)The addition of -init=snan to the Fortran debugging compile options is particularly beneficial for debugging purpose. This option aids in identifying the use of undefined floating-point variables, a type of error that is difficult to trace and can lead to unexplained program behavior.