Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Testing Dual-Resolution Ensemble DA in JEDI for HAFS #17

Open
XuLu-NOAA opened this issue Sep 18, 2024 · 5 comments
Open

Testing Dual-Resolution Ensemble DA in JEDI for HAFS #17

XuLu-NOAA opened this issue Sep 18, 2024 · 5 comments
Assignees

Comments

@XuLu-NOAA
Copy link
Collaborator

RRFS and Ting has tested the dual-resolution DA capability in JEDI for MPAS and FV3.
Corresponding discussions include:
NOAA-EMC/RDASApp#112
https://github.com/JCSDA-internal/fv3-jedi/pull/1243

This issue is opened to document the trials for HAFS.

@XuLu-NOAA XuLu-NOAA self-assigned this Sep 18, 2024
@XuLu-NOAA
Copy link
Collaborator Author

XuLu-NOAA commented Sep 18, 2024

The first issue is the nest domain does not have a halo3 grid file. The inconsistent x/y grid calculation crashed the run when reading in the mosaic file.

After modifying the scripts/exchange_atm_prep.sh
I am able to generate the corresponding halo3 file with
/scratch1/NCEPDEV/hwrf/save/Xu.Lu/hafsv2_jedidual

Now dual resolution JEDI (nest domain ingest parent domain ensemble) works almost identical as single resolution JEDI (parent domain ingest parent domain ensemble). But they are very different from GSI dual-reso DA.
image

Next trial is to investigate the ensemble size impact and change date/obs level etc.

@XuLu-NOAA
Copy link
Collaborator Author

A new test with 40member ensemble at 202406301200 UTC shows consistent horizontal increment patterns between GSI, JEDI, Single and Dual configurations. Next item is to enlarge the ensemble grid numbers and see if the memory still works.
image
Dual configuration is stored here on Hera:
/scratch1/NCEPDEV/hwrf/scrub/Xu.Lu/Backup/SampleCase_3DEnVarDual

@XuLu-NOAA
Copy link
Collaborator Author

Major difference from single-resolution DA:
In cost function:

  geometry:
    <<: *geometry_configs
  other geometry:
    <<: *ens_geometry_configs

In variational:

    geometry:
      <<: *ens_geometry_configs

ens_geometry_configs is the ensemble grid, and geometry_configs is the control grid.
Also need to update the

_filenames: &fv3file_names
  filename_core: '20240630.120000.fv_core.res.nest02.tile2.nc'
  filename_trcr: '20240630.120000.fv_tracer.res.nest02.tile2.nc'
  filename_sfcd: '20240630.120000.sfc_data.nest02.tile2.nc'
  filename_sfcw: '20240630.120000.fv_srf_wnd.res.nest02.tile2.nc'
  filename_cplr: '20240630.120000.coupler.res'

and

_ens_filenames: &ens_fv3file_names
  filename_core: '20240630.120000.fv_core.res.tile1.nc'
  filename_trcr: '20240630.120000.fv_tracer.res.tile1.nc'
  filename_sfcd: '20240630.120000.sfc_data.nc'
  filename_sfcw: '20240630.120000.fv_srf_wnd.res.tile1.nc'
  filename_cplr: '20240630.120000.coupler.res'

@XuLu-NOAA
Copy link
Collaborator Author

Regardless of dual or single resolution, if the ensemble grid is 1440 * 1320, the following memory error will occur:
580: slurmstepd: error: Detected 1 oom_kill event in StepId=113707.0. Some of the step tasks have been OOM Killed.
srun: error: h35m06: task 584: Out Of Memory
srun: Terminating StepId=113707.0
0: slurmstepd: error: *** STEP 113707.0 ON h1c22 CANCELLED AT 2024-09-26T17:54:05 ***
180: forrtl: severe (41): insufficient virtual memory
180:
180: Stack trace terminated abnormally.
184: [h9c03:1489706:0:1489706] Caught signal 11 (Segmentation fault: Sent by the kernel at address (nil))
170: forrtl: severe (41): insufficient virtual memory
170: Image PC Routine Line Source
170: libifcoremt.so.5 000014C248F9BB8D for_allocate_hand Unknown Unknown
170: libsaber.so 000014C215232550 type_com_mp_com_e Unknown Unknown
170: libsaber.so 000014C2152CBE48 type_geom_mp_geom Unknown Unknown
170: libsaber.so 000014C2152B5AA0 type_geom_mp_geom Unknown Unknown
170: libsaber.so 000014C2151B1384 type_bump_mp_bump Unknown Unknown
170: libsaber.so 000014C2151AD2F2 type_bump_mp_bump Unknown Unknown
170: libsaber.so 000014C214DB32A7 bump_create_f90 Unknown Unknown
170: libsaber.so 000014C214C9F0C4 _ZN5saber4bump4BU Unknown Unknown
170: libsaber.so 000014C214CA9CDB _ZN5saber4bump5NI Unknown Unknown
170: libsaber.so 000014C214D7AB9C _ZN5saber22SaberC Unknown Unknown
170: libsaber.so 000014C214C7CCC0 _ZN5saber24SaberC Unknown Unknown
170: libsaber.so 000014C214C8EFC1 _ZN5saber25SaberP Unknown Unknown
170: gdas.x 00000000005BF597 Unknown Unknown Unknown
170: gdas.x 00000000005BF072 Unknown Unknown Unknown
170: gdas.x 00000000005BE995 Unknown Unknown Unknown
170: gdas.x 00000000005BDFD6 Unknown Unknown Unknown
170: gdas.x 00000000005BDF7B Unknown Unknown Unknown
170: gdas.x 00000000005A8900 Unknown Unknown Unknown
170: gdas.x 00000000005A7465 Unknown Unknown Unknown
170: gdas.x 00000000005A4E38 Unknown Unknown Unknown
170: gdas.x 00000000005A42F5 Unknown Unknown Unknown
170: gdas.x 00000000005B6806 Unknown Unknown Unknown
170: gdas.x 00000000005B64D4 Unknown Unknown Unknown
170: gdas.x 00000000006506B1 Unknown Unknown Unknown
170: gdas.x 000000000064FB8F Unknown Unknown Unknown
170: gdas.x 000000000064DA2D Unknown Unknown Unknown
170: gdas.x 0000000000645995 Unknown Unknown Unknown
170: gdas.x 0000000000642BC2 Unknown Unknown Unknown
170: liboops.so 000014C2063FD68C _ZN4oops3Run7exec Unknown Unknown
170: gdas.x 00000000004DC880 Unknown Unknown Unknown
170: gdas.x 00000000004C7FDB Unknown Unknown Unknown
170: libc-2.28.so 000014C1FAA54D85 __libc_start_main Unknown Unknown
170: gdas.x 00000000004C786E Unknown Unknown Unknown
188: ==== backtrace (tid:1489710) ====
188: 0 0x00000000000534e9 ucs_debug_print_backtrace() ???:0
188: 1 0x0000000000012cf0 __funlockfile() :0
188: 2 0x000000000000f6df uw_frame_state_for() /tmp/role.apps/spack-stage/spack-stage-gcc-9.2.0-ku6r4f5qa5obpfnqpa6pezhogxq6sp7h/spack-src/libgcc/unwind-dw2.c:1265
188: 3 0x000000000000f6df uw_frame_state_for() /tmp/role.apps/spack-stage/spack-stage-gcc-9.2.0-ku6r4f5qa5obpfnqpa6pezhogxq6sp7h/spack-src/libgcc/unwind-dw2.c:1265
188: 4 0x0000000000011119 _Unwind_Backtrace() /tmp/role.apps/spack-stage/spack-stage-gcc-9.2.0-ku6r4f5qa5obpfnqpa6pezhogxq6sp7h/spack-src/libgcc/unwind.inc:302
188: =================================

Current solution on Hera:
#SBATCH --nodes=75-75
#SBATCH --tasks-per-node=8
#SBATCH --cpus-per-task=1
#SBATCH -t 00:30:00
#SBATCH --mem=48G
#SBATCH --exclusive

Further test on Orion is ongoing.

@XuLu-NOAA
Copy link
Collaborator Author

Update on the memory issue, it looks Orion works fine with only 30 nodes. The issue on Hera is likely due to the machine memory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant