Skip to content

Commit

Permalink
Add restart on failure capability for the forecast executable (NOAA-E…
Browse files Browse the repository at this point in the history
…MC#2510)

This PR:
- enables restart capability of the forecast executable from a previous
failure.
- saves restarts during the run in a new `DATA` structure. The current
`DATA` structure:
![current
`DATA`](https://github.com/NOAA-EMC/global-workflow/assets/11394126/03383e2f-b7f8-43e0-8b78-c8f37a79ab84)
is being replaced by:
![Screenshot 2024-04-19 at 12 55
44 PM](https://github.com/NOAA-EMC/global-workflow/assets/11394126/8ab6e6df-bbdb-43cf-b0dc-8e066f537ee7)
where, the colored boxes are described as:
![Screenshot 2024-04-19 at 12 56
14 PM](https://github.com/NOAA-EMC/global-workflow/assets/11394126/30b20e50-6cc8-4433-988a-02d5b484e7b5)
- saves model output from `MOM6` and `CICE` within `MOM6_OUTPUT/` and
`CICE_OUTPUT/` sub-directories. This is done to keep the run directory
clean and easily identify component output.

This PR also:
- replaces link with copy. This enables the creation of a `DATA`
directory that is self-contained and can be used to diagnose issues
during failures. This is a NCO EE2 requirement and addresses part of an
outstanding bugzilla.

In the process of enabling the restart capability, functionality from
`forecast_postdet.sh` is moved to `forecast_predet.sh` that does not
depend on the outcome of `forecast_det.sh`. `forecast_det.sh` determines
where the initial conditions will come from; `COM` in the case of a
clean run or `DATArestart` in the case of a `RERUN`.
This should make it easier to separate **static** configuration and data
(fix files, etc) from **runtime** configuration (namelists, etc) and
data (initial conditions)

Additionally, this PR:
- adds 3 utility shell scripts in `test/`.  
  - 'nccmp.sh` - compare netCDF files using `nccmp`
  - `g2cmp.sh` - compare grib2 files using `wgrib2`
- `f90nmlcmp.sh` - compare Fortran90 nml files using `f90nml` (Requires
modulefiles to load `py-f90nml` module on RDHPCS platforms)
They are not used in the workflow, but are useful for users to compare
files.

Resolves NOAA-EMC#2273

Co-authored-by: Walter Kolczynski - NOAA <[email protected]>
  • Loading branch information
aerorahul and WalterKolczynski-NOAA authored Apr 23, 2024
1 parent 1b6cef5 commit 3b20812
Show file tree
Hide file tree
Showing 26 changed files with 1,160 additions and 1,012 deletions.
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -177,7 +177,7 @@ ush/global_cycle.sh
ush/global_cycle_driver.sh
ush/jediinc2fv3.py
ush/ufsda
ush/finddate.sh
ush/soca
ush/make_NTC_file.pl
ush/make_ntc_bull.pl
ush/make_tif.sh
Expand Down
7 changes: 5 additions & 2 deletions env/HERA.env
Original file line number Diff line number Diff line change
Expand Up @@ -30,8 +30,11 @@ export OMP_STACKSIZE=2048000
export NTHSTACK=1024000000
#export LD_BIND_NOW=1

ulimit -s unlimited
ulimit -a
# Setting stacksize to unlimited on login nodes is prohibited
if [[ -n "${SLURM_JOB_ID:-}" ]]; then
ulimit -s unlimited
ulimit -a
fi

if [[ "${step}" = "prep" ]] || [[ "${step}" = "prepbufr" ]]; then

Expand Down
48 changes: 29 additions & 19 deletions jobs/JGLOBAL_FORECAST
Original file line number Diff line number Diff line change
@@ -1,18 +1,29 @@
#! /usr/bin/env bash

source "${HOMEgfs}/ush/preamble.sh"

if (( 10#${ENSMEM:-0} > 0 )); then
export DATAjob="${DATAROOT}/${RUN}efcs${ENSMEM}.${PDY:-}${cyc}"
export DATA="${DATAjob}/${jobid}"
source "${HOMEgfs}/ush/jjob_header.sh" -e "efcs" -c "base fcst efcs"
else
export DATAjob="${DATAROOT}/${RUN}fcst.${PDY:-}${cyc}"
export DATA="${DATAjob}/${jobid}"
source "${HOMEgfs}/ush/jjob_header.sh" -e "fcst" -c "base fcst"
fi

# Create the directory to hold restarts and output from the model in stmp
export DATArestart="${DATAjob}/restart"
if [[ ! -d "${DATArestart}" ]]; then mkdir -p "${DATArestart}"; fi
export DATAoutput="${DATAjob}/output"
if [[ ! -d "${DATAoutput}" ]]; then mkdir -p "${DATAoutput}"; fi

##############################################
# Begin JOB SPECIFIC work
##############################################

# Restart conditions for GFS cycle come from GDAS
rCDUMP=${RUN}
rCDUMP="${RUN}"
export rCDUMP="${RUN/gfs/gdas}"

# Ignore possible spelling error (nothing is misspelled)
Expand All @@ -24,47 +35,46 @@ declare -rx gPDY="${GDATE:0:8}"
declare -rx gcyc="${GDATE:8:2}"

# Construct COM variables from templates (see config.com)
YMD=${PDY} HH=${cyc} declare_from_tmpl -rx COM_ATMOS_RESTART COM_ATMOS_INPUT COM_ATMOS_ANALYSIS \
YMD="${PDY}" HH="${cyc}" declare_from_tmpl -rx COM_ATMOS_RESTART COM_ATMOS_INPUT COM_ATMOS_ANALYSIS \
COM_ATMOS_HISTORY COM_ATMOS_MASTER COM_TOP COM_CONF

RUN=${rCDUMP} YMD="${gPDY}" HH="${gcyc}" declare_from_tmpl -rx \
RUN="${rCDUMP}" YMD="${gPDY}" HH="${gcyc}" declare_from_tmpl -rx \
COM_ATMOS_RESTART_PREV:COM_ATMOS_RESTART_TMPL

if [[ ${DO_WAVE} == "YES" ]]; then
YMD=${PDY} HH=${cyc} declare_from_tmpl -rx COM_WAVE_RESTART COM_WAVE_PREP COM_WAVE_HISTORY
RUN=${rCDUMP} YMD="${gPDY}" HH="${gcyc}" declare_from_tmpl -rx \
if [[ "${DO_WAVE}" == "YES" ]]; then
YMD="${PDY}" HH="${cyc}" declare_from_tmpl -rx COM_WAVE_RESTART COM_WAVE_PREP COM_WAVE_HISTORY
RUN="${rCDUMP}" YMD="${gPDY}" HH="${gcyc}" declare_from_tmpl -rx \
COM_WAVE_RESTART_PREV:COM_WAVE_RESTART_TMPL
declare -rx RUNwave="${RUN}wave"
fi

if [[ ${DO_OCN} == "YES" ]]; then
YMD=${PDY} HH=${cyc} declare_from_tmpl -rx COM_MED_RESTART COM_OCEAN_RESTART COM_OCEAN_INPUT \
if [[ "${DO_OCN}" == "YES" ]]; then
YMD="${PDY}" HH="${cyc}" declare_from_tmpl -rx COM_MED_RESTART COM_OCEAN_RESTART COM_OCEAN_INPUT \
COM_OCEAN_HISTORY COM_OCEAN_ANALYSIS
RUN=${rCDUMP} YMD="${gPDY}" HH="${gcyc}" declare_from_tmpl -rx \
RUN="${rCDUMP}" YMD="${gPDY}" HH="${gcyc}" declare_from_tmpl -rx \
COM_OCEAN_RESTART_PREV:COM_OCEAN_RESTART_TMPL
fi

if [[ ${DO_ICE} == "YES" ]]; then
YMD=${PDY} HH=${cyc} declare_from_tmpl -rx COM_ICE_HISTORY COM_ICE_INPUT COM_ICE_RESTART
RUN=${rCDUMP} YMD="${gPDY}" HH="${gcyc}" declare_from_tmpl -rx \
if [[ "${DO_ICE}" == "YES" ]]; then
YMD="${PDY}" HH="${cyc}" declare_from_tmpl -rx COM_ICE_HISTORY COM_ICE_INPUT COM_ICE_RESTART
RUN="${rCDUMP}" YMD="${gPDY}" HH="${gcyc}" declare_from_tmpl -rx \
COM_ICE_RESTART_PREV:COM_ICE_RESTART_TMPL
fi

if [[ ${DO_AERO} == "YES" ]]; then
YMD=${PDY} HH=${cyc} declare_from_tmpl -rx COM_CHEM_HISTORY
if [[ "${DO_AERO}" == "YES" ]]; then
YMD="${PDY}" HH="${cyc}" declare_from_tmpl -rx COM_CHEM_HISTORY
fi


###############################################################
# Run relevant exglobal script
###############################################################
${FORECASTSH:-${SCRgfs}/exglobal_forecast.sh}
"${FORECASTSH:-${SCRgfs}/exglobal_forecast.sh}"
status=$?
[[ ${status} -ne 0 ]] && exit "${status}"
(( status != 0 )) && exit "${status}"

# Send DBN alerts for EnKF
# TODO: Should these be in post manager instead?
if [[ "${RUN}" =~ "enkf" ]] && [[ "${SENDDBN}" = YES ]]; then
if [[ "${RUN}" =~ "enkf" ]] && [[ "${SENDDBN:-}" == YES ]]; then
for (( fhr = FHOUT; fhr <= FHMAX; fhr + FHOUT )); do
if (( fhr % 3 == 0 )); then
fhr3=$(printf %03i "${fhr}")
Expand All @@ -88,6 +98,6 @@ fi
# Remove the Temporary working directory
##########################################
cd "${DATAROOT}" || true
[[ ${KEEPDATA} = "NO" ]] && rm -rf "${DATA}"
[[ "${KEEPDATA}" == "NO" ]] && rm -rf "${DATA} ${DATArestart}" # do not remove DATAjob. It contains DATAoutput

exit 0
1 change: 1 addition & 0 deletions modulefiles/module_base.hera.lua
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,7 @@ load(pathJoin("gsi-ncdiag", (os.getenv("gsi_ncdiag_ver") or "None")))
load(pathJoin("crtm", (os.getenv("crtm_ver") or "None")))
load(pathJoin("bufr", (os.getenv("bufr_ver") or "None")))
load(pathJoin("wgrib2", (os.getenv("wgrib2_ver") or "None")))
load(pathJoin("py-f90nml", (os.getenv("py_f90nml_ver") or "None")))
load(pathJoin("py-netcdf4", (os.getenv("py_netcdf4_ver") or "None")))
load(pathJoin("py-pyyaml", (os.getenv("py_pyyaml_ver") or "None")))
load(pathJoin("py-jinja2", (os.getenv("py_jinja2_ver") or "None")))
Expand Down
1 change: 1 addition & 0 deletions modulefiles/module_base.hercules.lua
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ load(pathJoin("gsi-ncdiag", (os.getenv("gsi_ncdiag_ver") or "None")))
load(pathJoin("crtm", (os.getenv("crtm_ver") or "None")))
load(pathJoin("bufr", (os.getenv("bufr_ver") or "None")))
load(pathJoin("wgrib2", (os.getenv("wgrib2_ver") or "None")))
load(pathJoin("py-f90nml", (os.getenv("py_f90nml_ver") or "None")))
load(pathJoin("py-netcdf4", (os.getenv("py_netcdf4_ver") or "None")))
load(pathJoin("py-pyyaml", (os.getenv("py_pyyaml_ver") or "None")))
load(pathJoin("py-jinja2", (os.getenv("py_jinja2_ver") or "None")))
Expand Down
1 change: 1 addition & 0 deletions modulefiles/module_base.jet.lua
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ load(pathJoin("gsi-ncdiag", (os.getenv("gsi_ncdiag_ver") or "None")))
load(pathJoin("crtm", (os.getenv("crtm_ver") or "None")))
load(pathJoin("bufr", (os.getenv("bufr_ver") or "None")))
load(pathJoin("wgrib2", (os.getenv("wgrib2_ver") or "None")))
load(pathJoin("py-f90nml", (os.getenv("py_f90nml_ver") or "None")))
load(pathJoin("py-netcdf4", (os.getenv("py_netcdf4_ver") or "None")))
load(pathJoin("py-pyyaml", (os.getenv("py_pyyaml_ver") or "None")))
load(pathJoin("py-jinja2", (os.getenv("py_jinja2_ver") or "None")))
Expand Down
1 change: 1 addition & 0 deletions modulefiles/module_base.orion.lua
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ load(pathJoin("gsi-ncdiag", (os.getenv("gsi_ncdiag_ver") or "None")))
load(pathJoin("crtm", (os.getenv("crtm_ver") or "None")))
load(pathJoin("bufr", (os.getenv("bufr_ver") or "None")))
load(pathJoin("wgrib2", (os.getenv("wgrib2_ver") or "None")))
load(pathJoin("py-f90nml", (os.getenv("py_f90nml_ver") or "None")))
load(pathJoin("py-netcdf4", (os.getenv("py_netcdf4_ver") or "None")))
load(pathJoin("py-pyyaml", (os.getenv("py_pyyaml_ver") or "None")))
load(pathJoin("py-jinja2", (os.getenv("py_jinja2_ver") or "None")))
Expand Down
1 change: 1 addition & 0 deletions modulefiles/module_base.s4.lua
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@ load(pathJoin("gsi-ncdiag", (os.getenv("gsi_ncdiag_ver") or "None")))
load(pathJoin("crtm", (os.getenv("crtm_ver") or "None")))
load(pathJoin("bufr", (os.getenv("bufr_ver") or "None")))
load(pathJoin("wgrib2", (os.getenv("wgrib2_ver") or "None")))
load(pathJoin("py-f90nml", (os.getenv("py_f90nml_ver") or "None")))
load(pathJoin("py-netcdf4", (os.getenv("py_netcdf4_ver") or "None")))
load(pathJoin("py-pyyaml", (os.getenv("py_pyyaml_ver") or "None")))
load(pathJoin("py-jinja2", (os.getenv("py_jinja2_ver") or "None")))
Expand Down
15 changes: 6 additions & 9 deletions parm/config/gefs/config.wave
Original file line number Diff line number Diff line change
Expand Up @@ -15,9 +15,6 @@ export CDUMPwave="${RUN}wave"
# In GFS/GDAS, restart files are generated/read from gdas runs
export CDUMPRSTwave="gdas"

# Grids for wave model
export waveGRD=${waveGRD:-'mx025'}

#grid dependent variable defaults
export waveGRDN='1' # grid number for ww3_multi
export waveGRDG='10' # grid group for ww3_multi
Expand Down Expand Up @@ -109,8 +106,8 @@ export RSTTYPE_WAV='T' # generate second tier of restart files
rst_dt_gfs=$(( restart_interval_gfs * 3600 )) # TODO: This calculation needs to move to parsing_namelists_WW3.sh
if [[ ${rst_dt_gfs} -gt 0 ]]; then
export DT_1_RST_WAV=0 #${rst_dt_gfs:-0} # time between restart files, set to DTRST=1 for a single restart file
#temporarily set to zero to avoid a clash in requested restart times
#which makes the wave model crash a fix for the model issue will be coming
# temporarily set to zero to avoid a clash in requested restart times
# which makes the wave model crash a fix for the model issue will be coming
export DT_2_RST_WAV=${rst_dt_gfs:-0} # restart stride for checkpointing restart
else
rst_dt_fhmax=$(( FHMAX_WAV * 3600 ))
Expand All @@ -121,15 +118,15 @@ export RSTIOFF_WAV=0 # first restart file offset relative to m
#
# Set runmember to default value if not GEFS cpl run
# (for a GFS coupled run, RUNMEN would be unset, this should default to -1)
export RUNMEM=${RUNMEM:--1}
export RUNMEM="-1"
# Set wave model member tags if ensemble run
# -1: no suffix, deterministic; xxxNN: extract two last digits to make ofilename prefix=gwesNN
if [[ ${RUNMEM} = -1 ]]; then
if (( RUNMEM == -1 )); then
# No suffix added to model ID in case of deterministic run
export waveMEMB=
export waveMEMB=""
else
# Extract member number only
export waveMEMB="${RUNMEM: -2}"
export waveMEMB="${RUNMEM}"
fi

# Determine if wave component needs input and/or is coupled
Expand Down
21 changes: 9 additions & 12 deletions parm/config/gfs/config.wave
Original file line number Diff line number Diff line change
Expand Up @@ -15,9 +15,6 @@ export CDUMPwave="${RUN}wave"
# In GFS/GDAS, restart files are generated/read from gdas runs
export CDUMPRSTwave="gdas"

# Grids for wave model
export waveGRD=${waveGRD:-'mx025'}

#grid dependent variable defaults
export waveGRDN='1' # grid number for ww3_multi
export waveGRDG='10' # grid group for ww3_multi
Expand Down Expand Up @@ -71,14 +68,14 @@ case "${waveGRD}" in
export wavepostGRD='glo_500'
export waveuoutpGRD=${waveGRD}
;;
"uglo_100km")
#unstructured 100km grid
"uglo_100km")
#unstructured 100km grid
export waveinterpGRD='glo_200'
export wavepostGRD=''
export waveuoutpGRD=${waveGRD}
;;
"uglo_m1g16")
#unstructured m1v16 grid
#unstructured m1v16 grid
export waveinterpGRD='glo_15mxt'
export wavepostGRD=''
export waveuoutpGRD=${waveGRD}
Expand Down Expand Up @@ -139,8 +136,8 @@ else # This is a GFS run
rst_dt_gfs=$(( restart_interval_gfs * 3600 )) # TODO: This calculation needs to move to parsing_namelists_WW3.sh
if [[ ${rst_dt_gfs} -gt 0 ]]; then
export DT_1_RST_WAV=0 #${rst_dt_gfs:-0} # time between restart files, set to DTRST=1 for a single restart file
#temporarily set to zero to avoid a clash in requested restart times
#which makes the wave model crash a fix for the model issue will be coming
# temporarily set to zero to avoid a clash in requested restart times
# which makes the wave model crash a fix for the model issue will be coming
export DT_2_RST_WAV=${rst_dt_gfs:-0} # restart stride for checkpointing restart
else
rst_dt_fhmax=$(( FHMAX_WAV * 3600 ))
Expand All @@ -152,15 +149,15 @@ fi
#
# Set runmember to default value if not GEFS cpl run
# (for a GFS coupled run, RUNMEN would be unset, this should default to -1)
export RUNMEM=${RUNMEM:--1}
export RUNMEM="-1"
# Set wave model member tags if ensemble run
# -1: no suffix, deterministic; xxxNN: extract two last digits to make ofilename prefix=gwesNN
if [[ ${RUNMEM} = -1 ]]; then
if (( RUNMEM == -1 )); then
# No suffix added to model ID in case of deterministic run
export waveMEMB=
export waveMEMB=""
else
# Extract member number only
export waveMEMB="${RUNMEM: -2}"
export waveMEMB="${RUNMEM}"
fi

# Determine if wave component needs input and/or is coupled
Expand Down
Loading

0 comments on commit 3b20812

Please sign in to comment.