Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature/rm_cpreq #643

Merged
merged 7 commits into from
Jan 10, 2025
Merged

Conversation

malloryprow
Copy link
Contributor

@malloryprow malloryprow commented Jan 7, 2025

Note to developers: You must use this PR template!

Description of Changes

This removes the usage of cpreq in EVS (an item under coder manger fixes and additions). It also sets KEEPDATA to YES for cam prep jobs, uses PB2NC_SKIP_VALID_TIMES for global_ens wave prep, and adds exclhost for jevs_mesoscale_headline_plots.

It also fixes typos in the PR template.

Developer Questions and Checklist

  • Is this a high priority PR? If so, why and is there a date it needs to be merged by?
  • Do you have any planned upcoming annual leave/PTO?
  • Are there any changes needed for when the jobs are supposed to run?
  • The code changes follow NCO's EE2 Standards.
  • Developer's name is removed throughout the code and have used ${USER} where necessary throughout the code.
  • References the feature branch for HOMEevs are removed from the code.
  • J-Job environment variables, COMIN and COMOUT directories, and output follow what has been defined for EVS.
  • Jobs over 15 minutes in runtime have restart capability.
  • If applicable, changes in the dev/drivers/scripts or dev/modulefiles have been made in the corresponding ecf/scripts and ecf/defs/evs-nco.def?
  • Jobs contain the appropriate file checking and don't run METplus for any missing data.
  • Code is using METplus wrappers structure and not calling MET executables directly.
  • Log is free of any ERRORs or WARNINGs.

Testing Instructions

Set-up

  1. Clone my fork and checkout branch feature/rm_cpreq
  2. ln -sf /lfs/h2/emc/vpppg/noscrub/emc.vpppg/verification/EVS_fix fix
  3. cd sorc; ./build

For everything below, be sure to set HOMEevs to the location of the clone and COMIN to /lfs/h2/emc/vpppg/noscrub/emc.vpppg/$NET/$evs_ver_2.

✔️ cam prep

  1. cd dev/drivers/scripts/prep/cam
  2. Run: jevs_cam_href_severe_prep.sh
  • Submit with qsub -v vhr=00 and qsub -v vhr=12

✔️ global_ens prep

  1. cd dev/drivers/scripts/prep/global_ens
  2. Run: jevs_global_ens_wave_grid2obs_prep.sh, jevs_global_ens_atmos_prep.sh, jevs_global_ens_headline_prep.sh

✔️ aqm stats

  1. cd dev/drivers/scripts/stats/aqm
  2. Run: jevs_aqm_grid2obs_stats.sh
  • Submit with qsub -v vhr=00

✔️ global_ens stats

  1. cd dev/drivers/scripts/stats/global_ens
  2. Run: jevs_global_ens_wave_grid2obs_stats.sh
  3. Run: jevs_global_ens_gefs_chem_grid2obs_aeronet_stats.sh
  • Submit with qsub -v vhr=00

✔️ aqm plots

  1. cd dev/drivers/scripts/plots/aqm
  2. Run: jevs_aqm_grid2obs_plots.sh

✔️ global_ens plots

  1. cd dev/drivers/scripts/plots/global_ens
  2. Run: jevs_global_ens_wave_grid2obs_plots.sh

✔️ mesoscale plots

  1. cd dev/drivers/scripts/plots/mesoscale
  2. Run: jevs_mesoscale_grid2obs_plots.sh, jevs_mesoscale_headline_plots.sh, jevs_mesoscale_precip_plots.sh, jevs_mesoscale_snowfall_plots.sh
  • NOTE: jevs_mesoscale_snowfall_plots and jevs_mesoscale_precip_plots exceeding walltime have been exceeding their walltime in the emc.vpppg parallel; perhaps we can address the walltime increase in this PR if desired and needed.

Copy link
Contributor

@AliciaBentley-NOAA AliciaBentley-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reviewed the changes made in this PR and I believe that they look good. I approve this PR to be merged after it has been tested/vetted. Thanks!

CC @PerryShafran-NOAA @malloryprow

Copy link

@AndrewBenjamin-NOAA AndrewBenjamin-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reviewed these changes and approve this PR provided testing is successful.

@PerryShafran-NOAA
Copy link
Contributor

Starting with the vhr=00 run of the cam_href_severe_prep.

@PerryShafran-NOAA
Copy link
Contributor

@malloryprow The vhr=00 href cam prep is done.

.o file is here: /lfs/h2/emc/vpppg/noscrub/perry.shafran/pr643test/EVS/dev/drivers/scripts/prep/cam/jevs_cam_href_severe_prep.o176418876

There is nothing in COMOUT.

There are many not found messages in here:

0 + echo 'Forecast file /lfs/h2/emc/vpppg/noscrub/mallory.row/evs/v2.0/prep/cam/hiresw.20250106/hireswarw.t00z.MXUPHL25_A24.SSPF.2025010612-2025010712.f36.nc not found for member 1.'

I set COMIN to point to your directory as instructed, but it appears that I need to run something else before I run the href severe prep job.

I'm moving on to global_ens prep and we'll circle back to here once we figure out what to do here.

@PerryShafran-NOAA
Copy link
Contributor

Oh, did you mean emc.vpppg and not mallory.row? Maybe that's the issue?

@malloryprow
Copy link
Contributor Author

Hahaha, I definitely meant for that to be emc.vpppg and not mallory.row. 🙃 I fixed it.

@PerryShafran-NOAA
Copy link
Contributor

Ok cool...take two.

@PerryShafran-NOAA
Copy link
Contributor

The job is complete:

.o file: /lfs/h2/emc/vpppg/noscrub/perry.shafran/pr643test/EVS/dev/drivers/scripts/prep/cam/jevs_cam_href_severe_prep.o176423216
output directory: /lfs/h2/emc/vpppg/noscrub/perry.shafran/evs/v2.0/prep/cam/href.20250106

There is no working directory as that is actually removed once the job is completed.

I'll submit the vhr=12 job.

@PerryShafran-NOAA
Copy link
Contributor

The vhr=12 job is complete:

.o file: /lfs/h2/emc/vpppg/noscrub/perry.shafran/pr643test/EVS/dev/drivers/scripts/prep/cam/jevs_cam_href_severe_prep.o176423906

There is no output, because t12z data is missing from emc.vpppg.

echo 'Forecast file /lfs/h2/emc/vpppg/noscrub/emc.vpppg/evs/v2.0/prep/cam/hiresw.20250106/hireswarwmem2.t12z.MXUPHL25_A24.SSPF.2025010612-2025010712.f24.nc not found for member 2.'

@PerryShafran-NOAA
Copy link
Contributor

The global_ens prep jobs are all underway.

@malloryprow
Copy link
Contributor Author

cam prep is good. I think it might be missing due to the production switch since the job is suppose to run at 12Z and that is when the switch started. I don't see any cpreq in the log and the file in COMOUT matches the parallel.

@PerryShafran-NOAA
Copy link
Contributor

Let me know if you wish to run cam prep again, just to get in a run where there actually is data.

@malloryprow
Copy link
Contributor Author

I think we are good!

@PerryShafran-NOAA
Copy link
Contributor

Also note as I said above, the cam prep job removes the working directory (which means KEEPDATA must be set to NO). Might we want to change that?

@PerryShafran-NOAA
Copy link
Contributor

For global_ens, which ran yesterday:

global_ens atmos prep:
.o file: /lfs/h2/emc/vpppg/noscrub/perry.shafran/pr643test/EVS/dev/drivers/scripts/prep/global_ens/jevs_global_ens_atmos_prep.o176424866
output: /lfs/h2/emc/vpppg/noscrub/perry.shafran/evs/v2.0/prep/global_ens/atmos.20250105
working directory: /lfs/h2/emc/stmp/perry.shafran/evs_test/prod/tmp/jevs_global_ens_atmos_prep.176424866.cbqs01

global_ens headline prep:
.o file: /lfs/h2/emc/vpppg/noscrub/perry.shafran/pr643test/EVS/dev/drivers/scripts/prep/global_ens/jevs_global_ens_headline_prep.o176424870
output: /lfs/h2/emc/vpppg/noscrub/perry.shafran/evs/v2.0/prep/global_ens/headline.20250105
working directory: /lfs/h2/emc/stmp/perry.shafran/evs_test/prod/tmp/jevs_global_ens_headline_prep.176424870.cbqs01

global_ens wave prep:
.o file: /lfs/h2/emc/vpppg/noscrub/perry.shafran/pr643test/EVS/dev/drivers/scripts/prep/global_ens/jevs_global_ens_wave_grid2obs_prep.o176424875
output: /lfs/h2/emc/vpppg/noscrub/perry.shafran/evs/v2.0/prep/global_ens/wave.20250106/gefs/grid2obs
working directory: /lfs/h2/emc/stmp/perry.shafran/evs_test/prod/tmp/jevs_global_ens_wave_grid2obs_prep.176424875.cbqs01

@PerryShafran-NOAA
Copy link
Contributor

Note on global_ens wave prep:

01/07 20:48:14.084 metplus.b630cbb8 (time_looping.py:300) WARNING: PB2NC_SKIP_TIMES is deprecated. Please use PB2NC_SKIP_VALID_TIMES

@malloryprow
Copy link
Contributor Author

Just pushed a commit to set KEEPDATA to YES. It was set to NO for multiple cam prep jobs.

@malloryprow
Copy link
Contributor Author

Just pushed a commit for PB2NC_SKIP_VALID_TIMES.

@PerryShafran-NOAA
Copy link
Contributor

As you look at the global_ens prep, I'll start on the aqm stats.

@malloryprow
Copy link
Contributor Author

global_ens prep for atmos and headline is good. Counts match the parallel and no cpreq in the logs. Do you want to remove /lfs/h2/emc/vpppg/noscrub/perry.shafran/evs/v2.0/prep/global_ens/wave.20250106 and re-run the job after pulling in the new changes?

@PerryShafran-NOAA
Copy link
Contributor

No need to remove since 20250106 was run yesterday; today's run would be 20250107. Unless you want me to run the same date as previously.

@malloryprow
Copy link
Contributor Author

malloryprow commented Jan 8, 2025

aqm plots is good! Some warnings about missing fils and thresholds not being met, but the final tar files match the parallel and no cpreq usage in the logs file.

@PerryShafran-NOAA
Copy link
Contributor

I also submitted the mesoscale plots jobs. For the precip and snowfall, I am submitting with a 10 hr walltime to see how long these jobs take.

However, when I tried to submit the mesoscale headline job, I got this weird error:

46 (clogin06) /lfs/h2/emc/vpppg/noscrub/perry.shafran/pr643test/EVS/dev/drivers/scripts/plots/mesoscale > qsub jevs_mesoscale_headline_plots.sh
qsub: Failed to set placement to exclhost
Job submit error: 32.
47 (clogin06) /lfs/h2/emc/vpppg/noscrub/perry.shafran/pr643test/EVS/dev/drivers/scripts/plots/mesoscale 

No clue why these errors are there. I'm quite befuddled. What do you think?

@PerryShafran-NOAA
Copy link
Contributor

The global_ens wave plots job is finished.

.o file: /lfs/h2/emc/vpppg/noscrub/perry.shafran/pr643test/EVS/dev/drivers/scripts/plots/global_ens/jevs_global_ens_wave_grid2obs_plots.o176516061
tarballs: /lfs/h2/emc/ptmp/perry.shafran/evs/v2.0/plots/global_ens/wave.20250107
working directory: /lfs/h2/emc/stmp/perry.shafran/evs_test/prod/tmp/jevs_global_ens_wave_grid2obs_plots.176516061.cbqs01

@PerryShafran-NOAA
Copy link
Contributor

I added exclhost to the PBS options for jevs_mesoscale_headline_plots.sh, and the job was submitted normally.

@malloryprow
Copy link
Contributor Author

malloryprow commented Jan 8, 2025

I also submitted the mesoscale plots jobs. For the precip and snowfall, I am submitting with a 10 hr walltime to see how long these jobs take.

However, when I tried to submit the mesoscale headline job, I got this weird error:

46 (clogin06) /lfs/h2/emc/vpppg/noscrub/perry.shafran/pr643test/EVS/dev/drivers/scripts/plots/mesoscale > qsub jevs_mesoscale_headline_plots.sh
qsub: Failed to set placement to exclhost
Job submit error: 32.
47 (clogin06) /lfs/h2/emc/vpppg/noscrub/perry.shafran/pr643test/EVS/dev/drivers/scripts/plots/mesoscale 

No clue why these errors are there. I'm quite befuddled. What do you think?

Wow okay you got this too! So I had been seeing this in the parallel I thought maybe it was something with the cron and I have emailed NCO about it. So it looks like it is a not problem with the submission coming from the cron. Wild thing is NCO said this: "This is the first time I've ever seen an error from that hook over millions of executions."

@malloryprow
Copy link
Contributor Author

global_ens wave plots is all good!

@PerryShafran-NOAA
Copy link
Contributor

Huh, you've been seeing this same error? With this specific job or with other jobs as well?

@malloryprow
Copy link
Contributor Author

Just this job!

@PerryShafran-NOAA
Copy link
Contributor

Interesting! I wonder why.

Well, adding exclhost seems to correct things...

@PerryShafran-NOAA
Copy link
Contributor

2 of the 4 plot jobs are complete:

mesoscale grid2obs:
.o file /lfs/h2/emc/vpppg/noscrub/perry.shafran/pr643test/EVS/dev/drivers/scripts/plots/mesoscale/jevs_mesoscale_grid2obs_plots.o176516476
plot tarball: /lfs/h2/emc/ptmp/perry.shafran/evs/v2.0/plots/mesoscale/atmos.20250107
working directory: /lfs/h2/emc/stmp/perry.shafran/evs_test/prod/tmp/jevs_mesoscale_grid2obs_plots.176516476.cbqs01

mesoscale headline:
.o file: /lfs/h2/emc/vpppg/noscrub/perry.shafran/pr643test/EVS/dev/drivers/scripts/plots/mesoscale/jevs_mesoscale_headline_plots.o176516638
plot tarball: /lfs/h2/emc/ptmp/perry.shafran/evs/v2.0/plots/mesoscale/headline.20250107
working directory: /lfs/h2/emc/stmp/perry.shafran/evs_test/prod/tmp/jevs_mesoscale_headline_plots.176516638.cbqs01

@malloryprow
Copy link
Contributor Author

Everything looks in order for these two jobs!

@PerryShafran-NOAA
Copy link
Contributor

Good news! I think the other two jobs are going to finish after your workday is over, so I guess we'll pick this up on Friday.

@malloryprow
Copy link
Contributor Author

Did the other two jobs complete?

@PerryShafran-NOAA
Copy link
Contributor

Whoops! My apologies, I was going to post the info first thing in the morning yesterday, but I got caught up in other stuff. I'll get you that info in a minute or so.

@PerryShafran-NOAA
Copy link
Contributor

meso precip plot:

.o file: /lfs/h2/emc/vpppg/noscrub/perry.shafran/pr643test/EVS/dev/drivers/scripts/plots/mesoscale/jevs_mesoscale_precip_plots.o176516471
Time to run job: 7 hr 42 min
tarball: /lfs/h2/emc/ptmp/perry.shafran/evs/v2.0/plots/mesoscale/atmos.20250107
working directory: /lfs/h2/emc/stmp/perry.shafran/evs_test/prod/tmp/jevs_mesoscale_precip_plots.176516471.cbqs01

meso snowfall plot:

.o file: /lfs/h2/emc/vpppg/noscrub/perry.shafran/pr643test/EVS/dev/drivers/scripts/plots/mesoscale/jevs_mesoscale_snowfall_plots.o176516475
Time to run job: 9 hr 15 min
tarball: /lfs/h2/emc/ptmp/perry.shafran/evs/v2.0/plots/mesoscale/atmos.20250107
working directory: /lfs/h2/emc/stmp/perry.shafran/evs_test/prod/tmp/jevs_mesoscale_snowfall_plots.176516475.cbqs01

@malloryprow
Copy link
Contributor Author

Wow! Those are really long run times. Were there new plots added? There haven't been problems in production so I'm not sure what is causing these jobs to run so long.

@PerryShafran-NOAA
Copy link
Contributor

I don't recall adding any new plots to these jobs, but they do seem quite long. I'm curious if the processes that added in the mpmd processing added time to the jobs. I think we talked about this previously when doing the restart but moving on because it was so long, and I think all plot processes will be split into 31day and 90day jobs soon.

@malloryprow
Copy link
Contributor Author

Oh yes, now I am remembering that now.

@malloryprow
Copy link
Contributor Author

I can't compare the final tar files to the parallel but the logs look good!

@PerryShafran-NOAA
Copy link
Contributor

Oh, I see, you can't compare the tar files because they weren't created due to the walltime exceedance - I see that they are not there on emc.vpppg. Ah well.

Looks like you gave it the check mark, so time to check the code and then I'll merge.

Copy link
Contributor

@PerryShafran-NOAA PerryShafran-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code works as expected. Approved for merge.

@PerryShafran-NOAA
Copy link
Contributor

Before I merge, is this code up to date or do we need to bring in develop?

@malloryprow
Copy link
Contributor Author

It's good!

@PerryShafran-NOAA
Copy link
Contributor

Great! Here we go!

@PerryShafran-NOAA PerryShafran-NOAA merged commit 6c4bae9 into NOAA-EMC:develop Jan 10, 2025
@malloryprow malloryprow deleted the feature/rm_cpreq branch January 10, 2025 13:45
BinbinZhou-NOAA added a commit to BinbinZhou-NOAA/EVS that referenced this pull request Jan 10, 2025
@malloryprow malloryprow linked an issue Jan 10, 2025 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Change cpreq -v to cp -v across EVS
4 participants