Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build using compute nodes #3131

Closed
3 tasks
WalterKolczynski-NOAA opened this issue Dec 2, 2024 · 4 comments · Fixed by #3186
Closed
3 tasks

Build using compute nodes #3131

WalterKolczynski-NOAA opened this issue Dec 2, 2024 · 4 comments · Fixed by #3186
Assignees
Labels
feature New feature or request

Comments

@WalterKolczynski-NOAA
Copy link
Contributor

What new functionality do you need?

The capability to build on compute nodes using the job scheduler

What are the requirements for the new functionality?

Build all components using the system job scheduler, whether it be slurm or pbspro. Most programs should be able to build on compute nodes, but the GDASapp will need to build in the service queue for now as it (currently) contacts the outside world to build.

Acceptance Criteria

  • All components build without error using slurm
  • All components build without error using pbspro
  • Output unchanged (maybe, building on compute nodes may result in changes; at very least all programs should run successfully)

Suggest a solution (optional)

No response

@WalterKolczynski-NOAA WalterKolczynski-NOAA added the feature New feature or request label Dec 2, 2024
@WalterKolczynski-NOAA WalterKolczynski-NOAA self-assigned this Dec 2, 2024
@CoryMartin-NOAA
Copy link
Contributor

@WalterKolczynski-NOAA will this be an option or the only way to build?

@DavidHuber-NOAA
Copy link
Contributor

@CoryMartin-NOAA I intend on making this an option.

@RussTreadon-NOAA
Copy link
Contributor

@DavidHuber-NOAA and @WalterKolczynski-NOAA :

GDASApp issue #1328 documents a successful compute node (e.g, batch job) build of GDASApp inside g-w develop on WCOSS2 (Dogwood), Hera, and Orion.

@DavidHuber-NOAA
Copy link
Contributor

Fantastic, thanks @RussTreadon-NOAA!

WalterKolczynski-NOAA pushed a commit that referenced this issue Dec 24, 2024
This creates scripts to run compute-node builds and also refactors the
build_all.sh script to make it easier to build all executables.

In place of various options to control what components are built when
using `build_all.sh`, instead it takes in a list of one or more systems
to build:

- `gfs` builds everything needed for forecast-only gfs (UFS model with
unstructured wave grid, gfs_utils, ufs_utils, upp, ww3 pre/post for
unstructured wave grid)
- `gefs` builds everything needed for GEFS (UFS model with structured
wave grid, gfs_utils, ufs_utils, upp, ww3 pre/post for structured wave
grid)
- `sfs` builds everything needed SFS (UFS model in hydrostatic mode with
unstructured wave grid, gfs_utils, ufs_utils, upp, ww3 pre/post for
structured wave grid)
- `gsi` builds GSI-based DA components (gsi_enkf, gsi_monitor,
gsi_utils)
- `gdas` builds JEDI-based DA components (gdas app, gsi_monitor,
gsi_utils)

`all` will build all of the above (mostly for testing)

Examples:
Build for forecast-only GFS:
```./build_all.sh gfs```
Build cycled GFS including coupled DA:
``` ./build_all.sh gfs gsi gdas```
Build GEFS:
```./build_all.sh gefs```
Build everything (for testing purposes):
```./build_all.sh all```
Other options, such as `-d` to build in debug mode, remain unchanged.

The full script signature is now:
```
./build_all.sh [-a UFS_app][-c build_config][-d][-f][-h][-v] [gfs] [gefs] [sfs] [gsi] [gdas] [all]
```

Additionally, there is a new script to build components on the compute
nodes using the job scheduler instead of the login node. This method
takes the load off of the login nodes and may be faster in some cases.
Compute build is invoked using the build_compute.sh script, which
behaves similarly to the new `build_all.sh:`
```
./build_compute.sh [-h][-v][-A <hpc-account>] [gfs] [gefs] [sfs] [gsi] [gdas] [all]
```
Compute build will generate a rocoto workflow and then call `rocotorun`
itself repeatedly until either a build fails or all builds succeed, at
which point the script will exit. Since the script is calling
`rocotorun` itself, you don't need to set up your own cron to do it, but
advanced users can also use all the regular rocoto tools on `build.xml`
and `build.db` if you wish.

Some things to note with the compute build:
- When a build fails, other build jobs are not cancelled and will
continue to run.
- Since the script stops running `rocotorun` once one build fails, the
rocoto database will no longer update with the status of the remaining
jobs after that point.
- Similarly, if the terminal running `build_compute.sh` gets
disconnected, the rocoto database will no longer update.
- In either of the above cases, you could run `rocotorun` yourself
manually to update the database as long as the job information hasn't
aged off the scheduler yet.

Resolves #3131

---------

Co-authored-by: Rahul Mahajan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
4 participants