gcm_setup and gcm_run.j need a "architecture" independent option for o-server #479

bena-nasa · 2023-06-20T18:58:32Z

The issue:
Right now when you run gcm_setup it explicitly asks you which type of architecture you wish to run on if running at NCCS, this is because have a predefined number of o-server nodes we would like to use, so the total number of tasks one needs is a function of the architecture. The bottom line is that your gcm_run.j slurp script will specify an architecture (--constraint) and number of nodes (--nodes) and the cores per node (--ntasks-per-node). This limits you to running on the architecture you asked for even if resource would be available on a different architecture which of course is not optimal.

As a concrete example at c720, running on a layout that requires 3456 cores for the model and you want 9 o-server nodes.

I will use the following abbreviations
N_M = number of model nodes
N_O = number of o-server nodes
Note that if 3456 doesn't to divide evenly I will use the ceiling for N_M
On cascade lake (45 cores per node)
77 N_M + 9 N_O = 3870 cores on 86 nodes
On skylake (40 cores per node)
87 N_M + 9 N_O = 3840 cores on 96 nodes
On Haskell (28 cores per node)
124 N_M + 9 N_O = 3724 cores on 131 nodes

Users have requested that it would be nice to get a configuration of the gcm_run.j script that is architecture independent. In that it would have no --constraint option.

After much discussion one idea we came up with:

One possibility would be to simply request the number of cores we want and assume that any remaining cores are just assigned to the IO server.
So one possibility would be to have a heuristic, maybe say we want there the IO-server to have ~10% of the model cores, so the user job would just specify a cores count.
In the above example, 3456 * .1 ~ 346

So assuming 10% you would want 3456 + 346 = 3802 cores and that is what the script would request.

Now you get the following (rounding up)
On cascade lake:
85 nodes = 77 N_M + 9 N_O)
On skylake:
95 nodes = 87 N_M + 8 N_O
On haswell:
136 nodes = 124 N_M + 12 N_O

So the gcm_run.j script would detect the total number of cores, the number of nodes, and the number of nodes needed by the model based the NX and NY and just calculate the available nodes left over.

The result is that the actual IO-server node number varies but the total tasks remains fix and will work without a constraint. This should provide a broadly applicable solution for users running a standard History configuration seeking a "work anywhere" at NCCS approach.

bena-nasa added the enhancement New feature or request label Jun 20, 2023

bena-nasa assigned mathomp4 Jun 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gcm_setup and gcm_run.j need a "architecture" independent option for o-server #479

gcm_setup and gcm_run.j need a "architecture" independent option for o-server #479

bena-nasa commented Jun 20, 2023 •

edited

Loading

gcm_setup and gcm_run.j need a "architecture" independent option for o-server #479

gcm_setup and gcm_run.j need a "architecture" independent option for o-server #479

Comments

bena-nasa commented Jun 20, 2023 • edited Loading

bena-nasa commented Jun 20, 2023 •

edited

Loading