From 3b6b6939742dd9ac33231ad3da007d2c941e2231 Mon Sep 17 00:00:00 2001 From: Ryan Day Date: Wed, 26 May 2021 09:39:10 -0700 Subject: [PATCH] incorporate feedback from Stephen and Dong --- flux/exercises/exercise1.md | 2 +- flux/section1.md | 37 +++++++++++++++++++------------------ flux/section2.md | 2 +- 3 files changed, 21 insertions(+), 20 deletions(-) diff --git a/flux/exercises/exercise1.md b/flux/exercises/exercise1.md index 53c51ac..6e53d24 100644 --- a/flux/exercises/exercise1.md +++ b/flux/exercises/exercise1.md @@ -17,7 +17,7 @@ flux-resource: ERROR: [Errno 2] Unable to connect to Flux: ENOENT: No such file ``` 2. See "Starting Flux" in [Section 1](/flux/section1). 3. See "Showing the resources in your Flux allocation" in [Section 1](/flux/section1). -4. The flux-hwloc man page gives the helpful command `flux hwloc topology | lstopo-no-graphics --if xml -i -` for displaying a detailed view of the hardware topology. +4. The flux-hwloc man page gives the helpful command `flux hwloc topology | lstopo-no-graphics --if xml -i -` for displaying a detailed view of the hardware topology (this command may not work with all versions of hwloc). --- [Introduction](/flux/intro) | [Section 1](/flux/section1) | Exercise 1 | [Section 2](/flux/section2) diff --git a/flux/section1.md b/flux/section1.md index 1a7dd5f..08bb6e7 100644 --- a/flux/section1.md +++ b/flux/section1.md @@ -8,43 +8,44 @@ author: Ryan Day, Lawrence Livermore National Laboratory Regardless of what resource management software a cluster is running, the first step in running in a multi-user environment is to get an allocation of hardware resources. Once you have an allocation, you can use Flux to manage your workload on those resources. This section will tell you where to find Flux and how to start it in an allocation even if it is not the main resource manager on the cluster that you are running on. ### Finding Flux Flux is included in the TOSS operating system on LC systems, so should be available in your standard `PATH`. You can check on this with: -``` -[day36@fluke108:~]$ which flux +```console +[day36@rzalastor1:~]$ which flux /usr/bin/flux -[day36@fluke108:~]$ flux --version -commands: 0.22.0 -libflux-core: 0.22.0 +[day36@rzalastor1:~]$ flux --version +commands: 0.26.0 +libflux-core: 0.26.0 libflux-security: 0.4.0 build-options: +hwloc==1.11.0 -[day36@fluke108:~]$ +[day36@rzalastor1:~]$ ``` Flux is under heavy development. At times you may want a version that is newer than the TOSS version, or just ensure that you stay on a consistent version. Builds of Flux are also installed in `/usr/global/tools/flux/` on LC clusters. You can use one of these versions by adding it to your PATH: -``` -[day36@fluke108:~]$ export PATH=/usr/global/tools/flux/$SYS_TYPE/flux-c0.18.0-s0.10.0/bin:$PATH -[day36@fluke108:~]$ which flux -/usr/global/tools/flux/toss_3_x86_64_ib/flux-c0.18.0-s0.10.0/bin/flux -[day36@fluke108:~]$ flux --version +```console +[day36@rzalastor2:~]$ export PATH=/usr/global/tools/flux/$SYS_TYPE/default/bin:$PATH +[day36@rzalastor2:~]$ which flux +/usr/global/tools/flux/toss_3_x86_64_ib/default/bin/flux +[day36@rzalastor2:~]$ flux --version commands: 0.18.0 libflux-core: 0.18.0 build-options: +hwloc==1.11.0 -[day36@fluke108:~]$ +[day36@rzalastor2:~]$ ``` +Note that the `default` and `new` links can change as new versions of Flux are released. + If you are not on an LC cluster, and flux is not already installed, or if you're just into that sort of thing, you can also install Flux using `spack` or build it from source. See [Appendix I](/flux/appendices/appendixI) for more details on those options. ### Starting Flux Even if you are on a cluster that is running another resource manager, such as Slurm or LSF, you can still use Flux to run your workload. You will need to get an allocation, then start Flux on all of the nodes in that allocation with the `flux start` command. This will start `flux-broker` processes on all of the nodes that will gather information about the hardware resources available and communicate between each other to assign your workload to those resources. On a cluster running Slurm, this will look like: -``` +```console [day36@rzalastor2:~]$ salloc -N2 --exclusive salloc: Granted job allocation 234174 -sh-4.2$ srun -N2 -n2 --pty flux start +sh-4.2$ srun -N2 -n2 --pty --mpibind=off flux start sh-4.2$ flux mini run -n 2 hostname rzalastor6 rzalastor5 sh-4.2$ ``` -If you're on a cluster that is running a multi-user Flux instance, getting an allocation with `flux-broker` processes running is even easier. You can just use the `flux mini alloc` command: -``` -fill this in when fluke works again -``` +The `--mpibind=off` flag affects an LC-specific plugin, and should not be used on non-LC clusters. + +If you're on a cluster that is running a multi-user Flux instance, getting an allocation with `flux-broker` processes running is even easier. You can just use the `flux mini alloc` command to get an interactive allocation or any of the batch commands described in [Section 3](/flux/section3). ### Showing the resources in your Flux instance Flux uses [hwloc](http://manpages.org/hwloc/7) to build an internal model of the hardware available in a Flux instance. You can query this model with `flux hwloc`, or see a view of what resources are allocated and available with `flux resource list`. For example, in the Flux instance started in the previous section, we have two nodes with 20 cores each: ``` diff --git a/flux/section2.md b/flux/section2.md index f2ead54..c741313 100644 --- a/flux/section2.md +++ b/flux/section2.md @@ -7,7 +7,7 @@ author: Ryan Day, Lawrence Livermore National Laboratory In the previous section, we learned how to find flux, get an allocation, and query the compute resources in that allocation. Now, we are ready to launch work on those compute resources and get some work done. When you launch work in Flux, that work can be either blocking or non-blocking. Blocking steps will run to completion before more work can be submitted, whereas non-blocking steps are enqueued, allowing you to immediately submit more work in the allocation. -Before we get into submitting and managing job steps, we should also discuss Flux's jobids as they're a bit different than what you'll find in other resource management software. In the introduction to this tutorial, we mentioned that Flux is fully hierarchical. That is, users can launch full flux instances within allocations, then launch more job steps or flux instances within those instances. While this has has many benefits for taking advantage of modern HPC hardware and allowing complex workflows, it also means that the sequential numeric jobids used in traditional resource managers do not match Flux's job model. Flux instead uses hashes of the job parameters and environment, including submit time, to create effectively unique identifiers for each job and job step. There are options to display these identifiers in a number of ways, but the default is an 8 character string prepended by an `f`, e.g. `fBsFXaow5` for the job submitted in the example below. For more details on Flux's identifiers, see the [FLUID documentation](https://flux-framework.readthedocs.io/projects/flux-rfc/en/latest/spec_19.html). +Before we get into submitting and managing job steps, we should also discuss Flux's jobids as they're a bit different than what you'll find in other resource management software. In the introduction to this tutorial, we mentioned that Flux is fully hierarchical. That is, users can launch full flux instances within allocations, then launch more job steps or flux instances within those instances. While this has has many benefits for taking advantage of modern HPC hardware and allowing complex workflows, it also means that the sequential numeric jobids used in traditional resource managers do not match Flux's job model. Flux instead combines the submit time, an id, and sequence number to create effectively unique identifiers for each job and job step. There are options to display these identifiers in a number of ways, but the default is an 8 character string prepended by an `f`, e.g. `fBsFXaow5` for the job submitted in the example below. For more details on Flux's identifiers, see the [FLUID documentation](https://flux-framework.readthedocs.io/projects/flux-rfc/en/latest/spec_19.html). ### Submit blocking job steps with `flux mini run` If you want your work to block until it completes, the `flux mini run` command will submit a job step and then wait until the step is complete before returning. For example, in a two node allocation, we can launch an mpi program with 4 tasks: ```