Replies: 47 comments 4 replies
-
From LANL:
mcoyne@nid00027:~> env | grep LD_LIBRARY
CRAYPAT_LD_LIBRARY_PATH=/opt/cray/pe/gcc-libs:/opt/cray/gcc-libs:/opt/cray/pe/perftools/7.0.6/lib64
LD_LIBRARY_PATH=/usr/projects/consult/mcoyne/craysuse12x/flux-sched/gcc73/0.8.0/lib64:/usr/projects/consult/mcoyne/craysuse12x/python_gcc73_icn/lib:/opt/cray/wlm_detect/1.3.3-6.0.7.1_5.8__g7109084.ari/lib64:/opt/cray/pe/perftools/7.0.6/lib64:/opt/cray/rca/2.2.18-6.0.7.1_5.55__g2aa4f39.ari/lib64:/opt/cray/alps/6.6.43-6.0.7.1_5.54__ga796da32.ari/lib64:/opt/cray/xpmem/2.2.15-6.0.7.1_5.16__g7549d06.ari/lib64:/opt/cray/dmapp/7.1.1-6.0.7.1_6.9__g45d1b37.ari/lib64:/opt/cray/pe/pmi/5.0.14/lib64:/opt/cray/ugni/6.0.14.0-6.0.7.1_3.18__gea11d3d.ari/lib64:/opt/cray/udreg/2.3.2-6.0.7.1_5.18__g5196236.ari/lib64:/opt/cray/pe/libsci/19.02.1/GNU/8.1/x86_64/lib:/opt/cray/pe/mpt/7.7.6/gni/mpich-cray/8.2/lib:/opt/cray/pe/mpt/7.7.6/gni/mpich-gnu/8.2/lib:/usr/projects/consult/mcoyne/craysuse12x/openmpi40/gcc73/4.0.3/lib64:/usr/projects/consult/mcoyne/craysuse12x/library/ucx17_gcc73_icn/1.7.0/lib:/usr/projects/consult/mcoyne/craysuse12x/library/libfabric19_gcc73_icn/1.9.0/lib:/usr/projects/consult/mcoyne/craysuse12x/gcc/gcc73/7.3.1/lib:/usr/projects/consult/mcoyne/craysuse12x/gcc/gcc73/7.3.1/lib64:/usr/projects/consult/mcoyne/craysuse12x/buildtools/lib64:/opt/pmix/gcc4x/3.1.4/lib64:/opt/libevent/gcc4x/2.1.8/lib64:/opt/cray/job/2.2.3-6.0.7.1_5.50__g6c4e934.ari/lib64:/opt/gcc/8.2.0/snos/lib64
CRAY_LD_LIBRARY_PATH=/opt/cray/wlm_detect/1.3.3-6.0.7.1_5.8__g7109084.ari/lib64:/opt/cray/pe/perftools/7.0.6/lib64:/opt/cray/rca/2.2.18-6.0.7.1_5.55__g2aa4f39.ari/lib64:/opt/cray/alps/6.6.43-6.0.7.1_5.54__ga796da32.ari/lib64:/opt/cray/xpmem/2.2.15-6.0.7.1_5.16__g7549d06.ari/lib64:/opt/cray/dmapp/7.1.1-6.0.7.1_6.9__g45d1b37.ari/lib64:/opt/cray/pe/pmi/5.0.14/lib64:/opt/cray/ugni/6.0.14.0-6.0.7.1_3.18__gea11d3d.ari/lib64:/opt/cray/udreg/2.3.2-6.0.7.1_5.18__g5196236.ari/lib64:/opt/cray/pe/libsci/19.02.1/GNU/8.1/x86_64/lib:/opt/cray/pe/mpt/7.7.6/gni/mpich-cray/8.2/lib:/opt/cray/pe/mpt/7.7.6/gni/mpich-gnu/8.2/lib
mcoyne@nid00027:~>
mcoyne@nid00028:~> readelf -d ./xthi
Dynamic section at offset 0x2d78 contains 34 entries:
<CUT>
0x000000000000001d (RUNPATH) Library runpath: [/usr/projects/consult/mcoyne/craysuse12x/library/ucx17_gcc73_icn/1.7.0/lib:/opt/cray/xpmem/default/lib64:/opt/cray/pe/pmi/default/lib64:/opt/cray/alps/default/lib64:/opt/cray/udreg/default/lib64:/opt/cray/wlm_detect/1.3.3-6.0.7.1_5.8__g7109084.ari/lib64:/usr/projects/consult/mcoyne/craysuse12x/openmpi40/gcc73/4.0.3/lib64]
mcoyne@nid00028:~>
mcoyne@nid00028:~> flux mini run -N2 -n 2 env LD_PRELOAD=/usr/projects/consult/mcoyne/craysuse12x/flux-core/gcc73/0.16.0/lib64/flux/libpmi2.so ./xthi
--------------------------------------------------------------------------
The application appears to have been direct launched using "aprun",
but OMPI was not built with ALPS PMI support and therefore cannot
execute. You must build Open MPI using --with-pmi pointing
to the ALPS PMI library location.
Please configure as appropriate and try again.
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
*** and potentially your MPI job)
[nid00029:09876] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
^Cflux-job: one more ctrl-C within 2s to cancel or ctrl-Z to detach
^C70.874s: job.exception type=cancel severity=0 interrupted by ctrl-C
flux-job: task(s) exited with exit code 143
mcoyne@nid00028:~> |
Beta Was this translation helpful? Give feedback.
-
I believe this is one step further! But the environment is such that OpenMPI still detects that this is ALPS system and uses apls underneath and I’d like to find a way to undo this.
No. We don’t support PMIx. However, OpenMPI will bootstrap MPI using PMI when it detects that it is running under Flux. I think the following two concrete next steps will be getting us one more step towards feasibility. Could you launch OpenMPI under SLURM and see if OpenMPI also uses APLS underneath? If SLURM can do this w/o ALPS, we should do the same with Flux. There are some magic we needed to support SpectrumMPI which is based on OpenMPI. I will find that and send it to you. If we are lucky, that could serve us well in this case too. |
Beta Was this translation helpful? Give feedback.
-
@garlick tells me that MPI plugins are now at https://github.com/flux-framework/flux-core/tree/master/src/shell/lua.d We should set runs with :~> flux mini run -N2 -n 2 -o mpi=openmpi env LD_PRELOAD=/usr/projects/consult/mcoyne/craysuse12x/flux-core/gcc73/0.16.0/lib64/flux/libpmi2.so ./xthi
:~> flux mini run -N2 -n 2 -o mpi=spectrum env LD_PRELOAD=/usr/projects/consult/mcoyne/craysuse12x/flux-core/gcc73/0.16.0/lib64/flux/libpmi2.so ./xthi My guess is we may want to manually learn the combination envVar to set and unset to make OpenMPI on Cray detect Flux and encapsulate that into our MPI plugin. |
Beta Was this translation helpful? Give feedback.
-
From: "Coyne, Michael K": I did get the version updated current for you flux and its scheduler , this is also with openmpi 4.0.4 .. just in case.. still on the machine gadget..it appears i have a lua module problem i will see if i can fix that .
|
Beta Was this translation helpful? Give feedback.
-
Looks like Lua dependency has not been installed. @herbein, Stephen: Spack probably doesn’t capture this? Mike: yeah it seems best to satisfy them manually at this point.. |
Beta Was this translation helpful? Give feedback.
-
Agreed. The lua-posix module needs to be installed, but this doesn't look like a spack installation. It looks like a manual install. If it is a spack installation, then the lua-posix package might not have been loaded. |
Beta Was this translation helpful? Give feedback.
-
i "manually" put lua.posix somewhere it could be found mcoyne@nid00028:~> flux mini run -N2 -n 2 -o mpi=spectrum env LD_PRELOAD=/usr/projects/consult/mcoyne/craysuse12x/flux-core/gcc73/0.17.0/lib64/flux/libpmi2.so ./xthi |
Beta Was this translation helpful? Give feedback.
-
./xthi is still getting to the vendor specific logic which should be avoided. Two things.
Mike: are you the in-house OpenMPI expert at LANL or is there someone else who can consult us with? |
Beta Was this translation helpful? Give feedback.
-
I haven't tried it yet, but I believe to have OpenMPI use the Flux "pmix" plugin (it isn't really pmix, it is the bootstrap method) rather than the Cray one, you can set |
Beta Was this translation helpful? Give feedback.
-
No i am not, you would want to chat with Howard Pritchard or Dave Shrader as well as jen they are who i turn to for support at lanl. |
Beta Was this translation helpful? Give feedback.
-
Do you also see MCA pmix? If not, can we compile that in? lassen708{dahn}22: ompi_info | grep flux |
Beta Was this translation helpful? Give feedback.
-
From Mike Coyne:
here are some of the pmix options in the new openmpi 4.0.4 mcoyne@nid00028:~> nm $PMI_LIBRARY | grep PMI_Get i will poke around and see if i can figure out where it is coming from, but also of note your libpmi.so now has some of the PMI2_xxx symbols in it. Mike |
Beta Was this translation helpful? Give feedback.
-
My guess is OpenMPI current uses under Flux: and hit this problem down in the Cray specific library. When it should use: I don’t see that MCA module from your output. So I think OpenMPI needs to be reconfigured to compile that in. BTW, W/ NERSC, we are also pursuing to use an MPICH variant instead. |
Beta Was this translation helpful? Give feedback.
-
FYI -- @SteVwonder logged onto Cory and try |
Beta Was this translation helpful? Give feedback.
-
@hppritcha and @sheepherder82: BTW, there is a bug in some versions of OpenMPI w/ respect to this: open-mpi/ompi#6730. So we need to make sure we use an OpenMPI version that fixed this bug or a version that this bug wasn't introduced. My guess is Cray environment would set other OMPI environment(s) that can confuser flux-openmpi interaction. We may need to play some more games to see if there are other variables. If we find a magic formula, we can codify that login into our mpi shell plugin. |
Beta Was this translation helpful? Give feedback.
-
okay, fixed my FLUX_PMI_LIBRARY_PATH and it works now:
Looks like setting the OMPI_MCA_schizo env. variable to flux should help you get further. |
Beta Was this translation helpful? Give feedback.
-
This is great news @hppritcha and @sheepherder82! Could you do our community a bit favor and post your recipe as to how to do this (step by step -- including how you configured OpenMPI and what version) so that other sites hosting Cray machine can replicate your success. Most importantly, this may be needed by NERSC (@brandongc), if we want OpenMPI to be available for our COVID19 modeling calculation on CORI. Once this recipe is posted here, I will make sure we will put that into our high level doc in parallel to "how to use Flux on CORAL": https://flux-framework.readthedocs.io/en/latest/coral.html At some point, doing some performance benchmarking comparing OpenMPI and CrayMPI on Cray machines would also be interesting. But this is for long term. Tagging @koning and @ptbremer once again so that they know they may have two MPI choices on Cray machines. This is a big advancement. Congratulations and Thanks everyone! |
Beta Was this translation helpful? Give feedback.
-
@hppritcha: sorry -- I think I missed one important point though:
Hmmm so we need a modification to Flux proper to make this happen? Presumably, you didn't make this change... How did you get the MPI job bootstrapped in
|
Beta Was this translation helpful? Give feedback.
-
Dong, can you could let me know when you get a patch to apply to my test flux install on trinititie
this will be necessary for meaning full usage on the machine. probably on cory as well Mike |
Beta Was this translation helpful? Give feedback.
-
@sheepherder82: Sounds reasonable.
So could either you or @hppritcha elaborate the limitations of @hppritcha's solution at #3064 (comment)? |
Beta Was this translation helpful? Give feedback.
-
In my limited understanding, crays fabric is accessed by their ugni libraries, on a cray you have to gain access to the fabric with a security token to make it accessible, this is something that is happening in concert with craypmi. I believe howard built his openmpi w/out ugni and would only be able to access via tcp/ip connections to other nodes, which is still over the fabric. As for my self i built my test openmpi's up with both ugni , and ucx which uses the ugni libraries , they of course do no run . Howard can explain better . |
Beta Was this translation helpful? Give feedback.
-
@sheepherder82: thank you for the explanation. I think Howard's solution would work for the current use cases as jobs running with Flux are expected to small scale (single node or a few nodes). It should still be worthwhile documenting that work around if you guys are willing. We can then look at the real solution -- but this may take some time due to our other priorities. |
Beta Was this translation helpful? Give feedback.
-
@sheepherder82 @dongahn you'll want to take the code in OMPI's orte_odls_alps_get_rdma_creds and paste it into either the flux daemon code or into the flux PMI code. It queries apshepherd or slurmstepd for the rdma credentials, then sets PMI environment variables that the uGNI BTL can pick up. the process doing the quering doesn't have to be a direct child of apshepherd/slurmstepd as long as there's no fork/exec in the tree that closes all pipes back up to apshepherd/slurmstepd |
Beta Was this translation helpful? Give feedback.
-
@hppritcha - thanks for the code citation. This needs to get done in Flux at some point. Pondering it a bit, here's a question: From slurm/alps's point of view, Flux is just another parallel program, so it's issued (I guess?) one credential, or a matched set of them for its lifetime, which might be useful if Flux itself wanted to use the interconnect for it's own message passing. But Flux wants to launch a bunch more parallel programs (some running concurrently) using the resources allocated to it. If it's only been allocated the one credential, can it launch more than one job that uses the interconnect? It seems like we are going to need some way to generate/allocate more of them? |
Beta Was this translation helpful? Give feedback.
-
As a related question, if we use the uGNI BTL, would that subject us to the jobs per node limitations of the Cray network that Slurm enforces? [1] If so, we may want to make that a configuration option so that HTC and MTC users can launch many more MPI jobs per node than just 4. |
Beta Was this translation helpful? Give feedback.
-
TLDR: use Intel MPI or OpenMPI if you want to run the same executable inside and outside flux with reasonable performance on Cori. Intel MPI had better performance when run under flux. stock MPICHI made a build of MPICH 3.3.2 on Cori with
Then using that to build OSU benchmarks and run inside flux:
Applications built this way won't work outside of flux since it doesn't know how to talk to right PMI.
MPICH with OFI
to run inside flux (or got decent performance but am running into hangs with message size > 8192 (pmodels/mpich#4720)
Cray MPIAnd for the record applications built with Cray MPI get much better performance
but, as expected, does not work with flux
Intel MPI
There is reasonable performance inside or outside of flux. I am not seeing the same In my case I built the OSU benchmarks with
openmpi
Works inside and outside flux, with and without env variables. |
Beta Was this translation helpful? Give feedback.
-
tl;dr I have recreated the experiment above by @brandongc and am going to share the results here. I used the (newly updated!) Flux/Cori module file provided by @SteVwonder. Mostly the same results as last time. I abbreviated the long build outputs by putting to show that the build succeeded. The all to all latency tests are very weird looking, I'm working out what that might be right now. Intel MPIHere's how I built and set them up:
Here's what I did to run the tests. I tried to stay as close to above as possible. I followed the instructions for building found on the UL HPC tutorials website.
And, here's what it looks like inside of Flux.
The amount of latency here seems...off to me. Very different from above. Maybe I made a mistake in the build for these? I also ran the all-to-all exchanges in and out of Flux with just two points and here's what I got.
Cray MPIAs suspected, much better performance. Here's how I built:
This looks more like it should. Here's the output for the Cray latency tests.
And, just to confirm, it fails when you try running it inside of Flux.
OpenMPIHere's how I built:
Here's what it looked like when run directly on Slurm:
And here's what it looks like on top of Flux. Like the Intel MPI builds, the slow-down here for all-to-all tests looks way higher than my intuition says it should. The all-to-all test was taking so long that I cancelled it.
In short, the updated benchmarks and newer module file provide relatively the same information as @brandongc 's run above. The all-to-all latency tests look very off, though. |
Beta Was this translation helpful? Give feedback.
-
Question is there a --with-pmi option yet in flux to support an external pmi such as cray or pmix ? for pmix it would need to have an accompaning --with-libevent= as well
Mike
…________________________________
From: Hobbs ***@***.***>
Sent: Monday, August 2, 2021 9:49:54 AM
To: flux-framework/flux-core
Cc: Coyne, Michael K; Mention
Subject: [EXTERNAL] Re: [flux-framework/flux-core] Flux on Cray xc40 (#3064)
tl;dr I have recreated the experiment above by @brandongc<https://github.com/brandongc> and am going to share the results here. I used the (newly updated!) Flux/Cori module file provided by @SteVwonder<https://github.com/SteVwonder>. Mostly the same results as last time.
The all to all latency tests are very weird looking, I'm working out what that might be right now.
Intel MPI
Here's how I built and set them up:
***@***.***:~> module load impi
***@***.***:~> cd /global/cscratch1/sd/hobbs/tutorials/OSU-MicroBenchmarks/
***@***.***:/global/cscratch1/sd/hobbs/tutorials/OSU-MicroBenchmarks> mkdir build.intel
***@***.***:/global/cscratch1/sd/hobbs/tutorials/OSU-MicroBenchmarks> cd build.intel/
***@***.***:/global/cscratch1/sd/hobbs/tutorials/OSU-MicroBenchmarks> export OSU_VERSION=5.7.1
***@***.***:/global/cscratch1/sd/hobbs/tutorials/OSU-MicroBenchmarks> ../src/osu-micro-benchmarks-${OSU_VERSION}/configure CC=mpiicc CXX=mpiicpc CFLAGS=-I$(pwd)/../src/osu-micro-benchmarks-${OSU_VERSION}/util --prefix=$(pwd)
<lots of output; passed>
***@***.***:/global/cscratch1/sd/hobbs/tutorials/OSU-MicroBenchmarks/build.intel> make && make install
<lots of output; passed>
Here's what I did to run the tests. I tried to stay as close to above as possible. I followed the instructions for building found on the UL HPC tutorials website<https://ulhpc-tutorials.readthedocs.io/en/latest/parallel/mpi/OSU_MicroBenchmarks/#now-for-lazy-frustrated-persons>.
***@***.***:/global/cscratch1/sd/hobbs/tutorials/OSU-MicroBenchmarks/build.intel> salloc -C haswell -N 2 -t 30 -q interactive
salloc: Granted job allocation 45035302
salloc: Waiting for resource configuration
salloc: Nodes nid000[13-14] are ready for job
***@***.***:/global/cscratch1/sd/hobbs/tutorials/OSU-MicroBenchmarks/build.intel> srun -N 2 -n 64 ./mpi/collective/osu_alltoall -f
# OSU MPI All-to-All Personalized Exchange Latency Test v5.7.1
# Size Avg Latency(us) Min Latency(us) Max Latency(us) Iterations
1 127.27 117.17 136.51 1000
2 150.56 119.61 180.92 1000
4 152.60 122.23 181.86 1000
8 157.69 126.57 186.97 1000
16 181.82 145.38 216.63 1000
32 197.31 158.86 235.31 1000
64 238.90 190.58 285.65 1000
128 361.25 292.75 429.40 1000
256 692.63 539.10 844.76 1000
512 312.30 308.60 315.20 1000
1024 428.57 424.47 431.45 1000
2048 654.69 650.27 657.29 1000
4096 1129.54 1125.79 1132.02 1000
8192 2080.79 2067.84 2085.96 1000
16384 4055.81 4014.48 4085.02 100
32768 7351.32 7304.75 7385.45 100
65536 13405.11 13294.38 13528.11 100
131072 31475.43 31475.10 31475.67 100
262144 62905.22 62904.77 62905.62 100
524288 125807.64 125806.79 125808.28 100
1048576 251492.44 251491.42 251493.43 100
***@***.***:/global/cscratch1/sd/hobbs/tutorials/OSU-MicroBenchmarks/build.intel> srun -N 2 -n 2 -c 2 --cpu-bind=cores ./mpi/pt2pt/osu_latency
# OSU MPI Latency Test v5.7.1
# Size Latency (us)
0 1.45
1 1.43
2 1.42
4 1.42
8 1.42
16 1.42
32 1.42
64 1.43
128 1.47
256 1.49
512 1.52
1024 1.72
2048 2.04
4096 2.60
8192 3.85
16384 7.73
32768 9.21
65536 12.63
131072 19.23
262144 32.28
524288 59.80
1048576 111.51
2097152 218.56
4194304 433.83
And, here's what it looks like inside of Flux.
***@***.***:/global/cscratch1/sd/hobbs/tutorials/OSU-MicroBenchmarks/build.intel> module use /global/common/software/flux/modulefiles/ && module load czmq jansson python flux
***@***.***:/global/cscratch1/sd/hobbs/tutorials/OSU-MicroBenchmarks/build.intel> srun -N 2 -n 2 --pty flux start
***@***.***:/global/cscratch1/sd/hobbs/tutorials/OSU-MicroBenchmarks/build.intel> flux mini run -N 2 -n 64 ./mpi/collective/osu_alltoall -f
# OSU MPI All-to-All Personalized Exchange Latency Test v5.7.1
# Size Avg Latency(us) Min Latency(us) Max Latency(us) Iterations
1 246.87 229.80 260.75 1000
2 265.64 229.09 315.68 1000
4 269.68 232.48 322.24 1000
8 222.04 185.47 267.40 1000
16 192.99 154.85 235.39 1000
32 211.95 168.23 261.08 1000
64 263.65 208.40 328.15 1000
128 397.25 316.77 495.10 1000
256 683.95 527.79 849.17 1000
512 373.94 367.32 378.88 1000
1024 461.51 454.27 466.59 1000
2048 656.11 650.55 660.22 1000
4096 1086.63 1070.57 1093.00 1000
8192 2202.86 2187.05 2217.07 1000
16384 4360.03 4234.52 4407.83 100
32768 7485.17 7354.17 7561.44 100
65536 14794.20 14661.07 14871.55 100
131072 33390.45 33389.98 33391.02 100
262144 66836.87 66836.33 66837.56 100
524288 133499.55 133498.91 133500.33 100
1048576 267122.19 267120.98 267123.71 100
***@***.***:/global/cscratch1/sd/hobbs/tutorials/OSU-MicroBenchmarks/build.intel> flux mini run -N 2 -n 2 ./mpi/pt2pt/osu_latency
# OSU MPI Latency Test v5.7.1
# Size Latency (us)
0 1.85
1 1.84
2 1.81
4 1.81
8 1.80
16 1.79
32 1.82
64 1.82
128 1.88
256 1.92
512 1.97
1024 2.38
2048 3.00
4096 3.83
8192 5.76
16384 7.89
32768 9.87
65536 13.71
131072 21.81
262144 36.93
524288 67.99
1048576 129.84
2097152 256.87
4194304 514.47
The amount of latency here seems...off to me. Very different from above. Maybe I made a mistake in the build for these? I also ran the all-to-all exchanges in and out of Flux with just two points and here's what I got.
***@***.***:/global/cscratch1/sd/hobbs/tutorials/OSU-MicroBenchmarks/build.intel> flux mini run -N 2 -n 2 ./mpi/collective/osu_alltoall -f
# OSU MPI All-to-All Personalized Exchange Latency Test v5.7.1
# Size Avg Latency(us) Min Latency(us) Max Latency(us) Iterations
1 6.52 6.51 6.54 1000
2 6.42 6.42 6.42 1000
4 6.37 6.37 6.38 1000
8 6.34 6.33 6.36 1000
16 6.37 6.37 6.38 1000
32 6.38 6.36 6.40 1000
64 6.39 6.39 6.40 1000
128 6.98 6.96 7.00 1000
256 7.12 7.12 7.13 1000
512 7.37 7.34 7.39 1000
1024 6.10 5.96 6.24 1000
2048 6.77 6.56 6.98 1000
4096 8.48 8.37 8.60 1000
8192 9.63 9.52 9.73 1000
16384 13.60 13.51 13.69 100
32768 16.55 16.47 16.62 100
65536 23.08 23.00 23.15 100
131072 58.09 58.03 58.16 100
262144 110.84 110.74 110.95 100
524288 215.80 215.77 215.83 100
1048576 429.07 428.85 429.29 100
***@***.***:/global/cscratch1/sd/hobbs/tutorials/OSU-MicroBenchmarks/build.intel> exit
exit
***@***.***:/global/cscratch1/sd/hobbs/tutorials/OSU-MicroBenchmarks/build.intel> srun -N 2 -n 2 ./mpi/collective/osu_alltoall -f
# OSU MPI All-to-All Personalized Exchange Latency Test v5.7.1
# Size Avg Latency(us) Min Latency(us) Max Latency(us) Iterations
1 5.97 5.75 6.19 1000
2 5.94 5.71 6.17 1000
4 5.99 5.78 6.19 1000
8 5.94 5.71 6.17 1000
16 5.94 5.71 6.17 1000
32 6.04 5.80 6.28 1000
64 6.01 5.72 6.30 1000
128 6.61 5.96 7.27 1000
256 6.67 5.99 7.34 1000
512 6.87 6.14 7.59 1000
1024 5.69 5.28 6.10 1000
2048 6.32 5.86 6.79 1000
4096 7.73 7.16 8.30 1000
8192 8.59 7.97 9.20 1000
16384 12.62 12.48 12.76 100
32768 16.04 15.83 16.25 100
65536 21.75 21.46 22.04 100
131072 53.67 53.56 53.79 100
262144 103.33 103.20 103.46 100
524288 200.93 200.92 200.93 100
1048576 419.53 419.39 419.67 100
Cray MPI
As suspected, much better performance. Here's how I built:
***@***.***:/global/cscratch1/sd/hobbs/tutorials/OSU-MicroBenchmarks> mkdir build.cray
***@***.***:/global/cscratch1/sd/hobbs/tutorials/OSU-MicroBenchmarks> cd build.cray
***@***.***:/global/cscratch1/sd/hobbs/tutorials/OSU-MicroBenchmarks/build.cray> module unload impi
***@***.***:/global/cscratch1/sd/hobbs/tutorials/OSU-MicroBenchmarks/build.cray> module list -l
- Package -----------------------------+- Versions -+- Last mod. ------
Currently Loaded Modulefiles:
modules/3.2.11.4 2019/10/23 20:26:46
altd/2.0 2021/07/30 23:53:10
darshan/3.2.1 2021/07/30 23:53:10
craype-network-aries 2020/10/16 18:38:06
intel/19.0.3.199 2020/09/29 21:01:29
craype/2.6.2 2020/11/30 15:44:20
cray-libsci/19.06.1 2020/10/16 17:50:09
udreg/2.3.2-7.0.1.1_3.52__g8175d3d.ari 2021/04/20 17:52:07
ugni/6.0.14.0-7.0.1.1_7.54__ge78e5b0.ar 2021/04/20 17:51:14
pmi/5.0.14 2020/10/16 17:19:55
dmapp/7.1.1-7.0.1.1_4.64__g38cf134.ari 2021/04/20 18:01:08
gni-headers/5.0.12.0-7.0.1.1_6.43__g3b1 2021/04/22 20:14:14
xpmem/2.2.20-7.0.1.1_4.23__g0475745.ari 2021/04/20 17:46:10
job/2.2.4-7.0.1.1_3.50__g36b56f4.ari 2021/04/20 17:49:46
dvs/2.12_2.2.167-7.0.1.1_17.6__ge473d3a 2021/04/20 18:13:26
alps/6.6.58-7.0.1.1_6.22__g437d88db.ari 2021/04/20 17:59:21
rca/2.2.20-7.0.1.1_4.65__g8e3fb5b.ari 2021/04/20 17:53:47
atp/2.1.3 2020/10/16 17:20:09
PrgEnv-intel/6.0.5 2020/10/16 18:46:43
craype-haswell 2020/10/16 18:38:06
cray-mpich/7.7.10 2020/10/16 17:51:01
craype-hugepages2M 2020/10/16 18:38:06
python/3.8-anaconda-2020.11 2021/07/30 23:53:12
***@***.***:/global/cscratch1/sd/hobbs/tutorials/OSU-MicroBenchmarks/build.cray> ../src/osu-micro-benchmarks-${OSU_VERSION}/configure CC=cc CXX=CC CFLAGS=-I$(pwd)/../src/osu-micro-benchmarks-${OSU_VERSION}/util --prefix=$(pwd)
<passed>
***@***.***:/global/cscratch1/sd/hobbs/tutorials/OSU-MicroBenchmarks/build.cray> make && make install
<passed>
This looks more like it should. Here's the output for the Cray latency tests.
***@***.***:/global/cscratch1/sd/hobbs/tutorials/OSU-MicroBenchmarks/build.cray> salloc -C haswell -N 2 -t 30 -q interactive
salloc: Pending job allocation 45036063
salloc: job 45036063 queued and waiting for resources
salloc: job 45036063 has been allocated resources
salloc: Granted job allocation 45036063
salloc: Waiting for resource configuration
salloc: Nodes nid000[18-19] are ready for job
***@***.***:/global/cscratch1/sd/hobbs/tutorials/OSU-MicroBenchmarks/build.cray> srun -N 2 -n 64 ./mpi/collective/osu_alltoall -f
# OSU MPI All-to-All Personalized Exchange Latency Test v5.7.1
# Size Avg Latency(us) Min Latency(us) Max Latency(us) Iterations
1 18.05 15.98 20.88 1000
2 18.36 16.26 21.28 1000
4 18.19 16.20 20.98 1000
8 18.32 16.39 20.99 1000
16 19.20 16.79 22.56 1000
32 22.03 19.35 25.54 1000
64 29.24 25.27 33.90 1000
128 58.27 58.16 58.58 1000
256 77.97 77.85 78.26 1000
512 105.43 105.34 105.71 1000
1024 163.24 163.16 163.50 1000
2048 306.03 305.88 306.29 1000
4096 586.81 586.73 587.02 1000
8192 1165.13 1165.02 1165.47 1000
16384 2310.44 2310.23 2310.84 100
32768 4881.63 4881.34 4882.35 100
65536 10276.84 10276.32 10277.74 100
131072 21090.14 21089.68 21091.14 100
262144 42743.78 42743.23 42744.97 100
524288 86157.61 86157.09 86158.07 100
1048576 183259.94 183259.63 183260.61 100
***@***.***:/global/cscratch1/sd/hobbs/tutorials/OSU-MicroBenchmarks/build.cray> srun -N 2 -n 2 -c 2 --cpu-bind=cores ./mpi/pt2pt/osu_latency
# OSU MPI Latency Test v5.7.1
# Size Latency (us)
0 1.28
1 1.13
2 1.12
4 1.12
8 1.12
16 1.13
32 1.12
64 1.12
128 1.13
256 1.14
512 1.17
1024 1.38
2048 1.69
4096 2.24
8192 4.12
16384 4.98
32768 6.64
65536 9.96
131072 16.54
262144 29.84
524288 56.32
1048576 109.00
2097152 215.70
4194304 427.55
And, just to confirm, it fails when you try running it inside of Flux.
***@***.***:/global/cscratch1/sd/hobbs/tutorials/OSU-MicroBenchmarks/build.cray> module use /global/common/software/flux/modulefiles/ && module load czmq jansson python flux
***@***.***:/global/cscratch1/sd/hobbs/tutorials/OSU-MicroBenchmarks/build.cray> srun -N 2 -n 2 --pty flux start
***@***.***:/global/cscratch1/sd/hobbs/tutorials/OSU-MicroBenchmarks/build.cray> flux mini run -N 2 -n 64 ./mpi/collective/osu_alltoall -f
<lots of errors>
./mpi/collective/osu_alltoall: symbol lookup error: /opt/cray/pe/lib64/libmpich_intel.so.3: undefined symbol: PMI2_Init
flux-job: task(s) exited with exit code 127
***@***.***:/global/cscratch1/sd/hobbs/tutorials/OSU-MicroBenchmarks/build.cray> flux mini run -N 2 -n 2 -c 2 ./mpi/pt2pt/osu_latency
./mpi/pt2pt/osu_latency: symbol lookup error: /opt/cray/pe/lib64/libmpich_intel.so.3: undefined symbol: PMI2_Init
./mpi/pt2pt/osu_latency: symbol lookup error: /opt/cray/pe/lib64/libmpich_intel.so.3: undefined symbol: PMI2_Init
flux-job: task(s) exited with exit code 127
OpenMPI
Here's how I built:
***@***.***:/global/cscratch1/sd/hobbs/tutorials/OSU-MicroBenchmarks> mkdir build.ompi
***@***.***:/global/cscratch1/sd/hobbs/tutorials/OSU-MicroBenchmarks> module load openmpi
***@***.***:/global/cscratch1/sd/hobbs/tutorials/OSU-MicroBenchmarks> ../src/osu-micro-benchmarks-${OSU_VERSION}/configure CC=mpicc CFLAGS=-I$(pwd)/../src/osu-micro-benchmarks-${OSU_VERSION}/util --prefix=$(pwd)
-bash: ../src/osu-micro-benchmarks-5.7.1/configure: No such file or directory
***@***.***:/global/cscratch1/sd/hobbs/tutorials/OSU-MicroBenchmarks> cd build.ompi
***@***.***:/global/cscratch1/sd/hobbs/tutorials/OSU-MicroBenchmarks/build.ompi> ../src/osu-micro-benchmarks-${OSU_VERSION}/configure CC=mpicc CFLAGS=-I$(pwd)/../src/osu-micro-benchmarks-${OSU_VERSION}/util --prefix=$(pwd)
<passed>
***@***.***:/global/cscratch1/sd/hobbs/tutorials/OSU-MicroBenchmarks/build.ompi> make && make install
<passed>
Here's what it looked like when run directly on Slurm:
***@***.***:/global/cscratch1/sd/hobbs/tutorials/OSU-MicroBenchmarks/build.ompi> salloc -C haswell -N 2 -q interactive -t 30
salloc: Granted job allocation 45036907
salloc: Waiting for resource configuration
salloc: Nodes nid000[45-46] are ready for job
***@***.***:/global/cscratch1/sd/hobbs/tutorials/OSU-MicroBenchmarks/build.ompi> srun -N 2 -n 64 ./mpi/collective/osu_alltoall -f
# OSU MPI All-to-All Personalized Exchange Latency Test v5.7.1
# Size Avg Latency(us) Min Latency(us) Max Latency(us) Iterations
1 20.75 19.15 23.51 1000
2 20.85 19.20 23.94 1000
4 21.22 18.99 25.00 1000
8 21.23 18.94 24.44 1000
16 25.88 22.86 29.08 1000
32 30.28 26.73 34.05 1000
64 35.51 32.07 38.87 1000
128 59.05 54.86 62.54 1000
256 133.53 125.81 142.25 1000
512 161.15 151.61 169.18 1000
1024 235.51 222.80 248.29 1000
2048 366.17 342.27 379.70 1000
4096 646.16 642.75 649.18 1000
8192 2002.75 1990.74 2021.37 1000
16384 3211.26 3193.98 3234.23 100
32768 5879.79 5857.57 5911.22 100
65536 11646.01 11606.81 11697.40 100
131072 22971.81 22883.80 23056.10 100
262144 45473.89 45308.81 45630.52 100
524288 90481.08 90167.19 90827.41 100
1048576 180587.80 179855.43 181284.43 100
***@***.***:/global/cscratch1/sd/hobbs/tutorials/OSU-MicroBenchmarks/build.ompi> srun -N 2 -n 2 -c 2 --cpu-bind=cores ./mpi/pt2pt/osu_latency
# OSU MPI Latency Test v5.7.1
# Size Latency (us)
0 1.20
1 1.23
2 1.22
4 1.23
8 1.22
16 1.22
32 1.22
64 1.22
128 1.22
256 1.23
512 1.27
1024 1.48
2048 1.94
4096 2.31
8192 5.69
16384 6.53
32768 8.20
65536 11.55
131072 18.14
262144 31.33
524288 57.79
1048576 110.42
2097152 217.66
4194304 429.98
And here's what it looks like on top of Flux. Like the Intel MPI builds, the slow-down here for all-to-all tests looks way higher than my intuition says it should. The all-to-all test was taking so long that I cancelled it.
***@***.***:/global/cscratch1/sd/hobbs/tutorials/OSU-MicroBenchmarks/build.ompi> module use /global/common/software/flux/modulefiles/
***@***.***:/global/cscratch1/sd/hobbs/tutorials/OSU-MicroBenchmarks/build.ompi> module load czmq jansson python flux
***@***.***:/global/cscratch1/sd/hobbs/tutorials/OSU-MicroBenchmarks/build.ompi> srun -N 2 -n 2 --pty flux start
***@***.***:/global/cscratch1/sd/hobbs/tutorials/OSU-MicroBenchmarks/build.ompi> flux mini run -N 2 -n 64 ./mpi/collective/osu_alltoall -f
# OSU MPI All-to-All Personalized Exchange Latency Test v5.7.1
# Size Avg Latency(us) Min Latency(us) Max Latency(us) Iterations
1 341.68 256.81 411.92 1000
2 307.34 236.89 370.90 1000
4 310.18 236.01 377.80 1000
8 329.68 264.15 398.67 1000
16 331.42 252.27 387.09 1000
32 319.66 245.34 371.27 1000
64 321.67 246.83 375.39 1000
128 338.49 249.34 412.25 1000
^Cflux-job: one more ctrl-C within 2s to cancel or ctrl-Z to detach
^C163.592s: job.exception type=cancel severity=0 interrupted by ctrl-C
flux-job: task(s) exited with exit code 143
***@***.***:/global/cscratch1/sd/hobbs/tutorials/OSU-MicroBenchmarks/build.ompi> flux mini run -N 2 -n 2 -c 2 ./mpi/pt2pt/osu_latency
# OSU MPI Latency Test v5.7.1
# Size Latency (us)
0 9.86
1 9.88
2 9.89
4 9.84
8 10.08
16 9.88
32 9.96
64 10.01
128 10.04
256 10.35
512 10.26
1024 10.38
2048 10.59
4096 12.57
8192 13.74
16384 16.72
32768 24.69
65536 57.33
131072 70.89
262144 94.25
524288 140.66
1048576 226.38
2097152 399.45
4194304 739.48
In short, the updated benchmarks and newer module file provide relatively the same information as @brandongc<https://github.com/brandongc> 's run above. The all-to-all latency tests look very off, though.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#3064 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AQMC3O7P5ZBWTG77B43PPG3T225CFANCNFSM4PO2WN2A>.
|
Beta Was this translation helpful? Give feedback.
-
We have users who want to run OpenMPI on Cray systems at LANL and NERSC. This ticket is opened to track the status of this support.
Current status:
Beta Was this translation helpful? Give feedback.
All reactions