-
Notifications
You must be signed in to change notification settings - Fork 14
Heterogeneous cluster with different node types
In the context of this linpack implementation a heterogeneous system is any system in which not all nodes can provide the same compute power. Such systems are supported since Version 1.1.0.
One obvious example for such a system is LOEWE-CSC. It combines nodes equipped with two CPUs and an AMD 5870 with nodes powered by four CPUs. Another example can be systems with different GPU or CPU versions or clock speeds.
In linpack the matrix is split into blocks of size NB
. These blocks are then equally distributed over all processes. This is also described on netlib.org. If one process is slower than the others the performance of all processes drops, as the have to wait for the slow process at synchronization points.
The problem is solved by handing slower processes less blocks to calculate. This way the slow process can finish at the same time as the faster processes. As no process has to wait the faster processes can contribute their full performance again.
To keep the simple PxQ grid of processes all processes in one process column are given the same number of blocks. Therefore the slowest member in a process column dictates that columns speed.
As of version Version 1.1.0 a file node-perf.dat
is located next to the HPL.dat
. On a homogeneous system this file needs to contain the relative performance of each node. The values need to be in the range of 0.01 to 1.0. One way to get these numbers is to perform single node runs and normalize the performance results by the fastest node. If the file does not contain any data a homogeneous system is assumed.
As the slowest node in a process column dictates that process columns performance you should try to find values for P
and Q
in HPL.dat
that allow to fill columns with nodes of one speed. Usually that is no perfect match is possible, but you can still try to minimize the number fast nodes required to fill up slow columns. Sometimes it even makes sense to remove a few of the slow nodes to get the best overall performance.
Finally you have to sort your hostfile such that each process is placed at the appropriate position in the process grid. Therefore it is usually helpful to choose a column major process grid in @HPL.dat@. That way subsequent nodes will be place in the same column.
This example assumes 6 hosts, four with a single node performance of 600 Gflops and two with a single node performance of 400 Gflops.
node-perf.dat
should look like this:
fast1 1.0 fast2 1.0 fast3 1.0 fast4 1.0 slow1 0.66 slow2 0.66
The slow nodes achieve 2/3 of the performance of the fast node, so the ratio is 0.66 to 1.0.
HPL.dat
should look like this (only relevant lines shown):
1 PMAP process mapping (0=Row-,1=Column-major) 2 Ps 3 Qs
The hostfile should look like this:
fast1 fast2 fast3 fast4 slow1 slow2
In this case, process columns 1 and 2 are composed of fast nodes, process column 3 consists of slow nodes.
Note that in principle any combination that produces columns of consistent speed is valid. The two slow nodes could also be in positions one and two, or in the positions three and four of the hostfile. The following hostfile should be avoided, as it contains two slow process columns and performance of nodes fast3
and fast4
would be wasted.
fast1 fast2 fast3 slow1 fast4 slow2
I.e., in this setup only process column 1 would run with full performance, while all 4 nodes of process columns 2 and 3 would run with the performance of the slow nodes.
- DMA and memory bandwidth
- CALDGEMM Performance Optimization Guide (CAL OpenCL without GPU_C)
- CALDGEMM Performance Optimization Guide (OpenCL CUDA)
- Thread to core pinning in HPL and CALDGEMM
- Important HPL GPU / CALDGEMM options
Tools / Information
- Analysis Plots of HPL GPU Runs
- Headless System with X Server
- Heterogeneous cluster with different node types
- HPL Compile Time Options
- Catalyst Driver Patch
Reference