Skip to content

Heterogeneous cluster with different node types

David Rohr edited this page Dec 16, 2015 · 2 revisions

Heterogeneous Systems

In the context of this linpack implementation a heterogeneous system is any system in which not all nodes can provide the same compute power. Such systems are supported since Version 1.1.0.

One obvious example for such a system is LOEWE-CSC. It combines nodes equipped with two CPUs and an AMD 5870 with nodes powered by four CPUs. Another example can be systems with different GPU or CPU versions or clock speeds.

How heterogeneous systems are handled

In linpack the matrix is split into blocks of size NB. These blocks are then equally distributed over all processes. This is also described on netlib.org. If one process is slower than the others the performance of all processes drops, as the have to wait for the slow process at synchronization points.

The problem is solved by handing slower processes less blocks to calculate. This way the slow process can finish at the same time as the faster processes. As no process has to wait the faster processes can contribute their full performance again.

To keep the simple PxQ grid of processes all processes in one process column are given the same number of blocks. Therefore the slowest member in a process column dictates that columns speed.

Configuring your run for a heterogeneous system

As of version Version 1.1.0 a file node-perf.dat is located next to the HPL.dat. On a homogeneous system this file needs to contain the relative performance of each node. The values need to be in the range of 0.01 to 1.0. One way to get these numbers is to perform single node runs and normalize the performance results by the fastest node. If the file does not contain any data a homogeneous system is assumed.

As the slowest node in a process column dictates that process columns performance you should try to find values for P and Q in HPL.dat that allow to fill columns with nodes of one speed. Usually that is no perfect match is possible, but you can still try to minimize the number fast nodes required to fill up slow columns. Sometimes it even makes sense to remove a few of the slow nodes to get the best overall performance.

Finally you have to sort your hostfile such that each process is placed at the appropriate position in the process grid. Therefore it is usually helpful to choose a column major process grid in @HPL.dat@. That way subsequent nodes will be place in the same column.

Example

This example assumes 6 hosts, four with a single node performance of 600 Gflops and two with a single node performance of 400 Gflops.

node-perf.dat should look like this:

fast1 1.0
fast2 1.0
fast3 1.0
fast4 1.0
slow1 0.66
slow2 0.66

The slow nodes achieve 2/3 of the performance of the fast node, so the ratio is 0.66 to 1.0.

HPL.dat should look like this (only relevant lines shown):

1            PMAP process mapping (0=Row-,1=Column-major)
2            Ps
3            Qs

The hostfile should look like this:

fast1
fast2
fast3
fast4
slow1
slow2

In this case, process columns 1 and 2 are composed of fast nodes, process column 3 consists of slow nodes.

Note that in principle any combination that produces columns of consistent speed is valid. The two slow nodes could also be in positions one and two, or in the positions three and four of the hostfile. The following hostfile should be avoided, as it contains two slow process columns and performance of nodes fast3 and fast4would be wasted.

fast1
fast2
fast3
slow1
fast4
slow2

I.e., in this setup only process column 1 would run with full performance, while all 4 nodes of process columns 2 and 3 would run with the performance of the slow nodes.