Skip to content

CALDGEMM Performance Optimization Guide (CAL OpenCL without GPU_C)

David Rohr edited this page May 15, 2015 · 2 revisions

Performance Optimization Guide:

To achieve good performance multiple steps should be performed: 0. Update Settings for the GPU used

  1. Optimize Kernel Performance.
  2. Optimize System Performance of GPU-DGEMM (including DMA-transfer, post-/ preprocessing).
  3. Optimize Combined GPU/CPU Performance.
  4. Optimize multi-GPU performance.

If you have multiple GPUs better do the following with a single GPU first. Try multiGPU afterwards (step 4). Add -y 0 to each of the following command lines at the beginning.

In principle, you should try to achieve the following performance: The kernel performance dictates the final performance. Kernel performance is usually 80%-90% of the theoretical peak performance of the GPU. The CAL kernel should achieve 574 GFLOPS with 5870 GPU, 623 GFLOPS with 6970 GPU, 805 GFLOPS with 7970 GPU, to give a rough overview.

Goint from single GPU kernel performance to single GPU system performance, you should expect a loss of 1%-3%. Scaling to multi-GPU should be almost perfect for 2 GPUs (less than 2% loss) and for 4 GPUs you should expect less than 4% less.

If you then go to HPL, a rough guideline is that HPL should achieve 7%-15% less GFLOPS than DGEMM, while multi-node HPL will encounter an additional 5%-10% loss.

The following procedure is mostly for CAL. Additional suggestions for OpenCL and CUDA follow later. Still, many aspects of the CAL guide are also valid for OpenCL / CUDA.

Some general remarks at the beginning: CALDGEMM by default uses pinned host memory, which cannot be swapped. It might be necessary to set ulimits accordingly: ulimit -m unlimited; ulimit -l unlimited; ulimit -v unlimited;

Some GPUs throttle themselves during DGEMM execution. For AMD GPUs, you can use the atitweak python utility to modify the GPU poertune feature (atitweak -p) to overcome this. Keep in mind that this might run the GPU our of spects, so it can damage your hardware if done incorrectly. This is at your own risk. You should at least monitor temperature constantly if doing so.

////////////////////////////////////////////////////////////////////////////////////////////////////////////////

Step 0: Different GPUs require different settings for optimal performance.

Especially the splitting ratio calculation may not work correctly. Always keep an eye on the GPU time and the CPU time. If one of them is higher then the other, adjust the -j ratio. This is also relevant for the 5000 series due to different clock speeds.

CALDGEMM comes with Assembler GPU DGEMM kernels for the CAL runtime. Depending on the particular GPU used, the options in caldgenn_config.h should be adjusted for optimal DGEMM performance.

For the 5xxx series, the following is suggested:

  • Enable exactly CALDGEMM_TRANSPOSED_B, CALDGEMM_44 as DGEMM kernel settings in caldgemm_config.h
  • For 5xxx series h can be used almost arbitrarily but is suggested to be at least 1024.
  • 5xxx works well both with -o g and -o c

For the 6xxx series the following configuration is suggested:

  • Enable CALDGEMM_TRANSPOSED_B, CALDGEMM_44.
  • It is best to enable the CALDGEMM_44_BT_64 and CALDGEMM_44_BT_64_CONVERT options in caldgemm_config.h.
  • h = 2304 performs best.
  • Use -o c in any case! See that implicit driver sync works (-I), or use DMA fetch queue (-^).

For the 7xxx series, please enable the following settings (default):

  • CALDGEMM_TRANSPOSED_A, CALDGEMM_44, CALDGEMM_DUAL_ENTRY, CALDGEMM_LATE_EXIT_CONDITION, CALDGEMM_SHIFT_TEXTURE 1
  • h = 3072 works well.
  • -o g works usually better than -o c

In general, it is no longer suggested to use CAL. OpenCL and CUDA are the better options. OpenCL comes only with a reference kernel, it has support to load an optimized kernel from a 3rd party library. This is the suggested way. CUDA also comes only with a reference kernel yet, this should be changed to CUBLAS in the future.

////////////////////////////////////////////////////////////////////////////////////////////////////////////////

Step 1: The kernel performance should be good out of the box. Most kernel parameters cannot be changed via command-line but during compilation in caldgemm_config.h. Usually the parameters are fine as they are.

Run a ./dgemm_bench -v to check the kernel performance. The kernel will usually write its output to host memory.

Some systems have a poor DMA. You can try to alter the output to GPU memory and see whether kernel performance gets better. Run ./dgemm_bench -o g -v for this. If the second option is better, always use -o g. For OpenCL and for 7xxx AMD series and above, -o g is suggested in general.

////////////////////////////////////////////////////////////////////////////////////////////////////////////////

Step 2: Optimize System performance

First check whether DMA is working well. Run ./dgemm_bench -o g -v and look at the copy speeds from and to the device. (-o g is required here to measure PCIe speed.) Anything above 5gb/s should be fine. If the speed is below probably the GPU threads are pinned to a wrong CPU core on NUMA architectures. You can alter the CPU core with the -t option. Try ./dgemm_bench -o g -v -t 0, ./dgemm_bench -o g -v -t 1, etc to find the best CPU core. Using a CPU core other than zero can lead to problems when using GPU/CPU combined DGEMM.

Test you system GPU DGEMM performance. The parameters you definitely want to have are: -z (multithreading) -p (memory interleaving) -A (asynchronous DMA transfer) Run ./dgemm_bench -z -p -A -m 40960 -n 40960

This part is only relevant if you found you want to use -o g in Step 1: There is a DMA problem in the AMD driver that can be overcome by a workaround. Usually it is autodetected whether the workaround can and must be applied. Still, you better recheck by hand. You can force the workaround using the -I parameter. Rerun the above test: ./dgemm_bench -z -p -A -m 40960 -n 40960 -o g -I If the performance is better you have to check whether the results are correct. The workaround will only work with some drivers and might produce false results with others. To verify run:

./dgemm_bench -z -p -A -m 40960 -n 40960 -o g -I -e

This part is only relevant if you found you want to use -o c in Step 1: Use the AMD driver hack. Apply the hack and then use the -B parameter. Run ./dgemm_bench -z -p -A -B -m 40960 -n 40960. You'll see a warning if the hack was not applied correctly. Performance is not necessarily better than without -B but the CPU load is decreased. You'll see the difference when using combined CPU/GPU DGEMM.

If you have an Ivy-Bridge system with CAL runtime, add -. option.

On intel systems, you can usually restrict to one output thread with -= 1 option.

If you have much more GPU power then CPU power, -J 1 is suggested, and perhaps disable dynamic CPU/GPU scheduling (no -s).

You can interleave GPUs among numa nodes with -/ setting (see quad-GPU 7xxx series example below).

////////////////////////////////////////////////////////////////////////////////////////////////////////////////

Step 3: Optimize Overall performance.

First check the possible CPU performance: ./dgemm_bench -c -z -p -m 40960 -n 40960. Then do a combined CPU/GPU run: ./dgemm_bench -c -g -l -s -p -z -A -m 40960 -n 40960. Use the -o g, -I, and -B parameters as determined in steps 1 and 2. The performance should be better than in step 2.

You can alter the CPU/GPU ratio using the -j parameter. Try to tune it such that the GPU and CPU DGEMM times are equal. It is better to set -j rather high, as the dynamic scheduler will compensate this with a work-stealing algorithm. If you see many 3rd-phase runs in caldgemm output, than -j is possibly to big.

If the AMD driver hack is not available, you might get better combined performance by using -o g (foolow the appropriate instructions in step 2 also).

////////////////////////////////////////////////////////////////////////////////////////////////////////////////

Step 4: There is little you can do to optimize multi-GPU performance. You have to determine the CPU core for each GPU independently. Repeat this part of step2. Use -y 0, -y 1, -y 2 etc to optimize each gpu. Finally use -G0 ? -G1 ? -G2 ? and insert the optimal cpu core you obtained for each GPU.

Next step is tuning -Ux settings. Try if Parallel DMA mode and grouped DMA mode yield a benefit.

First try to run without CPU. From now on omit the -y 0. The performance should scale almost linearly with multi-GPU.

You can try the -X and -Z options. They usually increase performance for 3 GPUs or more. You might also want to increase w. w = 1536 or w = 2048 can achieve good performance. For larger w a smaller h is suggested. Try h = 3072 for instance.

If you have good multi-GPU performance try to use the CPU as well. You might need to change the -j value. Best, start with -j 1 to do almost all work of the CPU. Then decrease j step by step until you see optimal performance. (-j 0.97 ... -j 0.94 ... -j 0.91).

////////////////////////////////////////////////////////////////////////////////////////////////////////////////