CALDGEMM Command Line Options

Command Line Options of dgemm_bench: The parameters here are those of DGEMM bench, and the defaults are valid for DGEMM bench. Most parameters translate directly to a CALDGEMM setting, in that case the relevant CALDGEMM setting with its default in CALDGEMM is listed.

Some CALDGEMM settings are only valid for HPL-GPU. In that case, there is usually still a DGEMM bench option, to test this parameter. These parameters are marked (HPL-GPU Setting).

CALDGEMM provides 4 backends, for CAL, OpenCL, CUDA, and CPU. Some parameters are only valid for only one or some backends. This is noted as e.g. (CAL Runtime and OpenCL Runtime only).

CALDGEMM has two DMA Frameworks, one keeps the C matrix on GPU (GPU_C = 1), on keeps the C matrix on the host (GPU_C = 0). This is switched with -Oc switch. Some parameters are only valid for the either or the other case. This is noted as e.g. (GPU_C = 1 only). The CAL Runtime will always use GPU_C = 0, CUDA will always use GPU_C = 1, OpenCL supports both, for CPU backend this setting is ignored. In general, GPU_C = 1 should be favored when the GPU is much faster as CPU (i.e. with a multi-gpu system), GPU_C = 0 is better when GPU and CPU performance do not differ by more than a factor 4. The CPU_C = 0 option requires preprocessing (DivideBuffer) and postprocessing (MergeBuffer) on the host. Compared to GPU_C = 0, GPU_C = 1 requires half the global host memory bandwidth, but it requires full duplex DMA transfer instead of half duplex for GPU_C = 0.

Command line options

-? (dgemm_bench specific)

Display help on command line options.

-e (default: disabled) (Config->Verify)

Verify Computational Correctness. The matrix is copied at the beginning of the computation. Sufficient memory must be available. See -7 for verification of large matrices.

-q (default: disabled) (Config->Quiet)

Supress Display Output in caldgemm. Output from dgemm_bench is still active. See -5 to suppress this.

-a (default: disabled) (CAL Runtime Only) (Config->Disassemble)

Print the disassembled kernel image

-i (default: disabled) (CAL Runtime and OpenCL Runtime Only)

Print IL Kernel used (Config->PrintILKernel)

-if <int> (default -1 = autodetect) (Config->ForceKernelVariant)

Force DGEMM Kernel Variant to use. CALDGEMM can use some special kernels for special cases, i.e. general (number 0), with beta=1 (number 1), with beta = 1 and alpha = 0 and hardcoded k (number 2), with beta = -1 and alpha = 1 (number 4). CALDGEMM will automatically use the correct kernel. Used for internal testing only.

-o <c|g> (default: 'c') (Config->DstMemory)

Specify the output location, c = CPU, g = GPU, default GPU. This is the output location of the kernel. If 'g' is specified the GPU write to GPU global memory and an additional DMA transfer fetches the data to the host. In general 'c' is the faster option. On some systems DMA is slow and 'g' gets the better kernel performance. See -I in combination with the 'g' option!

-I (default: -1 = autodetect) (CAL Runtime Only) (`Config->ImplicitDriverSync)

Force implicit driver sync. A bug in some AMD drivers prohibits DMA transfers and concurrent kernel execution in certain situations. Ths slows down caldgemm. A workaround is available that relies on a specific driver behavior and might result in wrong results with newer drivers. It is automatically detected whether you driver suffers by the bug and whether the workaround can be applied. This check does not work for newer driver versions though. -I forces the workaround enabled.

-^ <int> (CAL Runtime Only) (Config->UseDMAFetchQueue)

Set DMA fetch queue parameter. Some AMD GPU drivers show a bug with implicit driver sync, but they still prohibits the concurrent DMA transfer (see -I). In this case, the implicit driver sync (-I) cannot be used, but must be switched off. This would disallow concurrent DMA transfers. A DMA fetch queue is a second workaround, which works in general, but is slower than the implicit driver sync. In general, if the driver does not show the DMA limitation, no workaround should be used (-I 0 -^ 0), if the driver has the limitation and implicit driver sync does not cause data corruption, the implicit driver sync workaround should be used (-I 1 -^ 0), if implicit driver sync does not work, the DMA fetch queue should be used (-I 0 -^ 1).

-h <int> (default: =4096) (Config->Height)

Tile size for matrix multiply, default 4096. If you use GPU only DGEMM the matrix sizes must be a multiple of h.

-H <int> (default: =-h) (dgemm_bench specific)

Reduced block size for actual matrix multiply (buffer size given by -h). I.e. CALDGEMM will allocate buffers for a tile size of h, but then use an actual tile size of H. The reason is that when you want to run caldgemm with different matrix sizes, you should initialize it with large h suited for the largest matrix, but a smaller matrix might favor a smaller h, so tile size can be reduced during runtime in caldgemm. H is used to test the impact of this in dgemm_bench. Used for internal testing.

-w <int> (default: 1024) (Config->Width)

k for matrix multiply, default 1024.

-W <int> (default: =-w) (dgemm_bench specific)

Reduced width, see H. Used for internal testing.

-l (default: disabled) (Config->AutoHeight)

Automatically select tile-size for good performance. The -h paramameter defines the maximal size possible. -l parameter will use smaller tiles for smaller matrices. Activating this is generally a good idea.

-m <int> (default: 4096) (dgemm_bench specific)

m for matrix multiply, default 1024. Number of rows of the target matrix. If GPU-only DGEMM is used, this must be a multiple of -H. If small tiles are allowed via -J switch, it must be a multiple of the minimum small tile size. If CPU us used as well, m can be arbitrary. The CPU processes the remainder part.

-n <int> (default: 4096) (dgemm_bench specific)

n for matrix multiply, must be multiple of h, default 1024. Number of cols of the target matrix. If GPU-only DGEMM is used this must be a multiple of -H.

-v (default: disabled) (Config->VerboseTiming)

Verbose Synchronous Timing for Single Kernels / Transfers. This disables all asynchronous transfers in caldgemm. Overall performance will be poor. This can be used for directly measuring kernel performance and DMA performance and pre-/ postprocessing performance on CPU (pre-/postprocessing is only used for some operating modes.)

-k (default: disabled) (GPU_C = 0 Only) (Config->AsyncTiming)

Print Timing of Asynchronous DGEMM Operation. Used for internal testing.

-r <int> (default: 1) (Config->Iterations)

Number of iterations to run the program (inside caldgemm) Used for internal testing.

-R <int> (default: 1) (dgemm_bench specific)

Number of iterations to run in the benchmark (seperate caldgemm calls) Used for internal testing.

-y <int> (default: -1) (Config->DeviceNum)

Force Device ID (-1 = all devices) Force the device id to use. You can either specify a single device or provide -1 to use all devices.

-Y <int> (default: 8) (Config->NumDevices)

Maximal number of devices to use. Setting -Y greater than zero requires -y to be -1

-Ya <int> (default: 8) (Config->NumActiveDevices)

Use only device as active devices in main queue. Must be smaller than -Y. Other devices can be used for async side queue.

-Yu (default: disabled) (Config->AsyncSideQueueUseInactiveDeviceSet)

If GPUs were disabled via SetNumberDevices for the main queue, the async side queue will use these disabled devices instead of the active devices of the main queue. This improves the parallelism and allows better exploitation of all available devices.

-bb <int> (default: 0 = autodetection) (Config->max_bbuffers)

Maxumum number of allowed bbuffers. In many cases, mostly for OpenCL autodetection might not work properly. Then, -bb should be set to the higherst max(m,n)/h which is run.

-d (default: disabled) (Config->Debug)

Print lots of debug output

-z (default: disabled) (`Config->MultiThread')

Enable Multithreading. You definitely want to activate this. For some internal reasons, this is a prerequisite to use multiple GPUs. MultiThreading means asynchronous processing of pre-/ postprocessing (required if GPU_C = 1 (-Oc parameter)). In addition, it is required for asynchronous factorization, broadcast, etc. in HPL-GPU.

-Z (default: disabled) (Config->MultiThreadDivid)

Enabld Multithreading for DivideBuffer as well. Requires -z. Only valid for multiple GPUs and only when GPU_C is set to 0. Use -Gx to set CPUs for GPU pre/postprocessing!.

-b (default: disabled) (dgemm_bench specific)

Enable internal benchmarking mode. Used for internal testing.

-c (default: disabled) (Config->UseCPU)

Use CPU for DGEMM. You can supply -g as well to use both CPU and GPU. Supplying neither of them will use GPU only.

-g (default: enabled if and only if -c is disabled) (Config->UseGPU)

Use GPU for DGEMM. You can supply -g as well to use both CPU and GPU. Supplying neither of them will use GPU only.

-f (default: disabled) (dgemm_bench specific)

Fast Init (Empty Matrices). The matrices are filled with zeros instead of using a random number generator. Initialization is faster. Use for optimization and benchmarking only. The verification does not work with this initialization method. Neither are the benchmark results correct with newer GPUs. Multiplication with zeroes drains less power, hence the GPU will run in turbo mode constantly, which is not true for standard random numbers.

-j <dbl> (default: -1) (Config->GPURatio)

Ratio of GPU performance to total CPU+GPU performance. Set to -1 for autodetection. This defines how the matrix is split between CPU and GPU. The GPU will process a fraction of j. For DGEMM only, this should be the GPU_Perf/(CPU_Perf + GPU_Perf). When used within HPL-GPU, keep in mind that the CPU will have to perform other tasks as well, so j should be larger then. The -1 autodetection is usually missing a good initial guess for the first run, i.e. it will find a good value over time but at the beginning it can be far of. You can use a negative value to define an initial guess, i.e. -j -0.7 will start with a ratio of 0.7 and then refine this automatically.

-jf <dbl> (default: disabled) (HPL-GPU Setting for Multi-Node runs) (Config->GPURatioDuringFact)

If greater zero, this defines a minimum gpu ratio that is used during factorization phases of linpaxk. In these phases, the factorization causes significant cpu load. The ratio should thus be higher than in non-factorization phases. In that case -jf defines a lower limit to support the autocalculation.

-jm <dbl> (default: disabled) (Config->GPURatioMax)

If greater zero, this defines a maximum gpu ratio. This ensures the CPU always gets a certain part of the matrix. This is particularly usefull in combination with automatic ratio calulation. Autocalculation works only if the CPU has a certain part. Without -jm, as soon as the CPU part became 0 once, it will usually remain zero and never recover. In that case, you can use -jm 0.99.

-jt <bdbl> (default: 0) (Config->GPURatioMarginTime)

The automatic ratio calculation tries to match gpu and cpu execution time to ensure 100% execution of both processors. Performance is deterioted more, if the GPU idles, hence it is generally a good idea to aim for slightly longer gpu execution time than cpu execution time, to compensate for small variations. This margin parameter defines a margin in seconds, which defines how much GPU time should exceed CPU time.

-js <dbl> (default: 0.4) (HPL-GPU Setting) (Config->GPUMarginTimeDuringFact)

In linpack factorization phases, execution time variations can be larger. This setting overrides the -jt setting in this case.

-jl <dbl> (default: 0.2) (HPL-GPU Setting) (Config->GPURatioLookaheadSizeMod)

With the standard (non-alternate) lookahead, the CPU has to process a small non-quadratic matrix-part in the preparatory phase of lookahead. DGEMM on this part is usually slower than the full DGEMM. This parameter defines an extra factor that virtually increases the lookahead part in the CPU / GP distribution calculation, to account for reduces CPU performance. (a setting of zero means no compensation.)

-jp <int> (default: 1) (Config->GPURatioPenalties)

Apply ratio penalties to CPU part in some situations to ensure GPU keeps the dominant processor. A setting of 0 disables penalties. Setting of 1 applies a penalty if the CPU took longer than the GPU in the last iteration. A setting of two applies a penalty in addition when the CPU part in the last iteration was short, because in that case CPU performance may fluctuate and is not that important.

-jq <dbl> (default: 0.9) (Config->GPURatioPenaltyFactor)

Penalty factor to apply to CPU part.

-s (default: disabled) (Config->DynamicSched)

Dynamic CPU / GPU scheduling. Do not use only the fixed ratio specified by -j but use a dynamic CPU/GPU workload scheduling. This includes work-stealing, etc. The value provided by -j is the basis for the scheduling.

-M (default: disabled) (Config->ThirdPhaseDynamicRuns)

Disable third phase in dynamic scheduling

-N (default: disabled) (Config->SecondPhaseDynamicRuns)

Disable second phase in dynamic scheduling

-rr (default: disabled) (HPL-GPU Setting) (Config->RereserveLinpackCPU)

Rereserve Linpack CPU: HPL-GPU requires one GPU core for the broadcast. This core is not available for CPU DGEMM. CALDGEMM can estimate the broadcast time and then try split the DGEMM in two parts. One part in parallel to the broadcast with one core less, and then a second part after the broadcast with all cores. Makes sense when you are not GPU dominated and when you do not have too many cpu cores.

-p (default: disabled) (Config->MemPolicy)

Interleaving Memory Policy. Gotoblas usually activates memory interleaving. This leads to a problem with the CAL library. Interleaving should be activated after memory for the CAL library is allocated. Thus it is recommended to disable interleaving in GotoBLAS (apply the patch provided with caldgemm and set NO_MEMINTERLEAVE in GotoBLAS Make.rule) and use -p.

-u (default: disabled) (Config->DumpMatrix)

Dump Test Matrix. Used for internal testing only.

-1 (default: disabled) (dgemm_bench specific)

Transpose A Matrix. Provide a transposed input A matrix.

-2 (default: disabled) (dgemm_bench specific)

Transpose B Matrix. Provide a transposed input B matrix.

-3 (default: disabled) (dgemm_bench specific)

Set alpha parameter to 1.0 to test optimized kernel.

-# (default: disabled) (dgemm_bench specific)

Set beta parameter to 0.0 to test optimized memcpy.

-5 (default: disabled) (dgemm_bench specific)

Quiet Benchmark mode (different from quiet caldgemm mode -q). This suppresses output of dgemm_bench. Output of caldgemm is not suppressed. See -q for this.

-6 <int> (default: not used) (dgemm_bench specific)

Set m=n = value * tile-size (-h)

-4 <int> (default: not used) (dgemm_bench specific)

Set m=n to the closest multiple of tile-size (-h) to value

-7 (default: disabled) (dgemm_bench specific)

Verfication for large matrices. Compared to -e this does not require the matrix to be copied. However, the output is less elaborated and it only tells you whether the DGEMM succeeded.

-8 (default: initial run enabled) (dgemm_bench specific)

No initial run to negate cache effects. The first run is usually slower as the kernel must be copied to GPU, etc. Thus, for benchmarks, an initial run is performed before the actual benchmark run is started. The -8 option ommits this initial run. The initial run is automatically deactivated if the -d option or some other are given. This option is primarily used for debugging.

-9 (default: disabled) (Config->TabularTiming)

Output a table with timing information

-0 (default: disabled) (CAL Runtime only) (Config->DivideToGPU)

Write the output of divideBuffers-function directly to GPU instead of a seperate DMA transfer. This option turned out to not perform well. Better leave it deactivated.

-A (default: disabled) (Config->AsyncDMA)

Do the DMA transfer to GPU asynchronously. If you are not debugging, always enable this.

-Ap (default: disabled) (Config->PipelinedOperation)

Enable pipelined CALDGEMM operation mode. RunCALDGEMM will not wait for the DGEMM to finish, but exit earlier. You can then queue a new DGEMM operation already. Use FinishDGEMM to wait for the operation to finish!

-Aq (default: disabled) (Config->PipelineMidMarker)

Marks a position in terms of matrix_n where the pipeline can check whether the DGEMM has already passied this possition via WaitForCALDGEMMProgress(n).

-Ab (default: disabled) (Config->PipelineDoubleBuffer)

Activating the double buffer doubles all GPU buffers for A, B, and C matrices. This allows a better overlap of two pipelined DGEMM cals because the second dgemm does not have to wait for the first dgemm to free all buffers before it can start. The performance impact is small with only around 0.5 to 1% and in contrast this option essentially doubles the GPU memory requirements. Still, as CALDGEMM does usually not need that much GPU memory, you should activate this when enough memory is available. In case you run low on GPU memory, this is the first option to deactivate.

-L (default: disabled) (dgemm_bench specific)

Memory Organisation like in HPL (LINPACK). Do not pack the A, B, C matrices together but use a memory organisation like in HPL where the matrices are stored kind of interleaved.

-C (default: disabled) (dgemm_bench specific)

Call fake LINPACK callback functions. This is used to test the HPL callback implementation. For internal testing only.

-Ca <int> (default: 0) (HPL-GPU Setting) )Config->AlternateLookahead)

Set alternate lookahead threshold. Alternate lookahead mode will be used as soon as matrix_n (the HPL-GPU value matrix_n becomes smaller than .

-Cm <int> (default: 0 / disabled) (HPL-GPU Setting) )Config->MinimizeCPUPart)

Minimize CPU part as soon as matrix size below threshold .

-P <int> (default: not used) (dgemm_bench specific)

LDA=LDB=LDC = val for HPL like memory. Forces the leading dimension of the matrices to a specific value. If not set the leading dimensions are chosen such that each row starts at a new cache line.

-T (default: disabled) (dgemm_bench specific)

Allocate Memory using Huge Tables. Turned out not to perform well for some reasons. Better leave it deactivated. To activate this feature shared memory segments with huge tables must be provided.

-B (default: disabled) (CAL Runtime only) (Config->KeepBuffersMapped)

Keep DMA Buffers mapped during kernel execution. The Driver Hack is needed for this option. It is only relevant when using "-o c" which, however, is the default value.

-x <file> (default: not used) (dgemm_bench specific)

Load Matrix from file.

-- <int> (default: disabled) (dgemm_bench specific)

Run a torture test with n iterations. The torture test will automatically set "-A -B -p -z -g" If will use m and n of 86016. If you do not have sufficient memory available you can override the m and n settings. make sure you specifiy -m and -n after --. Without additional options a GPU only torture test is started. Using the standard option you can run combined GPU/CPU torture tests. E.g. a combined torture test with reduced matrix size can be started by: -- 10 -m 40960 -n 40960 -c -l -se;

-t <int> (default: 0) (Config->PinCPU)

Pin GPU thread to core n. The core which is closest to the GPU should be chosen. Mostly 0. The additional merge threads will use the next possible core then. E.g. running with 1 merge threads and -1 6 will use cores 6 and 7.

-ts (default: disabled) (Config->ShowThreadPinning)

Visualize the Thread affinities.

-tc (default: disable) (Config->ShowConfig)

Print full CALDGEMM config settings at startup.

-tr <int> (default: -2) ('Config->PinDeviceRuntimeThreads')

Pin the GPU device runtime threads to CPU core . Use -2 to use the same core as the CALDGEMM main thread. This is the default, as this frees the other cores for concurrent DGEMM execution. Set to -1 to allow the device runtime to utilize all CPU cores (which is the default of the device runtimes).

-K <int> (default: -1 / none) (Config->PinMainThread)

Pin GPU main thread for DMA handling to core .

-Kb <int> (default: -1 / autodetect) (Config->PinBroadcastThread)

Pin the broadcast thread, that does all the MPI calls, to core .

-KN <int> (default: -1 / autodetect) (Config->ForceNumCPUThreads)

Set the number of CPU cores used to .

-KO <int> (default: 0) (Config->CPUCoreOffset)

Offsets all CPU core pinnings by this value, i.e. this number is added to all pinnings. This can be used when two mpi processes (ranks) run on the same node to shift the cores used by the second process. Consider that this will also affect the linpack broadcast cpu core, so setting -Kb -2 in HPL will likely result in incorrect pinning.

-KG <int> (default: -2 / disabled) (Config->SpawnGPUThread)

Spawn a GPU thread instead of a cblas thread, and perform cblas calls from calling thread. -2: disabled (default), -1: enabled, >= 0: define the CPU core to pin the caller thread to (PinMainThread will affect the GPU thread!)

-Gx <int> (default: not used) (Config->GPUMapping)

Use CPU cored given for GPU x. The merge threads will use the next cores, ie -G1 12 will do dividebuffer for GPU 1 on core 12 (if -X) used and mergebuffers on 13 and following. Multiple GPUs assigned to the same core are automatically grouped correctly. Use multiple times for multiple GPUs, e.g. -G0 0 -G1 12 -G2 12. In order to use multiple cores for DivideBuffer, you have to enable the multithreaded DivideBuffer Option (-Z)!

-Ux <int> (default: -1 = auto, see -Gx) (Config->PostprocessMapping)

Pin CPU postprocessing threads of GPU x to CPU core , -1 = default mapping If -Ux and -Gx differ, postprocessing for GPU x happens on core Ux, other tasks happen on core Gx. (See -Gx)

-UAx <int> (default: -1 = no special pinning) (Config->AllocMapping)

Allocate memory for host buffers for GPU x on CPU die , -1 = default mapping In grouped DMA mode (see -[) this also defines the cpu core for the host thread responsible for GPU x.

-UBx <int> (default: none) (Config->DMAMapping)

Set DMA Mapping for GPU x, i.e. the CPU thread responsible for GPU x in parallel DMA mode (See -*) will run on core UBx.

-V <int> (default: automatic) (Config->ThreadSaveDriver)

Thread save GPU runtime: (0: no, 1: yes, -1: use global lock). This is mostly important for CAL backend. CAL is not completely thread save, some API functions are not reentrant. -V 0 will use a mutex to protect this, -V 1 will not use a mutex causing a possible race condition, so it is UNSAFE! For CUDA and OpenCL, the APIs are thread save, and there is no difference between -V 0 and -V 1. -V -1 has a different meaning. In that case, CALDGEMM will use a global mutex to protect and serialize all GPU API calls. This is for debugging in case one suspects the GPU driver to be not reentrant.

-S (default: not used) (Config->SlowCPU)

Set slow CPU option (see below)

-X (default: disabled) (Config->ImprovedScheduler)

Do not use a round robin scheduler for multi-GPU but split the matrix along the not favored direction and process each part by a distinct GPU. This saves BBuffers and is usually faster. This is mandatory force very large matrices.

-Xb <int> (default; 1) (Config->ImprovedSchedulerBalance)

Use a balanced improved scheduler. Only relevant in combinatio with -X. This should always be better. There are three mode: 0 no balancing, 1 standard balancing, 2 advanced balancing. 1 or 2 should always be better

-E (default: 0) (dgemm_bench specific)

Define random seed to use for matrix initialization. Use 0 for time.

-O (default: enabled) (dgemm_bench specific)

Define backend to use. Available options are: 0: CAL 1: OpenCL 2: CUDA 3: fCPU Only

-Oc <int> (default: autodetect) (OpenCL Runtime only) (Config->GPU_C)

Set this to 1, to enable the alter GPU_C DMA framework, which keeps the C matrix on the GPU. (See head of this file!). CAL is forced to -Oc 0, CUDA is forced to -Oc 1, OpenCL can use both and defaults to 0. It is usually a good idea to enable this with OpenCL, but some drivers do not support it, hence it is disabled by default.

-Ol <string> (default: none) (OpenCL Runtime only) (ConfigOpenCL->kernelLib)

Set a 3rd party external library that provides the GPU kernels. If this is not set, reference kernels (unoptimized) that come with CALDGEMM are used.

-Oe (default: disabled) (OpenCL Runtime only) (Config->NoConcurrentKernels)

Do not allow multiple concurrent OpenCL kernels. Some OpenCL devices are slower when they execute multiple DGEMM kernels at the same time. This settings uses OpenCL events to enforce serialization of OpenCL kernels. It does not work well, and should not be used. It is better to enforce serialization on the driver side, e.g. on AMD cards via GPU_NUM_COMPUTE_RINGS=1 env variable.

-Oq (default: disabled) (Config->SimpleGPUQueuing)

Use simple GPU Queuing for OpenCL. This comes with less overhead, so it is generally better for the GPU. But it is incompatible with GPU_C = 0 (-Oc option). If you use -Oc 1, you should also enable this. This enforces the Improved Scheduler (-X option)

-OQ (default: disabled) (Config->AlternateSimpleQueuing)

Different variant of simple queuint. The three command queues are not used in round-robin fashion, but one is used for kernels, one for transfers to, and one for transfers from the device. This is a different approach compared to the workaround with GPU_NUM_COMPUTE_RINGS=1, which avoids that two DGEMM kernels run at the same time. GPU_NUM_COMPUTE_RINGS must not be set when this option is used.

-OM (default: disabled) (Config->AlternateSimpleQueuingMulti)

Another variant of the simple queuing. As in the alternate queuing, there are two queues dedicated to DMA transfer, but there are multiple queues for the kernels. Hence, this does not provide the workaround for AMD FirePro GPUs of HAWAI series, which perform better with -OQ only. However, other GPUs such as NVIDIA's can benefit from more parallelism and are possibly faster with -OM.

-Op <int> (default: disabled) (Config->PreallocData)

CALDGEMM requires certain internal buffers. The number depends on n/h and m/h, i.e. on the matrix size. As this is not known in advance, these buffers are allocated during runtime. The can be preallocated during the initialization via -Op option. The maximum number of blocks maximum of(nb = n / h, mb = m/h) must be pro provided then.

-Oa (default: disabled) (OpenCL Runtime Only, CUDA support planned) (HPL-GPU Setting) (Config->AsyncSideQueue)

CALDGEMM can run asynchronous side queues on the GPU to offload other tasks concurrent to DGEMM execution If this is set, DGEMM bench creates an async side queues and uses this queue to test a single-tile dgemm.

-Or <int> (default: 480) (OpenCL Runtime Only) (Config->AsyncDGEMMThreshold)

Threshold for the minimum matrix size when the GPU is used for DGEMM in async side queue.

-Os <int> (default: 128) (OpenCL Runtime Only) (Config->AsyncDTRSMThreshold)

Threshold for the minimum matrix size when the GPU is used for DTRSM in async side queue.

-Od (default: disabled) (OpenCL Runtime Only) (Config->AsyncDTRSM)

Use the async side queue (available via -Oa flag also for async DTRSM calls.

-Ox (default: disabled) (Config->CPUInContext) (OpenCL Runtime Only)

Do not place the CPU in the OpenCL Context (if possible at all). Some OpenCL runtimes require the CPU in the context for large memory allocation. On the other hand, this can cause the allocation of addiotional unused resources.

-Ot (default: disabled) (Config->Use3rdPartyTranspose)

Use 3rdPartyTranspose kernel for matrix transposition, which is provided by 3rd party external library (See -Ol setting)

-F (default: 0) (Config->OpenCLPlatform)

Define OpenCL Platform ID to use.

-Fc (default: 0) (OpenCL Runtime only) (ConfigOpenCL->OpenCLPlatform)

Allow that the CPU is used as OpenCL device.

Define OpenCL Platform ID to use.

-J <int> (default: 0) (Config->SmallTiles)

Allow small tiles to process the remainder on GPU (0 disable, 1 enable, 2 auto). Auto tries to find a good tile size automatically, which does not always work. In general, for a system with a fast CPU, it is best to leave this as 0. That will ensure optimal GPU tile size, the CPU does the rest. If the GPU is very fast, set this to 1, to ensure that the remainder part processed by the CPU does not become a bottleneck. In any case, you can try setting 2 when you try to optimize, but the effect is small and if the prediction fails, it deteriorates performance significantly.

-Q (default: disabled) (dgemm_bench specific)

Wait for pressing a key before exiting

-! (default: disabled) (dgemm_bench specific)

Do not use page locked memory

-_ (default: disabled) (OpenCL Runtime and CUDA Runtime only) (dgemm_bench specific)

Allocate memory using the GPU runtime library (e.g. OpenCL) instead of malloc. This is required for using GPU_C = 1 (-Oc 1 option) in combination with -o c. In general, it is usually faster with GPU_C = 1 regardless of whether -o g or -o c is used. Some drivers do not support this properly.

-= <int> (default: 2) (GPU_C = 0 only) (Config->OutputThreads)

Define number of MergeBuffer threads per GPU.

-% (default: disabled) (Config->SkipCPUProcessing)

Skip CPU Pre- and Postprocessing. Leads to incorrect results. For internal testing only

-@ <list> (default: disabled) (Config->ExcludeCPUCores)

Comma or Semicolon separated list of CPU cores to exclude. This is usefull if you run something in parallel to CALDGEMM. Or if you have a bulldozer or HyperThreading GPU and you want to disable all even or all odd numbered cores. In general, it is a good idea to disable HyperThreading for CALDGEMM.

-. (default: disabled) (CAL Runtime only) (Config->RepinDuringActiveWaitForEvent)

Repin Main Thread During Active Wait for GPU Event. This is a workaround required for the CAL Runtime on Sandy-Bridge-E CPUs. It costs performance, so only enable when needed.

-~ (default: disabled) (Config->RepinMainThreadAlways)

Always repin main thread. This is an alternate workaround for Sandy-Bridge-E CPUs (see -. option)

-, <int> (default: disabled) (CAL Runtime only) (Config->SleepDuringActiveWait)

Sleep for n usec during active wait for GPU. This can save some CPU resources at the cost of increased latency.

-: (default: disabled) (Config->NumaPinning)

Enable NUMA Pinning. This tries to distribute all employed CPU threads evenly among NUMA nodes. Has little effect, does not always work, but has practically never a negative effect.

-/ <list> (default: disabled) (Config->DeviceNums)

Comma or Semicolon separated list of GPU devices to use (replaces -y for multiple devices) Usually, -Y 3 will use GPU devices 0, 1, and 2, while -y 3 will use only GPU device 3. This gives more fine-grained control on which GPU devices to use. On NUMA systems, it can be beneficial to interleave devices on different NUMA nodes.

-* <int> (default: 0) (Config->ParallelDMA)

Enable Parallel DMA option if n >= . Set to very large number to enable always. This mode will use a different CPU core to manage each GPU, thus requiring more CPU resources. The cores are defined via -UBx option.

-[ <int> (default: 0) (Config->GroupParallelDMA)

Enable Grouped Parallel DMA option if n < . Requires Parallel DMA option and requires -* > -[ or -* -1. Set this option to -1 in order to always use Gouped Parallel DMA and never standard Parallel DMA. This can group the GPUs and use a CPU core for multiple GPUs. Cores defined via -UAx option. As an example, -* -1 -[ 100000 -UA0 0 -UA1 0 -UA2 10 -UA3 10 -UB0 0 -UB1 1 -UB2 10 -UB3 11, will use 4 threads (one per gpu) for matrix sizes above 100000 on cores 0, 1, 10, 11 and 2 threads for smaller matrices on cores 0 and 10, with core 0 handling gpus 0 and 1.

-] <int> (default: disabled) (AMD GPUs only, uses ADL library) (dgemm_bench specific)

Maximum allowed GPU temperature (check applied after one caldgemm iteration, meaningfull in combination with -R)

Other CALDGEMM Options:

The CALDGEMM config allow the SlowCPU option which should be used when the CPU is comparably slow compared to the GPU. It deactivates 2nd and 3rd phase runs and adjusts the tiling size to minimize the 1st phase CPU run.
SetNumberDevices(int):

Allows the reduction of used GPU devices during runtime.

Home

Howto

Environment Variables

HPL Tuning

Tools / Information

Reference

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CALDGEMM Command Line Options

Command line options

Other CALDGEMM Options:

Clone this wiki locally