Performance

Improving performance

Below are some recommendations to further improve translation performance. Many of these recommendations were used in the WNGT 2020 efficiency task submission.

General

Set the compute type to "auto" to automatically select the fastest execution path on the current system
Reduce the beam size to the minimum value that meets your quality requirement
When using a beam size of 1, keep return_scores disabled if you are not using prediction scores: the final softmax layer can be skipped
Set max_batch_size and pass a larger batch to translate_batch: the input sentences will be sorted by length and split by chunk of max_batch_size elements for improved efficiency
Prefer the "tokens" batch_type to make the total number of elements in a batch more constant

CPU

Use an Intel CPU supporting AVX512
If you are translating a large volume of data, prefer increasing inter_threads over intra_threads to improve scalability
Avoid the total number of threads inter_threads * intra_threads to be larger than the number of physical cores

GPU

Use a NVIDIA GPU with Tensor Cores (Compute Capability >= 7.0)
Pass multiple GPU IDs to device_index to run translations on multiple GPUs

Measuring performance

Reporting the translation throughput

The command line option --log_throughput reports the tokens generated per second on the standard error output. This is the recommended metric to compare different runs (higher is better).

Benchmarking models

See the benchmark scripts.

Profiling the execution

The command line option --log_profiling reports an execution profile on the standard error output. It prints a list of selected functions in the format:

  2.51%  80.38%  87.27% beam_search                 557.00ms

where the columns mean:

Percent of time spent in the function
Percent of time spent in the function and its callees
Percent of time printed so far
Name of the function
Time spent in the function (in milliseconds)

The list is ordered on 5. from the largest to smallest time.

GPU performance

CUDA caching allocator

Allocating memory on the GPU with cudaMalloc is costly and is best avoided in high-performance code. For this reason CTranslate2 integrates caching allocators which enables a fast reuse of previously allocated buffers. The allocator can be selected with the environment variable CT2_CUDA_ALLOCATOR:

`cub_caching` (default)

This caching allocator from the CUB project can be tuned to tradeoff memory usage and speed. By default, CTranslate2 uses the following values which have been selected experimentally:

bin_growth = 4
min_bin = 3
max_bin = 12
max_cached_bytes = 209715200 (200MB)

You can override these values by setting the environment variable CT2_CUDA_CACHING_ALLOCATOR_CONFIG with comma-separated values in the same order as the list above:

export CT2_CUDA_CACHING_ALLOCATOR_CONFIG=8,3,7,6291455

See the description of each value in the allocator implementation.

`cuda_malloc_async`

CUDA 11.2 introduced an asynchronous allocator with memory pools. It is usually faster than cub_caching but uses more memory.

CPU performance

Packed GEMM

Packed GEMM could improve performance for single-core decoding. You can enable this mode by setting the environment variable CT2_USE_EXPERIMENTAL_PACKED_GEMM=1. See Intel's article to learn more about packed GEMM.

Tuning `intra_threads` and `inter_threads`

You can use the script tools/tune_inter_intra.py to find the threading configuration that maximizes the global throughput. Simply replace your call to ./build/cli/translate by python3 ./tools/tune_inter_intra.py ./build/cli/translate. The script will run the translation multiple times and report the final tokens per second metric and the maximum memory usage for each threading combination.

head -n 100 valid.de | python3 ./tools/tune_inter_intra.py ./build/cli/translate --model ende_ctranslate2 --beam_size 2 > values.csv
column -s, -t < out.csv | sort -k3 -r

inter_threads  intra_threads  tokens/s  memory_used (in MB)
4              2              919.333   918
2              4              919.333   706
1              8              919.333   557
8              1              689.5     914
7              1              689.5     910
3              2              689.5     876
2              3              689.5     731
2              2              689.5     729
1              5              689.5     562
1              7              689.5     553
1              4              689.5     553
1              6              689.5     549
5              1              551.6     914
4              1              551.6     910
6              1              551.6     869
3              1              551.6     861
1              3              551.6     567
2              1              394.0     715
1              2              394.0     562
1              1              212.154   559

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

performance.md

performance.md

Performance

Improving performance

General

CPU

GPU

Measuring performance

Reporting the translation throughput

Benchmarking models

Profiling the execution

GPU performance

CUDA caching allocator

`cub_caching` (default)

`cuda_malloc_async`

CPU performance

Packed GEMM

Tuning `intra_threads` and `inter_threads`

Files

performance.md

Latest commit

History

performance.md

File metadata and controls

Performance

Improving performance

General

CPU

GPU

Measuring performance

Reporting the translation throughput

Benchmarking models

Profiling the execution

GPU performance

CUDA caching allocator

cub_caching (default)

cuda_malloc_async

CPU performance

Packed GEMM

Tuning intra_threads and inter_threads

`cub_caching` (default)

`cuda_malloc_async`

Tuning `intra_threads` and `inter_threads`