From e06901a01b47e81718f78a6d1377ee02aaf12f9a Mon Sep 17 00:00:00 2001 From: lindstro Date: Fri, 27 Dec 2024 23:36:15 -0800 Subject: [PATCH] Document HIP support --- docs/source/defs.rst | 1 + docs/source/execution.rst | 99 ++++++++++++++++++++-------------- docs/source/high-level-api.rst | 9 ++-- docs/source/installation.rst | 69 ++++++++++++++++++++---- docs/source/zfpcmd.rst | 13 ++--- 5 files changed, 132 insertions(+), 59 deletions(-) diff --git a/docs/source/defs.rst b/docs/source/defs.rst index 93d1150ff..8713a9a8e 100644 --- a/docs/source/defs.rst +++ b/docs/source/defs.rst @@ -41,3 +41,4 @@ .. |cpprelease| replace:: 1.0.0 .. |verrelease| replace:: 1.0.0 .. |vrdecrelease| replace:: 1.1.0 +.. |hiprelease| replace:: 1.1.0 diff --git a/docs/source/execution.rst b/docs/source/execution.rst index fb02c4bcc..e9acac3fc 100644 --- a/docs/source/execution.rst +++ b/docs/source/execution.rst @@ -7,11 +7,14 @@ Parallel Execution ================== -As of |zfp| |omprelease|, parallel compression (but not decompression) is -supported on multicore processors via `OpenMP `_ -threads. +As of |zfp| |omprelease|, parallel compression is supported on multicore +processors via `OpenMP `_ threads. |zfp| |cudarelease| adds `CUDA `_ support for fixed-rate compression and decompression on the GPU. +|zfp| |hiprelease| further adds support for +`HIP `_ +and for fixed- and variable-rate parallel compression and decompression +for all three back-ends (OpenMP, CUDA, and HIP). Since |zfp| partitions arrays into small independent blocks, a large amount of data parallelism is inherent in the compression scheme that @@ -40,10 +43,10 @@ Execution Policies |zfp| supports multiple *execution policies*, which dictate how (e.g., sequentially, in parallel) and where (e.g., on the CPU or GPU) arrays are -compressed. Currently three execution policies are available: -``serial``, ``omp``, and ``cuda``. The default mode is +compressed. Currently four execution policies are available: +``serial``, ``omp``, ``cuda``, and ``hip``. The default mode is ``serial``, which ensures sequential compression on a single thread. -The ``omp`` and ``cuda`` execution policies allow for data-parallel +The ``omp``, ``cuda``, and ``hip`` execution policies allow for data-parallel compression on multiple threads. The execution policy is set by :c:func:`zfp_stream_set_execution` and @@ -62,7 +65,7 @@ Execution Parameters Each execution policy allows tailoring the execution via its associated *execution parameters*. Examples include number of threads, chunk size, -scheduling, etc. The ``serial`` and ``cuda`` policies have no +scheduling, etc. The ``serial``, ``cuda``, and ``hip`` policies have no parameters. The subsections below discuss the ``omp`` parameters. Whenever the execution policy is changed via @@ -216,6 +219,18 @@ The CUDA implementation has a number of limitations: We expect to address these limitations over time. +Using HIP +--------- + +Support for HIP is available as of |zfp| |hiprelease|, allowing |zfp| to be +run in parallel on AMD GPUs. To enable support, |zfp| the +:c:macro:`ZFP_WITH_HIP` macro must be set and |zfp| must be built with CMake. +See :c:macro:`ZFP_WITH_HIP` for further details. + +The HIP implementation is based off the CUDA implementation, and therefore +the same :ref:`limitations ` apply. + + Setting the Execution Policy ---------------------------- @@ -230,9 +245,10 @@ calling :c:func:`zfp_stream_set_execution` } before calling :c:func:`zfp_compress`. Replacing :code:`zfp_exec_omp` -with :code:`zfp_exec_cuda` enables CUDA execution. If OpenMP or CUDA is -disabled or not supported, then the return value of functions setting these -execution policies and parameters will indicate failure. Execution +with :code:`zfp_exec_cuda` enables CUDA execution. Similarly, +:code:`zfp_exec_hip` enables HIP execution. If the corresponding execution +policy is disabled or not supported, then the return value of functions +setting these policies and parameters will indicate failure. Execution parameters are optional and may be set using the functions discussed above. The source code for the |zfpcmd| command-line tool includes further examples @@ -241,39 +257,42 @@ decompression in this tool, see the :option:`-x` command-line option. .. note:: As of |zfp| |cudarelease|, the execution policy refers to both - compression and decompression. The OpenMP implementation does not - yet support decompression, and hence :c:func:`zfp_decompress` will - fail if the execution policy is not reset to :code:`zfp_exec_serial` - before calling the decompressor. Similarly, the CUDA implementation - supports only fixed-rate mode and will fail if other compression modes - are specified. + compression and decompression. + +.. note:: + As of |zfp| |vrdecrelease|, variable-rate compression modes are supported + for all execution policies, both for compression and decompression. + However, for parallel decompression, a block index must be provided that + encodes where in the compressed stream each block resides. See the section + on :ref:`parallel decompression ` for further + details. The following table summarizes which execution policies are supported with which :ref:`compression modes `: - +---------------------------------+---------+---------+---------+ - | (de)compression mode | serial | OpenMP | CUDA | - +===============+=================+=========+=========+=========+ - | | expert | |check| | |check| | | - | +-----------------+---------+---------+---------+ - | | fixed rate | |check| | |check| | |check| | - | +-----------------+---------+---------+---------+ - | compression | fixed precision | |check| | |check| | | - | +-----------------+---------+---------+---------+ - | | fixed accuracy | |check| | |check| | | - | +-----------------+---------+---------+---------+ - | | reversible | |check| | |check| | | - +---------------+-----------------+---------+---------+---------+ - | | expert | |check| | |check| | | - | +-----------------+---------+---------+---------+ - | | fixed rate | |check| | |check| | |check| | - | +-----------------+---------+---------+---------+ - | decompression | fixed precision | |check| | |check| | |check| | - | +-----------------+---------+---------+---------+ - | | fixed accuracy | |check| | |check| | |check| | - | +-----------------+---------+---------+---------+ - | | reversible | |check| | |check| | | - +---------------+-----------------+---------+---------+---------+ + +---------------------------------+---------+---------+---------+---------+ + | (de)compression mode | serial | OpenMP | CUDA | HIP | + +===============+=================+=========+=========+=========+=========+ + | | expert | |check| | |check| | |check| | |check| | + | +-----------------+---------+---------+---------+---------+ + | | fixed rate | |check| | |check| | |check| | |check| | + | +-----------------+---------+---------+---------+---------+ + | compression | fixed precision | |check| | |check| | |check| | |check| | + | +-----------------+---------+---------+---------+---------+ + | | fixed accuracy | |check| | |check| | |check| | |check| | + | +-----------------+---------+---------+---------+---------+ + | | reversible | |check| | |check| | | | + +---------------+-----------------+---------+---------+---------+---------+ + | | expert | |check| | |check| | |check| | |check| | + | +-----------------+---------+---------+---------+---------+ + | | fixed rate | |check| | |check| | |check| | |check| | + | +-----------------+---------+---------+---------+---------+ + | decompression | fixed precision | |check| | |check| | |check| | |check| | + | +-----------------+---------+---------+---------+---------+ + | | fixed accuracy | |check| | |check| | |check| | |check| | + | +-----------------+---------+---------+---------+---------+ + | | reversible | |check| | |check| | | | + +---------------+-----------------+---------+---------+---------+---------+ :c:func:`zfp_compress` and :c:func:`zfp_decompress` both return zero if the current execution policy is not supported for the requested compression @@ -290,6 +309,8 @@ function in turn inspects the execution policy given by the for executing compression. +.. _parallel-decompression: + Parallel Decompression ---------------------- diff --git a/docs/source/high-level-api.rst b/docs/source/high-level-api.rst index 62d88f7df..6c328a6af 100644 --- a/docs/source/high-level-api.rst +++ b/docs/source/high-level-api.rst @@ -272,7 +272,7 @@ Types :: typedef struct { - zfp_exec_policy policy; // execution policy (serial, omp, cuda, ...) + zfp_exec_policy policy; // execution policy (serial, omp, cuda, hip, ...) void* params; // execution parameters } zfp_execution; @@ -287,14 +287,15 @@ Types .. c:type:: zfp_exec_policy - Currently three execution policies are available: serial, OpenMP parallel, - and CUDA parallel. + Currently four execution policies are available: serial, OpenMP, CUDA, and + HIP. :: typedef enum { zfp_exec_serial = 0, // serial execution (default) zfp_exec_omp = 1, // OpenMP multi-threaded execution - zfp_exec_cuda = 2 // CUDA parallel execution + zfp_exec_cuda = 2, // CUDA parallel execution + zfp_exec_hip = 3 // HIP parallel execution } zfp_exec_policy; ---- diff --git a/docs/source/installation.rst b/docs/source/installation.rst index 4598024d2..4a63b169a 100644 --- a/docs/source/installation.rst +++ b/docs/source/installation.rst @@ -241,17 +241,57 @@ in the same manner that :ref:`build targets ` are specified, e.g., .. c:macro:: ZFP_WITH_CUDA - CMake macro for enabling or disabling CUDA support for - GPU compression and decompression. When enabled, CUDA and a compatible - host compiler must be installed. For a full list of compatible compilers, + CMake macro for enabling or disabling CUDA support for GPU compression and + decompression. When enabled, CUDA 11.0 or later and a compatible host + compiler must be installed. For a full list of compatible compilers, please consult the `NVIDIA documentation `__. - If a CUDA installation is in the user's path, it will be - automatically found by CMake. Alternatively, the CUDA binary directory - can be specified using the :envvar:`CUDA_BIN_DIR` environment variable. + If a CUDA installation is in the user's path, it will be automatically found + by CMake. See also :c:macro:`CMAKE_CUDA_ARCHITECTURES`. CMake default: off. GNU make default: off and ignored. + +.. c:macro:: CMAKE_CUDA_ARCHITECTURES + + `CMake macro `__ + for optionally specifying which + `NVIDIA GPU architectures `__ + to build for. Use a semicolon separated list of architectures to override + the default, e.g., ``35;50;72`` generates code for compute capabilities + 3.5, 5.0, and 7.2. Set to ``all`` to build for all supported architectures. + CMake default: compiler specific. + GNU make default: ignored. + +.. note:: + Setting ``CMAKE_CUDA_ARCHITECTURES=all`` makes it possible to use a single + binary across multiple architectures. However, this option can significantly + increase the build time and size of ``libzfp``. + + +.. c:macro:: ZFP_WITH_HIP + + CMake macro for enabling or disabling HIP support for GPU compression and + decompression. If a HIP installation is in the user's path, it will be + automatically found by CMake. Alternatively, one may set the environment + variable :envvar:`HIP_PATH` to point to the HIP installation. Some + platforms further require setting ``CMAKE_C_COMPILER=hipcc`` and + ``CMAKE_CXX_COMPILER=hipcc``. See also :c:macro:`CMAKE_HIP_ARCHITECTURES`. + CMake default: off. + GNU make default: off and ignored. + + +.. c:macro:: CMAKE_HIP_ARCHITECTURES + + `CMake macro `__ + for optionally specifying which + `AMD GPU architectures `__ + to build for. Use a semicolon separated list of architectures to override + the default, e.g., ``gfx900;gfx908``. + CMake default: compiler specific. + GNU make default: ignored. + + .. _rounding: .. c:macro:: ZFP_ROUNDING_MODE @@ -279,6 +319,7 @@ in the same manner that :ref:`build targets ` are specified, e.g., :code:`serial` and :code:`omp` :ref:`execution policies `. Default: :code:`ZFP_ROUND_NEVER`. + .. c:macro:: ZFP_WITH_TIGHT_ERROR **Experimental feature**. When enabled, this feature takes advantage of the @@ -293,6 +334,7 @@ in the same manner that :ref:`build targets ` are specified, e.g., :ref:`execution policies `. Default: undefined/off. + .. c:macro:: ZFP_WITH_DAZ When enabled, blocks consisting solely of subnormal floating-point numbers @@ -312,6 +354,7 @@ in the same manner that :ref:`build targets ` are specified, e.g., :code:`omp`. Default: undefined/off. + .. c:macro:: ZFP_WITH_ALIGNED_ALLOC Use aligned memory allocation in an attempt to align compressed blocks @@ -398,8 +441,8 @@ in the sections below. CMake ^^^^^ -CMake builds require version 3.9 or later. CMake is available -`here `__. +CPU-only CMake builds require version 3.9 or later; see below for GPU build +requirements. CMake is available `here `__. OpenMP ^^^^^^ @@ -409,8 +452,14 @@ OpenMP support requires OpenMP 2.0 or later. CUDA ^^^^ -CUDA support requires CUDA 7.0 or later, CMake, and a compatible host -compiler (see :c:macro:`ZFP_WITH_CUDA`). +CUDA support requires CUDA 11.0 or later, CMake 3.23 or later, and a +compatible host compiler (see :c:macro:`ZFP_WITH_CUDA`). + +HIP +^^^ + +HIP support requires ROCm 4.0 or later, CMake 3.21 or later, and a +compatible host compiler (see :c:macro:`ZFP_WITH_HIP`). C/C++ ^^^^^ diff --git a/docs/source/zfpcmd.rst b/docs/source/zfpcmd.rst index 51afd7dde..c0a848144 100644 --- a/docs/source/zfpcmd.rst +++ b/docs/source/zfpcmd.rst @@ -232,7 +232,8 @@ Execution parameters :code:`-x omp=threads,chunk_size` to specify the chunk size in number of blocks (see also :c:func:`zfp_stream_set_omp_chunk_size`). A chunk size of zero is ignored and results in the default size. - Use :code:`-x cuda` to for parallel CUDA compression and decompression. + Use :code:`-x cuda` or :code:`-x hip` for parallel CUDA or HIP + compression and decompression, respectively. As of |cudarelease|, the execution policy applies to both compression and decompression. If the execution policy is not supported for @@ -245,9 +246,9 @@ Block Index ^^^^^^^^^^^ A block index is needed to support variable-rate decompression using any -of the parallel execution policies (OpenMP and CUDA). This index must be -captured and stored to file during compression and later accessed prior to -decompression. +of the parallel execution policies (OpenMP, CUDA, and HIP). This index +must be captured and stored to file during compression and later accessed +prior to decompression. .. option:: -m @@ -258,8 +259,8 @@ decompression. Block index type ("offset" or "hybrid") and granularity in number of blocks per index entry. A granularity of one provides the highest flexibility and - performance potential (especially for CUDA) but also the highest storage - cost. + performance potential (especially for CUDA and HIP) but also the highest + storage cost. See the :ref:`hl-func-index` section for further details.