Skip to content

Commit

Permalink
Document HIP support
Browse files Browse the repository at this point in the history
  • Loading branch information
lindstro committed Dec 28, 2024
1 parent f3d31d7 commit e06901a
Show file tree
Hide file tree
Showing 5 changed files with 132 additions and 59 deletions.
1 change: 1 addition & 0 deletions docs/source/defs.rst
Original file line number Diff line number Diff line change
Expand Up @@ -41,3 +41,4 @@
.. |cpprelease| replace:: 1.0.0
.. |verrelease| replace:: 1.0.0
.. |vrdecrelease| replace:: 1.1.0
.. |hiprelease| replace:: 1.1.0
99 changes: 60 additions & 39 deletions docs/source/execution.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,14 @@
Parallel Execution
==================

As of |zfp| |omprelease|, parallel compression (but not decompression) is
supported on multicore processors via `OpenMP <http://www.openmp.org>`_
threads.
As of |zfp| |omprelease|, parallel compression is supported on multicore
processors via `OpenMP <http://www.openmp.org>`_ threads.
|zfp| |cudarelease| adds `CUDA <https://developer.nvidia.com/about-cuda>`_
support for fixed-rate compression and decompression on the GPU.
|zfp| |hiprelease| further adds support for
`HIP <https://rocm.docs.amd.com/projects/HIP/en/latest/>`_
and for fixed- and variable-rate parallel compression and decompression
for all three back-ends (OpenMP, CUDA, and HIP).

Since |zfp| partitions arrays into small independent blocks, a
large amount of data parallelism is inherent in the compression scheme that
Expand Down Expand Up @@ -40,10 +43,10 @@ Execution Policies

|zfp| supports multiple *execution policies*, which dictate how (e.g.,
sequentially, in parallel) and where (e.g., on the CPU or GPU) arrays are
compressed. Currently three execution policies are available:
``serial``, ``omp``, and ``cuda``. The default mode is
compressed. Currently four execution policies are available:
``serial``, ``omp``, ``cuda``, and ``hip``. The default mode is
``serial``, which ensures sequential compression on a single thread.
The ``omp`` and ``cuda`` execution policies allow for data-parallel
The ``omp``, ``cuda``, and ``hip`` execution policies allow for data-parallel
compression on multiple threads.

The execution policy is set by :c:func:`zfp_stream_set_execution` and
Expand All @@ -62,7 +65,7 @@ Execution Parameters

Each execution policy allows tailoring the execution via its associated
*execution parameters*. Examples include number of threads, chunk size,
scheduling, etc. The ``serial`` and ``cuda`` policies have no
scheduling, etc. The ``serial``, ``cuda``, and ``hip`` policies have no
parameters. The subsections below discuss the ``omp`` parameters.

Whenever the execution policy is changed via
Expand Down Expand Up @@ -216,6 +219,18 @@ The CUDA implementation has a number of limitations:
We expect to address these limitations over time.


Using HIP
---------

Support for HIP is available as of |zfp| |hiprelease|, allowing |zfp| to be
run in parallel on AMD GPUs. To enable support, |zfp| the
:c:macro:`ZFP_WITH_HIP` macro must be set and |zfp| must be built with CMake.
See :c:macro:`ZFP_WITH_HIP` for further details.

The HIP implementation is based off the CUDA implementation, and therefore
the same :ref:`limitations <cuda-limitations>` apply.


Setting the Execution Policy
----------------------------

Expand All @@ -230,9 +245,10 @@ calling :c:func:`zfp_stream_set_execution`
}

before calling :c:func:`zfp_compress`. Replacing :code:`zfp_exec_omp`
with :code:`zfp_exec_cuda` enables CUDA execution. If OpenMP or CUDA is
disabled or not supported, then the return value of functions setting these
execution policies and parameters will indicate failure. Execution
with :code:`zfp_exec_cuda` enables CUDA execution. Similarly,
:code:`zfp_exec_hip` enables HIP execution. If the corresponding execution
policy is disabled or not supported, then the return value of functions
setting these policies and parameters will indicate failure. Execution
parameters are optional and may be set using the functions discussed above.

The source code for the |zfpcmd| command-line tool includes further examples
Expand All @@ -241,39 +257,42 @@ decompression in this tool, see the :option:`-x` command-line option.

.. note::
As of |zfp| |cudarelease|, the execution policy refers to both
compression and decompression. The OpenMP implementation does not
yet support decompression, and hence :c:func:`zfp_decompress` will
fail if the execution policy is not reset to :code:`zfp_exec_serial`
before calling the decompressor. Similarly, the CUDA implementation
supports only fixed-rate mode and will fail if other compression modes
are specified.
compression and decompression.

.. note::
As of |zfp| |vrdecrelease|, variable-rate compression modes are supported
for all execution policies, both for compression and decompression.
However, for parallel decompression, a block index must be provided that
encodes where in the compressed stream each block resides. See the section
on :ref:`parallel decompression <parallel-decompression>` for further
details.

The following table summarizes which execution policies are supported
with which :ref:`compression modes <modes>`:

+---------------------------------+---------+---------+---------+
| (de)compression mode | serial | OpenMP | CUDA |
+===============+=================+=========+=========+=========+
| | expert | |check| | |check| | |
| +-----------------+---------+---------+---------+
| | fixed rate | |check| | |check| | |check| |
| +-----------------+---------+---------+---------+
| compression | fixed precision | |check| | |check| | |
| +-----------------+---------+---------+---------+
| | fixed accuracy | |check| | |check| | |
| +-----------------+---------+---------+---------+
| | reversible | |check| | |check| | |
+---------------+-----------------+---------+---------+---------+
| | expert | |check| | |check| | |
| +-----------------+---------+---------+---------+
| | fixed rate | |check| | |check| | |check| |
| +-----------------+---------+---------+---------+
| decompression | fixed precision | |check| | |check| | |check| |
| +-----------------+---------+---------+---------+
| | fixed accuracy | |check| | |check| | |check| |
| +-----------------+---------+---------+---------+
| | reversible | |check| | |check| | |
+---------------+-----------------+---------+---------+---------+
+---------------------------------+---------+---------+---------+---------+
| (de)compression mode | serial | OpenMP | CUDA | HIP |
+===============+=================+=========+=========+=========+=========+
| | expert | |check| | |check| | |check| | |check| |
| +-----------------+---------+---------+---------+---------+
| | fixed rate | |check| | |check| | |check| | |check| |
| +-----------------+---------+---------+---------+---------+
| compression | fixed precision | |check| | |check| | |check| | |check| |
| +-----------------+---------+---------+---------+---------+
| | fixed accuracy | |check| | |check| | |check| | |check| |
| +-----------------+---------+---------+---------+---------+
| | reversible | |check| | |check| | | |
+---------------+-----------------+---------+---------+---------+---------+
| | expert | |check| | |check| | |check| | |check| |
| +-----------------+---------+---------+---------+---------+
| | fixed rate | |check| | |check| | |check| | |check| |
| +-----------------+---------+---------+---------+---------+
| decompression | fixed precision | |check| | |check| | |check| | |check| |
| +-----------------+---------+---------+---------+---------+
| | fixed accuracy | |check| | |check| | |check| | |check| |
| +-----------------+---------+---------+---------+---------+
| | reversible | |check| | |check| | | |
+---------------+-----------------+---------+---------+---------+---------+

:c:func:`zfp_compress` and :c:func:`zfp_decompress` both return zero if the
current execution policy is not supported for the requested compression
Expand All @@ -290,6 +309,8 @@ function in turn inspects the execution policy given by the
for executing compression.


.. _parallel-decompression:

Parallel Decompression
----------------------

Expand Down
9 changes: 5 additions & 4 deletions docs/source/high-level-api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -272,7 +272,7 @@ Types
::

typedef struct {
zfp_exec_policy policy; // execution policy (serial, omp, cuda, ...)
zfp_exec_policy policy; // execution policy (serial, omp, cuda, hip, ...)
void* params; // execution parameters
} zfp_execution;

Expand All @@ -287,14 +287,15 @@ Types

.. c:type:: zfp_exec_policy
Currently three execution policies are available: serial, OpenMP parallel,
and CUDA parallel.
Currently four execution policies are available: serial, OpenMP, CUDA, and
HIP.
::

typedef enum {
zfp_exec_serial = 0, // serial execution (default)
zfp_exec_omp = 1, // OpenMP multi-threaded execution
zfp_exec_cuda = 2 // CUDA parallel execution
zfp_exec_cuda = 2, // CUDA parallel execution
zfp_exec_hip = 3 // HIP parallel execution
} zfp_exec_policy;

----
Expand Down
69 changes: 59 additions & 10 deletions docs/source/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -241,17 +241,57 @@ in the same manner that :ref:`build targets <targets>` are specified, e.g.,

.. c:macro:: ZFP_WITH_CUDA
CMake macro for enabling or disabling CUDA support for
GPU compression and decompression. When enabled, CUDA and a compatible
host compiler must be installed. For a full list of compatible compilers,
CMake macro for enabling or disabling CUDA support for GPU compression and
decompression. When enabled, CUDA 11.0 or later and a compatible host
compiler must be installed. For a full list of compatible compilers,
please consult the
`NVIDIA documentation <https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/>`__.
If a CUDA installation is in the user's path, it will be
automatically found by CMake. Alternatively, the CUDA binary directory
can be specified using the :envvar:`CUDA_BIN_DIR` environment variable.
If a CUDA installation is in the user's path, it will be automatically found
by CMake. See also :c:macro:`CMAKE_CUDA_ARCHITECTURES`.
CMake default: off.
GNU make default: off and ignored.


.. c:macro:: CMAKE_CUDA_ARCHITECTURES
`CMake macro <https://cmake.org/cmake/help/latest/variable/CMAKE_CUDA_ARCHITECTURES.html>`__
for optionally specifying which
`NVIDIA GPU architectures <https://arnon.dk/matching-sm-architectures-arch-and-gencode-for-various-nvidia-cards/>`__
to build for. Use a semicolon separated list of architectures to override
the default, e.g., ``35;50;72`` generates code for compute capabilities
3.5, 5.0, and 7.2. Set to ``all`` to build for all supported architectures.
CMake default: compiler specific.
GNU make default: ignored.

.. note::
Setting ``CMAKE_CUDA_ARCHITECTURES=all`` makes it possible to use a single
binary across multiple architectures. However, this option can significantly
increase the build time and size of ``libzfp``.


.. c:macro:: ZFP_WITH_HIP
CMake macro for enabling or disabling HIP support for GPU compression and
decompression. If a HIP installation is in the user's path, it will be
automatically found by CMake. Alternatively, one may set the environment
variable :envvar:`HIP_PATH` to point to the HIP installation. Some
platforms further require setting ``CMAKE_C_COMPILER=hipcc`` and
``CMAKE_CXX_COMPILER=hipcc``. See also :c:macro:`CMAKE_HIP_ARCHITECTURES`.
CMake default: off.
GNU make default: off and ignored.


.. c:macro:: CMAKE_HIP_ARCHITECTURES
`CMake macro <https://cmake.org/cmake/help/latest/variable/CMAKE_HIP_ARCHITECTURES.html>`__
for optionally specifying which
`AMD GPU architectures <https://rocm.docs.amd.com/en/latest/reference/gpu-arch-specs.html>`__
to build for. Use a semicolon separated list of architectures to override
the default, e.g., ``gfx900;gfx908``.
CMake default: compiler specific.
GNU make default: ignored.


.. _rounding:
.. c:macro:: ZFP_ROUNDING_MODE
Expand Down Expand Up @@ -279,6 +319,7 @@ in the same manner that :ref:`build targets <targets>` are specified, e.g.,
:code:`serial` and :code:`omp` :ref:`execution policies <execution>`.
Default: :code:`ZFP_ROUND_NEVER`.


.. c:macro:: ZFP_WITH_TIGHT_ERROR
**Experimental feature**. When enabled, this feature takes advantage of the
Expand All @@ -293,6 +334,7 @@ in the same manner that :ref:`build targets <targets>` are specified, e.g.,
:ref:`execution policies <execution>`.
Default: undefined/off.


.. c:macro:: ZFP_WITH_DAZ
When enabled, blocks consisting solely of subnormal floating-point numbers
Expand All @@ -312,6 +354,7 @@ in the same manner that :ref:`build targets <targets>` are specified, e.g.,
:code:`omp`.
Default: undefined/off.


.. c:macro:: ZFP_WITH_ALIGNED_ALLOC
Use aligned memory allocation in an attempt to align compressed blocks
Expand Down Expand Up @@ -398,8 +441,8 @@ in the sections below.
CMake
^^^^^

CMake builds require version 3.9 or later. CMake is available
`here <https://cmake.org>`__.
CPU-only CMake builds require version 3.9 or later; see below for GPU build
requirements. CMake is available `here <https://cmake.org>`__.

OpenMP
^^^^^^
Expand All @@ -409,8 +452,14 @@ OpenMP support requires OpenMP 2.0 or later.
CUDA
^^^^

CUDA support requires CUDA 7.0 or later, CMake, and a compatible host
compiler (see :c:macro:`ZFP_WITH_CUDA`).
CUDA support requires CUDA 11.0 or later, CMake 3.23 or later, and a
compatible host compiler (see :c:macro:`ZFP_WITH_CUDA`).

HIP
^^^

HIP support requires ROCm 4.0 or later, CMake 3.21 or later, and a
compatible host compiler (see :c:macro:`ZFP_WITH_HIP`).

C/C++
^^^^^
Expand Down
13 changes: 7 additions & 6 deletions docs/source/zfpcmd.rst
Original file line number Diff line number Diff line change
Expand Up @@ -232,7 +232,8 @@ Execution parameters
:code:`-x omp=threads,chunk_size` to specify the chunk size in number
of blocks (see also :c:func:`zfp_stream_set_omp_chunk_size`). A
chunk size of zero is ignored and results in the default size.
Use :code:`-x cuda` to for parallel CUDA compression and decompression.
Use :code:`-x cuda` or :code:`-x hip` for parallel CUDA or HIP
compression and decompression, respectively.

As of |cudarelease|, the execution policy applies to both compression
and decompression. If the execution policy is not supported for
Expand All @@ -245,9 +246,9 @@ Block Index
^^^^^^^^^^^

A block index is needed to support variable-rate decompression using any
of the parallel execution policies (OpenMP and CUDA). This index must be
captured and stored to file during compression and later accessed prior to
decompression.
of the parallel execution policies (OpenMP, CUDA, and HIP). This index
must be captured and stored to file during compression and later accessed
prior to decompression.

.. option:: -m <path>

Expand All @@ -258,8 +259,8 @@ decompression.

Block index type ("offset" or "hybrid") and granularity in number of blocks
per index entry. A granularity of one provides the highest flexibility and
performance potential (especially for CUDA) but also the highest storage
cost.
performance potential (especially for CUDA and HIP) but also the highest
storage cost.

See the :ref:`hl-func-index` section for further details.

Expand Down

0 comments on commit e06901a

Please sign in to comment.