Some operations could be implemented using more than one library or more than one technique. For example, a GEMM could be implemented for CUDA or ROCm using either the cublas/cublasLt libraries or hipblas/hipblasLt libraries, respectively. How does one know which implementation is the fastest and should be chosen? That's what TunableOp provides. Certain operators have been implemented using multiple strategies as Tunable Operators. At runtime, all strategies are profiled and the fastest is selected for all subsequent operations.
See the :doc:`documentation <cuda.tunable>` for information on how to use it.
CUDA Sanitizer is a prototype tool for detecting synchronization errors between streams in PyTorch. See the :doc:`documentation <cuda._sanitizer>` for information on how to use it.
