GrCUDA exposes existing GPU kernels and host functions that accept device objects as parameters to all GraalVM languages through the Truffle Interop Library. GrCUDA represents device objects as flat arrays of primitive types and makes them accessible as array-like TruffleObjects to other Truffle languages.
GrCUDA itself is an expression-oriented language. Most expressions in GrCUDA, however, simply evaluate to function objects that can then be used in the host languages.
Contents:
- Device Arrays
- Function Invocations
- Callables
- Function Namespaces
- Kernel and Host Function Signatures
- Built-in Functions
Device arrays are flat multi-dimensional arrays of primitive types.
The arrays are can be accessed from GPU kernels and native host functions that accept GPU pointers as parameter.
The device arrays can also be accessed from host languages through the array constructs available in these host languages.
The memory for the device array is CUDA-managed memory.
Device arrays can be allocated through array allocation expressions and through the built-in DeviceArray
constructor function (see built-in functions).
Device arrays are tied to garbage collector of the VM. Their underlying memory is allocated off-heap through the CUDA Runtime. Only a stub object containing the pointer to the off-heap memory, type, and size information etc, is kept on-heap. Therefore, the heap utilization does not reflect the off-heap utilization. A large device array does not have a large on-heap footprint. The garbage collection pass may not be initiated even though the memory utilization (off-heap) is high.
Just as for other native bindings, GrCUDA is not able to prevent GPU kernels or host functions from capturing pointers to device array objects or their elements. Because device arrays are managed by the garbage collector, capturing of references in native code can potentially lead to dangling references.
The CUDA-managed memory of the device array can be freed explicitly by calling the free()
method of DeviceArray
.
This will release the allocated memory through cudaFree()
.
Once the underlying memory is freed, the DeviceArray
enters a defunct state.
All subsequent accesses with throw an exception.
The Boolean property isMemoryFreed
of DeviceArray
can be checked whether the device array's memory buffer has already be freed.
Device arrays are allocated using a syntax that is similar to arrays C/C++.
Multi-dimensional arrays use row-major (C style) layout.
Use DeviceArray(type, size, 'F')
to create arrays with column-major (Fortran style) order.
The array sizes must be compile-time constants, i.e., either an integer literal or a constant expression. One way to define the array size in array allocation expressions from host languages is through string interpolation (e.g., template strings in JavaScript).
double[100] // array of 10 double-precision elements
int[10 * 100] // array of 1,000 32-bit integer elements
float[1000][28][28] // array of 1000 x 28 x 28 float elements
The supported data types are:
GrCUDA Type | Truffle (Java) Type | Compatible C++ Types |
---|---|---|
boolean |
boolean |
bool , unsigned char |
char |
byte |
char , signed char |
short |
short |
short |
int |
int |
int |
long |
long |
long , long long |
float |
float |
float |
double |
double |
double |
The polyglot expression returns a DeviceArray
object for one-dimensional arrays or a MultDimDeviceArray
for a multi-dimensional array to the host.
Example in Python:
device_array = polyglot.eval(language='grcuda', string='double[5]')
device_array[2] = 1.0
matrix = polyglot.eval(language='grcuda', string='float[28][28]')
matrix[4][3] = 42.0
Example in R:
Note that arrays are 1-indexed in R.
device_array <- eval.polyglot('grcuda', 'double[5]')
device_array[2:3] <- c(1.0, 2.0)
matrix <- eval.polyglot'grcuda', 'float[28][28]')
matrix[5, 4] <- 42.0
Example in NodeJS/JavaScript:
device_array = Polyglot.eval('grcuda', 'double[5]')
device_array[2] = 1.0
matrix = Polyglot.eval('grcuda', 'float[28][28]')
matrix[4][3] = 42.0
Example in Ruby:
device_array = Polyglot.eval('grcuda', 'double[5]')
device_array[2] = 1.0
matrix = Polyglot.eval('grcuda', 'float[28]28]')
matrix[4][3] = 42.0
For Java and the C-binding, element access is exposed through a function call syntax.
Example in Java:
Value deviceArray = polyglot.eval("grcuda", "double[5]");
double oldValue = (Double) deviceArray.getArrayElement(2);
deviceArray.setArrayElement(1, 1.0);
Value matrix = polyglot.eval("grcuda", "float[28][28]");
matrix.getArrayElement(4).setArrayElement(3, 42.0);
Function invocations are evaluated inside GrCUDA, then the return values are passed back to the host language. The argument expressions that can be used in GrCUDA language strings are currently limited to literals and constant integer expressions. For general function invocation, first create a GrCUDA expression that returns a callable back to the host language and then invoke the callable from the host language.
Example in JavaScript:
// cudaDeviceReset() does not return a value back to JavaScript
Polyglot.eval('grcuda','cudaDeviceReset()')
// DeviceArray("float", 1000) returns a device array containing 1,000 floats
const deviceArray = Polyglot.eval('grcuda', 'DeviceArray("float", 1000)')
Callables are Truffle objects that can be invoked from the host language. A common usage pattern is to submit polyglot expression that return callables. Identifiers are inside a namespace. CUDA Functions reside in the root namespace (see built-in functions).
For device arrays and GPU pointers that are passed as arguments to callables, GrCUDA automatically passes the underlying pointers to the native host or kernel functions.
Example in Python:
cudaDeviceReset = polyglot.eval(language='grcuda', string='cudaDeviceReset')
cudaDeviceReset()
cudaMalloccudaFree = polyglot.eval(language='grcuda', string='cudaFree')
mem = polyglot.eval(language='grcuda', string='cudaMalloc(100)')
cudaFree(mem) # free the memory
createDeviceArray = polyglot.eval(language='grcuda', String='DeviceArray')
in_ints = createDeviceArray('int', 100)
out_ints = createDeviceArray('int', 100)
bind = polyglot.eval(language='grcuda', string='bind')
# bind to symbol of native host function, returns callable
inc = bind('<path-to-library>/libinc.so',
'increment(in_arr: in pointer sint32 , out_arr: out pointer sint32, n: uint64): sint32')
inc(in_ints, out_ints, 100)
bindkernel = polyglot.eval(language='grcuda', string='bindkernel')
# bind to symbol of kernel function from binary, returns callable
inc_kernel = bindkernel("inc_kernel.cubin",
'inc_kernel(in_arr: in pointer sint32 , out_arr: out pointer sint32, n: uint64)')
inc_kernel(160, 128)(out_ints, in_ints, 100)
GrCUDA organizes functions in a hierarchical namespace, i.e. namespaces can be nested. The user can register additional kernel and host function into this namespace. The registration is not persisted across instantiations of the GrCUDA language context.
The root namespace is called CU
.
It can be received through a polyglot expression.
Example in Python:
import polyglot
cu = polyglot.eval(language='grcuda', string='CU')
GrCUDA needs to know the signature of native kernel functions and native host functions to correctly pass arguments.
The signature is expressed in NIDL (Native Interface Definition Language) syntax.
The NIDL specification is used in bind()
for host functions and in bindkernel()
for kernel functions.
Multiple host functions or multiple kernel function specifications can be combined into one single .nidl
file, which can then be passed to bindall()
.
This call registers the bindings to all listed functions in the GrCUDA namespace.
The NIDL signature contains the name of the function and the parameters with their types.
Host functions also have a return type.
Since GPU kernel functions always return void
, the return type is omitted for kernel functions in the NIDL syntax.
The parameters are primitive types passed by-value or are pointers to primitive values.
The parameter names can be chosen arbitrarily.
The parameter names are used to improve error messages.
They do not affect the execution.
A complete list of all supported NIDL types and their mapping to Truffle and C++ types can be found in the NIDL type mapping document.
Pointers are used to refer to device arrays.
Device array objects passed as arguments to kernels or host functions automatically decay into pointers.
Pointer are typed and have a direction <direction> pointer <element type>
.
Allowed keywords for pointer directions are in
, out
, and inout
.
NIDL | C++ |
---|---|
in pointer T |
const T* |
out pointer T |
T* |
inout pointer T |
T* |
With the async
execution policy and the with-const
DependencyPolicy, using the const
and in
signals that the input will not be modified by the kernel.
As such, the scheduler will optimize its execution, for example by overlapping it with other kernels that access the same data but do not modify them.
GrCUDA does not currently check that the arrays are actually unmodified!
It is responsibility of the user to use const/in
correctly.
Example Host Function Signatures:
saxpy(n: sint32, alpha: float, xs: in pointer float, ys: inout pointer float): void
The signature declares a host function saxpy
that takes a signed 32-bit int n
, a
single precision floating point value alpha
, a device memory pointer xs
to
constant float, a device memory pointer ys
to float. The function does not return a value and,
therefore, has the return type void
. GrCUDA looks for a C-style function definition
in the shared library, i.e., it searches for the symbol saxpy
.
A matching function would be defined in C++ code as:
extern "C"
void saxpy(int n, float alpha, const float* xs, float *ys) { ... }
If the function is implemented in C++ without C-style export, its symbol name is mangled.
In this case, prefix the function definition with the keyword cxx
which will instruct
GrCUDA to search for a C++ mangled symbol name.
cxx predict(samples: in pointer double, labels: out pointer sint32, num_samples: uint32, weights: in pointer float, num_weights: uint32): sint32
This function returns a signed 32-bit integer. Due to the cxx
keyword, GrCUDA looks for the
symbol with the mangled name _Z7predictPKdPijPKfj
in the shared library.
C++ namespaces are also supported. The namespaces can be specified (using ::
).
namespace aa {
namespace bb {
namespace cc {
float foo(int n, const int* in_arr, int* out_arr) { ... }
}
}
}
The NIDL signature that matches to the resulting mangled symbol name
_ZN2aa2bb2cc3fooEiPKiPi
of the function is:
cxx aa:bb:cc::foo(n: sint32, in_arr: in pointer sint32, out_arr: out pointer sint32): float
Example Kernel Signatures:
update_kernel(gradient: in pointer float, weights: inout pointer float, num_weights: uint32)
This signature declaration refers to a kernel that takes a pointer gradient
to constant float,
a pointer weights
to float, and a value num_weights
as an unsigned 32-bit int.
Note that there return type specification is missing because CUDA kernel functions do not return
any value. GrCUDA looks for the function symbol update_kernel
in the cubin or PTX file.
nvcc
uses C++ name mangling by default. The update_kernel
function would therefore have to be
defined as extern "C"
in the CUDA source code.
GrCUDA can be instructed search for the C++ mangled name by adding cxx
keyword.
cxx predict_kernel(samples: in pointer double, labels: out pointer sint32, num_samples: uint32, weights: in pointer float, num_weights: uint32)
GrCUDA then searches for the symbol _Z14predict_kernelPKdPijPKfj
.
Multiple declaration of host and kernel functions can be specified in a NIDL file and bound into GrCUDA namespace in one single step.
The functions are collected in binding groups within { }
. The following binding groups
exist:
Binding Group | Syntax |
---|---|
host functions in C++ namespace | hostfuncs aa:bb:cc { ... } |
C++ host functions w/o namespace | hostfuncs { ... } |
C-style host functions | chostfuncs { ... } |
kernel functions in C++ namespace | kernels aa:bb:cc { ... } |
kernel functions w/o namespace | kernels { ... } |
C-style kernel functions | ckernels { ... } |
For C++ host functions and kernels the cxx
prefix in the function signature is not used,
since it is already defined in the binding group.
Example Binding Host Function from NIDL File:
$ cat foo_host_bindings.nidl
// host functions with C++ symbols in namespace
hostfuncs aa::bb::cc {
// _ZN2aa2bb2cc20incr_outofplace_hostEPKiPfi
incr_outofplace_host(in_dev_arr: in pointer sint32, out_dev_arr: out pointer float, num_elements: sint32): sint32
// _ZN2aa2bb2cc19gpu_incr_inplaceEPfi
incr_inplace_host(dev_arr: inout pointer float, num_elements: sint32): void
}
// host function with C++ symbols without namespace
hostfuncs {
// _Z21cxx_incr_inplace_hostPfi
cxx_incr_inplace_host(dev_arr: inout pointer float, num_elements: sint32): void
}
// host function with C symbol (extern "C")
chostfuncs {
// c_incr_inplace_host
c_incr_inplace_host(dev_arr: inout pointer float, num_elements: sint32): void
}
The following example in JavaScript uses the foo_host_bindings.nidl
file in bindall()
.
The function are registered in to myfoo
namespace which is directly under the
root namespace. The function executed is aa::bb:cc::incr_inplace_host()
in the
shared library libfoo.so
.
const cu = Polyglot.eval('grcuda', 'CU')
// host all host functions from libfoo.so into 'myfoo' namespace.
cu.bindall('myfoo', 'libfoo.so', 'foo_host_bindings.nidl')
...
// invoke functions in myfoo namespace
cu.myfoo.incr_inplace_host(deviceArray, deviceArray.length)
cu.myfoo.cxx_incr_inplace_host(deviceArray, deviceArray.length)
cu.myfoo.c_incr_inplace_host(deviceArray, deviceArray.length)
Example Binding Kernel Function from NIDL File:
$ cat foo_kernel_bindings.nidl
// kernels with C++ symbols in namespace
kernels aa::bb::cc {
// _ZN2aa2bb2cc22incr_outofplace_kernelEPKiPfi
incr_outofplace_kernel(in_arr: in pointer sint32, out_arr: out pointer float,
num_elements: sint32)
// _ZN2aa2bb2cc19incr_inplace_kernelEPfi
incr_inplace_kernel(arr: inout pointer float, num_elements: sint32)
}
// kernel with C++ symbol without namespace
kernels {
// _Z23cxx_incr_inplace_kernelPfi
cxx_incr_inplace_kernel(arr: inout pointer float, num_elements: sint32)
}
// kernel with C symbol (extern "C")
ckernels {
// c_incr_inplace
c_incr_inplace_kernel(arr: inout pointer float, num_elements: sint32)
}
The following example in JavaScript uses the foo_kernel_bindings.nidl
file in bindall()
.
The kernel functions are registered in to myfoo
namespace which is directly under the
root namespace. The function executed is aa::bb:cc::incr_inplace_host()
in the
shared library.
const cu = Polyglot.eval('grcuda', 'CU')
// host all host functions from libfoo.so into 'myfoo' namespace.
cu.bindall('myfoo', '.so', 'foo_host_bindings.nidl')
...
// invoke functions in myfoo namespace
cu.myfoo.incr_inplace_host(deviceArray, deviceArray.length)
cu.myfoo.cxx_incr_inplace_host(deviceArray, deviceArray.length)
cu.myfoo.c_incr_inplace_host(deviceArray, deviceArray.length)
The built-in functions are located in the root namespace of GrCUDA.
The functions are accessible directly in polyglot expression or, alternatively,
through the CU
root namespace object.
The bind()
function allows binding a symbol which corresponds to a
host function from a shared library (.so) to a callable.
bind()
returns the callable back to the host language.
bind(libraryPath, nidlSignature): callable
libraryPath
: is the full path to the .so file
nidlSignature
: is the function's signature in NIDL syntax.
Add prefix cxx
if the symbol for function has C++ mangled name in the shared library.
Example:
bind("<path-to-library>/libinc.so",
"increment(arr_out: out pointer float, arr_in: pointer float, n_elements: uint64): sint32")
bind("<path-to-library>/libinc.so", "c",
"cxx inc(arr_out: out pointer float, arr_in: pointer float, n_elements: uint64): sint32")
A complete example is given in the bindings tutorial.
Multiple host and kernel functions can be grouped and imported into rCUDA in one single step.
The signatures of host and kernel functions are specified NIDL files using
NIDL syntax. All listed functions are registered in
the GrCUDA targetNamespace
.
bindall(targetNamespace, fileName, nidlFileName)
targetNamespace
: GrCUDA namespace into which all functions listed in the NIDL file that
are imported from fileName
are registered.
nidlFileName
: name of the NIDL that contains specification for the host or kernel
function.
-
Note that the NIDL file can only contain specifications for host functions or kernel functions, but not both.
-
Note that for functions whose symbol have C++ mangled names need to be enclosed in
hostfuncs
andkernels
groups. Thecxx
prefix as used inbind()
andbindkernel()
cannot used inside NIDL files.
Examples:
bindall("myfoo", "libfoo.so", "foo_host_bindings.nidl")
bindall("myfoo", "foo.cubin", "foo_kernel_bindings.nidl")
A complete example is given in the bindings tutorial.
Kernel functions are bound to callable objects in the host
lost language. Kernel functions can be imported from cubin
binary files
or PTX files. A kernel callable is bound to a __global__
symbol in
a cubin
or PTX file using the bindkernel(..)
function.
bindkernel(cubinFile, nidlSignature)
cubinFile
: name of the cubin file that was produced by nvcc
nidlSignature
: is the function's signature in NIDL syntax. Since kernels do not return
a result value, the return type is omitted.
Add prefix cxx
if the symbol for function has C++ mangled name in the shared library.
Example:
bindkernel("inc_kernel.cubin", "inc_kernel(arr_out: out pointer float, arr_in: pointer float, n_elements: uint64)
A complete example is given in the bindings tutorial.
A GPU kernel callable can be created at runtime from a host language string that contains the CUDA C/C++ source code.
buildkernel(cudaSourceString, nidlSignature)
cudaSourceString
: name of the cubin file that was produced by nvcc
nidlSignature
: is the function's signature in NIDL syntax. Since kernels do not return
a result value, the return type is omitted.
The prefix cxx
cannot he specified in build kernel as the NVRTC, used to compile
the kernel, correctly resolves the lowered name itself.
If the kernel in cudaSourceString
has a template parameter, the template argument
in the instantiation can be specified between < >
the kernel name. The template
argument is passed to NVRTC. Thus, type arguments must be C++ types rather than
NIDL types. In the following example the template argument is int
rather than
sint32
.
Example:
const kernelSource = `
template <typename T>
__global__ void inc_kernel(T *out_arr, const T *in_arr, int num_elements) {
for (auto idx = blockIdx.x * blockDim.x + threadIdx.x; idx < num_elements;
idx += gridDim.x * blockDim.x) {
out_arr[idx] = in_arr[idx] + (T{} + 1);
}
}`
const signature =
'inc_kernel<int>(out_arr: out pointer sint32, in_arr: in pointer sint32, num_elments: sint32)'
const incKernel = cu.buildkernel(kernelSource, signature)
A complete example is given in the bindings tutorial.
The getdevices()
functions returns an array that contains all visible
CUDA devices. getdevice(k)
returns the k
visible device, with
k
ranging from 0 to the number of visible devices - 1.
devices = getdevices()
device = getdevice(deviceOrdinal)
deviceOrdinal
: integer k
that for the kth device, k
from 0 to
the number of visible devices
(see cudaGetDevice)
Both functions return Devices
objects which have the following members:
Attribute id
: the device ID (ordinal)
Attribute properties
: property objects containing device attributes
returned by the CUDA runtime cudaDeviceGetAttributeGet()
,
cudaMemgetInfo()
and cuDeviceGetName()
.
Method isCurrent()
: method returns true iff id
is the device
on which the currently active host thread executes device code.
Method setCurrent()
: method sets id
as the device the
currently active host thread should execute device code.
Example:
devices = polyglot.eval(language='grcuda', 'getdevices()')
device0 = polyglot.eval(language='grcuda', 'getdevice(0)')
# identical to device0 = devices[0]
for device in devices:
print('{}: {}, {} multiprocessors'.format(device.id,
device.property.deviceName,
device.property.multiProcessorCount))
# example output
# 0: TITAN V, 80 multiprocessors
# 1: Quadro GP100, 56 multiprocessors
device0.setCurrent()
print(device0.isCurrent()) # true
Table: Device Properties Names (see also CUDA Runtime Documentation)
Property Name:
asyncEngineCount
canFlushRemoteWrites
canMapHostMemory
canUseHostPointerForRegisteredMem
clockRate
computeCapabilityMajor
computeCapabilityMinor
computeMode
computePreemptionSupported
concurrentKernels
concurrentManagedAccess
cooperativeLaunch
cooperativeMultiDeviceLaunch
deviceName
directManagedMemAccessFromHost
eccEnabled
freeDeviceMemory
globalL1CacheSupported
globalMemoryBusWidth
gpuOverlap
hostNativeAtomicSupported
hostRegisterSupported
integrated
isMultiGpuBoard
kernelExecTimeout
l2CacheSize
localL1CacheSupported`
managedMemory
maxBlockDimX
maxBlockDimY
maxBlockDimZ
maxGridDimX
maxGridDimY
maxGridDimZ
maxPitch
maxRegistersPerBlock
maxRegistersPerMultiprocessor
maxSharedMemoryPerBlock
maxSharedMemoryPerBlockOptin
maxSharedMemoryPerMultiprocessor
maxSurface1DLayeredLayers
maxSurface1DWidth
maxSurface2DHeight
maxSurface2DLayeredHeight
maxSurface2DLayeredLayers
maxSurface2DLayeredWidth
maxSurface2DWidth
maxSurface3DDepth
maxSurface3DHeight
maxSurface3DWidth
maxSurfaceCubemapLayeredLayers
maxSurfaceCubemapLayeredWidth
maxSurfaceCubemapWidth
maxTexture1DLayeredLayers
maxTexture1DLayeredWidth
maxTexture1DLinearWidth
maxTexture1DMipmappedWidth
maxTexture1DWidth
maxTexture2DGatherHeight
maxTexture2DGatherWidth
maxTexture2DHeight
maxTexture2DLayeredHeight
maxTexture2DLayeredLayers
maxTexture2DLayeredWidth
maxTexture2DLinearHeight
maxTexture2DLinearPitch
maxTexture2DLinearWidth
maxTexture2DMipmappedHeight
maxTexture2DMipmappedWidth
maxTexture2DWidth
maxTexture3DDepth
maxTexture3DDepthAlt
maxTexture3DHeight
maxTexture3DHeightAlt
maxTexture3DWidth
maxTexture3DWidthAlt
maxTextureCubemapLayeredLayers
maxTextureCubemapLayeredWidth
maxTextureCubemapWidth
maxThreadsPerBlock
maxThreadsPerMultiProcessor
memoryClockRate
multiGpuBoardGroupID
multiProcessorCount
pageableMemoryAccess
pageableMemoryAccessUsesHostPageTables
pciBusId
pciDeviceId
pciDomainId
singleToDoublePrecisionPerfRatio
streamPrioritiesSupported
surfaceAlignment
tccDriver
textureAlignment
texturePitchAlignment
totalConstantMemory
totalDeviceMemory
unifiedAddressing
warpSize
A subset of the functions of the
CUDA Runtime API
are exported into the root namespace of GrCUDA i.e., the CU
object.
opaquePointerToDeviceMemory = polyglot.eval(language='grcuda', string='cudaMalloc(1000))')
...
cu = polyglot.eval(langauge='grcuda', string='CU')
cu.cudaFree(opaquePointerToDeviceMemory)
CUDA functions that take an out pointer, e.g., cudaError_t cudaMalloc(void** devPtr, size_t size)
, are wrapped such that they
return the result through the return value. In this case, errors are signaled
by throwing exceptions of type GrCUDAException
, e.g., cudaMalloc(size: uint64): pointer
.
CUDA Function | NIDL Signature |
---|---|
cudaDeviceGetAttribute | cudaDeviceGetAttribute(attr: sint32, device: sint32): sint32 |
cudaDeviceReset | cudaDeviceReset(): void |
cudaDeviceSynchronize | cudaDeviceSynchronize(): void |
cudaFree | cudaFree(ptr: inout pointer void): void |
cudaGetDevice | cudaGetDevice(): sint32 |
cudaGetDeviceCount | cudaGetDeviceCount(): sint32 |
cudaGetErrorString | cudaGetErrorString(error: sint32): string |
cudaMalloc | cudaMalloc(size: uint64): pointer |
cudaMallocManaged | cudaMallocManaged(size: uint64, flags: uint32): pointer |
cudaSetDevice | cudaSetDevice(device: sint32): void |
cudaMemcpy | cudaMemcpy(dst: out pointer void, src: in pointer void, count: uint64, kind: sint32):void |
The following functions from cuBLAS are are preregistered in in the namespace
BLAS
. The cublasHandle_t
is implicitly provided.
The location of the cuBLAS library (libcublas.so
) can be set via the
--grcuda.CuBLASLibrary= location
. cuBLAS support can be disabled with the option
--grcuda.CuBLASEnabled=false
.
// BLAS-1 AXPY: y[i] := alpha * x[i] + y[i]
cublasSaxpy_v2(n: sint32, alpha: in pointer float, x: in pointer float, incx: sint32,
y: inout pointer float, incy: sint32): sint32
cublasDaxpy_v2(n: sint32, alpha: in pointer double, x: in pointer double, incx: sint32,
y: inout pointer double, incy: sint32): sint32
cublasCaxpy_v2(n: sint32, alpha: in pointer float, x: in pointer float, incx: sint32,
y: inout pointer float, incy: sint32): sint32
cublasZaxpy_v2(n: sint32, alpha: in pointer double, x: in pointer double, incx: sint32,
y: inout pointer double, incy: sint32): sint32
// BLAS-2 GEMV: y[i] := alpha * trans(A) + beta * y[i]
cublasSgemv_v2(trans: sint32, n: sint32, m: sint32, alpha: in pointer float,
A: in pointer float, lda: sint32,
x: in pointer float, incx: sint32,
beta: in pointer float,
y: inout pointer float, incy: sint32): sint32
cublasDgemv_v2(trans: sint32, n: sint32, m: sint32, alpha: in pointer double,
A: in pointer double, lda: sint32,
x: in pointer double, incx: sint32,
beta: in pointer double,
y: inout pointer double, incy: sint32): sint32
cublasCgemv_v2(trans: sint32, n: sint32, m: sint32, alpha: in pointer float,
A: in pointer float, lda: sint32,
x: in pointer float, incx: sint32,
beta: in pointer float,
y: inout pointer float, incy: sint32): sint32
cublasZgemv_v2(trans: sint32, n: sint32, m: sint32, alpha: in pointer double,
A: in pointer double, lda: sint32,
x: in pointer double, incx: sint32,
beta: in pointer double,
y: inout pointer double, incy: sint32): sint32
// BLAS-3 GEMM: C := alpha * transa(A) * transb(B) + beta * C
cublasSgemm_v2(transa: sint32, transb: sint32, n: sint32, m: sint32, k: sint32,
alpha: in pointer float,
A: in pointer float, lda: sint32,
B: in pointer float, ldb: sint32,
beta: in pointer float,
C: in pointer float, ldc: sint32): sint32
cublasDgemm_v2(transa: sint32, transb: sint32, n: sint32, m: sint32, k: sint32,
alpha: in pointer double,
A: in pointer double, lda: sint32,
B: in pointer double, ldb: sint32,
beta: in pointer double,
C: in pointer double, ldc: sint32): sint32
cublasCgemm_v2(transa: sint32, transb: sint32, n: sint32, m: sint32, k: sint32,
alpha: in pointer float,
A: in pointer float, lda: sint32,
B: in pointer float, ldb: sint32,
beta: in pointer float,
C: in pointer float, ldc: sint32): sint32
cublasZgemm_v2(transa: sint32, transb: sint32, n: sint32, m: sint32, k: sint32,
alpha: in pointer double,
A: in pointer double, lda: sint32,
B: in pointer double, ldb: sint32,
beta: in pointer double,
C: in pointer double, ldc: sint32): sint32
The letter S, D, C, Z designate the data type:
{X} |
Type |
---|---|
S |
single precision (32 bit) real number |
D |
double precision (64 bit) real number |
C |
single precision (32 bit) complex number |
Z |
double precision (64 bit) complex number |
Exposed functions from RAPIDS cuML (libcuml.so
) are preregistered with
in the namespace ML
. The cumlHandle_t
argument, is implicitly provided by
GrCUDA and, thus, must be omitted in the polyglot callable.
The cuML function registry can be disabled by setting --grcuda.CuMLEnabled=false
.
The absolute path to the libcuml.so
shared library must be specified in --grcuda.CuMLLibrary=
.
Current set of preregistered cuML functions:
void ML::cumlSpDbscanFit(DeviceArray input, int num_rows, int num_cols, float eps, int min_samples, DeviceArray labels, size_t max_bytes_per_chunk, int verbose)
void ML::cumlDbDbscanFit(DeviceArray input, int num_rows, int num_cols, double eps, int min_samples, DeviceArray labels, size_t max_bytes_per_chunk, int verbose)
expr ::= arrayExpression | funcCall | callable
arrayExpr ::= TypeName ('[' constExpr ']')+
callExpr ::= FunctionName '(' argList? ')'
callable ::= ('::') Identifier
| Namespace '::' Identifier
argList ::= argExpr (',' argExpr)*
argExpr ::= Identifier | String | constExpr
constExpr ::= constTerm ([+-] constTerm)*
constTerm ::= constFactor ([*/%] constFactor)*
constFactor ::= IntegerLiteral | '(' constExpr ')'
TypeName ::= Identifier
FunctionName ::= Identifier
String ::= '"' Character '"'