Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Newly-spawned tasks should re-set the device #851

Closed
kshyatt opened this issue Apr 20, 2021 · 4 comments
Closed

Newly-spawned tasks should re-set the device #851

kshyatt opened this issue Apr 20, 2021 · 4 comments
Labels
needs documentation Documentation is requested.

Comments

@kshyatt
Copy link
Contributor

kshyatt commented Apr 20, 2021

Describe the bug

We should be able to handle running multiple streams on multiple GPUs simultaneously using our old friends @sync and @async. However, this currently leads to illegal address errors.

To reproduce

The Minimal Working Example (MWE) for this bug:

using CUDA

NUM_DEVS = 4
n = 10

function my_func(n::Int)
    @sync begin
        for dev in 1:NUM_DEVS
            @async begin
                device!(dev)
                dev_arrs = [CUDA.rand(2^10, 2^10) for ii in 1:n]
                synchronize()
                @sync begin
                    for arr in dev_arrs
                        @async abs.(arr)
                    end
                end
            end
        end
    end
end
my_func(n)
Manifest.toml

master for CUDA and dependencies.

Expected behavior

The code should run without illegal address errors when Julia exits.

Version info

Details on Julia:

Julia Version 1.7.0-DEV.534
Commit 05f78df5bd* (2021-02-14 00:20 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, broadwell)

Details on CUDA:

CUDA 11.2

Additional context
The script itself runs without issues, however when Julia exits with illegal address and finalizer errors.

Commenting out the assertion that fails in Julia's locking code (line 70 of locks.h), I get a segfault:

signal (11): Segmentation fault
in expression starting at /home/kshyatt/multi_gpu.jl:23
jl_eh_restore_state at /home/kshyatt/julia-1.6/src/rtutils.c:254
macro expansion at /home/kshyatt/.julia/packages/GPUCompiler/8sSXl/src/driver.jl:109 [inlined]
emit_julia at /home/kshyatt/.julia/packages/GPUCompiler/8sSXl/src/utils.jl:62
cufunction_compile at /home/kshyatt/.julia/dev/CUDA/src/compiler/execution.jl:299
check_cache at /home/kshyatt/.julia/packages/GPUCompiler/8sSXl/src/cache.jl:47 [inlined]
cached_compilation at /home/kshyatt/.julia/packages/GPUArrays/gjXOn/src/host/broadcast.jl:57 [inlined]
cached_compilation at /home/kshyatt/.julia/packages/GPUCompiler/8sSXl/src/cache.jl:0
#cufunction#280 at /home/kshyatt/.julia/dev/CUDA/src/compiler/execution.jl:289
cufunction at /home/kshyatt/.julia/dev/CUDA/src/compiler/execution.jl:283 [inlined]
macro expansion at /home/kshyatt/.julia/dev/CUDA/src/compiler/execution.jl:102 [inlined]
#launch_heuristic#305 at /home/kshyatt/.julia/dev/CUDA/src/gpuarrays.jl:17 [inlined]
launch_heuristic at /home/kshyatt/.julia/dev/CUDA/src/gpuarrays.jl:17 [inlined]
copyto! at /home/kshyatt/.julia/packages/GPUArrays/gjXOn/src/host/broadcast.jl:63 [inlined]
copyto! at ./broadcast.jl:936 [inlined]
copy at /home/kshyatt/.julia/packages/GPUArrays/gjXOn/src/host/broadcast.jl:47 [inlined]
materialize at ./broadcast.jl:883 [inlined]
#3 at ./task.jl:411
unknown function (ip: 0x7f2e83f0ebac)
_jl_invoke at /home/kshyatt/julia-1.6/src/gf.c:2237 [inlined]
jl_apply_generic at /home/kshyatt/julia-1.6/src/gf.c:2419
jl_apply at /home/kshyatt/julia-1.6/src/julia.h:1703 [inlined]
start_task at /home/kshyatt/julia-1.6/src/task.c:839
unknown function (ip: (nil))
Allocations: 34702275 (Pool: 34692648; Big: 9627); GC: 33
@kshyatt kshyatt added the bug Something isn't working label Apr 20, 2021
@maleadt
Copy link
Member

maleadt commented Apr 21, 2021

API trace:

julia> my_func(n)
cuDriverGetVersion(Base.RefValue{Int32}) = CUDA_SUCCESS
 1: 11020
cuDeviceGetCount(Base.RefValue{Int32}) = CUDA_SUCCESS
 1: 9
cuDeviceGet(Base.RefValue{Int32}, 1) = CUDA_SUCCESS
 1: 1
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR, CuDevice(1)) = CUDA_SUCCESS
 1: 7
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR, CuDevice(1)) = CUDA_SUCCESS
 1: 0
cuDevicePrimaryCtxRetain(Base.RefValue{Ptr{Nothing}}, CuDevice(1)) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dd40d0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000000000000
cuCtxSetCurrent(CuContext(0x0000000002dd40d0, instance 53e6ad8edd3cd1c0)) = CUDA_SUCCESS
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002b60bc0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dd40d0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dd40d0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dd40d0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dd40d0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dd40d0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dd40d0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dd40d0
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_MEMORY_POOLS_SUPPORTED, CuDevice(1)) = CUDA_SUCCESS
 1: 1
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_MEMORY_POOLS_SUPPORTED, CuDevice(1)) = CUDA_SUCCESS
 1: 1
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dd40d0
cuDeviceGetMemPool(Base.RefValue{Ptr{Nothing}}, CuDevice(1)) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000044fe010
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dd40d0
cuCtxSetCurrent(CuContext(0x0000000002dd40d0, instance 53e6ad8edd3cd1c0)) = CUDA_SUCCESS
cuCtxGetDevice(Base.RefValue{Int32}) = CUDA_SUCCESS
 1: 1
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dd40d0
cuMemGetInfo_v2(Base.RefValue{UInt64}, Base.RefValue{UInt64}) = CUDA_SUCCESS
 1: 33757331456
 2: 34089730048
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR, CuDevice(1)) = CUDA_SUCCESS
 1: 7
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR, CuDevice(1)) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dd40d0
cuMemPoolSetAttribute(CuMemoryPool(Ptr{Nothing} @0x00000000044fe010, CuContext(0x0000000002dd40d0, instance 53e6ad8edd3cd1c0)), CU_MEMPOOL_ATTR_RELEASE_THRESHOLD, Base.RefValue{UInt64}) = CUDA_SUCCESS
 3: 1101004800
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR, CuDevice(1)) = CUDA_SUCCESS
 1: 7
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR, CuDevice(1)) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dd40d0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000002b60bc0, CuContext(0x0000000002dd40d0, instance 53e6ad8edd3cd1c0))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000000b02000000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dd40d0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dd40d0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000002b60bc0, CuContext(0x0000000002dd40d0, instance 53e6ad8edd3cd1c0))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000000b02400000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dd40d0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dd40d0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000002b60bc0, CuContext(0x0000000002dd40d0, instance 53e6ad8edd3cd1c0))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000000b02800000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dd40d0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dd40d0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000002b60bc0, CuContext(0x0000000002dd40d0, instance 53e6ad8edd3cd1c0))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000000b02c00000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dd40d0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dd40d0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000002b60bc0, CuContext(0x0000000002dd40d0, instance 53e6ad8edd3cd1c0))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000000b03000000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dd40d0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dd40d0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000002b60bc0, CuContext(0x0000000002dd40d0, instance 53e6ad8edd3cd1c0))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000000b03400000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dd40d0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dd40d0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000002b60bc0, CuContext(0x0000000002dd40d0, instance 53e6ad8edd3cd1c0))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000000b03800000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dd40d0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dd40d0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000002b60bc0, CuContext(0x0000000002dd40d0, instance 53e6ad8edd3cd1c0))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000000b03c00000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dd40d0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dd40d0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000002b60bc0, CuContext(0x0000000002dd40d0, instance 53e6ad8edd3cd1c0))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000000b04000000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dd40d0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dd40d0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000002b60bc0, CuContext(0x0000000002dd40d0, instance 53e6ad8edd3cd1c0))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000000b04400000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dd40d0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dd40d0
cuStreamQuery(CuStream(0x0000000002b60bc0, CuContext(0x0000000002dd40d0, instance 53e6ad8edd3cd1c0))) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 2) = CUDA_SUCCESS
 1: 2
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR, CuDevice(2)) = CUDA_SUCCESS
 1: 7
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR, CuDevice(2)) = CUDA_SUCCESS
 1: 0
cuDevicePrimaryCtxRetain(Base.RefValue{Ptr{Nothing}}, CuDevice(2)) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029572b0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dd40d0
cuCtxSetCurrent(CuContext(0x00000000029572b0, instance c6157752c12b6192)) = CUDA_SUCCESS
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x000000000544a2e0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029572b0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029572b0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029572b0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029572b0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029572b0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029572b0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029572b0
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_MEMORY_POOLS_SUPPORTED, CuDevice(2)) = CUDA_SUCCESS
 1: 1
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_MEMORY_POOLS_SUPPORTED, CuDevice(2)) = CUDA_SUCCESS
 1: 1
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029572b0
cuDeviceGetMemPool(Base.RefValue{Ptr{Nothing}}, CuDevice(2)) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000056dc740
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029572b0
cuCtxSetCurrent(CuContext(0x00000000029572b0, instance c6157752c12b6192)) = CUDA_SUCCESS
cuCtxGetDevice(Base.RefValue{Int32}) = CUDA_SUCCESS
 1: 2
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029572b0
cuMemGetInfo_v2(Base.RefValue{UInt64}, Base.RefValue{UInt64}) = CUDA_SUCCESS
 1: 33757331456
 2: 34089730048
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR, CuDevice(2)) = CUDA_SUCCESS
 1: 7
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR, CuDevice(2)) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029572b0
cuMemPoolSetAttribute(CuMemoryPool(Ptr{Nothing} @0x00000000056dc740, CuContext(0x00000000029572b0, instance c6157752c12b6192)), CU_MEMPOOL_ATTR_RELEASE_THRESHOLD, Base.RefValue{UInt64}) = CUDA_SUCCESS
 3: 1101004800
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR, CuDevice(2)) = CUDA_SUCCESS
 1: 7
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR, CuDevice(2)) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029572b0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x000000000544a2e0, CuContext(0x00000000029572b0, instance c6157752c12b6192))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000001ae2000000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029572b0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029572b0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x000000000544a2e0, CuContext(0x00000000029572b0, instance c6157752c12b6192))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000001ae2400000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029572b0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029572b0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x000000000544a2e0, CuContext(0x00000000029572b0, instance c6157752c12b6192))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000001ae2800000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029572b0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029572b0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x000000000544a2e0, CuContext(0x00000000029572b0, instance c6157752c12b6192))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000001ae2c00000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029572b0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029572b0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x000000000544a2e0, CuContext(0x00000000029572b0, instance c6157752c12b6192))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000001ae3000000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029572b0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029572b0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x000000000544a2e0, CuContext(0x00000000029572b0, instance c6157752c12b6192))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000001ae3400000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029572b0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029572b0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x000000000544a2e0, CuContext(0x00000000029572b0, instance c6157752c12b6192))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000001ae3800000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029572b0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029572b0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x000000000544a2e0, CuContext(0x00000000029572b0, instance c6157752c12b6192))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000001ae3c00000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029572b0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029572b0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x000000000544a2e0, CuContext(0x00000000029572b0, instance c6157752c12b6192))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000001ae4000000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029572b0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029572b0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x000000000544a2e0, CuContext(0x00000000029572b0, instance c6157752c12b6192))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000001ae4400000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029572b0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029572b0
cuStreamQuery(CuStream(0x000000000544a2e0, CuContext(0x00000000029572b0, instance c6157752c12b6192))) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 3) = CUDA_SUCCESS
 1: 3
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR, CuDevice(3)) = CUDA_SUCCESS
 1: 7
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR, CuDevice(3)) = CUDA_SUCCESS
 1: 0
cuDevicePrimaryCtxRetain(Base.RefValue{Ptr{Nothing}}, CuDevice(3)) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000031cb7e0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029572b0
cuCtxSetCurrent(CuContext(0x00000000031cb7e0, instance 5a5fd5240cac200)) = CUDA_SUCCESS
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000005ea40f0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000031cb7e0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000031cb7e0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000031cb7e0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000031cb7e0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000031cb7e0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000031cb7e0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000031cb7e0
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_MEMORY_POOLS_SUPPORTED, CuDevice(3)) = CUDA_SUCCESS
 1: 1
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_MEMORY_POOLS_SUPPORTED, CuDevice(3)) = CUDA_SUCCESS
 1: 1
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000031cb7e0
cuDeviceGetMemPool(Base.RefValue{Ptr{Nothing}}, CuDevice(3)) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x000000000631ef00
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000031cb7e0
cuCtxSetCurrent(CuContext(0x00000000031cb7e0, instance 5a5fd5240cac200)) = CUDA_SUCCESS
cuCtxGetDevice(Base.RefValue{Int32}) = CUDA_SUCCESS
 1: 3
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000031cb7e0
cuMemGetInfo_v2(Base.RefValue{UInt64}, Base.RefValue{UInt64}) = CUDA_SUCCESS
 1: 33757331456
 2: 34089730048
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR, CuDevice(3)) = CUDA_SUCCESS
 1: 7
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR, CuDevice(3)) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000031cb7e0
cuMemPoolSetAttribute(CuMemoryPool(Ptr{Nothing} @0x000000000631ef00, CuContext(0x00000000031cb7e0, instance 5a5fd5240cac200)), CU_MEMPOOL_ATTR_RELEASE_THRESHOLD, Base.RefValue{UInt64}) = CUDA_SUCCESS
 3: 1101004800
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR, CuDevice(3)) = CUDA_SUCCESS
 1: 7
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR, CuDevice(3)) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000031cb7e0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000005ea40f0, CuContext(0x00000000031cb7e0, instance 5a5fd5240cac200))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000002ac2000000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000031cb7e0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000031cb7e0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000005ea40f0, CuContext(0x00000000031cb7e0, instance 5a5fd5240cac200))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000002ac2400000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000031cb7e0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000031cb7e0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000005ea40f0, CuContext(0x00000000031cb7e0, instance 5a5fd5240cac200))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000002ac2800000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000031cb7e0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000031cb7e0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000005ea40f0, CuContext(0x00000000031cb7e0, instance 5a5fd5240cac200))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000002ac2c00000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000031cb7e0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000031cb7e0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000005ea40f0, CuContext(0x00000000031cb7e0, instance 5a5fd5240cac200))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000002ac3000000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000031cb7e0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000031cb7e0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000005ea40f0, CuContext(0x00000000031cb7e0, instance 5a5fd5240cac200))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000002ac3400000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000031cb7e0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000031cb7e0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000005ea40f0, CuContext(0x00000000031cb7e0, instance 5a5fd5240cac200))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000002ac3800000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000031cb7e0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000031cb7e0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000005ea40f0, CuContext(0x00000000031cb7e0, instance 5a5fd5240cac200))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000002ac3c00000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000031cb7e0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000031cb7e0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000005ea40f0, CuContext(0x00000000031cb7e0, instance 5a5fd5240cac200))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000002ac4000000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000031cb7e0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000031cb7e0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000005ea40f0, CuContext(0x00000000031cb7e0, instance 5a5fd5240cac200))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000002ac4400000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000031cb7e0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000031cb7e0
cuStreamQuery(CuStream(0x0000000005ea40f0, CuContext(0x00000000031cb7e0, instance 5a5fd5240cac200))) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 4) = CUDA_SUCCESS
 1: 4
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR, CuDevice(4)) = CUDA_SUCCESS
 1: 7
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR, CuDevice(4)) = CUDA_SUCCESS
 1: 0
cuDevicePrimaryCtxRetain(Base.RefValue{Ptr{Nothing}}, CuDevice(4)) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000031cb7e0
cuCtxSetCurrent(CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23)) = CUDA_SUCCESS
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000006cd98e0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_MEMORY_POOLS_SUPPORTED, CuDevice(4)) = CUDA_SUCCESS
 1: 1
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_MEMORY_POOLS_SUPPORTED, CuDevice(4)) = CUDA_SUCCESS
 1: 1
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuDeviceGetMemPool(Base.RefValue{Ptr{Nothing}}, CuDevice(4)) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000006f6b240
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuCtxSetCurrent(CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23)) = CUDA_SUCCESS
cuCtxGetDevice(Base.RefValue{Int32}) = CUDA_SUCCESS
 1: 4
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuMemGetInfo_v2(Base.RefValue{UInt64}, Base.RefValue{UInt64}) = CUDA_SUCCESS
 1: 16613113856
 2: 16945512448
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR, CuDevice(4)) = CUDA_SUCCESS
 1: 7
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR, CuDevice(4)) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuMemPoolSetAttribute(CuMemoryPool(Ptr{Nothing} @0x0000000006f6b240, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23)), CU_MEMPOOL_ATTR_RELEASE_THRESHOLD, Base.RefValue{UInt64}) = CUDA_SUCCESS
 3: 1101004800
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR, CuDevice(4)) = CUDA_SUCCESS
 1: 7
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR, CuDevice(4)) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000006cd98e0, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aa2000000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000006cd98e0, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aa2400000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000006cd98e0, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aa2800000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000006cd98e0, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aa2c00000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000006cd98e0, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aa3000000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000006cd98e0, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aa3400000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000006cd98e0, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aa3800000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000006cd98e0, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aa3c00000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000006cd98e0, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aa4000000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000006cd98e0, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aa4400000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuStreamQuery(CuStream(0x0000000006cd98e0, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000006f9d570
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000006f9d570, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aa4800000)
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR, CuDevice(4)) = CUDA_SUCCESS
 1: 7
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR, CuDevice(4)) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuCtxSynchronize() = CUDA_SUCCESS
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuModuleLoadDataEx(Base.RefValue{Ptr{Nothing}}, Ptr{UInt8} @0x0000000008b39748, 3, 3-element Vector{CUDA.CUjit_option_enum}, 3-element Vector{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000008ba3210
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuModuleGetFunction(Base.RefValue{Ptr{Nothing}}, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23)), _Z27julia_broadcast_kernel_226315CuKernelContext13CuDeviceArrayI7Float32Li2ELi1EE11BroadcastedIv5TupleI5OneToI5Int64ES4_IS5_EE4_absS3_I8ExtrudedIS0_IS1_Li2ELi1EES3_I4BoolS8_ES3_IS5_S5_EEEES5_) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000089ccb60
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuModuleGetGlobal_v2(Base.RefValue{CuPtr{Nothing}}, Base.RefValue{UInt64}, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23)), exception_flag) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x00007fb6a7b42e00)
 2: 8
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuMemHostAlloc(Base.RefValue{Ptr{Nothing}}, 8, 2) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00007fb697e00000
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuMemHostGetDevicePointer_v2(Base.RefValue{CuPtr{Nothing}}, Ptr{Nothing} @0x00007fb697e00000, 0) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x00007fb697e00000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuMemcpyHtoDAsync_v2(CuGlobal{Ptr{Nothing}}(DeviceBuffer(8 bytes at 0x00007fb6a7b42e00)), Base.RefValue{Ptr{Nothing}}, 8, CuStream(0x0000000006f9d570, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))) = CUDA_SUCCESS
 2: Ptr{Nothing} @0x00007fb697e00000
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x0000000006f9d570, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000064c8be0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x00000000064c8be0, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aa4c00000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x00000000064c8be0, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000079df9e0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x00000000079df9e0, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aa5000000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x00000000079df9e0, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000007fe4da0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000007fe4da0, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aa5400000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x0000000007fe4da0, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000009529420
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000009529420, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aa5800000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x0000000009529420, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000007f98e40
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000007f98e40, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aa5c00000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x0000000007f98e40, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000007f87670
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000007f87670, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aa6000000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x0000000007f87670, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000007e56b80
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000007e56b80, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aa6400000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x0000000007e56b80, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000007b5ac40
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000007b5ac40, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aa6800000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x0000000007b5ac40, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000008f278b0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000008f278b0, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aa6c00000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x0000000008f278b0, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x000000000802af10
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x000000000802af10, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aa7000000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x000000000802af10, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000008cd06a0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000008cd06a0, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aa7400000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x0000000008cd06a0, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000047b0dd0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x00000000047b0dd0, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aa7800000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x00000000047b0dd0, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000008f448b0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000008f448b0, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aa7c00000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x0000000008f448b0, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000034dc4c0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x00000000034dc4c0, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aa8000000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x00000000034dc4c0, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000004600470
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000004600470, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aa8400000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x0000000004600470, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000007aef0b0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000007aef0b0, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aa8800000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x0000000007aef0b0, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000098950e0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x00000000098950e0, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aa8c00000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x00000000098950e0, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000099309e0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x00000000099309e0, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aa9000000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x00000000099309e0, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000099faaa0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x00000000099faaa0, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aa9400000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x00000000099faaa0, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000009a960e0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000009a960e0, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aa9800000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x0000000009a960e0, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000009b07b20
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000009b07b20, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aa9c00000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x0000000009b07b20, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000008a15940
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000008a15940, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aaa000000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x0000000008a15940, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000009bdc060
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000009bdc060, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aaa400000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x0000000009bdc060, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000009cbb7e0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000009cbb7e0, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aaa800000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x0000000009cbb7e0, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000009da41e0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000009da41e0, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aaac00000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x0000000009da41e0, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000009c72a60
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000009c72a60, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aab000000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x0000000009c72a60, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000007fde310
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000007fde310, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aab400000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x0000000007fde310, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000009f625a0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000009f625a0, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aab800000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x0000000009f625a0, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x000000000a019120
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x000000000a019120, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aabc00000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x000000000a019120, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x000000000a0b46e0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x000000000a0b46e0, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aac000000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x000000000a0b46e0, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x000000000a0be660
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x000000000a0be660, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aac400000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x000000000a0be660, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x000000000a1d7a20
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x000000000a1d7a20, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aac800000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x000000000a1d7a20, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x000000000a2610e0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x000000000a2610e0, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aacc00000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x000000000a2610e0, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x000000000a15a3e0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x000000000a15a3e0, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aad000000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x000000000a15a3e0, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x000000000a3c6c20
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x000000000a3c6c20, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aad400000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x000000000a3c6c20, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x000000000a462220
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x000000000a462220, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aad800000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x000000000a462220, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x000000000a48ece0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x000000000a48ece0, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aadc00000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x000000000a48ece0, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000095922e0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x00000000095922e0, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aae000000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x00000000095922e0, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000008a59820
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000008a59820, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aae400000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x00000000089ccb60, CuModule(Ptr{Nothing} @0x0000000008ba3210, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x0000000008a59820, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000008977060
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002ed1ee0
cuStreamQuery(CuStream(0x0000000008977060, CuContext(0x0000000002ed1ee0, instance 68d2a20687252d23))) = CUDA_ERROR_ILLEGAL_ADDRESS
ERROR: cuGetErrorString(CuError(CUDA.CUDA_ERROR_ILLEGAL_ADDRESS, nothing), Base.RefValue{Cstring}) = CUDA_SUCCESS
 2: Cstring(0x00007fb7d77faf90)
cuGetErrorName(CuError(CUDA.CUDA_ERROR_ILLEGAL_ADDRESS, nothing), Base.RefValue{Cstring}) = CUDA_SUCCESS
 2: Cstring(0x00007fb7d77fa6d7)
CUDA error: an illegal memory access was encountered (code 700, ERROR_ILLEGAL_ADDRESS)
Stacktrace:
 [1] throw_api_error(res::CUDA.cudaError_enum)
   @ CUDA ~/Julia/pkg/CUDA/lib/cudadrv/error.jl:88
 [2] query
   @ ~/Julia/pkg/CUDA/lib/cudadrv/stream.jl:102 [inlined]
 [3] synchronize(s::CuStream; blocking::Bool)
   @ CUDA ~/Julia/pkg/CUDA/lib/cudadrv/stream.jl:117
 [4] synchronize (repeats 2 times)
   @ ~/Julia/pkg/CUDA/lib/cudadrv/stream.jl:117 [inlined]
 [5] top-level scope
   @ ~/Julia/pkg/CUDA/src/initialization.jl:83

@maleadt
Copy link
Member

maleadt commented Apr 21, 2021

And under compute-sanitizer:

julia> my_func(n)
cuDriverGetVersion(Base.RefValue{Int32}) = CUDA_SUCCESS
 1: 11020
cuDeviceGetCount(Base.RefValue{Int32}) = CUDA_SUCCESS
 1: 9
cuDeviceGet(Base.RefValue{Int32}, 1) = CUDA_SUCCESS
 1: 1
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR, CuDevice(1)) = CUDA_SUCCESS
 1: 7
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR, CuDevice(1)) = CUDA_SUCCESS
 1: 0
cuDevicePrimaryCtxRetain(Base.RefValue{Ptr{Nothing}}, CuDevice(1)) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001bfce20
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000000000000
cuCtxSetCurrent(CuContext(0x0000000001bfce20, instance 6dd8c03df8dcfbec)) = CUDA_SUCCESS
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000003cdb610
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001bfce20
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001bfce20
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001bfce20
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001bfce20
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001bfce20
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001bfce20
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001bfce20
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_MEMORY_POOLS_SUPPORTED, CuDevice(1)) = CUDA_SUCCESS
 1: 1
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_MEMORY_POOLS_SUPPORTED, CuDevice(1)) = CUDA_SUCCESS
 1: 1
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001bfce20
cuDeviceGetMemPool(Base.RefValue{Ptr{Nothing}}, CuDevice(1)) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000057411e0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001bfce20
cuCtxSetCurrent(CuContext(0x0000000001bfce20, instance 6dd8c03df8dcfbec)) = CUDA_SUCCESS
cuCtxGetDevice(Base.RefValue{Int32}) = CUDA_SUCCESS
 1: 1
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001bfce20
cuMemGetInfo_v2(Base.RefValue{UInt64}, Base.RefValue{UInt64}) = CUDA_SUCCESS
 1: 33719582720
 2: 34089730048
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR, CuDevice(1)) = CUDA_SUCCESS
 1: 7
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR, CuDevice(1)) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001bfce20
cuMemPoolSetAttribute(CuMemoryPool(Ptr{Nothing} @0x00000000057411e0, CuContext(0x0000000001bfce20, instance 6dd8c03df8dcfbec)), CU_MEMPOOL_ATTR_RELEASE_THRESHOLD, Base.RefValue{UInt64}) = CUDA_SUCCESS
 3: 1101004800
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR, CuDevice(1)) = CUDA_SUCCESS
 1: 7
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR, CuDevice(1)) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001bfce20
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000003cdb610, CuContext(0x0000000001bfce20, instance 6dd8c03df8dcfbec))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000000b02000000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001bfce20
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001bfce20
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000003cdb610, CuContext(0x0000000001bfce20, instance 6dd8c03df8dcfbec))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000000b02400000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001bfce20
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001bfce20
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000003cdb610, CuContext(0x0000000001bfce20, instance 6dd8c03df8dcfbec))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000000b02800000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001bfce20
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001bfce20
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000003cdb610, CuContext(0x0000000001bfce20, instance 6dd8c03df8dcfbec))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000000b02c00000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001bfce20
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001bfce20
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000003cdb610, CuContext(0x0000000001bfce20, instance 6dd8c03df8dcfbec))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000000b03000000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001bfce20
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001bfce20
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000003cdb610, CuContext(0x0000000001bfce20, instance 6dd8c03df8dcfbec))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000000b03400000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001bfce20
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001bfce20
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000003cdb610, CuContext(0x0000000001bfce20, instance 6dd8c03df8dcfbec))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000000b03800000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001bfce20
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001bfce20
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000003cdb610, CuContext(0x0000000001bfce20, instance 6dd8c03df8dcfbec))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000000b03c00000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001bfce20
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001bfce20
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000003cdb610, CuContext(0x0000000001bfce20, instance 6dd8c03df8dcfbec))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000000b04000000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001bfce20
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001bfce20
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000003cdb610, CuContext(0x0000000001bfce20, instance 6dd8c03df8dcfbec))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000000b04400000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001bfce20
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001bfce20
cuStreamQuery(CuStream(0x0000000003cdb610, CuContext(0x0000000001bfce20, instance 6dd8c03df8dcfbec))) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 2) = CUDA_SUCCESS
 1: 2
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR, CuDevice(2)) = CUDA_SUCCESS
 1: 7
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR, CuDevice(2)) = CUDA_SUCCESS
 1: 0
cuDevicePrimaryCtxRetain(Base.RefValue{Ptr{Nothing}}, CuDevice(2)) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029e24d0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001bfce20
cuCtxSetCurrent(CuContext(0x00000000029e24d0, instance c9efdad935940afd)) = CUDA_SUCCESS
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000065fb420
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029e24d0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029e24d0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029e24d0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029e24d0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029e24d0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029e24d0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029e24d0
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_MEMORY_POOLS_SUPPORTED, CuDevice(2)) = CUDA_SUCCESS
 1: 1
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_MEMORY_POOLS_SUPPORTED, CuDevice(2)) = CUDA_SUCCESS
 1: 1
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029e24d0
cuDeviceGetMemPool(Base.RefValue{Ptr{Nothing}}, CuDevice(2)) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000007c96ec0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029e24d0
cuCtxSetCurrent(CuContext(0x00000000029e24d0, instance c9efdad935940afd)) = CUDA_SUCCESS
cuCtxGetDevice(Base.RefValue{Int32}) = CUDA_SUCCESS
 1: 2
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029e24d0
cuMemGetInfo_v2(Base.RefValue{UInt64}, Base.RefValue{UInt64}) = CUDA_SUCCESS
 1: 33719582720
 2: 34089730048
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR, CuDevice(2)) = CUDA_SUCCESS
 1: 7
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR, CuDevice(2)) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029e24d0
cuMemPoolSetAttribute(CuMemoryPool(Ptr{Nothing} @0x0000000007c96ec0, CuContext(0x00000000029e24d0, instance c9efdad935940afd)), CU_MEMPOOL_ATTR_RELEASE_THRESHOLD, Base.RefValue{UInt64}) = CUDA_SUCCESS
 3: 1101004800
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR, CuDevice(2)) = CUDA_SUCCESS
 1: 7
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR, CuDevice(2)) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029e24d0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x00000000065fb420, CuContext(0x00000000029e24d0, instance c9efdad935940afd))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000001ae2000000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029e24d0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029e24d0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x00000000065fb420, CuContext(0x00000000029e24d0, instance c9efdad935940afd))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000001ae2400000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029e24d0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029e24d0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x00000000065fb420, CuContext(0x00000000029e24d0, instance c9efdad935940afd))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000001ae2800000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029e24d0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029e24d0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x00000000065fb420, CuContext(0x00000000029e24d0, instance c9efdad935940afd))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000001ae2c00000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029e24d0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029e24d0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x00000000065fb420, CuContext(0x00000000029e24d0, instance c9efdad935940afd))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000001ae3000000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029e24d0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029e24d0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x00000000065fb420, CuContext(0x00000000029e24d0, instance c9efdad935940afd))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000001ae3400000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029e24d0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029e24d0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x00000000065fb420, CuContext(0x00000000029e24d0, instance c9efdad935940afd))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000001ae3800000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029e24d0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029e24d0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x00000000065fb420, CuContext(0x00000000029e24d0, instance c9efdad935940afd))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000001ae3c00000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029e24d0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029e24d0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x00000000065fb420, CuContext(0x00000000029e24d0, instance c9efdad935940afd))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000001ae4000000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029e24d0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029e24d0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x00000000065fb420, CuContext(0x00000000029e24d0, instance c9efdad935940afd))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000001ae4400000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029e24d0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029e24d0
cuStreamQuery(CuStream(0x00000000065fb420, CuContext(0x00000000029e24d0, instance c9efdad935940afd))) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 3) = CUDA_SUCCESS
 1: 3
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR, CuDevice(3)) = CUDA_SUCCESS
 1: 7
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR, CuDevice(3)) = CUDA_SUCCESS
 1: 0
cuDevicePrimaryCtxRetain(Base.RefValue{Ptr{Nothing}}, CuDevice(3)) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001f477e0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000029e24d0
cuCtxSetCurrent(CuContext(0x0000000001f477e0, instance d5b0eb2fcd14bd72)) = CUDA_SUCCESS
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000084553a0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001f477e0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001f477e0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001f477e0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001f477e0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001f477e0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001f477e0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001f477e0
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_MEMORY_POOLS_SUPPORTED, CuDevice(3)) = CUDA_SUCCESS
 1: 1
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_MEMORY_POOLS_SUPPORTED, CuDevice(3)) = CUDA_SUCCESS
 1: 1
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001f477e0
cuDeviceGetMemPool(Base.RefValue{Ptr{Nothing}}, CuDevice(3)) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000009e1c8c0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001f477e0
cuCtxSetCurrent(CuContext(0x0000000001f477e0, instance d5b0eb2fcd14bd72)) = CUDA_SUCCESS
cuCtxGetDevice(Base.RefValue{Int32}) = CUDA_SUCCESS
 1: 3
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001f477e0
cuMemGetInfo_v2(Base.RefValue{UInt64}, Base.RefValue{UInt64}) = CUDA_SUCCESS
 1: 33719582720
 2: 34089730048
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR, CuDevice(3)) = CUDA_SUCCESS
 1: 7
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR, CuDevice(3)) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001f477e0
cuMemPoolSetAttribute(CuMemoryPool(Ptr{Nothing} @0x0000000009e1c8c0, CuContext(0x0000000001f477e0, instance d5b0eb2fcd14bd72)), CU_MEMPOOL_ATTR_RELEASE_THRESHOLD, Base.RefValue{UInt64}) = CUDA_SUCCESS
 3: 1101004800
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR, CuDevice(3)) = CUDA_SUCCESS
 1: 7
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR, CuDevice(3)) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001f477e0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x00000000084553a0, CuContext(0x0000000001f477e0, instance d5b0eb2fcd14bd72))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000002ac2000000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001f477e0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001f477e0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x00000000084553a0, CuContext(0x0000000001f477e0, instance d5b0eb2fcd14bd72))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000002ac2400000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001f477e0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001f477e0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x00000000084553a0, CuContext(0x0000000001f477e0, instance d5b0eb2fcd14bd72))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000002ac2800000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001f477e0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001f477e0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x00000000084553a0, CuContext(0x0000000001f477e0, instance d5b0eb2fcd14bd72))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000002ac2c00000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001f477e0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001f477e0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x00000000084553a0, CuContext(0x0000000001f477e0, instance d5b0eb2fcd14bd72))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000002ac3000000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001f477e0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001f477e0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x00000000084553a0, CuContext(0x0000000001f477e0, instance d5b0eb2fcd14bd72))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000002ac3400000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001f477e0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001f477e0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x00000000084553a0, CuContext(0x0000000001f477e0, instance d5b0eb2fcd14bd72))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000002ac3800000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001f477e0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001f477e0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x00000000084553a0, CuContext(0x0000000001f477e0, instance d5b0eb2fcd14bd72))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000002ac3c00000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001f477e0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001f477e0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x00000000084553a0, CuContext(0x0000000001f477e0, instance d5b0eb2fcd14bd72))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000002ac4000000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001f477e0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001f477e0
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x00000000084553a0, CuContext(0x0000000001f477e0, instance d5b0eb2fcd14bd72))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000002ac4400000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001f477e0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001f477e0
cuStreamQuery(CuStream(0x00000000084553a0, CuContext(0x0000000001f477e0, instance d5b0eb2fcd14bd72))) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 4) = CUDA_SUCCESS
 1: 4
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR, CuDevice(4)) = CUDA_SUCCESS
 1: 7
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR, CuDevice(4)) = CUDA_SUCCESS
 1: 0
cuDevicePrimaryCtxRetain(Base.RefValue{Ptr{Nothing}}, CuDevice(4)) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000001f477e0
cuCtxSetCurrent(CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611)) = CUDA_SUCCESS
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x000000000a90e860
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_MEMORY_POOLS_SUPPORTED, CuDevice(4)) = CUDA_SUCCESS
 1: 1
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_MEMORY_POOLS_SUPPORTED, CuDevice(4)) = CUDA_SUCCESS
 1: 1
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuDeviceGetMemPool(Base.RefValue{Ptr{Nothing}}, CuDevice(4)) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x000000000bfabd00
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuCtxSetCurrent(CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611)) = CUDA_SUCCESS
cuCtxGetDevice(Base.RefValue{Int32}) = CUDA_SUCCESS
 1: 4
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuMemGetInfo_v2(Base.RefValue{UInt64}, Base.RefValue{UInt64}) = CUDA_SUCCESS
 1: 16575365120
 2: 16945512448
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR, CuDevice(4)) = CUDA_SUCCESS
 1: 7
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR, CuDevice(4)) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuMemPoolSetAttribute(CuMemoryPool(Ptr{Nothing} @0x000000000bfabd00, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611)), CU_MEMPOOL_ATTR_RELEASE_THRESHOLD, Base.RefValue{UInt64}) = CUDA_SUCCESS
 3: 1101004800
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR, CuDevice(4)) = CUDA_SUCCESS
 1: 7
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR, CuDevice(4)) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x000000000a90e860, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aa2000000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x000000000a90e860, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aa2400000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x000000000a90e860, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aa2800000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x000000000a90e860, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aa2c00000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x000000000a90e860, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aa3000000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x000000000a90e860, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aa3400000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x000000000a90e860, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aa3800000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x000000000a90e860, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aa3c00000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x000000000a90e860, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aa4000000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x000000000a90e860, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aa4400000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuStreamQuery(CuStream(0x000000000a90e860, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000009e9cf40
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000009e9cf40, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aa4800000)
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MAJOR, CuDevice(4)) = CUDA_SUCCESS
 1: 7
cuDeviceGetAttribute(Base.RefValue{Int32}, CU_DEVICE_ATTRIBUTE_COMPUTE_CAPABILITY_MINOR, CuDevice(4)) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuCtxSynchronize() = CUDA_SUCCESS
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuModuleLoadDataEx(Base.RefValue{Ptr{Nothing}}, Ptr{UInt8} @0x000000000cc40b08, 3, 3-element Vector{CUDA.CUjit_option_enum}, 3-element Vector{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x000000000dc4c440
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuModuleGetFunction(Base.RefValue{Ptr{Nothing}}, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611)), _Z27julia_broadcast_kernel_219515CuKernelContext13CuDeviceArrayI7Float32Li2ELi1EE11BroadcastedIv5TupleI5OneToI5Int64ES4_IS5_EE4_absS3_I8ExtrudedIS0_IS1_Li2ELi1EES3_I4BoolS8_ES3_IS5_S5_EEEES5_) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x000000000dda08d0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuModuleGetGlobal_v2(Base.RefValue{CuPtr{Nothing}}, Base.RefValue{UInt64}, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611)), exception_flag) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x00007f9117d42e00)
 2: 8
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuMemHostAlloc(Base.RefValue{Ptr{Nothing}}, 8, 2) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00007f9107e00000
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuMemHostGetDevicePointer_v2(Base.RefValue{CuPtr{Nothing}}, Ptr{Nothing} @0x00007f9107e00000, 0) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x00007f9107e00000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuMemcpyHtoDAsync_v2(CuGlobal{Ptr{Nothing}}(DeviceBuffer(8 bytes at 0x00007f9117d42e00)), Base.RefValue{Ptr{Nothing}}, 8, CuStream(0x0000000009e9cf40, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))) = CUDA_SUCCESS
 2: Ptr{Nothing} @0x00007f9107e00000
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x0000000009e9cf40, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x000000000d30cfa0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x000000000d30cfa0, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aa4c00000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x000000000d30cfa0, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x000000000d751fe0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x000000000d751fe0, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aa5000000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x000000000d751fe0, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x000000000c7e93e0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x000000000c7e93e0, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aa5400000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x000000000c7e93e0, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x000000000ccbed20
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x000000000ccbed20, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aa5800000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x000000000ccbed20, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x000000000dd5cde0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x000000000dd5cde0, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aa5c00000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x000000000dd5cde0, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x000000000d382160
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x000000000d382160, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aa6000000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x000000000d382160, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x000000000c08c3c0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x000000000c08c3c0, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aa6400000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x000000000c08c3c0, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x000000000d1b6680
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x000000000d1b6680, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aa6800000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x000000000d1b6680, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x000000000cc5e880
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x000000000cc5e880, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aa6c00000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x000000000cc5e880, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x000000000d1bdaf0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x000000000d1bdaf0, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aa7000000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x000000000d1bdaf0, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000009f6a3d0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000009f6a3d0, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aa7400000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x0000000009f6a3d0, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x000000000c0e4170
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x000000000c0e4170, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aa7800000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x000000000c0e4170, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000009f6a790
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000009f6a790, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aa7c00000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x0000000009f6a790, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x000000000bf45390
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x000000000bf45390, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aa8000000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x000000000bf45390, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x000000000d7c18c0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x000000000d7c18c0, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aa8400000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x000000000d7c18c0, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x000000000d00d520
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x000000000d00d520, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aa8800000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x000000000d00d520, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000009f8def0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000009f8def0, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aa8c00000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x0000000009f8def0, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002f8f1b0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000002f8f1b0, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aa9000000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x0000000002f8f1b0, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000008770d60
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000008770d60, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aa9400000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x0000000008770d60, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000009fe88c0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000009fe88c0, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aa9800000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x0000000009fe88c0, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x000000000d1e7730
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x000000000d1e7730, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aa9c00000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x000000000d1e7730, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x000000000dc5e770
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x000000000dc5e770, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aaa000000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x000000000dc5e770, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x000000000d380c30
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x000000000d380c30, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aaa400000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x000000000d380c30, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000065eb2e0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x00000000065eb2e0, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aaa800000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x00000000065eb2e0, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x000000000e54fda0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x000000000e54fda0, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aaac00000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x000000000e54fda0, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x000000000d0097a0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x000000000d0097a0, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aab000000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x000000000d0097a0, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x000000000ddfd5a0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x000000000ddfd5a0, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aab400000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x000000000ddfd5a0, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x000000000c5b46f0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x000000000c5b46f0, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aab800000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x000000000c5b46f0, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x000000000dd6fa20
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x000000000dd6fa20, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aabc00000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x000000000dd6fa20, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x000000000dc49820
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x000000000dc49820, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aac000000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x000000000dc49820, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x00000000031c38c0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x00000000031c38c0, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aac400000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x00000000031c38c0, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000009f46fa0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x0000000009f46fa0, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aac800000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x0000000009f46fa0, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x000000000dcabe20
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x000000000dcabe20, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aacc00000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x000000000dcabe20, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x000000000d40b960
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x000000000d40b960, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aad000000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x000000000d40b960, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x000000000dcc07e0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x000000000dcc07e0, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aad400000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x000000000dcc07e0, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x000000000d3ff260
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x000000000d3ff260, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aad800000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x000000000d3ff260, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x000000000e20b0a0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x000000000e20b0a0, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aadc00000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x000000000e20b0a0, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x000000000dca9d60
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x000000000dca9d60, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aae000000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x000000000dca9d60, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x000000000d9e7de0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuMemAllocAsync(Base.RefValue{CuPtr{Nothing}}, 4194304, CuStream(0x000000000d9e7de0, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))) = CUDA_SUCCESS
 1: CuPtr{Nothing}(0x0000003aae400000)
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuOccupancyMaxPotentialBlockSize(Base.RefValue{Int32}, Base.RefValue{Int32}, CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), Ptr{Nothing} @0x0000000000000000, 0, 256) = CUDA_SUCCESS
 1: 480
 2: 256
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuLaunchKernel(CuFunction(Ptr{Nothing} @0x000000000dda08d0, CuModule(Ptr{Nothing} @0x000000000dc4c440, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))), 4096, 1, 1, 256, 1, 1, 0, CuStream(0x000000000d9e7de0, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611)), 3-element Vector{Ptr{Nothing}}, Ptr{Nothing} @0x0000000000000000) = CUDA_SUCCESS
cuDeviceGet(Base.RefValue{Int32}, 0) = CUDA_SUCCESS
 1: 0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuStreamCreate(Base.RefValue{Ptr{Nothing}}, CU_STREAM_DEFAULT) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x000000000e52cbe0
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuCtxGetCurrent(Base.RefValue{Ptr{Nothing}}) = CUDA_SUCCESS
 1: Ptr{Nothing} @0x0000000002dc4d10
cuStreamQuery(CuStream(0x000000000e52cbe0, CuContext(0x0000000002dc4d10, instance cb508bbbb2bd611))) = CUDA_SUCCESS

julia> ========= Invalid __global__ read of size 4 bytes
=========     at 0x870 in julia_broadcast_kernel_2195(CuKernelContext,CuDeviceArray<Float32,int=2,int=1>,Broadcasted<void,Tuple<OneTo<Int64>,Tuple<OneTo>>,_abs,Broadcasted<Extruded<CuDeviceArray<Float32,int=2,int=1>,Broadcasted<Bool,Tuple<OneTo>>,Broadcasted<OneTo,OneTo>>>>,OneTo)
=========     by thread (224,0,0) in block (0,0,0)
=========     Address 0xb02000380 is out of bounds

@maleadt
Copy link
Member

maleadt commented Apr 22, 2021

Well, this took me a bit. The problem is that your @async kernel(...) spawns a new task, initialized for the currently-active device rather than dev for which the arrays were allocated. Instead, we should use the device that's used by the parent task, but we can't currently do that: JuliaLang/julia#35757. So you need to repeat the device!(dev) in each @async block, and everything works.

@maleadt maleadt added needs documentation Documentation is requested. and removed bug Something isn't working labels Apr 22, 2021
@maleadt maleadt changed the title Memory errors with nested async tasks across GPUs Newly-spawned tasks should re-set the device Apr 22, 2021
@kshyatt
Copy link
Contributor Author

kshyatt commented Apr 22, 2021

Wild! Thanks for the advice, pretty interesting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs documentation Documentation is requested.
Projects
None yet
Development

No branches or pull requests

2 participants