CUDA.jl 5.1 greatly improves the support of two important parts of the CUDA toolkit: unified memory, for accessing GPU memory on the CPU and vice-versa, and cooperative groups which offer a more modular approach to kernel programming.
Unified Memory
Unified memory is a feature of CUDA that allows the programmer to access memory from both the CPU and GPU, relying on the driver to move data between the two. This can be useful for a variety of reasons: to avoid explicit memory copies, to use more memory than the GPU has available, or to be able to incrementally port code to the GPU and still have parts of the application run on the CPU.
julia> gpu = cu([1., 2.]; unified=true)
2-element CuArray{Float32, 1, CUDA.Mem.UnifiedBuffer]:
1.0
2.0
julia> # accessing GPU memory from the CPU
gpu[1] = 3;
julia> gpu
2-element CuArray{Float32, 1, CUDA.Mem.UnifiedBuffer}:
3.0
2.0
Accessing GPU memory like this used to throw an error, but with CUDA.jl 5.1 it is safe and efficient to perform scalar iteration on CuArrays backed by unified memory. This greatly simplifies porting applications to the GPU, as it no longer is a problem when code uses AbstractArray fallbacks from Base that process element by element.
In addition, CUDA.jl 5.1 also makes it easier to convert CuArrays to Array objects. This is important when wanting to use high-performance CPU libraries like BLAS or LAPACK which do not support CuArrays:
julia> cpu = unsafe_wrap(Array, gpu)
2-element Vector{Float32}:
3.0
2.0
julia> LinearAlgebra.BLAS.scal!(2f0, cpu);
julia> gpu
2-element CuArray{Float32, 1, CUDA.Mem.UnifiedBuffer}:
6.0
4.0
The reverse is also possible: CPU-based Arrays can now trivially be converted to CuArray objects for use on the GPU, without the need to explicitly allocate unified memory. This further simplifies memory management, as it makes it possible to use the GPU inside of an existing application without having to copy data into a CuArray:
julia> gpu = unsafe_wrap(CuArray, cpu)
2-element CuArray{Int64, 1, CUDA.Mem.UnifiedBuffer}:
1
2
julia> CUDA.@sync gpu .+= 1;
julia> cpu
2-element Vector{Int64}:
2
3
Note that the above methods are prefixed unsafe because of how they require careful management of object lifetimes: When creating an Array from a CuArray, the CuArray must be kept alive for as long as the Array is used, and vice-versa when creating a CuArray from an Array. Explicit synchronization (i.e. waiting for the GPU to finish computing) is also required, as CUDA.jl cannot synchronize automatically when accessing GPU memory through a CPU pointer.
Cooperative Groups
Another major improvement in CUDA.jl 5.1 are the greatly expanded wrappers for the CUDA cooperative groups API. Cooperative groups are a low-level feature of CUDA that make it possible to write kernels that are more flexible than the traditional approach of differentiating computations based on thread and block indices. Instead, cooperative groups allow the programmer to use objects representing groups of threads, pass those around, and differentiate computations based on queries on those objects.
function reduce_sum(group, temp, val)
lane = CG.thread_rank(group)
# Each iteration halves the number of active threads
# Each thread adds its partial sum[i] to sum[lane+i]
i = CG.num_threads(group) ÷ 2
while i > 0
temp[lane] = val
CG.sync(group)
if lane <= i
val += temp[lane + i]
end
CG.sync(group)
i ÷= 2
end
return val # note: only thread 1 will return full sum
end
When the threads of a group call this function, they cooperatively compute the sum of the values passed by each thread in the group. For example, let’s write a kernel that calls this function using a group representing the current thread block:
function sum_kernel_block(sum::AbstractArray{T},
input::AbstractArray{T}) where T
# have each thread compute a partial sum
my_sum = thread_sum(input)
# perform a cooperative summation
temp = CuStaticSharedArray(T, 256)
g = CG.this_thread_block()
block_sum = reduce_sum(g, temp, my_sum)
# combine the block sums
if CG.thread_rank(g) == 1
CUDA.@atomic sum[] += block_sum
end
return
end
function thread_sum(input::AbstractArray{T}) where T
sum = zero(T)
i = (blockIdx().x-1) * blockDim().x + threadIdx().x
stride = blockDim().x * gridDim().x
while i <= length(input)
sum += input[i]
i += stride
end
return sum
end
n = 1<<24
threads = 256
blocks = cld(n, threads)
data = CUDA.rand(n)
sum = CUDA.fill(zero(eltype(data)), 1)
@cuda threads=threads blocks=blocks sum_kernel_block(sum, data)
This style of programming makes it possible to write kernels that are safer and more modular than traditional kernels. Some CUDA features also require the use of cooperative groups, for example, asynchronous memory copies between global and shared memory are done using the CG.memcpy_async function.
Other Updates
Apart from these two major features, CUDA.jl 5.1 also includes a number of smaller fixes and improvements:
- Support for CUDA 12.3
- Performance improvements related to memory copies, which regressed in CUDA 5.0
- Improvements to the native profiler (CUDA.@profile), now also showing local memory usage, supporting more NVTX metadata, and with better support for Pluto.jl and Jupyter
- Many CUSOLVER and CUSPARSE improvements by @amontoison