Vendor-Neutral GPU Programming in Julia: Unlocking Portability with KernelAbstractions.jl

Written by Sanjeeb Das Gupta | Feb 19, 2025

GPUs are at the heart of scientific computing, AI, and high-performance computing (HPC), yet most GPU programming remains tied to vendor-specific frameworks like CUDA (NVIDIA), ROCm (AMD), oneAPI (Intel) and Metal (Apple). While these frameworks offer highly optimized performance, they create major portability issues for developers who need their code to run across different hardware.

Challenges with Vendor-Specific GPU Code

Code Fragmentation – Developers must rewrite and maintain separate codebases for different GPUs.
Limited Deployment Flexibility – AI and HPC workloads must run on various GPU architectures, including cloud-based clusters.
Increased Maintenance Overhead – Supporting multiple GPU architectures requires different optimization strategies.

The ideal solution is a vendor-agnostic GPU programming model, where the same code can run efficiently across different GPUs without modification. This is exactly what KernelAbstractions.jl provides.

The Julia Solution: KernelAbstractions.jl for Unified GPU Programming

To address these challenges, Julia offers KernelAbstractions.jl, a framework that abstracts away vendor-specific details and enables GPU portability. Instead of writing separate kernels for CUDA, ROCm, oneAPI and Metal, KernelAbstractions.jl allows developers to write a single GPU kernel that runs efficiently across all architectures.

Single Codebase – Write once, deploy on multiple GPU architectures.
Cross-Platform AI & HPC Workflows – Run AI models and scientific simulations across mixed GPU clusters.
Future-Proof GPU Development – Avoid vendor lock-in as GPU architectures evolve.

Writing Cross-Vendor GPU Code in Julia

Instead of manually optimizing code for different architectures, KernelAbstractions.jl provides a unified interface for defining and launching GPU kernels. Let's see how it works.

Step 1: Writing a Vendor-Neutral GPU Kernel

Before running computations on a GPU, we define a kernel that performs vector addition across multiple threads:

using KernelAbstractions

@kernel function vector_addition!(A, B, C)

i = @index(Global)

@inbounds C[i] = A[i] + B[i]

end

Why This Matters: The @kernel macro abstracts away vendor-specific syntax, while @index(Global) assigns global thread indices dynamically for parallel execution.

Step 2: Running the Same Code on Different GPUs

Now, we execute the same kernel on NVIDIA (CUDA), AMD (ROCm), and Apple (Metal) GPUs:

# Initialize vectors on the CPU

A = rand(Float32, 1000000)

B = rand(Float32, 1000000)

C = similar(A)

# Run on NVIDIA GPU

using CUDA

A_cuda, B_cuda, C_cuda = CuArray(A), CuArray(B), CuArray(C)

vector_addition!(CUDABackend(), 256)(A_cuda, B_cuda, C_cuda)

# Run on AMD GPU

using AMDGPU

A_amd, B_amd, C_amd = ROCArray(A), ROCArray(B), ROCArray(C)

vector_addition!(ROCBackend(), 256)(A_amd, B_amd, C_amd)

# Run on Apple GPU
using Metal
A_metal, B_metal, C_metal = MtlArray(A), MtlArray(B), MtlArray(C)
vector_addition!(MetalBackend(), 256)(A_metal, B_metal, C_metal)

Curious about writing portable GPU kernels that work across NVIDIA, AMD, Apple, and Intel GPUs? Watch this webinar to explore KernelAbstractions.jl in action!

Real-World Applications of Vendor-Neutral GPU Programming

Vendor-neutral GPU programming in Julia unlocks new possibilities across industries:

1. Deep Learning & AI: Seamless Training and Deployment Across GPUs

Machine learning models often need to be trained on one type of hardware (e.g., NVIDIA) and deployed on another (e.g., AMD in cloud environments). Traditionally, this requires rewriting CUDA-specific kernels for new architectures. With KernelAbstractions.jl, deep learning frameworks in Julia, such as Lux.jl and NNlib.jl, can run on multiple GPU architectures without modification. This means models trained on an NVIDIA RTX 4090 can be deployed seamlessly on an AMD Instinct MI300X cloud GPU, avoiding vendor lock-in and optimizing cloud deployment costs.

2. Scientific Simulations & Digital Twins: Vendor-Neutral GPU Acceleration

Many industries use digital twins and scientific simulations for predictive modeling, such as in aerospace, automotive, and climate science. These simulations often run across different high-performance computing (HPC) clusters, where GPU hardware varies. With KernelAbstractions.jl, numerical solvers in Dyad (Formerly JuliaSim) can now leverage GPU acceleration across diverse architectures, ensuring that simulations run efficiently, regardless of the underlying hardware. This is particularly beneficial for high-fidelity modeling of autonomous vehicle dynamics or aerospace propulsion systems, where performance gains from heterogeneous GPU computing are critical.

3. Climate & Weather Forecasting: Scalable, Multi-GPU Simulations

Climate models require massive computations to simulate global atmospheric and oceanic interactions. These models are often executed across supercomputers with mixed GPU architectures, making vendor-neutral GPU programming essential. By leveraging KernelAbstractions.jl, researchers running large-scale climate simulations with Clima.jl can now deploy their models on both NVIDIA A100s and AMD MI250s in national HPC centers without rewriting code. This enhances the ability to predict weather patterns, model environmental changes, and improve climate resilience strategies.

By embracing vendor-neutral GPU programming, engineers and researchers can build future-proof applications that adapt to evolving hardware landscapes.

View full post