Mastering CUDA and High-Performance Computing, Part II
A Deep Dive from Compiler Internals to High-Performance Parallel Computing
Introduction
For a long time, I approached CUDA and GPU performance the way most engineers do: through fragments.
I read the CUDA Programming Guide, the PTX ISA specification, sections of the LLVM Language Reference, scattered forum posts, backend source files, and many whitepapers describing streaming multiprocessors and warp schedulers.
Each document was precise, internally consistent, and technically rigorous; but none of them, in isolation, explained the system. The PTX manual described instructions but not why they existed.
The CUDA guide described memory hierarchies but not how code became those memory accesses. LLVM documentation described passes and SSA form, but without grounding them in the physical constraints of a GPU executing 64 warps simultaneously.
I could see every layer, but I could not see how they connected. Performance tuning still felt empirical: inspect PTX, adjust launch parameters, add __restrict__, benchmark again.
The compiler remained a black box that emitted artifacts, not a system whose behavior I could predict.
The shift came when I stopped treating these as independent specifications and instead focused on the underlying invariants:
how computation is represented
how hardware executes dependency graphs
how those representations evolved alongside processor architecture.
I began reading compiler design notes, backend implementations, and especially the writings and technical explanations of Chris Lattner, tracing how LLVM’s SSA model, register allocation, and instruction selection were explicitly designed to map abstract programs onto finite physical machines.
In parallel, I studied the evolution of hardware itself: from in-order scalar pipelines to superscalar out-of-order execution, and eventually to GPUs, where thousands of threads exist not as independent programs but as replicated instances of the same dependency graph.
It became clear that hardware had stopped becoming fundamentally more complex in its execution model; instead, compilers had absorbed that complexity.
The compiler was no longer just translating syntax: it was actively restructuring computation to satisfy register file limits, memory latency constraints, and instruction throughput requirements.
Modern performance was no longer determined solely by hardware capability, but by how effectively the compiler could expose parallelism within the constraints of that hardware.
That realization reframed everything. CUDA was not a separate programming model layered on top of GPUs; it was a frontend into LLVM’s transformation pipeline, and LLVM itself was the mechanism that reshaped high-level intent into a form the GPU could physically sustain.
The compiler had become the critical mediator between software and silicon, encoding assumptions about latency, bandwidth, register pressure, and execution width directly into the structure of the program.
Understanding CUDA performance therefore required understanding LLVM: not just its syntax, but its internal passes, its SSA graph semantics, and its register allocation strategies.
Only then did the abstraction barrier dissolve, revealing that what ultimately runs on a GPU is not the original kernel, nor even its PTX representation, but the final physical realization of an LLVM-optimized dependency graph constrained by the realities of hardware.
From CUDA kernels to silicon
I remember the exact moment the abstraction barrier finally dissolved into dust.
Until then, CUDA performance tuning felt like an endless cycle of micro-optimizations: insert a __restrict__ here, manually unroll a loop there, inspect the PTX output, rerun benchmarks, adjust launch parameters, and hope for better occupancy.
PTX felt like the canonical truth: the visible, low-level artifact you could reason about, but it was only the tip of the iceberg.
Reading the design notes and deep technical writings of Chris Lattner, combined with the LLVM Language Reference, the NVVM backend specification, and numerous LLVM source dives, revealed a far more rigorous reality: CUDA kernels never become GPU programs directly.
They first exist as fully functional LLVM SSA graphs. Every subsequent transformation, like for example NVVM lowering, PTX emission, instruction scheduling, register allocation, is a constrained projection of that SSA graph onto the finite hardware resources of an SM.
When Clang compiles a CUDA kernel, the “threads” you write in source code are abstracted away; the compiler produces a single SSA function parameterized by the thread index.
Execution is purely functional, and dependencies are fully explicit. Consider a canonical SAXPY kernel:
__global__ void saxpy(float a, float* __restrict x,
float* __restrict y,
float* __restrict out) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
out[i] = a * x[i] + y[i];
}
Clang then lowers this into an SSA function resembling:
define void @saxpy(float %a,
float addrspace(1)* noalias %x,
float addrspace(1)* noalias %y,
float addrspace(1)* noalias %out) {
entry:
%tid = call i32 @llvm.nvvm.read.ptx.sreg.tid.x()
%ctaid = call i32 @llvm.nvvm.read.ptx.sreg.ctaid.x()
%ntid = call i32 @llvm.nvvm.read.ptx.sreg.ntid.x()
%block_offset = mul i32 %ctaid, %ntid
%i32 = add i32 %block_offset, %tid
%idx = sext i32 %i32 to i64
%x_ptr = getelementptr float, float addrspace(1)* %x, i64 %idx
%y_ptr = getelementptr float, float addrspace(1)* %y, i64 %idx
%out_ptr = getelementptr float, float addrspace(1)* %out, i64 %idx
%vx = load float, float addrspace(1)* %x_ptr, !noalias !0
%vy = load float, float addrspace(1)* %y_ptr, !noalias !1
%mul = fmul contract float %a, %vx
%sum = fadd contract float %mul, %vy
store float %sum, float addrspace(1)* %out_ptr, !noalias !2
ret void
}
Several structural observations are immediately apparent:
SSA nodes encode all dependencies explicitly. Every intermediate
%value, like%vx,%mul,%sum, occupies a virtual register that is immutable. Lifetimes are explicit; LLVM can compute dead values and reuse registers globally, long before physical registers exist.Thread identity is parameterized, not instantiated.
%tid,%ctaid,%ntidare just SSA inputs. The compiler sees a single, deterministic function, not thousands of concurrently executing threads. Hardware parallelism emerges only when the same SSA graph is instantiated across SM warps.Memory hierarchy is embedded in the type system.
addrspace(1)for global memory,addrspace(3)for shared memory,addrspace(5)for local/thread memory. This distinction allows alias analysis and load-store motion passes to apply different reordering guarantees depending on latency and side effects.
The role of __restrict__ is now immediately visible in LLVM IR:
load float, float addrspace(1)* %a, !noalias !0
This metadata allows the compiler to disambiguate memory locations, enabling Global Value Numbering (GVN), load forwarding, and redundant load elimination.
On a GPU, the difference is dramatic: a single unprovable alias in global memory could serialize hundreds of cycles of latency across a warp.
Another early and profound transformation is the mem2reg pass. Stack allocations introduced in IR construction are converted into pure SSA values:
Before mem2reg:
%tmp = alloca float
store float %vx, float* %tmp
%val = load float, float* %tmp
%mul = fmul float %val, %val
After mem2reg:
%mul = fmul float %vx, %vx
The load and store vanish entirely. On GPUs, this avoids global memory instructions that would have cost hundreds of cycles per thread. SSA immutability and live-range analysis now become the primary mechanism controlling register pressure downstream.
The next critical stage is SelectionDAG, where LLVM converts SSA IR into a directed acyclic graph of operations:
fadd
/ \
fmul load y
/ \
a load x
This graph is target-independent but already encodes all data dependencies. DAG nodes are matched against target-specific instruction patterns using heuristics that account for latency, throughput, and register use. For example, the LLVM IR:
%offset = mul i64 %i, 4
%addr = add i64 %base, %offset
becomes PTX:
mul.lo.s64 %rd2, %rd1, 4;
add.s64 %rd3, %rdBase, %rd2;
but at the SASS stage, NVIDIA’s backend can fold this into a single scaled addressing instruction.
Instruction count is not the primary metric; minimizing dependency depth and live-range length dominates warp occupancy and latency hiding.
Machine IR (MIR) after instruction selection but before register allocation exposes these constraints:
%3:gpr64 = IMUL64ri32 %1, 4
%4:gpr64 = ADD64rr %base, %3
Here, %gpr64 are still virtual. LLVM builds an interference graph and maps virtual registers onto finite physical registers.
On an Ampere SM, 65536 registers per SM dictate occupancy: a kernel requiring 32 registers per thread allows 2048 threads, but 96 registers per thread drops occupancy to 682.
Correlating LLVM live-range dumps with ptxas --verbose reports confirmed that register pressure originates entirely at the LLVM IR level, long before PTX emission.
Finally, loop transformations illustrate LLVM’s deterministic orchestration of latency hiding. Consider a reduction:
for (int i = 0; i < 1024; i++)
sum += x[i];
The initial IR contains a loop-carried dependency:
%sum_next = fadd float %sum, %val
which enforces strict serialization. After partial unrolling:
%v0 = load float, ptr
%v1 = load float, ptr+4
%v2 = load float, ptr+8
%v3 = load float, ptr+12
%s1 = fadd float %sum, %v0
%s2 = fadd float %s1, %v1
%s3 = fadd float %s2, %v2
%s4 = fadd float %s3, %v3
LLVM’s scheduler can now interleave loads and arithmetic:
load x[i]
load x[i+1]
fmul previous
fadd previous
The scheduler can interleave arithmetic with independent loads. By the time PTX is emitted, all parallelism, instruction ordering, and memory dependencies have already been determined; the warp scheduler is merely executing what LLVM has exposed.
Global memory latency is hidden not by runtime heuristics, but by compiler-scheduled instruction independence.
Ultimately, CUDA performance is a function of SSA graph transformations, not PTX heuristics. Every critical metric (register pressure, warp occupancy, memory coalescing, instruction-level parallelism) originates in LLVM IR.
Once I understood this, CUDA tuning ceased being black magic: by inspecting IR and tracing passes, I could predict register allocation, spilling, latency hiding, and throughput entirely before compiling to PTX or running on hardware.
The abstraction barrier had vanished, and performance became a deterministic, analyzable function of compiler-driven SSA transformations.
Tracing LLVM Transformations to Hardware
Once I understood that CUDA kernels first exist as LLVM SSA graphs, the next natural question emerged: how exactly do these immutable, dependency-annotated IR values map all the way down to warp execution on the SM?
To answer this, I traced a simple kernel (again, SAXPY) through every stage: SSA → NVVM → PTX → SASS → warp execution.
LLVM SSA Graph
The SSA graph encodes pure computation and memory dependencies:
%vx = load float, float addrspace(1)* %x_ptr, !noalias !0
%vy = load float, float addrspace(1)* %y_ptr, !noalias !1
%mul = fmul contract float %a, %vx
%sum = fadd contract float %mul, %vy
store float %sum, float addrspace(1)* %out_ptr, !noalias !2
Key points:
Each
%valueis immutable. There is no “overwriting” of registers; instead, values flow along edges in a DAG.Dependencies are explicit.
%sumdepends on%mul, which depends on%vxand%a.Memory type is explicit.
addrspace(1)informs alias analysis, reordering, and coalescing passes.
LLVM passes operate entirely on this DAG:
mem2regremoves unnecessary stack loads/stores.GVNeliminates redundant calculations.LICMhoists loop-invariant loads into shared memory.LoopUnrollexposes independent operations for the scheduler.SLPVectorizerpacks independent arithmetic into wider SIMD instructions (or pseudo-SIMT operations for the GPU).
At this stage, parallelism is implicit, coming from parameterization over %tid and %ctaid, not from threads or warps.
NVVM Lowering and PTX Generation
LLVM’s NVVM backend converts SSA into PTX pseudo-assembly, a virtual instruction set for NVIDIA GPUs:
// Global memory load
ld.global.f32 %f1, [%x+%idx];
// Arithmetic
mul.f32 %f2, %a, %f1
add.f32 %f3, %f2, %y[%idx]
// Store
st.global.f32 [%out+%idx], %f3
Observations:
SSA immutability is preserved:
%f1,%f2,%f3correspond 1:1 with SSA nodes.Memory hierarchy is respected. Loads/stores from
addrspace(1)are global,addrspace(3)would be shared.Instruction reordering is limited only by LLVM metadata and alias analysis, not PTX syntax. PTX reflects already scheduled operations.
PTX is still virtual: the actual scheduling, register allocation, and instruction fusion are deferred until SASS.
SASS: Physical Instruction Selection
The NVIDIA SASS (microarchitecture ISA) stage maps PTX to hardware instructions for the SM:
IMAD R3, R1, 4, RBase // Compute address
LD.E R4, [R3] // Load x[i]
LD.E R5, [R3_y] // Load y[i]
FMAD R6, R4, Ra, R5 // Multiply-add
ST.E [R3_out], R6 // Store result
Key points:
LLVM’s DAG scheduling determines the order of these operations. Instruction count is not minimized; dependency depth and live-range length are.
Register pressure is now concrete: each
%valueis mapped to a physical register. The interference graph built in MIR ensures no two simultaneously live values share the same physical register.Instruction fusion happens here (
FMAD); an optimization LLVM hinted at by exposing independent arithmetic in SSA.
By now, warp-level execution semantics are encoded: independent instructions are consecutive, loads can hide arithmetic latency, and dependency chains are minimized.
Warp Execution
Finally, the SM executes multiple threads (warps) of the same SASS code:
Thread 0: load x[0], load y[0], fma, store
Thread 1: load x[1], load y[1], fma, store
...
Thread 31: load x[31], load y[31], fma, store
Warp scheduler selects ready instructions from the DAG instantiated across 32 threads.
Global memory latency hiding emerges naturally: because LLVM already separated independent loads from dependent arithmetic, the scheduler always finds instructions to issue, overlapping memory and compute.
Register pressure limits occupancy: if LLVM IR produced too many live
%values per thread, fewer threads can execute concurrently.
The crucial insight: hardware doesn’t create parallelism. It executes what LLVM has already structurally exposed.
Warp scheduling, latency hiding, and coalescing are emergent properties of SSA DAG transformations combined with physical constraints.
Loop Unrolling in SSA → PTX → SASS
Consider a reduction:
for (int i=0; i<4; i++)
sum += x[i];
SSA after partial unroll:
%v0 = load float, %x+0
%v1 = load float, %x+1
%v2 = load float, %x+2
%v3 = load float, %x+3
%s1 = fadd %sum, %v0
%s2 = fadd %s1, %v1
%s3 = fadd %s2, %v2
%s4 = fadd %s3, %v3
PTX:
ld.global.f32 %f0, [%x]
ld.global.f32 %f1, [%x+4]
ld.global.f32 %f2, [%x+8]
ld.global.f32 %f3, [%x+12]
add.f32 %sum1, %sum, %f0
add.f32 %sum2, %sum1, %f1
add.f32 %sum3, %sum2, %f2
add.f32 %sum4, %sum3, %f3
SASS:
LD.E R0, [R1]
LD.E R2, [R1+4]
LD.E R4, [R1+8]
LD.E R6, [R1+12]
FADD R8, Rsum, R0
FADD R10, R8, R2
FADD R12, R10, R4
FADD R14, R12, R6
ST.E [Rout], R14
Instruction interleaving: the SM scheduler can issue independent loads before dependent FADDs.
Latency hiding: memory loads from global memory overlap with prior arithmetic.
Predictable register pressure: each
%vNmaps to R0–R6; FADDs reuse registers, keeping occupancy within hardware limits.
Merging History, Architecture, and Software
I remember stepping a little bit back and seeing the full picture: GPU evolution, compiler transformations, and kernel performance are inseparable.
Every decision in a CUDA kernel exists in the context of decades of architectural evolution.
In the 1990s, GPUs were fixed-function pipelines: vertices and pixels flowed through rigid stages, parallelism implicit but inaccessible.
Programmable shaders in the early 2000s allowed tiny per-pixel or per-vertex programs (SIMD-style warps, predication, and texture-aware memory) but general-purpose computation was still a hack.
Then CUDA arrived in 2006, exposing threads, blocks, and explicit memory hierarchies: registers, shared memory, global memory. For the first time, developers could directly reason about latency, coalescing, and warp-level execution.
Yet kernels were still only the tip of the iceberg: true performance lives in LLVM IR, long before PTX or hardware sees it.
Every kernel begins as an SSA graph: pure, immutable, fully explicit in dependencies. %tid, %ctaid, %ntid are function parameters, not threads.
Address spaces encode memory hierarchy: addrspace(1) for global, addrspace(3) for shared, addrspace(5) for local. __restrict__ metadata allows disambiguation, enabling GVN, load forwarding, and aggressive reordering.
Loop unrolling, scalar replacement, and DAG scheduling expose independent operations, determining instruction-level parallelism and register pressure.
When LLVM lowers to PTX and eventually SASS, the graph’s structure dictates occupancy, warp scheduling, and latency hiding. The GPU does not invent parallelism; it realizes the parallelism the compiler has already exposed.
Every optimization (register allocation, memory coalescing, interleaved arithmetic) is preordained by SSA transformations.
Compiler, Hardware, and the Deterministic Birth of Performance
Stepping back, it becomes clear that GPU performance is never accidental.
Every evolution, starting from fixed-function pipelines to shaders, from CUDA to modern heterogeneous SMs, is a layer that shaped how we write, reason about, and execute kernels.
But the real magic lies in the compiler, particularly LLVM: it transforms human code into SSA graphs where dependencies, memory hierarchy, and live ranges are explicit, deterministic, and fully analyzable.
PTX and SASS are not optimization stages; they are projections of these decisions onto silicon.
Performance emerges not at runtime but at compile time, where LLVM orchestrates instruction scheduling, loop transformations, memory disambiguation, and register allocation.
The GPU simply executes what LLVM has already exposed: warps, latency hiding, coalesced accesses, and parallelism are consequences of carefully structured SSA graphs.
Every tweak in IR ripples through PTX, affects occupancy, and determines throughput. Understanding this chain (history, architecture, compiler) is what allows a developer to predict, reason about, and ultimately master GPU performance.



