Mastering CUDA and High-Performance Computing, Part V
A Deep Dive from Compiler Internals to High-Performance Parallel Computing
The instruction pipeline
There is a moment in GPU optimization work that comes after the first profiling session, after you’ve internalized roofline arithmetic and stopped chasing occupancy for its own sake.
You’ve fixed the obvious things. Coalescing is clean. Shared memory is tiled. Bank conflicts are gone. Register pressure is measured and tolerated. Arithmetic intensity sits above the ridge point. You run the kernel.
The profiler still shows stalls.
Different stalls. The memory stalls are reduced to acceptable levels, but something else is happening.
The SM is doing work, not waiting on DRAM, and yet it is slower than the theoretical ceiling by a factor the memory model doesn’t explain.
This is when the focus shifts from data movement to instruction flow. From where data lives to how instructions are scheduled, issued, and retired.
The second real lesson of GPU programming is this: the instruction pipeline is not transparent.
It has depth, hazards, and throughput ceilings that are entirely independent of memory bandwidth. It can be the bottleneck even when memory is not.
Understanding it requires going deeper into the SM than most CUDA tutorials ever go.
The instruction stream beneath the kernel
Every CUDA kernel you write is compiled to PTX (NVIDIA’s Parallel Thread eXecution intermediate representation) and from there to SASS: Streaming ASSembler, the native instruction set of the GPU.
PTX is portable across GPU generations within broad families. SASS is not. It is tied to a specific architecture, encodes pipeline-specific timing constraints directly into instruction control bits, and is the only thing that actually executes on the hardware.
Most CUDA developers never look at SASS. This is a mistake: not because you need to write it, but because it is the only place you can verify what the compiler actually produced, confirm that instruction selection matches your intent, and identify pipeline bottlenecks that the high-level model cannot expose.
The translation path:
CUDA source (.cu)
↓ [nvcc / clang front-end: C++ parsing, template instantiation]
PTX (target-independent virtual ISA; open, documented)
↓ [ptxas: machine-specific optimization, register allocation, SASS generation]
SASS (cubin; architecture-locked, closed binary)
↓ [CUDA runtime: kernel loading, parameter binding, SM dispatch]
SM execution unitsTo inspect SASS for any compiled binary, use:
# From a compiled CUDA binary
cuobjdump --dump-sass my_kernel.cubin
# Or directly from a PTX file compiled for a target
ptxas -arch=sm_80 kernel.ptx -o kernel.cubin
cuobjdump --dump-sass kernel.cubinWhen you write float x = a * b + c, the compiler emits a single FFMA instruction (fused multiply-add) which performs both operations in one pipeline pass with a single rounding step.
This is semantically different from two separate operations: (a * b) + c computed with FFMA rounds once at the end; float t = a * b; t = t + c; rounds twice.
The compiler fuses by default. Disable with -fmad=false if numerical reproducibility across implementations matters, accepting roughly halved throughput on compute-bound code.
The SASS instruction stream is not a one-to-one transcription of your source. It is the compiler’s best attempt to map your intent to the hardware’s actual execution model.
Understanding that model is what makes the difference between reading SASS as noise and reading it as a diagnostic.
Latency versus throughput
The single most important conceptual distinction in pipeline-level reasoning is between instruction latency and instruction throughput. They govern different bottlenecks, require different fixes, and are frequently confused.
Latency is the number of clock cycles between when an instruction is issued and when its result is available to a subsequent instruction that depends on it. A dependent instruction cannot issue until this interval has elapsed.
Throughput is the inverse of the rate at which independent instructions can flow through an execution unit, expressed as cycles per instruction. It represents the pipeline’s steady-state capacity, ignoring data dependencies entirely.
The following table gives measured values for the Ampere architecture (A100, sm_80), derived from microbenchmarking work by Abdelkhalik et al. (2022) and corroborated across multiple independent sources:
Instruction Latency (cycles) Throughput (cyc/instr, per SMSP)
─────────────────────────────────────────────────────────────────────────────
FFMA (FP32) 4 0.25 (4 FMAs/cycle)
FADD / FMUL (FP32) 4 0.25
IMAD / IADD3 (INT32) 4 0.25
DFMA (FP64) 8 2.0
MUFU.RCP / RSQ 16 4.0
MUFU.SIN / COS 16 4.0
MUFU.EX2 / LG2 16 4.0
HFMA2 (FP16×2) 4 0.25
─────────────────────────────────────────────────────────────────────────────
Shared memory load 23 ~1.0
Shared memory store 19 ~1.0
L1 cache hit (LDG) ~33 ~1.0
L2 cache hit (LDG) ~200 ~1.0
HBM uncached (LDG) ~290–566† ~1.0†The range reflects two different measurement methodologies: pointer-chasing through fully resident arrays (290 cycles, Abdelkhalik et al.) versus pointer-chasing through a working set larger than L2, which forces full HBM round-trips and measured ~566 cycles on A100 (Shi et al., 2025). The lower number approximates L2-miss latency; the upper number is closer to the true cold-DRAM round-trip experienced in bandwidth-saturated conditions.
The MUFU throughput of 4 cycles per instruction requires emphasis. MUFU executes transcendental functions (sinf, cosf, expf, logf, rsqrtf, rcpf) via the Special Function Unit, a separate pipeline from the FP32 FMA units.
On A100, each SM sub-partition has one SFU. That SFU can issue one MUFU instruction every 4 cycles, while the FP32 pipe can issue four FMAs per cycle.
A kernel that mixes heavy transcendental usage with FP32 arithmetic will hit SFU throughput long before it approaches FP32 throughput. This matters for ML activation functions (tanhf, expf in softmax) and any scientific kernel using trigonometry.
Now the fundamental insight: FFMA latency is 4 cycles, but FFMA throughput is 1 instruction per 0.25 cycles (4 per cycle). Four independent FP32 FMAs can enter the pipeline simultaneously every cycle.
If no dependency links them, all four execute in parallel, and the pipeline issues one per cycle. If every instruction reads the result of the previous one, the pipeline must wait 4 cycles between each issue. The throughput ceiling goes unrealized.
This is not hypothetical. It is the dominant performance regime for naive accumulation loops.
The SMSP: the real unit of execution
A critical detail that the previous part’s SM diagram abstracts away: on Ampere (and Volta and Turing), the SM is not monolithic. It is divided into four SM sub-partitions, each designated SMSP in the Nsight Compute metric namespace.
Each SMSP contains:
One warp scheduler (one instruction issued per cycle)
One dispatch unit
An L0 instruction cache (private to the SMSP)
A 16K×32-bit register file (64 KB, 16,384 32-bit registers per SMSP; four SMSPs yield 65,536 total per SM)
32 FP32 CUDA cores (on GA100; gaming Ampere GA10x uses a different split)
16 INT32 cores
8 FP64 cores
1 third-generation Tensor Core
8 Load/Store Units
The SMSP, not the SM, is the scheduling atom. A warp is assigned to a specific SMSP at launch and remains there for its entire lifetime. It does not migrate between SMSPs.
All the warp stall metrics in Nsight Compute that carry the smsp__ prefix are per-SMSP counters; sm__ metrics aggregate across all four. This subdivision matters for two reasons.
First, it explains the per-SMSP warp pool size. On Ampere, each SMSP can host up to 16 warps. Four SMSPs yield the 64-warp-per-SM maximum. The scheduler operates per SMSP, picking one eligible warp per cycle from its local pool of 16.
The practical ceiling for latency hiding is thus determined per SMSP, not per SM. A kernel with 32 resident warps per SM has 8 per SMSP, which is often sufficient for latency hiding when HBM latency is the bottleneck.
Second, it exposes the register file partitioning. The 65,536 registers per SM are physically distributed across four SMSPs, 16,384 each.
When you compute that 32 registers per thread allows 2,048 threads per SM at full occupancy, those 2,048 threads are physically spread 512 per SMSP, each holding 32 registers, fully consuming the 16,384 available.
The constraint is per-SMSP, and violating it in one SMSP limits the entire SM.
The Nsight Compute metric smsp__warps_active.avg.pct_of_peak_sustained_active reports active warp fraction per SMSP averaged over time.
It is more informative than sm__warps_active for diagnosing occupancy limits because it reflects the actual scheduling capacity of the unit that does the scheduling.
The hardware that enforces the dependency graph
Before a warp can issue its next instruction, the warp scheduler must verify that all source operands for that instruction are available.
The hardware mechanism for this is the scoreboard, a per-SMSP register file that tracks which registers have outstanding writes from in-flight instructions.
Every issued instruction that writes to a register marks that register as “pending” in the scoreboard. When the instruction completes and the result is written to the register file, the mark is cleared.
If the scheduler selects a warp to issue and the warp’s next instruction reads a register that is still marked pending, the warp is stalled. It is not eligible. The scheduler moves to the next warp.
CUDA distinguishes two scoreboard domains based on the source of the pending result:
Short scoreboard (Nsight metric: smsp__pcsamp_warps_issue_stalled_short_scoreboard): tracks instructions with latency short enough that the hardware uses a fixed countdown timer rather than a completion signal.
This covers: FP32/INT32/FP16 arithmetic from the FMA pipe, shared memory loads (23-cycle latency), SFU/MUFU results (16 cycles), indexed constant loads, and warp-level vote instructions.
The hardware knows the exact cycle at which the result will be ready and unblocks the scoreboard entry on that cycle.
Long scoreboard (Nsight metric: smsp__pcsamp_warps_issue_stalled_long_scoreboard): tracks instructions whose completion time is not fixed in advance.
This covers all loads from global memory ( L1 hits (~33 cycles), L2 hits (~200 cycles), HBM (~290–566 cycles)) and anything that crosses the L1TEX pipeline.
The hardware cannot predict when the data will arrive; it waits for an explicit writeback signal from the memory subsystem.
This split has a crucial diagnostic implication. High stall_long_scoreboard means threads are waiting for data from DRAM. The fix is occupancy (more warps to swap in while waiting), prefetching, or restructured data layout.
High stall_short_scoreboard means threads are waiting on arithmetic results or shared memory: a dependency bottleneck in the instruction stream itself. The fix is instruction-level parallelism, not occupancy.
A third stall class completes the picture: MIO throttle (stall_mio_throttle). This appears when the input FIFO to the Memory I/O pipeline is full: too many outstanding memory requests are already in flight and new ones cannot be accepted.
It is distinct from stall_long_scoreboard. The latter means a warp is waiting for a specific result. The former means a warp cannot even submit a new request yet.
MIO throttle is the signature of heavy but poorly coalesced global memory access, where many independent 32-transaction memory operations are flooding the queue simultaneously.
And the fourth: math pipe throttle (stall_math_pipe_throttle). This appears when a warp is ready to issue an FP32 (or FP64, or tensor) instruction but the execution pipeline is already occupied by instructions from other warps.
This is the good stall: it means arithmetic throughput, not memory or dependencies, is the actual ceiling. On a well-tuned compute-bound kernel, stall_math_pipe_throttle should dominate the stall breakdown.
The differential diagnosis in Nsight Compute’s Warp State Statistics section is the most powerful single analytical tool available after the roofline. Reading the dominant stall reason maps directly onto the class of optimization required:
stall_long_scoreboarddominant → memory latency. Fix: more warps, better tiling, async prefetch.stall_short_scoreboarddominant → arithmetic dependency chain. Fix: ILP, loop unrolling, independent accumulators.stall_mio_throttledominant → memory request queue saturation. Fix: coalescing, vectorized loads, reduced memory instruction count.stall_math_pipe_throttledominant → compute-bound. Fix: tensor cores if using scalar FP16/FP32, mixed-precision, or just accept that you’ve reached peak.stall_barrierdominant → synchronization structure. Fix: reduced barrier scope, better work distribution, cooperative groups.stall_not_selecteddominant → too many eligible warps, not enough issue bandwidth. Fine above ~20%; unusually high means you’re register-rich with very high ILP and can potentially increase complexity per thread.
The dependency chain
Consider a canonical example: a dot product over a fixed array, unrolled.
float acc = 0.0f;
for (int i = 0; i < N; i++) {
acc += a[i] * b[i];
}After loading all a[i] and b[i] into registers (ignoring the loads themselves for now), the inner loop body compiles to a sequence of FFMA instructions, each writing to acc and reading the result of the previous one:
FFMA R4, R6, R8, R4 // acc += a[0]*b[0]; R4 depends on R4
FFMA R4, R10, R12, R4 // acc += a[1]*b[1]; R4 depends on R4 from prior
FFMA R4, R14, R16, R4 // acc += a[2]*b[2]; R4 depends on R4 from prior
...Each FFMA writes to R4 and reads from R4. The short scoreboard marks R4 as pending for 4 cycles after each issue. The next instruction reads R4, finds it pending, and stalls. The effective issue rate is 1 FFMA per 4 cycles.
The FP32 pipeline on an A100 SMSP can issue 4 FMAs per cycle when fed independent work. This naive accumulation uses 1/16 of that capacity.
The fix is independent accumulators, which break the serial chain into parallel chains that the hardware can interleave:
float acc0 = 0.0f, acc1 = 0.0f, acc2 = 0.0f, acc3 = 0.0f;
for (int i = 0; i < N; i += 4) {
acc0 += a[i+0] * b[i+0];
acc1 += a[i+1] * b[i+1];
acc2 += a[i+2] * b[i+2];
acc3 += a[i+3] * b[i+3];
}
float acc = (acc0 + acc1) + (acc2 + acc3);
The SASS now has four independent dependency chains. After FFMA R4, ..., R4, the scheduler can immediately issue FFMA R5, ..., R5, FFMA R6, ..., R6, and FFMA R7, ..., R7.
When the first FFMA’s 4-cycle latency expires and R4 is ready, the next FFMA R4 instruction can issue. The pipeline runs at near peak throughput.
This is not a micro-optimization on the margin. On compute-bound kernels with long accumulation chains (reductions, dot products, small GEMMs written without tensor cores) the difference between serial and independent-accumulator form is frequently 4–8× in throughput.
The compiler performs this unrolling automatically in many cases when #pragma unroll is used or when the loop trip count is known at compile time and is small.
It does not reliably unroll across complex loop bodies, through function call boundaries, or when accumulators are accessed via pointers (which the compiler may assume alias).
To verify: inspect the SASS. If you see FFMA R4, ..., R4 repeating with the same destination register, the dependency chain is serialized. If you see FFMA R4, FFMA R5, FFMA R6, FFMA R7 cycling, the compiler found the independent accumulator form.
The profiler can confirm via smsp__pcsamp_warps_issue_stalled_short_scoreboard — if this metric is non-trivial on a compute-bound kernel, the dependency graph is restricting throughput.
Register bank conflicts
The register file on each SMSP is a four-banked SRAM, where each bank is 4 bytes wide. An FFMA instruction reads up to three source registers (two multiplicands and an addend) and one destination.
If two or more source registers of a single instruction map to the same bank, the reads serialize. Each register at index R maps to bank R % 4 (on Ampere; earlier architectures used different moduli and widths).
For an FFMA with sources R4, R8, R12: banks are 0, 0, 0. All three reads conflict. The register file must issue three sequential read operations to bank 0, adding latency to the instruction even if no data dependency exists.
For an FFMA with sources R4, R5, R6: banks are 0, 1, 2. No conflict. All three reads issue in parallel.
This has a direct consequence for accumulation loop unrolling. Consider four accumulators in R4, R5, R6, R7. If the multiplicands for each land in:
FFMA R4, R8, R12, R4 // banks: 0, 0, 0, 0 → conflict on R8, R12, R4
FFMA R5, R9, R13, R5 // banks: 1, 1, 1, 1 → conflict
All three sources of each FFMA map to the same bank (since 8%4=0, 12%4=0, 4%4=0). Every instruction stalls internally. The throughput benefit of ILP is partially eaten by register bank conflicts.
The compiler’s register allocator is aware of this and attempts to assign registers to avoid conflicts in hot instruction sequences. But it operates under register pressure constraints and cannot always do so.
When manual register assignment is needed for peak performance (as in hand-tuned sgemm kernels) this requires explicit attention to bank distribution.
To check: in SASS output from Nsight Compute’s Source view, bank conflict indicators appear per instruction when hardware counter smsp__sass_data_bank_conflicts_pipe_fma_cycles_active is nonzero.
Values above 5% on a compute-bound kernel suggest register allocation is constraining pipeline throughput.
Control divergence
Divergence is well-understood at the conceptual level but frequently mis-modeled quantitatively.
The SIMT execution model on post-Pascal NVIDIA GPUs (Volta and later) uses independent thread scheduling: each thread has its own program counter and stack, and the hardware can reconverge threads that diverged.
This replaced the pre-Volta model where threads were locked to warp-level lockstep with explicit SIMD masking.
What independent thread scheduling provides:
Threads can diverge without being permanently trapped in separate execution paths until an explicit reconvergence point.
The scheduler can interleave instructions from different sub-groups within a warp to improve utilization.
What it does not provide:
SIMT execution is still 32-wide. When a warp diverges, the hardware executes the taken path and the not-taken path serially, masking inactive threads. Independent thread scheduling changes when reconvergence can happen, not whether both paths must execute.
The true cost of a divergent conditional is not “50% efficiency if threads split 50/50.” It is the sum of execution time for all distinct paths through the conditional, not the maximum.
If half the threads take path A (10 instructions) and half take path B (20 instructions), total warp execution time is ~30 instruction-cycles, not ~20.
For nested conditionals, the cost compounds. A kernel with a two-level nested conditional where each level has 50/50 divergence may see a 4× slowdown compared to fully converged execution.
The SASS opcode BRA (branch) is preceded by a predicate evaluation. All threads evaluate the predicate; the warp then issues along the taken path with non-predicated threads masked.
The reconvergence point is encoded in SSY (set synchronize) and SYNC instructions in the SASS, inserted by the compiler.
The diagnostic metric is smsp__sass_thread_inst_executed_op_control.sum relative to total instructions: high control overhead relative to arithmetic instructions indicates either heavy divergence or loop overhead.
Nsight Compute’s Source Counters section shows per-instruction thread execution counts; instructions in the taken path of a divergent branch show fewer thread executions than instructions outside the branch.
The practical implication: for kernels with data-dependent branching (parsing, tree traversal, sparse format processing), minimizing the number of distinct per-warp execution paths matters more than minimizing the total number of conditionals.
One 32-way divergent conditional is better than eight 2-way ones.
Conclusion
After roofline analysis, memory coalescing, and careful tiling, most GPU kernels are no longer limited by bandwidth: they are limited by the instruction pipeline itself.
Understanding the pipeline means looking beyond PTX and into SASS: the real instruction stream executed by the hardware, with its latencies, throughput ceilings, and resource partitions.
Instruction-level bottlenecks (dependency chains, register bank conflicts and control divergence) often dominate performance even when memory stalls are minimal.
Metrics like short and long scoreboard stalls, math pipe utilization, and SMSP-level warp activity provide the only reliable window into these hidden limits.
The practical lesson is simple but profound: to push a compute-bound kernel to its theoretical peak, you must treat the instruction pipeline as first-class terrain.
Unroll loops into independent accumulators, balance register allocation to avoid bank conflicts, minimize divergent paths, and account for specialized units like the SFU.
Only by reasoning at the level of instruction flow, rather than just data movement, can you approach the true limits of GPU performance.



