gpucudanvidiatoolkitnvccprofiling

CUDA Toolkit: Compilers, Libraries, and the Host-Device Model

March 6, 2026·9 min read
CUDA Toolkit: Compilers, Libraries, and the Host-Device Model

In the first post we covered GPU performance metrics - bandwidth, TFLOPS, precision. In the second, we went inside the SM and traced how warps execute on partitions. This post connects the software to the hardware: the CUDA Toolkit - the full stack of compilers, libraries, runtime APIs, and profiling tools that let you actually program a GPU.

CUDA stands for Compute Unified Device Architecture. It's NVIDIA's proprietary parallel computing platform, and understanding its components is essential whether you're writing CUDA kernels yourself or just trying to understand why your ML framework needs a specific CUDA version installed.

The Host-Device Model

Every CUDA program operates on a fundamental assumption: there are two separate processors with separate memory.

Host-device model: CPU with DRAM connected to GPU with GDRAM via PCIe bus, with cudaMemcpy and cudaMalloc annotations
  • The Host = the CPU and its system memory (DRAM)
  • The Device = the GPU and its graphics memory (GDRAM/HBM)
  • The PCIe Bus = the bridge between them

This is the most important thing to internalize about GPU programming: the CPU and GPU have completely separate memory spaces. Data doesn't magically appear on the GPU. You have to explicitly:

  1. Allocate memory on the GPU - cudaMalloc()
  2. Copy data from CPU memory to GPU memory - cudaMemcpy(Host → Device)
  3. Launch a kernel (GPU function) - kernel<<<blocks, threads>>>()
  4. Copy results back from GPU to CPU - cudaMemcpy(Device → Host)
  5. Free the GPU memory - cudaFree()

That PCIe transfer is the bottleneck in many GPU applications. PCIe Gen4 ×16 gives you ~32 GB/s in each direction - compare that to the A100's 2 TB/s internal memory bandwidth. You're looking at a 60× gap between how fast the GPU can process data internally and how fast you can feed it new data from the CPU.

The first question in GPU optimization isn't "how do I make the kernel faster?" - it's "how do I minimize the data transfer?" Experienced CUDA developers keep data on the GPU as long as possible and batch transfers to amortize the PCIe overhead.

This is also why NVIDIA invested in Unified Memory (cudaMallocManaged()) - it creates a single address space that the runtime automatically migrates between CPU and GPU. It simplifies programming but doesn't eliminate the transfer cost; it just hides it.

GPGPU: General-Purpose Computing on GPUs

Before CUDA (pre-2007), GPUs could only run graphics shaders. If you wanted to do matrix math on a GPU, you had to disguise your computation as a pixel shader - seriously. The data went in as "textures" and came out as "pixel colors."

CUDA changed that by making the GPU a general-purpose compute device. This is what GPGPU (General-Purpose computing on Graphics Processing Units) means: using a GPU for anything beyond graphics rendering - scientific simulation, machine learning, data analytics, cryptography.

The term "GPGPU" is sometimes used interchangeably with "GPU computing," but technically GPGPU specifically refers to non-graphics workloads running on GPU hardware. Every time you run a PyTorch model on a CUDA device, that's GPGPU.

The nvcc Compiler: From Source to Silicon

The CUDA compiler is nvcc (NVIDIA CUDA Compiler). It's not a simple compiler - it's a compiler driver that orchestrates a multi-stage compilation pipeline.

Here's what happens when you compile a .cu file:

  1. Separation - nvcc separates your code into host code (regular C/C++) and device code (CUDA kernels marked with __global__ or __device__)
  2. Host compilation - the host code is passed to your system's C++ compiler (gcc, clang, or MSVC)
  3. Device compilation - the device code is compiled to PTX (Parallel Thread Execution), NVIDIA's intermediate assembly language
  4. Assembly - PTX is assembled into SASS (Shader ASSembly), the native machine code for a specific GPU architecture
  5. Linking - host and device code are linked into a single executable

PTX is worth understanding. It's NVIDIA's equivalent of LLVM IR or Java bytecode - a architecture-independent intermediate representation. When you compile for a specific compute capability (e.g., -arch=sm_80), nvcc generates SASS code optimized for that exact architecture. But you can also embed PTX in your binary so that the CUDA driver can JIT-compile it for whatever GPU the user actually has.

The -gencode flag in nvcc lets you target multiple architectures in a single binary: -gencode arch=compute_70,code=sm_70 -gencode arch=compute_80,code=sm_80. This is how libraries like PyTorch support multiple GPU generations without separate builds.

CUDA Libraries: Standing on NVIDIA's Shoulders

Most people never write raw CUDA kernels. Instead, they use NVIDIA's highly optimized libraries that do the heavy lifting:

LibraryPurposeCPU Equivalent
cuBLASGPU-accelerated linear algebra (matrix multiply, etc.)OpenBLAS, MKL
cuDNNDeep learning primitives (convolution, pooling, normalization)
cuFFTFast Fourier TransformFFTW
cuRANDRandom number generationstd::random
cuSPARSESparse matrix operationsSuiteSparse
ThrustHigh-level parallel algorithms (sort, reduce, scan)C++ STL algorithms
NCCLMulti-GPU and multi-node collective communicationsMPI

When PyTorch calls torch.matmul() on a CUDA tensor, it's calling cuBLAS under the hood. When you run a Conv2d layer, that's cuDNN. These libraries are the reason GPU deep learning is fast - NVIDIA engineers have hand-optimized them for each architecture down to the instruction level.

From an SRE perspective, this means that CUDA toolkit versions matter. cuDNN 8.x behaves differently from 9.x. A library compiled against CUDA 11.8 may not work with CUDA 12.0. These version dependencies are the source of an enormous number of "works on my machine" GPU debugging sessions.

Runtime API vs Driver API

CUDA actually provides two levels of API:

  • Runtime API (higher-level) - this is what 99% of CUDA code uses. Functions like cudaMalloc(), cudaMemcpy(), and the <<<blocks, threads>>> kernel launch syntax. It handles context management, module loading, and error checking automatically.
  • Driver API (lower-level) - functions prefixed with cu (not cuda): cuMemAlloc(), cuLaunchKernel(). You manage contexts, modules, and memory explicitly. More control, more boilerplate.

The driver API exists for edge cases: building GPU runtimes, loading PTX at runtime, managing multiple GPU contexts manually. If you're writing application code, stick with the runtime API.

The Execution Pipeline: Code to Cores

Now let's trace what happens when your CUDA program actually runs on the GPU. This connects everything from the previous two posts:

CUDA execution pipeline: source code through nvcc compiler to PTX to SASS binary, then through the Giga Thread Engine to SMs, partitions, warps, and individual cores
  1. Your .cu source code gets compiled by nvcc into PTX → SASS binary
  2. At runtime, you call a kernel launch: kernel<<<gridDim, blockDim>>>(args)
  3. The CUDA runtime sends the kernel to the GPU's Giga Thread Engine - the top-level hardware scheduler
  4. The Giga Thread Engine distributes thread blocks across available SMs. Each SM gets one or more blocks depending on resource availability
  5. Within each SM, blocks are divided into warps (32 threads each)
  6. Warps are assigned to SM partitions, where the partition's warp scheduler manages their execution
  7. Each cycle, the warp scheduler selects a ready warp and dispatches its instruction to the cores (FP32, FP64, INT, tensor)

The key insight: the Giga Thread Engine is the GPU's "operating system." It handles block-to-SM assignment, and from there the SM hardware takes over with warp scheduling. As a programmer, you control the grid dimensions (how many blocks) and block dimensions (how many threads per block). The hardware handles everything below that.

The Grid, Block, and Thread Hierarchy

When you launch a kernel, you specify:

  • Grid = the total collection of all thread blocks (1D, 2D, or 3D)
  • Block = a group of threads that execute on the same SM and can share memory (1D, 2D, or 3D)
  • Thread = a single execution unit

For example, launching kernel<<<dim3(16, 16), dim3(32, 32)>>>() creates:

  • A grid of 16 × 16 = 256 blocks
  • Each block has 32 × 32 = 1,024 threads
  • Total: 256 × 1,024 = 262,144 threads
  • Each block produces 1,024 / 32 = 32 warps

All 262,144 threads run the same kernel function, but each one knows its position via built-in variables (threadIdx, blockIdx, blockDim, gridDim) so it can work on its own slice of data.

Profiling and Debugging Tools

NVIDIA ships a full suite of profiling and debugging tools with the CUDA Toolkit. These are essential for production GPU workloads:

ToolPurposeWhen to Use
Nsight SystemsSystem-level profiler - CPU + GPU timeline, kernel launches, memory transfersFirst pass: "where is time being spent?"
Nsight ComputeKernel-level profiler - SM utilization, memory throughput, warp stallsDeep dive: "why is this kernel slow?"
cuda-gdbGPU-aware debugger - set breakpoints in kernels, inspect thread stateDebugging kernel logic errors
compute-sanitizerMemory error detection - out-of-bounds access, race conditions, leaksCatching memory bugs (like GPU Valgrind)
nvidia-smiGPU management CLI - utilization, temperature, memory usage, clock speedsReal-time monitoring and health checks

For SREs, nvidia-smi is the one you'll use daily. It's the top or htop equivalent for GPUs:

$ nvidia-smi
+-----------------------------------------------------------------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M.  |
|=============================================================================|
|   0  A100-SXM4-80GB      On  | 00000000:07:00.0 Off |                    0 |
| N/A   34C    P0    52W / 400W |    345MiB / 81920MiB |      0%      Default |
+-----------------------------------------------------------------------------+

Key metrics to watch: GPU-Util (compute utilization - low means the GPU is idle or memory-bound), Memory-Usage (OOM errors come from here), Temperature (throttling starts around 80–85°C on most data center GPUs), and ECC errors (non-zero uncorrected ECC means hardware is failing).

Getting Started: CUDA Samples

The CUDA Toolkit ships with a samples directory full of example programs. Two worth running first:

  • deviceQuery - prints your GPU's compute capability, SM count, memory, clock speeds, and supported features. Run this first to verify your CUDA installation is working.
  • bandwidthTest - measures host-to-device, device-to-host, and device-to-device transfer speeds. Good for verifying PCIe bandwidth and detecting hardware issues.

These are the GPU equivalent of "Hello World" - they confirm that the driver, runtime, and hardware are all talking to each other correctly.

Wrapping Up the Series

Across these three posts, we've traced the full path from silicon to software:

  1. Performance metrics - bandwidth, TFLOPS, precision, and tensor cores tell you what a GPU can do
  2. SM internals - partitions, warp schedulers, and the SIMT model show you how it does it
  3. The CUDA toolkit - compilers, libraries, and the host-device model give you the tools to make it do it

The through-line: GPU programming is about understanding the hierarchy at every level - from the chip's architecture down to individual warps, from the high-level cuBLAS call down to the PTX instruction it compiles to. The more layers you understand, the better you can diagnose why a workload is slow, why a toolkit version matters, or why a particular GPU is the right (or wrong) choice for your use case.

Learn the hardware. Learn the toolkit. Then the specs sheets start making sense.

Resources