gpuperformancecudanvidiabandwidth

GPU Performance: Bandwidth, Throughput, and What the Specs Actually Mean

February 27, 2026·10 min read
GPU Performance: Bandwidth, Throughput, and What the Specs Actually Mean

Last week I walked through the fundamentals of GPU architecture - what an SM is, how CUDA organizes threads, and why GPUs exist in the first place. This week I want to answer a more practical question: how do you actually read GPU specs and know what matters?

Because here's something that tripped me up early: you can't just look at "6,912 CUDA cores" and conclude a GPU is fast. Core count is one number in a much bigger equation. And if you're provisioning GPU infrastructure or evaluating hardware for ML workloads, understanding that equation is the difference between overspending and under-delivering.

The Three Pillars of GPU Performance

GPU performance comes down to three things:

  1. Memory bandwidth - how fast data moves between memory and the cores
  2. Throughput (TFLOPS) - how many operations the cores can execute per second
  3. Specialized hardware - tensor cores, RT cores, and other purpose-built units

Miss any one of these and you'll misread what a GPU can actually do. Let's break each one down.

Memory Bandwidth: The Highway Analogy

Memory bandwidth is probably the single most important spec for data-heavy workloads, and it's the one most people gloss over. It tells you how much data the GPU can move from memory to the cores per second, measured in GB/s.

Think of it like a highway. Three things determine how much traffic it can handle:

Memory bandwidth as a highway: bus width is the number of lanes, memory clock is the speed limit, and DDR technology determines how many loads each car carries
  • Bus width = the number of lanes. Each "lane" is 1 bit wide. A 384-bit bus is a 384-lane highway.
  • Memory clock speed = the speed limit. How fast data travels across each lane.
  • Memory technology = the vehicles. DDR (Double Data Rate) means each "car" carries two loads per cycle. GDDR6X uses PAM4 signaling, effectively quadrupling the data rate. HBM stacks memory vertically with massively wide buses.

The formula is straightforward:

Memory Bandwidth (GB/s) = (Bus Width × Effective Data Rate) / 8

The division by 8 converts bits to bytes. Let's run some real numbers.

RTX 3090 (Consumer)

  • Bus width: 384-bit
  • Memory type: GDDR6X
  • Effective data rate: 19.5 Gbps per pin
  • Bandwidth: (384 × 19.5) / 8 = 936 GB/s

A100 80GB SXM (Data Center)

  • Bus width: 5,120-bit (HBM2e, ten 512-bit stacks)
  • Effective data rate: ~3.19 Gbps per pin
  • Bandwidth: (5120 × 3.19) / 8 ≈ 2,039 GB/s (~2 TB/s)

H100 SXM (Data Center)

  • Bus width: 5,120-bit (HBM3)
  • Effective data rate: ~5.23 Gbps per pin
  • Bandwidth: (5120 × 5.23) / 8 ≈ 3,350 GB/s (~3.35 TB/s)

Notice what's happening here. The A100 and H100 don't win on clock speed - HBM actually runs at lower clocks than GDDR6X. They win on bus width. A 5,120-bit bus versus 384-bit. That's 13x more lanes. This is why HBM dominates in data center GPUs: it trades clock speed for a massively parallel memory interface.

A wider bus at moderate speed often beats a narrow bus at high speed. The data center GPUs prove this: HBM runs slower clocks per pin but uses 13× the bus width of consumer GDDR6X.

The practical takeaway: if your workload is memory-bound (moving large datasets, running large batch ML inference), bandwidth matters more than core count. No amount of CUDA cores will help if the memory bus can't feed them fast enough.

Throughput: TFLOPS and What They Actually Mean

TFLOPS (Tera Floating-Point Operations Per Second) measures raw computational throughput - how many trillions of floating-point operations the GPU can theoretically perform every second.

The formula:

TFLOPS = Core Count × Clock Speed (GHz) × 2

The "× 2" comes from the fact that each CUDA core can perform one FMA (fused multiply-add) operation per clock cycle, which counts as two floating-point operations (one multiply + one add).

Let's work through the A100 as an example:

  • CUDA cores: 6,912 (FP32)
  • Boost clock: 1.41 GHz
  • FP32 TFLOPS: 6,912 × 1.41 × 2 = 19.5 TFLOPS

Now here's where it gets interesting. Neither core count nor clock speed alone determines throughput - they're codependent. Consider two hypothetical GPUs:

GPUCoresClock (GHz)FP32 TFLOPS
GPU A4,0962.016.4
GPU B6,9121.4119.5
GPU C10,2401.020.5

GPU C has the most cores, but GPU B (the A100) still lands close in raw TFLOPS because its clock speed compensates. Meanwhile GPU A has far fewer cores but a high clock, and still delivers competitive throughput. The point: always look at the resulting TFLOPS, not the individual specs.

The Efficiency Question

This brings up a question I've heard a lot: "Which GPU is more efficient?"

The answer is always: efficient at what?

  • Compute efficiency: how many TFLOPS per dollar? The H100 delivers 51 TFLOPS FP32 but costs significantly more than an RTX 4090 at ~83 TFLOPS. For raw bang-per-buck on FP32, the consumer card wins.
  • Power efficiency: how many TFLOPS per watt? The H100 runs at 700W TDP. The A100 runs at 400W. An RTX 4090 runs at 450W. Different power budgets for different operational realities.
  • Workload efficiency: the H100's tensor cores, NVLink, larger HBM, and ECC memory make it the only serious option for large-scale ML training - despite the raw TFLOPS looking comparable to consumer cards.

More cores and higher clocks also mean more power consumption, which means more heat, which means more cooling costs. In data center operations, the watt-per-TFLOPS ratio directly hits your bottom line.

Data Types and Precision: The Speed vs Accuracy Tradeoff

Not all floating-point operations are created equal. The precision (data type) you use determines both the accuracy of your results and the speed at which you get them.

FP16 to FP64 precision spectrum: speed increases left, accuracy increases right, with Pi values and cycle counts

Integer Types

Integers come in several sizes: 8-bit, 16-bit, 32-bit, and 64-bit. In CUDA:

  • int uses 32 bits
  • long uses 64 bits

Floating-Point Types

Floating-point is where it gets more interesting, because these are the types that define GPU performance categories:

TypeBitsNameπ approximationDecimal digits
FP1616Half precision3.14~3
FP3232Single precision3.1415927~7
FP6464Double precision3.141592653589793~15

In CUDA, float maps to FP32 (32 bits) and double maps to FP64 (64 bits).

The performance difference is significant. An FP64 operation typically takes about 4 clock cycles to complete, while an FP32 operation takes about 2 cycles. That's a 2× speed penalty for double precision - and on many consumer GPUs, the ratio is far worse (1:32 or even 1:64) because they have very few FP64 cores.

This creates a clear decision tree:

  • FP16 (half precision) - fastest, lowest accuracy. Great for ML inference where you're already working with approximate weights. Uses half the memory of FP32, doubles throughput on supported hardware.
  • FP32 (single precision) - the workhorse. Good enough for ML training, most graphics, and general computation. When GPU spec sheets list "CUDA cores," they usually mean the FP32 core count.
  • FP64 (double precision) - the powerhouse for precision. Required for climate modeling, molecular dynamics, financial risk calculations, and any domain where rounding errors compound across millions of iterations.

When someone says a GPU has "6,912 cores," they almost always mean FP32 cores. The FP64 core count is usually much lower - the A100 has 3,456 FP64 cores (half its FP32 count). Consumer GPUs often have a 1:64 FP64-to-FP32 ratio.

Tensor Cores: Purpose-Built for Matrix Math

Starting with the Volta architecture (2017), NVIDIA added a new type of core to the SM: the Tensor Core. These aren't general-purpose ALUs - they're specialized hardware designed to do one thing extremely fast: matrix multiply-accumulate operations.

Why does this matter? Because deep learning essentially is matrix multiplication. Training a neural network means multiplying weight matrices by input matrices billions of times. Tensor cores accelerate this by performing a 4×4 matrix multiply in a single clock cycle - something that would take dozens of cycles on regular CUDA cores.

The performance gains are dramatic:

GPUFP32 TFLOPS (CUDA cores)FP16 Tensor TFLOPSSpeedup
V10015.7125~8×
A10019.5312~16×
H10051990~19×

Tensor cores also introduced mixed-precision training: use FP16 for the forward pass (fast), accumulate gradients in FP32 (accurate), and get the best of both worlds. NVIDIA later added TF32 (TensorFloat-32), which gives FP32-level accuracy at FP16-like speed - another architectural trick to push throughput without sacrificing quality.

If you're evaluating GPUs for ML workloads, the tensor core TFLOPS number is the one that actually matters for training performance. Regular CUDA core TFLOPS tell you about general compute.

Compute Capability: GPU Architecture Versioning

Here's something that catches people off guard when they first get into CUDA programming: your GPU has a version number that determines what features are available to your code. This is the Compute Capability (CC), expressed as a two-digit number like 8.0 (X.Y), where X is the major version (architecture) and Y is the minor version (incremental improvements).

NVIDIA GPU architectures mapped to compute capability versions and minimum CUDA toolkit support
ArchitectureYearCompute CapabilityMin. CUDA ToolkitExample HPC GPU
Maxwell20145.x6.5M60
Pascal20166.x8.0P100
Volta20177.09.0V100
Turing20187.510.0T4
Ampere20208.x11.0A100
Hopper20229.x11.8H100

Each compute capability unlocks specific features. Tensor cores? Available from CC 7.0 (Volta) onward. BF16 (BFloat16) support? CC 8.0+ (Ampere). Transformer Engine? CC 9.0 (Hopper). If your code uses features from a higher CC than your hardware supports, it simply won't compile.

The practical rule:

Always match your CUDA toolkit version to your GPU's compute capability. Higher CC requires newer CUDA. Compiling for the wrong CC means either your code won't run, or you'll miss out on hardware-specific optimizations.

You can check your GPU's compute capability by searching "[gpu model] techpowerup" - their spec pages always include it.

Putting It All Together: Reading a Real Spec Sheet

Let's put everything we've learned side by side with three NVIDIA data center GPUs across three architecture generations:

FeatureP100 (Pascal)V100 (Volta)A100 (Ampere)
Architecture chipGP100GV100GA100
Compute Capability6.07.08.0
SMs5680108
FP32 Cores/SM646464
FP32 Cores Total3,5845,1206,912
FP32 TFLOPS10.615.719.5
FP64 Cores Total1,7922,5603,456
FP64 TFLOPS5.37.89.7
Tensor Cores/SM84 (3rd gen)
Tensor TFLOPS (FP16)125312
Memory16 GB HBM232 GB HBM280 GB HBM2e
Bandwidth732 GB/s900 GB/s2,039 GB/s

A few things jump out:

  • FP32 cores per SM stayed constant at 64 across three generations. The performance gain came from more SMs (56 → 80 → 108), not more cores per SM.
  • Tensor core TFLOPS dwarfs CUDA core TFLOPS. The A100's tensor cores deliver 16× the throughput of its CUDA cores for half-precision work. If you're doing ML and not using tensor cores, you're leaving most of the GPU on the table.
  • Memory bandwidth more than doubled from V100 to A100 (900 → 2,039 GB/s) thanks to the jump from HBM2 to HBM2e. This is often the biggest real-world performance differentiator.

What's Next

Now that we've covered the metrics, next up is what's actually going on inside the streaming multiprocessor - the SM partitions, warp scheduling, SIMT execution, and how NVIDIA's hardware maps those CUDA threads we wrote about in week one to real silicon.

The takeaway from this week: GPU specs are a system of interlocking variables. Bandwidth, throughput, precision, and specialized hardware all interact. A GPU with massive TFLOPS but limited bandwidth will bottleneck on memory-bound workloads. A GPU with tons of bandwidth but low tensor core throughput will underwhelm on ML training. And a workstation GPU at compute capability 5.0 won't even compile code that targets CC 8.0 features.

Read the full spec sheet. Understand the workload. Then pick the GPU.

Resources