GPU Performance: Bandwidth, Throughput, and What the Specs Actually Mean

Last week I walked through the fundamentals of GPU architecture - what an SM is, how CUDA organizes threads, and why GPUs exist in the first place. This week I want to answer a more practical question: how do you actually read GPU specs and know what matters?

Because here's something that tripped me up early: you can't just look at "6,912 CUDA cores" and conclude a GPU is fast. Core count is one number in a much bigger equation. And if you're provisioning GPU infrastructure or evaluating hardware for ML workloads, understanding that equation is the difference between overspending and under-delivering.

The Three Pillars of GPU Performance

GPU performance comes down to three things:

Memory bandwidth - how fast data moves between memory and the cores
Throughput (TFLOPS) - how many operations the cores can execute per second
Specialized hardware - tensor cores, RT cores, and other purpose-built units

Miss any one of these and you'll misread what a GPU can actually do. Let's break each one down.

Memory Bandwidth: The Highway Analogy

Memory bandwidth is probably the single most important spec for data-heavy workloads, and it's the one most people gloss over. It tells you how much data the GPU can move from memory to the cores per second, measured in GB/s.

Think of it like a highway. Three things determine how much traffic it can handle:

Memory bandwidth as a highway: bus width is the number of lanes, memory clock is the speed limit, and DDR technology determines how many loads each car carries

Bus width = the number of lanes. Each "lane" is 1 bit wide. A 384-bit bus is a 384-lane highway.
Memory clock speed = the speed limit. How fast data travels across each lane.
Memory technology = the vehicles. DDR (Double Data Rate) means each "car" carries two loads per cycle. GDDR6X uses PAM4 signaling, effectively quadrupling the data rate. HBM stacks memory vertically with massively wide buses.

The formula is straightforward:

Memory Bandwidth (GB/s) = (Bus Width × Effective Data Rate) / 8

The division by 8 converts bits to bytes. Let's run some real numbers.

RTX 3090 (Consumer)

Bus width: 384-bit
Memory type: GDDR6X
Effective data rate: 19.5 Gbps per pin
Bandwidth: (384 × 19.5) / 8 = 936 GB/s

A100 80GB SXM (Data Center)

Bus width: 5,120-bit (HBM2e, ten 512-bit stacks)
Effective data rate: ~3.19 Gbps per pin
Bandwidth: (5120 × 3.19) / 8 ≈ 2,039 GB/s (~2 TB/s)

H100 SXM (Data Center)

Bus width: 5,120-bit (HBM3)
Effective data rate: ~5.23 Gbps per pin
Bandwidth: (5120 × 5.23) / 8 ≈ 3,350 GB/s (~3.35 TB/s)

Notice what's happening here. The A100 and H100 don't win on clock speed - HBM actually runs at lower clocks than GDDR6X. They win on bus width. A 5,120-bit bus versus 384-bit. That's 13x more lanes. This is why HBM dominates in data center GPUs: it trades clock speed for a massively parallel memory interface.

A wider bus at moderate speed often beats a narrow bus at high speed. The data center GPUs prove this: HBM runs slower clocks per pin but uses 13× the bus width of consumer GDDR6X.

The practical takeaway: if your workload is memory-bound (moving large datasets, running large batch ML inference), bandwidth matters more than core count. No amount of CUDA cores will help if the memory bus can't feed them fast enough.

Throughput: TFLOPS and What They Actually Mean

TFLOPS (Tera Floating-Point Operations Per Second) measures raw computational throughput - how many trillions of floating-point operations the GPU can theoretically perform every second.

The formula:

TFLOPS = Core Count × Clock Speed (GHz) × 2

The "× 2" comes from the fact that each CUDA core can perform one FMA (fused multiply-add) operation per clock cycle, which counts as two floating-point operations (one multiply + one add).

Let's work through the A100 as an example:

CUDA cores: 6,912 (FP32)
Boost clock: 1.41 GHz
FP32 TFLOPS: 6,912 × 1.41 × 2 = 19.5 TFLOPS

Now here's where it gets interesting. Neither core count nor clock speed alone determines throughput - they're codependent. Consider two hypothetical GPUs:

GPU	Cores	Clock (GHz)	FP32 TFLOPS
GPU A	4,096	2.0	16.4
GPU B	6,912	1.41	19.5
GPU C	10,240	1.0	20.5

GPU C has the most cores, but GPU B (the A100) still lands close in raw TFLOPS because its clock speed compensates. Meanwhile GPU A has far fewer cores but a high clock, and still delivers competitive throughput. The point: always look at the resulting TFLOPS, not the individual specs.

The Efficiency Question

This brings up a question I've heard a lot: "Which GPU is more efficient?"

The answer is always: efficient at what?

Compute efficiency: how many TFLOPS per dollar? The H100 delivers 51 TFLOPS FP32 but costs significantly more than an RTX 4090 at ~83 TFLOPS. For raw bang-per-buck on FP32, the consumer card wins.
Power efficiency: how many TFLOPS per watt? The H100 runs at 700W TDP. The A100 runs at 400W. An RTX 4090 runs at 450W. Different power budgets for different operational realities.
Workload efficiency: the H100's tensor cores, NVLink, larger HBM, and ECC memory make it the only serious option for large-scale ML training - despite the raw TFLOPS looking comparable to consumer cards.

More cores and higher clocks also mean more power consumption, which means more heat, which means more cooling costs. In data center operations, the watt-per-TFLOPS ratio directly hits your bottom line.

Data Types and Precision: The Speed vs Accuracy Tradeoff

Not all floating-point operations are created equal. The precision (data type) you use determines both the accuracy of your results and the speed at which you get them.

FP16 to FP64 precision spectrum: speed increases left, accuracy increases right, with Pi values and cycle counts

Integer Types

Integers come in several sizes: 8-bit, 16-bit, 32-bit, and 64-bit. In CUDA:

int uses 32 bits
long uses 64 bits

Floating-Point Types

Floating-point is where it gets more interesting, because these are the types that define GPU performance categories:

Type	Bits	Name	π approximation	Decimal digits
FP16	16	Half precision	3.14	~3
FP32	32	Single precision	3.1415927	~7
FP64	64	Double precision	3.141592653589793	~15

In CUDA, float maps to FP32 (32 bits) and double maps to FP64 (64 bits).

The performance difference is significant. An FP64 operation typically takes about 4 clock cycles to complete, while an FP32 operation takes about 2 cycles. That's a 2× speed penalty for double precision - and on many consumer GPUs, the ratio is far worse (1:32 or even 1:64) because they have very few FP64 cores.

This creates a clear decision tree:

FP16 (half precision) - fastest, lowest accuracy. Great for ML inference where you're already working with approximate weights. Uses half the memory of FP32, doubles throughput on supported hardware.
FP32 (single precision) - the workhorse. Good enough for ML training, most graphics, and general computation. When GPU spec sheets list "CUDA cores," they usually mean the FP32 core count.
FP64 (double precision) - the powerhouse for precision. Required for climate modeling, molecular dynamics, financial risk calculations, and any domain where rounding errors compound across millions of iterations.

When someone says a GPU has "6,912 cores," they almost always mean FP32 cores. The FP64 core count is usually much lower - the A100 has 3,456 FP64 cores (half its FP32 count). Consumer GPUs often have a 1:64 FP64-to-FP32 ratio.

Tensor Cores: Purpose-Built for Matrix Math

Starting with the Volta architecture (2017), NVIDIA added a new type of core to the SM: the Tensor Core. These aren't general-purpose ALUs - they're specialized hardware designed to do one thing extremely fast: matrix multiply-accumulate operations.

Why does this matter? Because deep learning essentially is matrix multiplication. Training a neural network means multiplying weight matrices by input matrices billions of times. Tensor cores accelerate this by performing a 4×4 matrix multiply in a single clock cycle - something that would take dozens of cycles on regular CUDA cores.

The performance gains are dramatic:

GPU	FP32 TFLOPS (CUDA cores)	FP16 Tensor TFLOPS	Speedup
V100	15.7	125	~8×
A100	19.5	312	~16×
H100	51	990	~19×

Tensor cores also introduced mixed-precision training: use FP16 for the forward pass (fast), accumulate gradients in FP32 (accurate), and get the best of both worlds. NVIDIA later added TF32 (TensorFloat-32), which gives FP32-level accuracy at FP16-like speed - another architectural trick to push throughput without sacrificing quality.

If you're evaluating GPUs for ML workloads, the tensor core TFLOPS number is the one that actually matters for training performance. Regular CUDA core TFLOPS tell you about general compute.

Compute Capability: GPU Architecture Versioning

Here's something that catches people off guard when they first get into CUDA programming: your GPU has a version number that determines what features are available to your code. This is the Compute Capability (CC), expressed as a two-digit number like 8.0 (X.Y), where X is the major version (architecture) and Y is the minor version (incremental improvements).

NVIDIA GPU architectures mapped to compute capability versions and minimum CUDA toolkit support

Architecture	Year	Compute Capability	Min. CUDA Toolkit	Example HPC GPU
Maxwell	2014	5.x	6.5	M60
Pascal	2016	6.x	8.0	P100
Volta	2017	7.0	9.0	V100
Turing	2018	7.5	10.0	T4
Ampere	2020	8.x	11.0	A100
Hopper	2022	9.x	11.8	H100

Each compute capability unlocks specific features. Tensor cores? Available from CC 7.0 (Volta) onward. BF16 (BFloat16) support? CC 8.0+ (Ampere). Transformer Engine? CC 9.0 (Hopper). If your code uses features from a higher CC than your hardware supports, it simply won't compile.

The practical rule:

Always match your CUDA toolkit version to your GPU's compute capability. Higher CC requires newer CUDA. Compiling for the wrong CC means either your code won't run, or you'll miss out on hardware-specific optimizations.

You can check your GPU's compute capability by searching "[gpu model] techpowerup" - their spec pages always include it.

Putting It All Together: Reading a Real Spec Sheet

Let's put everything we've learned side by side with three NVIDIA data center GPUs across three architecture generations:

Feature	P100 (Pascal)	V100 (Volta)	A100 (Ampere)
Architecture chip	GP100	GV100	GA100
Compute Capability	6.0	7.0	8.0
SMs	56	80	108
FP32 Cores/SM	64	64	64
FP32 Cores Total	3,584	5,120	6,912
FP32 TFLOPS	10.6	15.7	19.5
FP64 Cores Total	1,792	2,560	3,456
FP64 TFLOPS	5.3	7.8	9.7
Tensor Cores/SM	–	8	4 (3rd gen)
Tensor TFLOPS (FP16)	–	125	312
Memory	16 GB HBM2	32 GB HBM2	80 GB HBM2e
Bandwidth	732 GB/s	900 GB/s	2,039 GB/s

A few things jump out:

FP32 cores per SM stayed constant at 64 across three generations. The performance gain came from more SMs (56 → 80 → 108), not more cores per SM.
Tensor core TFLOPS dwarfs CUDA core TFLOPS. The A100's tensor cores deliver 16× the throughput of its CUDA cores for half-precision work. If you're doing ML and not using tensor cores, you're leaving most of the GPU on the table.
Memory bandwidth more than doubled from V100 to A100 (900 → 2,039 GB/s) thanks to the jump from HBM2 to HBM2e. This is often the biggest real-world performance differentiator.

What's Next

Now that we've covered the metrics, next up is what's actually going on inside the streaming multiprocessor - the SM partitions, warp scheduling, SIMT execution, and how NVIDIA's hardware maps those CUDA threads we wrote about in week one to real silicon.

The takeaway from this week: GPU specs are a system of interlocking variables. Bandwidth, throughput, precision, and specialized hardware all interact. A GPU with massive TFLOPS but limited bandwidth will bottleneck on memory-bound workloads. A GPU with tons of bandwidth but low tensor core throughput will underwhelm on ML training. And a workstation GPU at compute capability 5.0 won't even compile code that targets CC 8.0 features.

Read the full spec sheet. Understand the workload. Then pick the GPU.