Primer · Roofline

rooflinethe physics of AI compute

STAGE 1 / 7

corpus

petabyte · text

Before any training happens, the raw material: roughly 15 trillion tokens of filtered text, scraped from the web, stripped of boilerplate, deduplicated, and quality-ranked. A single human reading 24/7 at 200 words per minute would need ~140,000 years to finish.

Raw text scraped

~250B

web pages, pre-filter

After dedup & quality filter

~15T

tokens · ~60 TB

Filter ratio

~99%

of raw data discarded

Vocabulary

~128k

unique tokens (BPE)

raw internet HTML→deduplicated, quality-filtered UTF-8 text

Most of the compute cost of data prep is filtering, not scraping. Boilerplate, spam, near-duplicates, and low-quality content get stripped before a single GPU sees a token.

STAGE 2 / 7

tokens

kilobyte · integers

Text is useless to a GPU. Every character sequence gets mapped to integers using byte-pair encoding — a lossless compression that learns which character clusters appear together most often. Common words become one token; rare strings get chopped into sub-word pieces.

Vocabulary size

~128k

unique token IDs

Average compression

bytes per token

Storage per token

bytes (int32)

Tokenizer speed

~1M

tokens/sec/CPU core

UTF-8 bytes→int32 array of token IDs

The full corpus, post-tokenization, is roughly 15 trillion integers — about 60 TB stored as int32. This lives on NVMe SSDs across hundreds of storage nodes, never fitting on a single machine.

STAGE 3 / 7

batches

megabyte · tensors

The 15T token stream gets shuffled and packed into fixed-shape rectangles. Each training step feeds the cluster a single batch — typically ~16 million tokens arranged as ~2,000 sequences of 8,192 tokens each. Storage → CPU RAM → GPU memory over PCIe at ~64 GB/s.

Global batch size

~16M

tokens per step

Sequence length

8,192

tokens (often 32k+)

PCIe 5.0 bandwidth

GB/s CPU → GPU

Steps to finish epoch

~940k

for 15T / 16M batches

int32 token stream on NVMe→[2048, 8192] tensor in GPU HBM

The pipeline runs async — while GPU N is computing on batch K, the CPU is already prefetching batch K+1 from disk. If the data loader stalls, thousands of GPUs sit idle burning electricity.

STAGE 4 / 7

cluster

exabyte · cluster

One model is too big for one GPU. Training a frontier LLM means 25,000–100,000 H100s wired together with three overlapping networks: NVLink inside a node, InfiniBand between nodes, and Ethernet for control. Weights, activations, and gradients get sharded across the whole mesh.

Cluster size

~25,000

H100 GPUs (Llama 3)

NVLink (intra-node)

900

GB/s per GPU

InfiniBand (inter-node)

400

Gb/s per link

Total training compute

~10²⁵

FLOPs, ~2 months

1 logical model (~500 GB weights)→sharded across 25,000 GPUs via data + tensor + pipeline parallelism

The communication layer often dominates the compute layer. After every step, gradients from every GPU have to be summed across the entire cluster — an all-reduce that moves terabytes of data over InfiniBand in milliseconds.

STAGE 5 / 7

one gpu

gigabyte · hbm

Zoom in on a single GPU. 80 GB of HBM3 memory on the edges, feeding the compute die at 3.35 TB/s. The die itself: 132 streaming multiprocessors, each housing 4 tensor cores. Data flows HBM → L2 cache → SM registers → tensor cores → back out, millions of times per second.

HBM3 memory

GB @ 3.35 TB/s

Streaming multiprocessors

132

SMs · 528 tensor cores

BF16 throughput

~989

TFLOPS dense

FP8 throughput

~1,979

TFLOPS dense

weight + activation tiles in HBM→matrix multiplies on tensor cores

Training a frontier LLM is fundamentally a bandwidth problem, not a compute problem. Tensor cores can do ~1 PFLOP/s but HBM only feeds them ~3 TB/s. Every architectural decision — flash attention, tensor parallelism, FP8 — exists to get more math done per byte moved. The memory wall is the real constraint; compute is cheap relative to moving bits around.

STAGE 6 / 7

matmul

megaflop · matmul

Almost everything the GPU does during training is matrix multiplication — about 99% of the FLOPs. Activations times weights in the forward pass, gradients times activations in the backward pass. Everything else (attention softmax, layer norm, activation functions) is rounding error by comparison.

the training loop · ~1 million repetitions per model

Single output element

16,384

multiply-adds

Total per matmul

~275B

FMAs (one layer · one batch)

Per tensor core per cycle

256+

FMAs (BF16 · 4th gen)

Matmuls per step

~25,000

across all layers

matrix A [M, K] × matrix B [K, N]→matrix C [M, N], via M×N×K fused multiply-adds

Training is forward pass → loss → backward pass → weight update, repeated ~1 million times. Each pass through the model is just these matrix multiplies, chained — the entire intelligence of an LLM reduces to a sequence of GEMM operations run at astronomical scale.

STAGE 7 / 7

silicon

picosecond · gate

At the bottom of the stack: a single fused-multiply-add circuit. Two numbers arrive as voltages on thousands of wires. Transistor gates ~4 nanometers wide switch from 0V to ~0.7V in tens of picoseconds, implementing the logic of multiply-then-add. 80 billion of these switches coordinate per clock tick.

Transistor count

80B

on one 814 mm² die

Gate switching time

~10-20

picoseconds

Clock cycle

~550

ps @ 1.83 GHz

Supply voltage

~0.7

volts

3 numbers as voltage patterns on ~24 wires→1 number (a×b + c) as a new voltage pattern

Zoom all the way back out: 10,000 words of your question become billions of tokens become trillions of matrix elements become quadrillions of transistor switches — and the result is one more token of response. That's training. Inference is the same pipeline running in reverse, once.

THE WALL · WHAT YOU ACTUALLY GET

3% of peak FP8

1,979 TFLOPS available · the other 97% is tensor cores waiting for memory

now what

Try a different lens.

The animation tells this story in 11 seconds. Worth seeing again with the climb fresh in your head.

11 seconds · animatedback to landing →

THE TOOL

What it costs.

Configure a real training or inference run. See the bill. Find the wall.

interactive · keyboard navigableopen the tool →

Interactive zoom through the scales of LLM training, from the internet-scale data corpus down to individual transistors switching in silicon

3% of peak FP8

Try a different lens.

What it costs.