Interactive zoom through the scales of LLM training, from the internet-scale data corpus down to individual transistors switching in silicon

STAGE 1 / 7

corpus

petabyte · text

Before any training happens, the raw material: roughly 15 trillion tokens of filtered text, scraped from the web, stripped of boilerplate, deduplicated, and quality-ranked. A single human reading 24/7 at 200 words per minute would need ~140,000 years to finish.

Relative sizes of text corpora on a log scale, from a single book to the full 15 trillion token training setNested squares showing that Wikipedia fits inside a library, which is a speck inside the full corpusFull training corpus~15T tokens · ~60 TB compressedCurated web~9T tokens · 60%Code~2T tokens · 15%Books~1.5T · 10%Papers~1T · 7%Other~1.5T · 8%= all of Wikipedia (~6B tokens)
Raw text scraped
~250B
web pages, pre-filter
After dedup & quality filter
~15T
tokens · ~60 TB
Filter ratio
~99%
of raw data discarded
Vocabulary
~128k
unique tokens (BPE)
raw internet HTMLdeduplicated, quality-filtered UTF-8 text
Most of the compute cost of data prep is filtering, not scraping. Boilerplate, spam, near-duplicates, and low-quality content get stripped before a single GPU sees a token.
STAGE 2 / 7

tokens

kilobyte · integers

Text is useless to a GPU. Every character sequence gets mapped to integers using byte-pair encoding — a lossless compression that learns which character clusters appear together most often. Common words become one token; rare strings get chopped into sub-word pieces.

A sentence being transformed into tokens, then integer IDs, then binaryThree-layer diagram showing text, token splits, and integer encodings1. Raw text (UTF-8 bytes)The quick brown fox jumps over the lazy dog.2. BPE tokenization (vocab = 128k)The quick brown fox jumps over the lazy3. Integer IDs (int32 array · 4 bytes each)[791, 4062, 14198, 39935, 35308, 927, 279, 16053, 5679, 13]
Vocabulary size
~128k
unique token IDs
Average compression
~4
bytes per token
Storage per token
4
bytes (int32)
Tokenizer speed
~1M
tokens/sec/CPU core
UTF-8 bytesint32 array of token IDs
The full corpus, post-tokenization, is roughly 15 trillion integers — about 60 TB stored as int32. This lives on NVMe SSDs across hundreds of storage nodes, never fitting on a single machine.
STAGE 3 / 7

batches

megabyte · tensors

The 15T token stream gets shuffled and packed into fixed-shape rectangles. Each training step feeds the cluster a single batch — typically ~16 million tokens arranged as ~2,000 sequences of 8,192 tokens each. Storage → CPU RAM → GPU memory over PCIe at ~64 GB/s.

Linear token stream being chunked into a 2D tensor of batch by sequence lengthA flowing stream of tokens gets packed into rows of a rectangular grid, which is the batch tensorLinear token stream (15 trillion, shuffled)7914062141983993535308927279160535679138122031pack into [batch, seq_len]Batch tensor — one training stepbatch = 2,048 rowsseq_len = 8,192 tokens per row
Global batch size
~16M
tokens per step
Sequence length
8,192
tokens (often 32k+)
PCIe 5.0 bandwidth
64
GB/s CPU → GPU
Steps to finish epoch
~940k
for 15T / 16M batches
int32 token stream on NVMe[2048, 8192] tensor in GPU HBM
The pipeline runs async — while GPU N is computing on batch K, the CPU is already prefetching batch K+1 from disk. If the data loader stalls, thousands of GPUs sit idle burning electricity.
STAGE 4 / 7

cluster

exabyte · cluster

One model is too big for one GPU. Training a frontier LLM means 25,000–100,000 H100s wired together with three overlapping networks: NVLink inside a node, InfiniBand between nodes, and Ethernet for control. Weights, activations, and gradients get sharded across the whole mesh.

Hierarchical diagram of an LLM training cluster from cluster down to node down to single GPUThree nested levels showing pods containing nodes, nodes containing eight GPUs connected by NVLinkCluster · ~25,000 GPUsInfiniBand fabric · 400 Gb/s per linkPod · 512 GPUs64 nodes × 8 GPUszoomNode · 8 GPUsNVLink 4 · 900 GB/sGPU 0GPU 1GPU 2GPU 3GPU 4GPU 5GPU 6GPU 7all-to-all NVLink mesh (via NVSwitch)zoom1× H10080 GB · ~1 PFLOPSH10080B transistors4nm · 814 mm²700W TDP1.83 GHz boost
Cluster size
~25,000
H100 GPUs (Llama 3)
NVLink (intra-node)
900
GB/s per GPU
InfiniBand (inter-node)
400
Gb/s per link
Total training compute
~10²⁵
FLOPs, ~2 months
1 logical model (~500 GB weights)sharded across 25,000 GPUs via data + tensor + pipeline parallelism
The communication layer often dominates the compute layer. After every step, gradients from every GPU have to be summed across the entire cluster — an all-reduce that moves terabytes of data over InfiniBand in milliseconds.
STAGE 5 / 7

one gpu

gigabyte · hbm

Zoom in on a single GPU. 80 GB of HBM3 memory on the edges, feeding the compute die at 3.35 TB/s. The die itself: 132 streaming multiprocessors, each housing 4 tensor cores. Data flows HBM → L2 cache → SM registers → tensor cores → back out, millions of times per second.

Cross-section of an H100 GPU showing HBM memory stacks flanking the compute die with streaming multiprocessorsSchematic of GPU architecture with memory on the sides and a grid of SM tiles in the centerH100 SXM5 packageHBM3 stack16 GB16-high DRAMHBM3 stack16 GB16-high DRAMCompute die · 814 mm² · 80B transistorsL2 cache · 50 MB · shared across all SMs132 SMs · each with 4 tensor cores = 528 tensor cores totalhighlighted tile = one SM executing a warp of 32 threads
HBM3 memory
80
GB @ 3.35 TB/s
Streaming multiprocessors
132
SMs · 528 tensor cores
BF16 throughput
~989
TFLOPS dense
FP8 throughput
~1,979
TFLOPS dense
weight + activation tiles in HBMmatrix multiplies on tensor cores
Training a frontier LLM is fundamentally a bandwidth problem, not a compute problem. Tensor cores can do ~1 PFLOP/s but HBM only feeds them ~3 TB/s. Every architectural decision — flash attention, tensor parallelism, FP8 — exists to get more math done per byte moved. The memory wall is the real constraint; compute is cheap relative to moving bits around.
STAGE 6 / 7

matmul

megaflop · matmul

Almost everything the GPU does during training is matrix multiplication — about 99% of the FLOPs. Activations times weights in the forward pass, gradients times activations in the backward pass. Everything else (attention softmax, layer norm, activation functions) is rounding error by comparison.

Two matrices multiplying to produce a third, with one output element highlighted showing how it comes from a row and column dot productVisual of matrix A times matrix B equals matrix C with highlighted row, column, and output elementA (activations)[4096 × 16384]×B (weights)[16384 × 4096]== Σ(row · col)16,384 FMAsC (output)[4096 × 4096]Total FMAs per multiply: 4,096 × 4,096 × 16,384 = ~275 billionOn one H100 at ~1 PFLOPS: ~0.55 ms · tensor cores do 256+ FMAs per cycle per core
The training loop — five stages that repeat about a million timesHBMload batchforward passlossbackward pass≈ 2× forward FLOPsall-reduce + optimizer
the training loop · ~1 million repetitions per model
Single output element
16,384
multiply-adds
Total per matmul
~275B
FMAs (one layer · one batch)
Per tensor core per cycle
256+
FMAs (BF16 · 4th gen)
Matmuls per step
~25,000
across all layers
matrix A [M, K] × matrix B [K, N]matrix C [M, N], via M×N×K fused multiply-adds
Training is forward pass → loss → backward pass → weight update, repeated ~1 million times. Each pass through the model is just these matrix multiplies, chained — the entire intelligence of an LLM reduces to a sequence of GEMM operations run at astronomical scale.
STAGE 7 / 7

silicon

picosecond · gate

At the bottom of the stack: a single fused-multiply-add circuit. Two numbers arrive as voltages on thousands of wires. Transistor gates ~4 nanometers wide switch from 0V to ~0.7V in tens of picoseconds, implementing the logic of multiply-then-add. 80 billion of these switches coordinate per clock tick.

Diagram of a fused multiply-add circuit showing inputs flowing through multiplier and adder gates built from transistorsThree-way FMA circuit with input voltages, multiplier block, adder block, and output, with transistor switching indicatorsA = 0.73B = 0.91C = 0.12inputs (FP8)8 bits eachmultiplierA × B = 0.664~200 transistors~20 ps latencyadder(A×B) + C = 0.784~300 transistors~15 ps latencyout = 0.784output (FP32 accum)next cycle: += to COne FMA consumes ~500 transistors. An H100 fires ~550 trillion FMAs/sec.At any instant, billions of transistors are switching between 0V and 0.7V, rearranging the state of the entire die ~2 billion times per second.
Transistor count
80B
on one 814 mm² die
Gate switching time
~10-20
picoseconds
Clock cycle
~550
ps @ 1.83 GHz
Supply voltage
~0.7
volts
3 numbers as voltage patterns on ~24 wires1 number (a×b + c) as a new voltage pattern
Zoom all the way back out: 10,000 words of your question become billions of tokens become trillions of matrix elements become quadrillions of transistor switches — and the result is one more token of response. That's training. Inference is the same pipeline running in reverse, once.
THE WALL · WHAT YOU ACTUALLY GET

3% of peak FP8

1,979 TFLOPS available · the other 97% is tensor cores waiting for memory

now what