methodology · last updated 2026-04-30
The equations behind Roofline.
A first-order analytical model of training and serving large transformers on modern accelerators. Useful as a mental model; not a substitute for measured runs. This appendix records the equations we use, the constants we assume, and the failure modes we know about.
1. The roofline model
The roofline model[1] plots achievable throughput against arithmetic intensity. It separates a workload into the two regimes that a single accelerator can be in: bound by peak compute, or bound by memory bandwidth. The crossover point between the two is a function of the hardware alone, not the workload.
MFU (model FLOPs utilization) is the empirical derate that captures kernel-level inefficiency: dispatch overhead, non-tensor ops, mixed-precision conversions, partial waves. We default to 0.45 for training, 0.35 for decode, and 0.55 for prefill on H100-class hardware[2]. The slider exposes this for users matching a measured run.
Implementation: calculateRidgePoint, calculateArithmeticIntensity, achievedThroughput in lib/sim/roofline.ts.
2. Parameters and memory
We follow the parameter-counting formula tabulated in the Ultra-Scale Playbook[2]. For a dense decoder-only transformer:
N = h · v + L · ( 12 h² + 13 h ) + 2 hwhere h is hidden dim, v is vocab size, and L is layer count. Because real models report a published parameter count, our implementation uses params_b from the spec directly and reserves the formula above as a sanity check — see paramCount in lib/sim/memory.ts.
2.1 Training memory footprint
Mixed-precision training with the AdamW optimizer[3] maintains the following per parameter:
bytes/param = 2 (BF16 weight) + 4 (FP32 master weight) + 4 (FP32 gradient) + 4 (FP32 momentum) + 4 (FP32 variance) = 18 bytes/param with FP32 grad accumulation: +2 → 20 bytes/paramFor Llama 3 70B[4], this is 70e9 × 18 ≈ 1.26 TB of optimizer-and-weight state — over fifteen H100 80 GB GPUs of capacity, before any activations. The Ultra-Scale Playbook reports 1.40 TB for the same model with FP32 grad accumulation; our eval evals/benchmarks.test.ts § 2 passes within 5 %.
2.2 ZeRO sharding
ZeRO[5] partitions the optimizer state, gradients, and weights across data-parallel ranks. Per-GPU memory drops to:
ZeRO-1 → weights + grads + (opt_state / N_dp) ZeRO-2 → weights + (grads + opt_state) / N_dp ZeRO-3 → ( weights + grads + opt_state ) / N_dp (≡ FSDP)FSDP is functionally ZeRO-3 with the parameters reshaped into flat shards. We model it as such in memoryPerGpuFsdp at lib/sim/parallelism.ts.
2.3 Activation memory
The Korthikanti et al. formula[6], in BF16:
m_act = L · seq · bs · h · ( 34 + 5 · n_heads · seq / h )The constant 34 captures attention QKV projections, attention output, MLP, layernorms, and dropout; the 5 · n_heads · seq / h term is the attention-score blowup that makes long context training painful. In practice, every long-context training run uses full activation checkpointing[6], which trades roughly 70 % of activation memory for one extra forward pass per step.
2.4 KV cache
kv_bytes = 2 · L · batch · seq · kv_heads · head_dim · bytes_per_elemThe factor of 2 holds K and V separately. Modern open weights use grouped-query attention (GQA): Llama 3 70B has 64 attention heads but 8 KV heads[4], cutting cache size by 8×. At batch=16, seq=8192, FP16, this is approximately 43 GB — checked at evals/benchmarks.test.ts § 3.
3. Communication and parallelism
3.1 Ring all-reduce
The bandwidth-optimal collective, used for every DP/FSDP gradient sync inside an NVLink domain:
t_ring = ( 2 (N - 1) / N ) · ( message_size / per_link_bandwidth )See ringAllReduceTime in lib/sim/collectives.ts. We ignore the alpha (latency) term because for the GB-scale messages of LLM gradient sync, bandwidth dominates by three orders of magnitude.
3.2 Hierarchical all-reduce
When data parallelism crosses node boundaries, the runtime decomposes the all-reduce into intra-node reduce-scatter on NVLink, inter-node all-reduce on InfiniBand, and intra-node all-gather on NVLink[7]:
t = t_intra_RS + t_inter_AR + t_intra_AG t_inter_AR = ( 2 (n_nodes - 1) / n_nodes ) · ( msg / gpus_per_node ) / IB_BWInter-node bandwidth is typically 10–40× lower than intra-node NVLink, so the inter-node phase dominates at scale. Implementation: hierarchicalAllReduceTime.
3.3 Pipeline bubble
For 1F1B / AFAB schedules with pipeline depth P and m microbatches per optimizer step[8]:
r_bubble = (P - 1) / m (1F1B / AFAB) r_bubble_intl = (P - 1) / (v · m) (interleaved, v stages/GPU)See pipelineBubbleFraction. 1F1B trades pipeline bubble for activation memory; the interleaved variant pays v× communication for an v× reduction in bubble.
3.4 Tensor parallelism beyond the NVLink domain
When tensor-parallel degree exceeds the per-node NVLink domain (e.g. TP > 8 on HGX H100), the all-reduce on every transformer block traverses InfiniBand. The Ultra-Scale Playbook[2] reports a roughly 40 % step-time penalty under this regime; we encode this as a fixed 1.40× multiplier rather than attempt to model the full per-block latency. This is empirical, not derived. Validated at evals/benchmarks.test.ts § 9.
3.5 The GB200 NVL72 case
The GB200 NVL72 packages 72 Blackwell GPUs in a single NVLink-5 domain[9]. This raises the cliff in §3.4 from TP=8 to TP=72 — a quantitative change with qualitative consequences for very large models. The simulator reads nvlink_domain from each GPU spec and applies the inter-node penalty only when TP exceeds it.
4. Throughput and time-to-train
For a dense model:
flops_per_token = 6 · N (dense) flops_per_token ≈ 6 · N_active (MoE) t_compute = ( tokens_per_step · flops_per_token ) / ( N_gpus · MFU · peak_FLOPS ) t_comm = t_DP_all_reduce + bubble_fraction · t_compute t_step = max(t_compute, t_comm) + (1 - overlap) · min(t_compute, t_comm)The factor of 6 is forward (2×) plus backward (≈ 4×). For MoE, only the active expert FLOPs contribute per token; we ignore router overhead, which is a single percent of total FLOPs and within the model's noise budget. Default compute–communication overlap is 0.8, reflecting backward prefetch in modern FSDP stacks[2].
timeToTrain in lib/sim/workloads.ts integrates t_step over the planned token budget. Empirically this under-estimates published wall-clock by 10–20 %; the gap is unmodeled overhead — batch warmup, restart-from-checkpoint, stragglers — discussed in §6.
5. Cost model
We model owned-fleet TCO, not cloud rental rates. The capex–to–TCO multiplier is:
TCO_3yr ≈ capex_gpu · 1.35 where 1.35 ≈ 1 + 3 · (power_$ / capex) + 3 · (cooling_$ / capex) + 3 · (real_estate_$ / capex) + 3 · 0.04 (ops headcount, 4 %/yr of capex)Power is priced at $$0.07/kWh continuous, scaled by PUE = 1.45[10]; cooling capex at $1.50/W of GPU TDP; real estate at $1M/MW; staff at 4 % of capex per year, which we corroborate against the SemiAnalysis ClusterMAX accounting[11] of roughly $200 K per cluster engineer at one engineer per 1,000–2,000 GPUs of modern hardware.
ClusterMAX is the better source for what production tenants actually pay — they price rented goodput with mean-time-between-failure accounting baked in, which is the layer above ours. Our numbers are useful for capacity planning; theirs are useful for procurement decisions.
6. Reality checks
The simulator's outputs are pinned against 27 published benchmarks in evals/benchmarks.test.ts. All currently pass within ±25 % on common workloads and ±50 % on edge cases. The categories are:
- Parameter counts. Llama 3 8B / 70B / 405B[4], dense formula within 5 %.
- Training memory. 1B / 7B / 13B / 70B / 175B at the Ultra-Scale Playbook tabulated values[2].
- Activation memory. Korthikanti formula at seq = 1k, 8k, 32k with and without checkpointing[6].
- KV cache. Llama 3 70B GQA at batch=16 seq=8k → 43 GB.
- Ridge points. H100 FP8 ≈ 590 FLOP/byte vs vendor ≈ 600[12].
- Decode throughput. Llama 3 70B FP8 batch=1 on H100, 12–18 tok/s, against published ~14.
- Time-to-train. GPT-3 175B on 1024 A100s[13]: model says 28 days, paper reports ~34; gap is the unmodeled overhead band.
- Pipeline bubble. 1F1B and interleaved at P=8, m=32, v=2 against playbook tables.
- NVLink-domain crossing penalty. TP=8 → TP=16 inter-node, 40 % vs playbook ≈ 43 %.
- Bottleneck classification. Decode at batch=1 should always classify as memory-bound; prefill at seq ≥ 2k as compute-bound.
7. What the model does not capture
Roofline is a first-order model. The omissions below are deliberate, but they are also load-bearing — if your question depends on any of them, the simulator is the wrong tool.
- Communication overlap with compute beyond the single tunable
overlap_fraction. We do not model per-layer prefetch granularity, gradient bucketing, or the difference between FSDP-1 and FSDP-2 schedulers. - Goodput, MTBF, and hot-spare overhead. Real clusters lose 5–15 % of capacity to failures and checkpoint restarts. SemiAnalysis[11] covers this layer; we do not.
- Storage layer. Checkpoint write bandwidth, dataset shuffle I/O, and parallel filesystem cost are excluded entirely.
- Cooling efficiency variation. We use a single PUE constant. Liquid-cooled GB200 racks run lower; air-cooled retrofit halls run higher.
- Network topology beyond NVLink-domain. We assume a fat-tree-equivalent inter-node fabric with no congestion. Real clusters have rail-aligned topologies and bandwidth taper at the spine.
- Closed-model architectures. Specs for Claude, GPT-5, and Gemini are estimates inferred from public discussion (see
lib/data/models.tsestimated_arch: true). The UI marks these with “(est.)”; treat their outputs as directional only.
8. Sources
- Williams, S., Waterman, A., Patterson, D. Roofline: An Insightful Visual Performance Model for Multicore Architectures. CACM, 2009. https://dl.acm.org/doi/10.1145/1498765.1498785 · verified 2026-04-30
- Nanotron / Hugging Face. The Ultra-Scale Playbook. https://huggingface.co/spaces/nanotron/ultrascale-playbook · verified 2026-04-24
- Loshchilov, I., Hutter, F. Decoupled Weight Decay Regularization. ICLR 2019. https://arxiv.org/abs/1711.05101 · verified 2026-04-30
- Meta AI. The Llama 3 Herd of Models. 2024. https://ai.meta.com/research/publications/the-llama-3-herd-of-models/ · verified 2026-04-24
- Rajbhandari, S. et al. ZeRO: Memory Optimizations Toward Training Trillion Parameter Models. SC20. https://arxiv.org/abs/1910.02054 · verified 2026-04-30
- Korthikanti, V. et al. Reducing Activation Recomputation in Large Transformer Models. MLSys 2023. https://arxiv.org/abs/2205.05198 · verified 2026-04-30
- Patarasuk, P., Yuan, X. Bandwidth optimal all-reduce algorithms for clusters of workstations. J. Parallel Distrib. Comput., 2009. https://www.cs.fsu.edu/~xyuan/paper/09jpdc.pdf · verified 2026-04-30
- Narayanan, D. et al. Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM. SC21. https://arxiv.org/abs/2104.04473 · verified 2026-04-30
- NVIDIA. GB200 NVL72. https://www.nvidia.com/en-us/data-center/gb200-nvl72/ · verified 2026-04-29
- Uptime Institute. Global Data Center Survey 2024 (PUE). https://uptimeinstitute.com/resources/research/annual-survey-2024 · verified 2026-04-30 (figure cited; survey access gated)
- SemiAnalysis. How Much Do GPU Clusters Really Cost? (ClusterMAX). https://newsletter.semianalysis.com/p/how-much-do-gpu-clusters-really-cost · verified 2026-04-30
- NVIDIA. H100 Tensor Core GPU Datasheet. https://www.nvidia.com/en-us/data-center/h100/ · verified 2026-04-24
- Brown, T. et al. Language Models are Few-Shot Learners. NeurIPS 2020. https://arxiv.org/abs/2005.14165 · verified 2026-04-24
- NVIDIA. H200 Tensor Core GPU. https://www.nvidia.com/en-us/data-center/h200/ · verified 2026-04-24
- NVIDIA. HGX B200 / B300 Platform. https://www.nvidia.com/en-us/data-center/hgx/ · verified 2026-04-29
- AMD. Instinct MI300X Accelerator. https://www.amd.com/en/products/accelerators/instinct/mi300/mi300x.html · verified 2026-04-24
- AMD. Instinct MI325X Accelerator. https://www.amd.com/en/products/accelerators/instinct/mi300/mi325x.html · verified 2026-04-29
- Google Cloud. TPU v5p Training. https://cloud.google.com/tpu/docs/v5p-training · verified 2026-04-24
- Google Cloud. TPU Trillium (v6e). https://cloud.google.com/tpu/docs/v6e · verified 2026-04-29
- AWS. Trainium 2 / Trn2 Instances. https://aws.amazon.com/ec2/instance-types/trn2/ · verified 2026-04-29
- DeepSeek-AI. DeepSeek-V3 Technical Report. 2024. https://arxiv.org/abs/2412.19437 · verified 2026-04-24
- DeepSeek-AI. DeepSeek-R1 Technical Report. 2025. https://arxiv.org/abs/2501.12948 · verified 2026-04-29
- Alibaba (Qwen team). Qwen 3 Release. 2025. https://qwenlm.github.io/blog/qwen3/ · verified 2026-04-29
- Mistral AI. Mistral Large 2. 2024. https://mistral.ai/news/mistral-large-2407 · verified 2026-04-29
- Mistral AI. Mixtral 8×22B. 2024. https://mistral.ai/news/mixtral-8x22b/ · verified 2026-04-24
- Meta AI. Llama 4 Multimodal Intelligence. 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/ · verified 2026-04-29
v0.2 · data last verified 2026-04-30 · first-order model; under-claims accuracy by design.