methodology · last updated 2026-04-30

The equations behind Roofline.

A first-order analytical model of training and serving large transformers on modern accelerators. Useful as a mental model; not a substitute for measured runs. This appendix records the equations we use, the constants we assume, and the failure modes we know about.

1. The roofline model

The roofline model[1] plots achievable throughput against arithmetic intensity. It separates a workload into the two regimes that a single accelerator can be in: bound by peak compute, or bound by memory bandwidth. The crossover point between the two is a function of the hardware alone, not the workload.

Arithmetic intensity (AI) is FLOPs per byte moved between HBM and SRAM. The ridge point (RP) is the AI at which the accelerator transitions from memory-bound to compute-bound.
AI = FLOPs / bytes_moved RP = peak_FLOPS / peak_HBM_bandwidth throughput = min( peak_FLOPS, AI × peak_HBM_bandwidth ) × MFU

MFU (model FLOPs utilization) is the empirical derate that captures kernel-level inefficiency: dispatch overhead, non-tensor ops, mixed-precision conversions, partial waves. We default to 0.45 for training, 0.35 for decode, and 0.55 for prefill on H100-class hardware[2]. The slider exposes this for users matching a measured run.

Implementation: calculateRidgePoint, calculateArithmeticIntensity, achievedThroughput in lib/sim/roofline.ts.

2. Parameters and memory

We follow the parameter-counting formula tabulated in the Ultra-Scale Playbook[2]. For a dense decoder-only transformer:

N = h · v + L · ( 12 h² + 13 h ) + 2 h

where h is hidden dim, v is vocab size, and L is layer count. Because real models report a published parameter count, our implementation uses params_b from the spec directly and reserves the formula above as a sanity check — see paramCount in lib/sim/memory.ts.

2.1 Training memory footprint

Mixed-precision training with the AdamW optimizer[3] maintains the following per parameter:

bytes/param = 2 (BF16 weight) + 4 (FP32 master weight) + 4 (FP32 gradient) + 4 (FP32 momentum) + 4 (FP32 variance) = 18 bytes/param with FP32 grad accumulation: +2 → 20 bytes/param

For Llama 3 70B[4], this is 70e9 × 18 ≈ 1.26 TB of optimizer-and-weight state — over fifteen H100 80 GB GPUs of capacity, before any activations. The Ultra-Scale Playbook reports 1.40 TB for the same model with FP32 grad accumulation; our eval evals/benchmarks.test.ts § 2 passes within 5 %.

2.2 ZeRO sharding

ZeRO[5] partitions the optimizer state, gradients, and weights across data-parallel ranks. Per-GPU memory drops to:

ZeRO-1 → weights + grads + (opt_state / N_dp) ZeRO-2 → weights + (grads + opt_state) / N_dp ZeRO-3 → ( weights + grads + opt_state ) / N_dp (≡ FSDP)

FSDP is functionally ZeRO-3 with the parameters reshaped into flat shards. We model it as such in memoryPerGpuFsdp at lib/sim/parallelism.ts.

2.3 Activation memory

The Korthikanti et al. formula[6], in BF16:

m_act = L · seq · bs · h · ( 34 + 5 · n_heads · seq / h )

The constant 34 captures attention QKV projections, attention output, MLP, layernorms, and dropout; the 5 · n_heads · seq / h term is the attention-score blowup that makes long context training painful. In practice, every long-context training run uses full activation checkpointing[6], which trades roughly 70 % of activation memory for one extra forward pass per step.

2.4 KV cache

kv_bytes = 2 · L · batch · seq · kv_heads · head_dim · bytes_per_elem

The factor of 2 holds K and V separately. Modern open weights use grouped-query attention (GQA): Llama 3 70B has 64 attention heads but 8 KV heads[4], cutting cache size by 8×. At batch=16, seq=8192, FP16, this is approximately 43 GB — checked at evals/benchmarks.test.ts § 3.

3. Communication and parallelism

3.1 Ring all-reduce

The bandwidth-optimal collective, used for every DP/FSDP gradient sync inside an NVLink domain:

t_ring = ( 2 (N - 1) / N ) · ( message_size / per_link_bandwidth )

See ringAllReduceTime in lib/sim/collectives.ts. We ignore the alpha (latency) term because for the GB-scale messages of LLM gradient sync, bandwidth dominates by three orders of magnitude.

3.2 Hierarchical all-reduce

When data parallelism crosses node boundaries, the runtime decomposes the all-reduce into intra-node reduce-scatter on NVLink, inter-node all-reduce on InfiniBand, and intra-node all-gather on NVLink[7]:

t = t_intra_RS + t_inter_AR + t_intra_AG t_inter_AR = ( 2 (n_nodes - 1) / n_nodes ) · ( msg / gpus_per_node ) / IB_BW

Inter-node bandwidth is typically 10–40× lower than intra-node NVLink, so the inter-node phase dominates at scale. Implementation: hierarchicalAllReduceTime.

3.3 Pipeline bubble

For 1F1B / AFAB schedules with pipeline depth P and m microbatches per optimizer step[8]:

r_bubble = (P - 1) / m (1F1B / AFAB) r_bubble_intl = (P - 1) / (v · m) (interleaved, v stages/GPU)

See pipelineBubbleFraction. 1F1B trades pipeline bubble for activation memory; the interleaved variant pays communication for an reduction in bubble.

3.4 Tensor parallelism beyond the NVLink domain

When tensor-parallel degree exceeds the per-node NVLink domain (e.g. TP > 8 on HGX H100), the all-reduce on every transformer block traverses InfiniBand. The Ultra-Scale Playbook[2] reports a roughly 40 % step-time penalty under this regime; we encode this as a fixed 1.40× multiplier rather than attempt to model the full per-block latency. This is empirical, not derived. Validated at evals/benchmarks.test.ts § 9.

3.5 The GB200 NVL72 case

The GB200 NVL72 packages 72 Blackwell GPUs in a single NVLink-5 domain[9]. This raises the cliff in §3.4 from TP=8 to TP=72 — a quantitative change with qualitative consequences for very large models. The simulator reads nvlink_domain from each GPU spec and applies the inter-node penalty only when TP exceeds it.

4. Throughput and time-to-train

For a dense model:

flops_per_token = 6 · N (dense) flops_per_token ≈ 6 · N_active (MoE) t_compute = ( tokens_per_step · flops_per_token ) / ( N_gpus · MFU · peak_FLOPS ) t_comm = t_DP_all_reduce + bubble_fraction · t_compute t_step = max(t_compute, t_comm) + (1 - overlap) · min(t_compute, t_comm)

The factor of 6 is forward (2×) plus backward (≈ 4×). For MoE, only the active expert FLOPs contribute per token; we ignore router overhead, which is a single percent of total FLOPs and within the model's noise budget. Default compute–communication overlap is 0.8, reflecting backward prefetch in modern FSDP stacks[2].

timeToTrain in lib/sim/workloads.ts integrates t_step over the planned token budget. Empirically this under-estimates published wall-clock by 10–20 %; the gap is unmodeled overhead — batch warmup, restart-from-checkpoint, stragglers — discussed in §6.

5. Cost model

We model owned-fleet TCO, not cloud rental rates. The capex–to–TCO multiplier is:

TCO_3yr ≈ capex_gpu · 1.35 where 1.35 ≈ 1 + 3 · (power_$ / capex) + 3 · (cooling_$ / capex) + 3 · (real_estate_$ / capex) + 3 · 0.04 (ops headcount, 4 %/yr of capex)

Power is priced at $$0.07/kWh continuous, scaled by PUE = 1.45[10]; cooling capex at $1.50/W of GPU TDP; real estate at $1M/MW; staff at 4 % of capex per year, which we corroborate against the SemiAnalysis ClusterMAX accounting[11] of roughly $200 K per cluster engineer at one engineer per 1,000–2,000 GPUs of modern hardware.

ClusterMAX is the better source for what production tenants actually pay — they price rented goodput with mean-time-between-failure accounting baked in, which is the layer above ours. Our numbers are useful for capacity planning; theirs are useful for procurement decisions.

6. Reality checks

The simulator's outputs are pinned against 27 published benchmarks in evals/benchmarks.test.ts. All currently pass within ±25 % on common workloads and ±50 % on edge cases. The categories are:

  • Parameter counts. Llama 3 8B / 70B / 405B[4], dense formula within 5 %.
  • Training memory. 1B / 7B / 13B / 70B / 175B at the Ultra-Scale Playbook tabulated values[2].
  • Activation memory. Korthikanti formula at seq = 1k, 8k, 32k with and without checkpointing[6].
  • KV cache. Llama 3 70B GQA at batch=16 seq=8k → 43 GB.
  • Ridge points. H100 FP8 ≈ 590 FLOP/byte vs vendor ≈ 600[12].
  • Decode throughput. Llama 3 70B FP8 batch=1 on H100, 12–18 tok/s, against published ~14.
  • Time-to-train. GPT-3 175B on 1024 A100s[13]: model says 28 days, paper reports ~34; gap is the unmodeled overhead band.
  • Pipeline bubble. 1F1B and interleaved at P=8, m=32, v=2 against playbook tables.
  • NVLink-domain crossing penalty. TP=8 → TP=16 inter-node, 40 % vs playbook ≈ 43 %.
  • Bottleneck classification. Decode at batch=1 should always classify as memory-bound; prefill at seq ≥ 2k as compute-bound.

7. What the model does not capture

Roofline is a first-order model. The omissions below are deliberate, but they are also load-bearing — if your question depends on any of them, the simulator is the wrong tool.

  1. Communication overlap with compute beyond the single tunable overlap_fraction. We do not model per-layer prefetch granularity, gradient bucketing, or the difference between FSDP-1 and FSDP-2 schedulers.
  2. Goodput, MTBF, and hot-spare overhead. Real clusters lose 5–15 % of capacity to failures and checkpoint restarts. SemiAnalysis[11] covers this layer; we do not.
  3. Storage layer. Checkpoint write bandwidth, dataset shuffle I/O, and parallel filesystem cost are excluded entirely.
  4. Cooling efficiency variation. We use a single PUE constant. Liquid-cooled GB200 racks run lower; air-cooled retrofit halls run higher.
  5. Network topology beyond NVLink-domain. We assume a fat-tree-equivalent inter-node fabric with no congestion. Real clusters have rail-aligned topologies and bandwidth taper at the spine.
  6. Closed-model architectures. Specs for Claude, GPT-5, and Gemini are estimates inferred from public discussion (see lib/data/models.ts estimated_arch: true). The UI marks these with “(est.)”; treat their outputs as directional only.

8. Sources

  1. Williams, S., Waterman, A., Patterson, D. Roofline: An Insightful Visual Performance Model for Multicore Architectures. CACM, 2009. https://dl.acm.org/doi/10.1145/1498765.1498785 · verified 2026-04-30
  2. Nanotron / Hugging Face. The Ultra-Scale Playbook. https://huggingface.co/spaces/nanotron/ultrascale-playbook · verified 2026-04-24
  3. Loshchilov, I., Hutter, F. Decoupled Weight Decay Regularization. ICLR 2019. https://arxiv.org/abs/1711.05101 · verified 2026-04-30
  4. Meta AI. The Llama 3 Herd of Models. 2024. https://ai.meta.com/research/publications/the-llama-3-herd-of-models/ · verified 2026-04-24
  5. Rajbhandari, S. et al. ZeRO: Memory Optimizations Toward Training Trillion Parameter Models. SC20. https://arxiv.org/abs/1910.02054 · verified 2026-04-30
  6. Korthikanti, V. et al. Reducing Activation Recomputation in Large Transformer Models. MLSys 2023. https://arxiv.org/abs/2205.05198 · verified 2026-04-30
  7. Patarasuk, P., Yuan, X. Bandwidth optimal all-reduce algorithms for clusters of workstations. J. Parallel Distrib. Comput., 2009. https://www.cs.fsu.edu/~xyuan/paper/09jpdc.pdf · verified 2026-04-30
  8. Narayanan, D. et al. Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM. SC21. https://arxiv.org/abs/2104.04473 · verified 2026-04-30
  9. NVIDIA. GB200 NVL72. https://www.nvidia.com/en-us/data-center/gb200-nvl72/ · verified 2026-04-29
  10. Uptime Institute. Global Data Center Survey 2024 (PUE). https://uptimeinstitute.com/resources/research/annual-survey-2024 · verified 2026-04-30 (figure cited; survey access gated)
  11. SemiAnalysis. How Much Do GPU Clusters Really Cost? (ClusterMAX). https://newsletter.semianalysis.com/p/how-much-do-gpu-clusters-really-cost · verified 2026-04-30
  12. NVIDIA. H100 Tensor Core GPU Datasheet. https://www.nvidia.com/en-us/data-center/h100/ · verified 2026-04-24
  13. Brown, T. et al. Language Models are Few-Shot Learners. NeurIPS 2020. https://arxiv.org/abs/2005.14165 · verified 2026-04-24
  14. NVIDIA. H200 Tensor Core GPU. https://www.nvidia.com/en-us/data-center/h200/ · verified 2026-04-24
  15. NVIDIA. HGX B200 / B300 Platform. https://www.nvidia.com/en-us/data-center/hgx/ · verified 2026-04-29
  16. AMD. Instinct MI300X Accelerator. https://www.amd.com/en/products/accelerators/instinct/mi300/mi300x.html · verified 2026-04-24
  17. AMD. Instinct MI325X Accelerator. https://www.amd.com/en/products/accelerators/instinct/mi300/mi325x.html · verified 2026-04-29
  18. Google Cloud. TPU v5p Training. https://cloud.google.com/tpu/docs/v5p-training · verified 2026-04-24
  19. Google Cloud. TPU Trillium (v6e). https://cloud.google.com/tpu/docs/v6e · verified 2026-04-29
  20. AWS. Trainium 2 / Trn2 Instances. https://aws.amazon.com/ec2/instance-types/trn2/ · verified 2026-04-29
  21. DeepSeek-AI. DeepSeek-V3 Technical Report. 2024. https://arxiv.org/abs/2412.19437 · verified 2026-04-24
  22. DeepSeek-AI. DeepSeek-R1 Technical Report. 2025. https://arxiv.org/abs/2501.12948 · verified 2026-04-29
  23. Alibaba (Qwen team). Qwen 3 Release. 2025. https://qwenlm.github.io/blog/qwen3/ · verified 2026-04-29
  24. Mistral AI. Mistral Large 2. 2024. https://mistral.ai/news/mistral-large-2407 · verified 2026-04-29
  25. Mistral AI. Mixtral 8×22B. 2024. https://mistral.ai/news/mixtral-8x22b/ · verified 2026-04-24
  26. Meta AI. Llama 4 Multimodal Intelligence. 2025. https://ai.meta.com/blog/llama-4-multimodal-intelligence/ · verified 2026-04-29

v0.2 · data last verified 2026-04-30 · first-order model; under-claims accuracy by design.