Simulate · Roofline

rooflineSimulate

Configuration

Model

Hardware

Precision

NVIDIA H100 SXM doesn't support FP4

Batch

Sequence

4,096

Cluster

1,024

DP16

TP8

PP8

DP×TP×PP = 1,024

MFU0.45

Presets

Llama 3 70B · b=1 · FP8 · 1,024 × NVIDIA

→memory-boundat22.1Ktok/s·$3.66/GPU-hr·71GB/81.92TB

You're reading ~70 GB of weights from HBM per token to do ~420 GFLOPs of math. The tensor cores sit at 1% utilization — they're fast enough; they just can't get fed. To earn more throughput you need HBM bandwidth, bigger batches, or smarter attention — not more FLOPS. That's the wall.

§ 01

Inside the system

where the bits are right now

HBM 3.35 / 3.35 TB/s

SMs 1 / 132 active

NVLink idle

reading 70 GB of weights per token · SMs starving. The HBM pipes run full and amber; the tensor cores sit mostly dark. This is the memory wall in pixel form.

§ 02

What it costs

dollars, rates, and counterfactuals

greenfield on-prem · 3-yr amortized

GPUs · $30.7M

Chassis & CPUs · $5.1M

Power infra · $7.2M

Real estate · $15.6M

Fabric & storage · $5.4M

hover a segment for its unit formula

GPUs $30.7MInterconnect $4.6MChassis & CPUs $5.1MPower infra $7.2MCooling $4.3MReal estate $15.6MFabric & storage $5.4M

Total capex

$72.9M

3-yr TCO: $98.4M

1,024 × NVIDIA H100 SXM

$ / GPU-hour

$3.66/hr

3-yr amortized

incl. power + ops

$ / M tokens

$47/ M tok

decode at batch=1

continuous batching: ~$2.36

≈ 0.97 Gulfstream G700 ($75M each)≈ 3 Manhattan penthouses ($25M each)

capex indicative · power $0.08/kWh · 4%/yr ops · PUE 1.45 · greenfield on-prem · batch=1 decode (continuous batching drops $/M tokens ~20×)

§ 03

Where it lives

watts, floor space, cooling

1 GPU

NVIDIA H100 SXM

700 W
80 GB HBM

1 tray

8 GPUs · HGX board

5.6 kW
NVIDIA × 8

1 rack

NVL-class, 72 GPUs

50 kW
liquid-cooled

Your cluster

15 racks

1,024 GPUs · 717 kW
IT load

Site

Building + substation

1.04 MW
PUE 1.45 · ~4,000 sqft

IT load717 kW≈ 611 US homes' continuous draw

Total facility1.04 MW≈ peak draw of a Walmart Supercenter

Annual energy9.1 GWh/yr≈ 0.23× a small college campus

Heat rejected1.04 MW≈ 1,642 gal/day evaporated at cooling tower

Training run energy201K MWhover 8 days · ≈ Iceland's annual × 1.059%

that's training. now try Serve →