Working Notes: a commonplace notebook for recording & exploring ideas.
Home. Site Map. Subscribe. More at expLog. Kunal

THE CONTENTS OF THIS PAGE WERE GENERATED WITH AI

AI Design Constraints

All numbers converted to GB/s of sustained throughput for a single axis of comparison. For latency-only items, effective throughput is derived from a typical access size.

The Table

Layer GB/s Category Source
B200 HBM3e 8,000 GPU Memory NVIDIA B200 Datasheet — 192GB HBM3e, 8 TB/s bandwidth
H100 SXM HBM3 3,350 GPU Memory NVIDIA H100 Product Page — 80GB HBM3, 3 TB/s+ bandwidth
A100 80GB HBM2e 2,000 GPU Memory NVIDIA A100 Datasheet (PDF) — 80GB HBM2e, 2,039 GB/s (SXM)
NVLink 5 (B200, per GPU) 1,800 GPU Interconnect NVIDIA B200 Datasheet — 1.8 TB/s bidirectional
NVLink 4 (H100, per GPU) 900 GPU Interconnect NVIDIA Hopper Architecture In-Depth — 900 GB/s bidirectional
NVLink 3 (A100, per GPU) 600 GPU Interconnect NVIDIA A100 Architecture Whitepaper (PDF) — 600 GB/s bidirectional
L1 cache (per core, x86) ~500 CPU Cache Jeff Dean / Peter Norvig Latency Numbers — 0.5ns per ref ≈ ~500 GB/s at cache-line granularity
L2 cache (per core, x86) ~100–200 CPU Cache Jeff Dean / Peter Norvig Latency Numbers — ~7ns per ref
PCIe Gen5 x16 (duplex) 128 Bus Rambus PCIe 5.0 Overview — 32 GT/s × 16 lanes, 128 GB/s aggregate duplex
PCIe Gen5 x16 (unidirectional) 64 Bus Rambus PCIe 5.0 Overview — 64 GB/s per direction
DDR5 server memory (8-ch) ~50–100 System Memory Typical 8-channel DDR5-4800 to DDR5-5600 server config; ~6.4 GB/s/channel × 8
InfiniBand NDR 400G 50 Network (inter-node) NVIDIA DGX SuperPOD Cabling Guide — NDR Overview — 400 Gbps = 50 GB/s
NVMe SSD Gen5 sequential ~14 Storage WD_BLACK SN8100 Press Release — up to 14.9 GB/s read
100GbE network 12.5 Network (datacenter) 100 Gbps ÷ 8 = 12.5 GB/s (line rate)
NVMe SSD Gen4 sequential ~7 Storage Typical Gen4 x4 NVMe — 7 GB/s sequential read
25GbE network 3.1 Network (NIC) 25 Gbps ÷ 8 = 3.1 GB/s (line rate)
Protobuf parse throughput ~1 Serialization Estimated: 1KB in ~1μs; see Colin Scott's Latency Numbers for methodology
NVMe SSD random 4K reads ~0.25 Storage (random) Derived: ~16μs per 4KB IOP × queue depth; see Jeff Dean Latency Numbers (updated SSD random read ~16μs)
HDD sequential ~0.2 Storage Typical 7200 RPM HDD sequential throughput
JSON parse throughput ~0.1 Serialization Estimated: 1KB in ~10μs; Beyond Latency Numbers Every Programmer Should Know
Single TCP flow, cross-region ~0.03–0.25 Network (WAN) Bandwidth-delay product limited: window_size / RTT. 40ms RTT with typical window sizes
HDD random 4K reads ~0.002 Storage (random) Derived: ~2–10ms seek per 4KB IOP; Jeff Dean Latency Numbers

Key Ratios

Ratio Value Architectural Implication
HBM (H100) vs InfiniBand NDR 67× Tensor parallelism stays intra-node; pipeline/data parallelism goes inter-node
NVLink (H100) vs InfiniBand NDR 18× Same as above — crossing node boundary drops ~1 order of magnitude
NVMe sequential vs HDD random 7,000× SSDs changed everything for serving; random access on spinning disk is catastrophic
SSD sequential vs JSON parse 140× If your hot path deserializes JSON, your serialization format is slower than your storage
L1 cache vs main memory ~500× Cache-friendly data structures (contiguous arrays > linked lists) dominate performance
B200 HBM3e vs H100 HBM3 2.4× Generational bandwidth improvement; keeps tensor cores fed at lower precision
NVLink 5 vs NVLink 4 Blackwell doubles intra-node interconnect

Notes

Canonical References