Working Notes: a commonplace notebook for recording & exploring ideas.
Home. Site Map. Subscribe. More at expLog.
— Kunal
All numbers converted to GB/s of sustained throughput for a single axis of comparison. For latency-only items, effective throughput is derived from a typical access size.
| Layer | GB/s | Category | Source |
|---|---|---|---|
| B200 HBM3e | 8,000 | GPU Memory | NVIDIA B200 Datasheet — 192GB HBM3e, 8 TB/s bandwidth |
| H100 SXM HBM3 | 3,350 | GPU Memory | NVIDIA H100 Product Page — 80GB HBM3, 3 TB/s+ bandwidth |
| A100 80GB HBM2e | 2,000 | GPU Memory | NVIDIA A100 Datasheet (PDF) — 80GB HBM2e, 2,039 GB/s (SXM) |
| NVLink 5 (B200, per GPU) | 1,800 | GPU Interconnect | NVIDIA B200 Datasheet — 1.8 TB/s bidirectional |
| NVLink 4 (H100, per GPU) | 900 | GPU Interconnect | NVIDIA Hopper Architecture In-Depth — 900 GB/s bidirectional |
| NVLink 3 (A100, per GPU) | 600 | GPU Interconnect | NVIDIA A100 Architecture Whitepaper (PDF) — 600 GB/s bidirectional |
| L1 cache (per core, x86) | ~500 | CPU Cache | Jeff Dean / Peter Norvig Latency Numbers — 0.5ns per ref ≈ ~500 GB/s at cache-line granularity |
| L2 cache (per core, x86) | ~100–200 | CPU Cache | Jeff Dean / Peter Norvig Latency Numbers — ~7ns per ref |
| PCIe Gen5 x16 (duplex) | 128 | Bus | Rambus PCIe 5.0 Overview — 32 GT/s × 16 lanes, 128 GB/s aggregate duplex |
| PCIe Gen5 x16 (unidirectional) | 64 | Bus | Rambus PCIe 5.0 Overview — 64 GB/s per direction |
| DDR5 server memory (8-ch) | ~50–100 | System Memory | Typical 8-channel DDR5-4800 to DDR5-5600 server config; ~6.4 GB/s/channel × 8 |
| InfiniBand NDR 400G | 50 | Network (inter-node) | NVIDIA DGX SuperPOD Cabling Guide — NDR Overview — 400 Gbps = 50 GB/s |
| NVMe SSD Gen5 sequential | ~14 | Storage | WD_BLACK SN8100 Press Release — up to 14.9 GB/s read |
| 100GbE network | 12.5 | Network (datacenter) | 100 Gbps ÷ 8 = 12.5 GB/s (line rate) |
| NVMe SSD Gen4 sequential | ~7 | Storage | Typical Gen4 x4 NVMe — 7 GB/s sequential read |
| 25GbE network | 3.1 | Network (NIC) | 25 Gbps ÷ 8 = 3.1 GB/s (line rate) |
| Protobuf parse throughput | ~1 | Serialization | Estimated: 1KB in ~1μs; see Colin Scott's Latency Numbers for methodology |
| NVMe SSD random 4K reads | ~0.25 | Storage (random) | Derived: ~16μs per 4KB IOP × queue depth; see Jeff Dean Latency Numbers (updated SSD random read ~16μs) |
| HDD sequential | ~0.2 | Storage | Typical 7200 RPM HDD sequential throughput |
| JSON parse throughput | ~0.1 | Serialization | Estimated: 1KB in ~10μs; Beyond Latency Numbers Every Programmer Should Know |
| Single TCP flow, cross-region | ~0.03–0.25 | Network (WAN) | Bandwidth-delay product limited: window_size / RTT. 40ms RTT with typical window sizes |
| HDD random 4K reads | ~0.002 | Storage (random) | Derived: ~2–10ms seek per 4KB IOP; Jeff Dean Latency Numbers |
| Ratio | Value | Architectural Implication |
|---|---|---|
| HBM (H100) vs InfiniBand NDR | 67× | Tensor parallelism stays intra-node; pipeline/data parallelism goes inter-node |
| NVLink (H100) vs InfiniBand NDR | 18× | Same as above — crossing node boundary drops ~1 order of magnitude |
| NVMe sequential vs HDD random | 7,000× | SSDs changed everything for serving; random access on spinning disk is catastrophic |
| SSD sequential vs JSON parse | 140× | If your hot path deserializes JSON, your serialization format is slower than your storage |
| L1 cache vs main memory | ~500× | Cache-friendly data structures (contiguous arrays > linked lists) dominate performance |
| B200 HBM3e vs H100 HBM3 | 2.4× | Generational bandwidth improvement; keeps tensor cores fed at lower precision |
| NVLink 5 vs NVLink 4 | 2× | Blackwell doubles intra-node interconnect |