Working Notes: a commonplace notebook for recording & exploring ideas.
Home. Site Map. Subscribe. More at expLog.
— Kunal
This has been a fairly busy week. I've settled on using this newsletter as my microblog, accumulating entries over the week (that I keep refining till Sunday), and then start all over again. It's slightly more coherent than having to manage different accounts.
Used some more travel time to continue iterating on my first essay for the year, djn. Writing the essay cleared my head a lot, particularly around vibe limits, or risk management -- it took several iterations to get the graph to a place that felt intuitive.
I've fallen out of practice with writing coherently, the essays are a good way to rebuild those muscles. This time around I'm trying to use both Claude and Codex as line editors; the last time I hired real editors but that felt a bit too expensive just for the sake of writing practice.
Given how painful writing can be I'm very annoyed it helps clear up how I think so much.
As I explore ways to learn faster I've been exploring ways to use agents to review and improve my writing, without directly rewriting anything. That way I can generally improve and make writing smoother again.
The Worst Possible Outcome would be if I internalized the ChatGPT voice, but I think I should be able to watch out for that and prevent it, and explicitly prompt the agents accordingly. So far the structural feedback has resonated well.
This week's experiment is asking Claude to help me write as a mix of some of the writers I look up to a lot.
A recent habit I've picked up is to quickly run through a typ.ing challenge or exercise before I start working as a quick warmup and brain reset. The accuracy and speed I hit also gives me quick feedback around how fresh and comfortable I happen to be at that point of time, which is useful.
The website is fairly wonderful and my favorite of a lot of different typing tutors / challenges I've used online: all the way from the venerable Gnu Typist to Typeracer and ZType which is satisfying on several levels. It's run by ZSA: their keyboards have always been tempting but I'm currently extremely satisfied by the Nuphy while I'm out and about and the Glove80 while I'm at a desk.
The reset exercise that prompted this post:
π Todayβs typ.ing daily challenge:
π Speed: 125wpm
π― Accuracy: 100.00%
π₯ Position: 2 out of 87 players today
π₯ Streak: 9
π typ.ing/daily
Generally, I only trust I understand something if I can implement it.
My new favorite way to learn and internalize how a given system works is to get Claude to write out an execution plan for me. And then I go ahead and implement to the best of my ability and use Claude for debugging when I get stuck.
I used this to play with GRPO to some success with very limited time, and I've been doing the same with flow matching. It does mean I don't build as much debugging muscle up front as I'd normally do while struggling through a problem, but it significantly increases the amount of exploration I can do without repeatedly getting exhausted.
Asking for idiomatic ways to do things after writing them up also helps me improve much faster.
I asked Claude to list general design constraints: a more up to date list of numbers everyone should know; and then to rearrange them so that everything was in GB/s so I could keep things straight in my head.
This is useful enough I wanted to make sure I captured it somewhere; though I want to play more with the numbers and results, so this week's letter it is. The rest of this section was written by Claude.
All numbers converted to GB/s of sustained throughput for a single axis of comparison. For latency-only items, effective throughput is derived from a typical access size.
| Layer | GB/s | Category | Source |
|---|---|---|---|
| B200 HBM3e | 8,000 | GPU Memory | NVIDIA B200 Datasheet β 192GB HBM3e, 8 TB/s bandwidth |
| H100 SXM HBM3 | 3,350 | GPU Memory | NVIDIA H100 Product Page β 80GB HBM3, 3 TB/s+ bandwidth |
| A100 80GB HBM2e | 2,000 | GPU Memory | NVIDIA A100 Datasheet (PDF) β 80GB HBM2e, 2,039 GB/s (SXM) |
| NVLink 5 (B200, per GPU) | 1,800 | GPU Interconnect | NVIDIA B200 Datasheet β 1.8 TB/s bidirectional |
| NVLink 4 (H100, per GPU) | 900 | GPU Interconnect | NVIDIA Hopper Architecture In-Depth β 900 GB/s bidirectional |
| NVLink 3 (A100, per GPU) | 600 | GPU Interconnect | NVIDIA A100 Architecture Whitepaper (PDF) β 600 GB/s bidirectional |
| L1 cache (per core, x86) | ~500 | CPU Cache | Jeff Dean / Peter Norvig Latency Numbers β 0.5ns per ref β ~500 GB/s at cache-line granularity |
| L2 cache (per core, x86) | ~100β200 | CPU Cache | Jeff Dean / Peter Norvig Latency Numbers β ~7ns per ref |
| PCIe Gen5 x16 (duplex) | 128 | Bus | Rambus PCIe 5.0 Overview β 32 GT/s Γ 16 lanes, 128 GB/s aggregate duplex |
| PCIe Gen5 x16 (unidirectional) | 64 | Bus | Rambus PCIe 5.0 Overview β 64 GB/s per direction |
| DDR5 server memory (8-ch) | ~50β100 | System Memory | Typical 8-channel DDR5-4800 to DDR5-5600 server config; ~6.4 GB/s/channel Γ 8 |
| InfiniBand NDR 400G | 50 | Network (inter-node) | NVIDIA DGX SuperPOD Cabling Guide β NDR Overview β 400 Gbps = 50 GB/s |
| NVMe SSD Gen5 sequential | ~14 | Storage | WD_BLACK SN8100 Press Release β up to 14.9 GB/s read |
| 100GbE network | 12.5 | Network (datacenter) | 100 Gbps Γ· 8 = 12.5 GB/s (line rate) |
| NVMe SSD Gen4 sequential | ~7 | Storage | Typical Gen4 x4 NVMe β 7 GB/s sequential read |
| 25GbE network | 3.1 | Network (NIC) | 25 Gbps Γ· 8 = 3.1 GB/s (line rate) |
| Protobuf parse throughput | ~1 | Serialization | Estimated: 1KB in ~1ΞΌs; see Colin Scott's Latency Numbers for methodology |
| NVMe SSD random 4K reads | ~0.25 | Storage (random) | Derived: ~16ΞΌs per 4KB IOP Γ queue depth; see Jeff Dean Latency Numbers (updated SSD random read ~16ΞΌs) |
| HDD sequential | ~0.2 | Storage | Typical 7200 RPM HDD sequential throughput |
| JSON parse throughput | ~0.1 | Serialization | Estimated: 1KB in ~10ΞΌs; Beyond Latency Numbers Every Programmer Should Know |
| Single TCP flow, cross-region | ~0.03β0.25 | Network (WAN) | Bandwidth-delay product limited: window_size / RTT. 40ms RTT with typical window sizes |
| HDD random 4K reads | ~0.002 | Storage (random) | Derived: ~2β10ms seek per 4KB IOP; Jeff Dean Latency Numbers |
| Ratio | Value | Architectural Implication |
|---|---|---|
| HBM (H100) vs InfiniBand NDR | 67Γ | Tensor parallelism stays intra-node; pipeline/data parallelism goes inter-node |
| NVLink (H100) vs InfiniBand NDR | 18Γ | Same as above β crossing node boundary drops ~1 order of magnitude |
| NVMe sequential vs HDD random | 7,000Γ | SSDs changed everything for serving; random access on spinning disk is catastrophic |
| SSD sequential vs JSON parse | 140Γ | If your hot path deserializes JSON, your serialization format is slower than your storage |
| L1 cache vs main memory | ~500Γ | Cache-friendly data structures (contiguous arrays > linked lists) dominate performance |
| B200 HBM3e vs H100 HBM3 | 2.4Γ | Generational bandwidth improvement; keeps tensor cores fed at lower precision |
| NVLink 5 vs NVLink 4 | 2Γ | Blackwell doubles intra-node interconnect |