Working Notes: a commonplace notebook for recording & exploring ideas.
Home. Site Map. Subscribe. More at expLog. — Kunal

2026-01-25

Learning Transformers

In my quest to implement transformers with plain JAX: to be able to have a much deeper intuition on the weight updates, costs, parallelisms, and generally being much more confident about what each decision adds to the transformer I've been skimming and implementing from a lot of different old papers.

After reading about transformers from a lot of different books and places I can triangulate a bit on what the papers are talking about and trying to accomplish, but I wouldn't have been able to do any of this without having had access to cheat sheet's like Seb Raschka's Building a Transformer from Scratch.

For this attempt I've been trying to make a transformer that can count, because it's trivial to generate data and then see how far I can get. Doing this exercise has been very valuable in building intuition around the different decisions in the transformer architecture, and trying to figure out which parts contribute to what the model is doing. I'm recording losses and trivial evals along with one liner descriptions as I go to explore the space.

Books

I stumbled upon AI Systems Performance Engineering by Chris Fregly recently and it's a beautiful, extremely relevant book covering a lot of the magic needed to make GPUs truly go brr -- and it's really up to date, focused on Blackwell GPUs with brief discussions of whatever is going to come next.

The latest Dresden files book novel (12 months) released on the 20th, and given I was on a plane I ended up finishing it on the same day. A satisfying read, and one that makes me want to work out more regularly which was a nice outcome.

The second plane ride this week gave me time to catch up with more books:

I spent a third of the flight continuing reading Systems Performance Engineering; a lot of it is familiar, and a lot of it is new -- I hadn't known about being able to use NICs for arithmetic. I'm pretty sure I'm not going to be able to retain most of the book, but I do want to read through it to be easily able to refer back for things I'm confused about.
I re-started reading Designing Data-Intensive Applications, I think the next version is coming out soon.
Picked up Database Internals as well, and after the amount of hardware awareness neeeded for AI Systems Performance Engineering only having to worry about on disk pages feels quaint.
Finally, continued reading The Notebook -- each chapter is fairly self contained so I've been walking over pieces of it over time. Every chapter just reminds me to take more time to write and think through things, and is the main inspiration for why this site is modeled after a commonplace notebook.

Vibe coding

I'm constantly amazed by Claude and then always a little bit let down when I run into the next bug in the generated code. More to play with here.