Working Notes: a commonplace notebook for recording & exploring ideas.
Home. Site Map. Subscribe. More at expLog.
— Kunal
In my quest to implement transformers with plain JAX: to be able to have a much deeper intuition on the weight updates, costs, parallelisms, and generally being much more confident about what each decision adds to the transformer I've been skimming and implementing from a lot of different old papers.
After reading about transformers from a lot of different books and places I can triangulate a bit on what the papers are talking about and trying to accomplish, but I wouldn't have been able to do any of this without having had access to cheat sheet's like Seb Raschka's Building a Transformer from Scratch.
For this attempt I've been trying to make a transformer that can count, because it's trivial to generate data and then see how far I can get. Doing this exercise has been very valuable in building intuition around the different decisions in the transformer architecture, and trying to figure out which parts contribute to what the model is doing. I'm recording losses and trivial evals along with one liner descriptions as I go to explore the space.
I stumbled upon AI Systems Performance Engineering by Chris Fregly recently and it's a beautiful, extremely relevant book covering a lot of the magic needed to make GPUs truly go brr -- and it's really up to date, focused on Blackwell GPUs with brief discussions of whatever is going to come next.
The latest Dresden files book novel (12 months) released on the 20th, and given I was on a plane I ended up finishing it on the same day. A satisfying read, and one that makes me want to work out more regularly which was a nice outcome.
The second plane ride this week gave me time to catch up with more books:
I'm constantly amazed by Claude and then always a little bit let down when I run into the next bug in the generated code. More to play with here.