Working Notes: a commonplace notebook for recording & exploring ideas.
Home. Site Map. Subscribe. More at expLog. — Kunal

Lecture 3: Architectures & Hyperparams

Changes to Transformers

Notes across changes over the year; beautiful slides

Architecture

pre-vs-post norm
- extract the layer norm and don't have it happen in the residual stream
- pre-norm is a more stable architecture
- used as a stability aid for large nets
- layer norm in residual bad
  - residual gives an identity connection between first / last layers
  - this helps propagate gradients
  - but with lns in between gradient propagation gets messed up
"double norm"
- have layer norm both before and after ffn, and even the multi head attn
- grok, gemma2, olmo2 just keeps ln after
layernorm vs rms norm
- normalize the activations and shift them
- modern models use rms norms -- don't do any mean adjustment
- does just as well, and simpler -- no mean calculation & no bias term
no bias terms
- FFN(x) = sigma(xW1)W2
- apparently bias terms are bad for stability (empirically)
gated activations
- almost all models now use a gated linear unit
- swish is x * sigmoid(x)
- swiglu is very popular now
- gelu isn't monotonically decreasing
- nn optimization dynamics: using momentum, activations don't get trapped here
parallel layers
- some models have tried parallel layers
- instead of serial mlp, do mlp and attention simultaneously
- can share a lot of stuff for systems efficiency
- most models have been serial
position embeddigs
- original: sine/cosine embeddings
- rope embeddings are most common
  - relative positions of the vectors
  - <f(x, i), f(y, j)> = g(x, y, i-j)
  - should only be expressible in relative difference
  - vector rotation works really well here
  - rotate every pair of dimensions with diferent velocities

Hyperparameters

d_ff = 4 d_model
- there are several exceptions
heads = model_dim
- Keep dimensions fixed as we add more heads
dmodel / nlayer ~ 128
- has generally been stuck to
- this controls the amount of parallelism
- tensor parallel needs fastest networking
vocabulary sizes
- ternding upwards
- monolingual: 30-50k
- multilingual: 100-250k
dropout / regularization
- pretraining is just doing a single epoch
- dropout is out of fashion
- weight decay still used, interacts with learning rate

Stability Tricks

l2 norm of gradient graph is pretty scary
- can stop converging
stability issues arise from softmaxes
- output softmax
  - z-loss
  - auxiliary loss term that tries to force z loss to 0
- in self attention
  - attention operation
  - inputs to softmax are bounded using layer norms for q & k
  - from multimodal models
just add layer norms
logit soft-capping
- use tanh to cap logit values

Attention Heads

gqa / mqa
- saving inference costs
- arithmetic intensity
  - lots of compute per memory access
- with kv cache, the model access pattern breaks
- MQA -- have multiple heads for query / multi query attention
  - only 1 dimension for keys and values
  - improves arithmetic intensity
- GQA -- grouped query
  - trade off between inference and expressiveness
sparse / sliding window attention
- sparser structured attention
- now have transformer blocks
  - full self attention every 4 blocks with no position embedding
  - the rest of them have rope

Followups
- memory movement is all you need paper
- optimization dynamics exploration
- noam shazeer's paper on gated linear units
- simulate effect of hyperparams/options in training / behavior and look up any experiments
- layer norm implementation
- catch up with the olmo papers
- probably want to work through cs224n as well, particularly transformers assignment