Working Notes: a commonplace notebook for recording & exploring ideas.
Home. Site Map. Subscribe. More at expLog.
— Kunal
Lecture 3: Architectures & Hyperparams
Notes across changes over the year; beautiful slides
Architecture
-
pre-vs-post norm
- extract the layer norm and don't have it happen in the residual stream
- pre-norm is a more stable architecture
- used as a stability aid for large nets
- layer norm in residual bad
- residual gives an identity connection between first / last layers
- this helps propagate gradients
- but with lns in between gradient propagation gets messed up
-
"double norm"
- have layer norm both before and after ffn, and even the multi head attn
- grok, gemma2, olmo2 just keeps ln after
-
layernorm vs rms norm
- normalize the activations and shift them
- modern models use rms norms -- don't do any mean adjustment
- does just as well, and simpler -- no mean calculation & no bias term
-
no bias terms
- FFN(x) = sigma(xW1)W2
- apparently bias terms are bad for stability (empirically)
-
gated activations
- almost all models now use a gated linear unit
- swish is x * sigmoid(x)
- swiglu is very popular now
- gelu isn't monotonically decreasing
- nn optimization dynamics: using momentum, activations don't get trapped here
-
parallel layers
- some models have tried parallel layers
- instead of serial mlp, do mlp and attention simultaneously
- can share a lot of stuff for systems efficiency
- most models have been serial
-
position embeddigs
- original: sine/cosine embeddings
- rope embeddings are most common
- relative positions of the vectors
- <f(x, i), f(y, j)> = g(x, y, i-j)
- should only be expressible in relative difference
- vector rotation works really well here
- rotate every pair of dimensions with diferent velocities
Hyperparameters
-
d_ff = 4 d_model
- there are several exceptions
-
heads = model_dim
- Keep dimensions fixed as we add more heads
-
dmodel / nlayer ~ 128
- has generally been stuck to
- this controls the amount of parallelism
- tensor parallel needs fastest networking
-
vocabulary sizes
- ternding upwards
- monolingual: 30-50k
- multilingual: 100-250k
-
dropout / regularization
- pretraining is just doing a single epoch
- dropout is out of fashion
- weight decay still used, interacts with learning rate
Stability Tricks
Attention Heads
- Followups
- memory movement is all you need paper
- optimization dynamics exploration
- noam shazeer's paper on gated linear units
- simulate effect of hyperparams/options in training / behavior and look up any experiments
- layer norm implementation
- catch up with the olmo papers
- probably want to work through cs224n as well, particularly transformers assignment