Working Notes: a commonplace notebook for recording & exploring ideas.
Home. Site Map. Subscribe. More at expLog. — Kunal

Lecture 4: MoE

Everyone seems to be doing it
several subcomponents called experts that are sparsely activated
main difference
- is that the ffn is split up and copied
- router picks a couple on forward
can have same flops as a dense model while having more parameters
- params can help memorize facts around the world
for same flops, get better perf from the MoE
drawbacks -- harder to make efficient
started in eastern models, qwen, deepseek v3, etc.
deepseek had nailed the architecture up front
infrastructure becomes much more complicated
routing decisions are not differentiable -- makes optimization hard
classic MoE replaces ffn with moe layer with spares routing decisions
can have a sparsely routed attention layer too, rare to see
options
- token chooses expert
- expert chooses token
- complex optimization selection
almost all moes do token choice
- topk choice
top k -- k should be at least 2 for exploration
at the end can average it
even hashing instead of routing helps
some moes tried to use RL to route correctly
compute cost was too prohibitive
use gating to eliminate other exerts from router
top 2 routing will have twice the parameters
people wanted to have a lot of experts
deepseek cut the expert into lots of smaller pieces
smaller matrics, but more fine grained experts
olmoe ablations
- more experts and shared experts are helpful
- more gains from fine-grained experts than shared
common configs
original motivation for shared expert -- tentatively to keep experts same size
training is gnarly; cannot turn on all experts
sparse gating is not differentiable
RL to figure out routing -- most principled
- non differentiable routing decision as a policy
- not better than other options
- very finicky
Stochastic approximations
- add perturbations
- topk routing
- control exploration exploitation
- multiplicative perturbation, abandoned
heuristic exploration
- top k routing, some optimization
- end up picking just one expert all the time
- get out of the local minimum
- loss balancing -- generally use for training these MoEs
  - switch transformer -- loop over all experts,
    - fraction of tokens allocated to expert i *
    - fraction of router probability allocated to expert i
  - derivative of the loss -- downweights the most popular experts
    - forces load to be spread somewhat more evenly
  - start with batches, deep seek does this -- within the batch
- may also want to think about systems -- balance per device
  - instead of summing tokens going to experts
  - measure which tokens go to which devices
  - try to maximize even device utilization
- deepseek v3
  - auxiliary loss free balancing
  - take softmax scores + fudge factor per expert
  - fudge factor is through online learning -- measure what the experts are getting
  - adjust learning rate for each expert
  - only used for routing decisions
  - but also have heuristic loss for per seq level-l
skipping this just means 2 experts pick up everything
enables additional types of parallelism
- data parallelism
- model parallelsism
- expert and data parallelism
- can make different tradeoffs * (follow up on this)
multiple experts on a single device
- by laying out weights correctly
- minimize flop wastage using megablocks
gpt4
- temp 0 still returned different responses
- router allocates too many tokens to an expert, can get dropped
- other parts of the batch can introduce stochasticity
- (maybe also maps to the thinky paper)
stability is hard; tricks
- router softmax: do all computations in float32
- add an auxiliary z loss
- fine-tuning to deal with overfitting
  - just finetune dense layers
  - or use a lot of data with lots of SFT data
- upcycling: take a dense model and duplicate mlp/perturb it
  - train router
  - initialize moe from a dense model
  - very cost effective way to get an moe
  - minicpm
    - get a good bump
  - qwen did this as well; 2.7b param active model
deep seek moe architecture
- architecture itself hasn't changed significantly from v1-v3
- 16b with 2.8 active
- 2 shared + 64 fine grained
- topk routing
- auxiliary loss per device/expert
- v2 -- 236bn params 21bn activate
- same arch
- sharding experts very finely means we have to route to experts
- too many tokens to many devices
- fine grained 160/10 experts + 2 shared, 6 activate
- restrict devices first, and then choose top k
- engage with the systems concerns
- communication balancing loss
- deep seek v3
- 671b, 37b active
- shared 1 + fine grained 258 experts, 8 active
- using sigmoid + softmax topx + top m
- aux loss free + seq wise aux
- at inference time: can't control the sequence
- have stronger balancing that works at a seq level
other bits of deep seek v3
- mla -- multihead latent attention
- instead of reducing heads, project heads into a lower space
- cache the lower space
- merge projection matrix into q
- rope conflicts with mla caching
- apply rope on non compressed dimension
- mtp
- pass hidden state and send it to a lightweight model to predict 1 token in the future
- just do this with 1 token ahead
- (play with this)

Follow ups

Olmo paper again
deep seek v3 paper
router implementation
spend more time modeling the flops
play with moe implementations