Working Notes: a commonplace notebook for recording & exploring ideas.
Home. Site Map. Subscribe. More at expLog.
— Kunal
Everyone seems to be doing it
several subcomponents called experts that are sparsely activated
main difference
can have same flops as a dense model while having more parameters
for same flops, get better perf from the MoE
drawbacks -- harder to make efficient
started in eastern models, qwen, deepseek v3, etc.
deepseek had nailed the architecture up front
infrastructure becomes much more complicated
routing decisions are not differentiable -- makes optimization hard
classic MoE replaces ffn with moe layer with spares routing decisions
can have a sparsely routed attention layer too, rare to see
options
almost all moes do token choice
top k -- k should be at least 2 for exploration
at the end can average it
even hashing instead of routing helps
some moes tried to use RL to route correctly
compute cost was too prohibitive
use gating to eliminate other exerts from router
top 2 routing will have twice the parameters
people wanted to have a lot of experts
deepseek cut the expert into lots of smaller pieces
smaller matrics, but more fine grained experts
olmoe ablations
common configs
original motivation for shared expert -- tentatively to keep experts same size
training is gnarly; cannot turn on all experts
sparse gating is not differentiable
RL to figure out routing -- most principled
Stochastic approximations
heuristic exploration
skipping this just means 2 experts pick up everything
enables additional types of parallelism
multiple experts on a single device
gpt4
stability is hard; tricks
deep seek moe architecture
architecture itself hasn't changed significantly from v1-v3
16b with 2.8 active
2 shared + 64 fine grained
topk routing
auxiliary loss per device/expert
v2 -- 236bn params 21bn activate
same arch
sharding experts very finely means we have to route to experts
too many tokens to many devices
fine grained 160/10 experts + 2 shared, 6 activate
restrict devices first, and then choose top k
engage with the systems concerns
communication balancing loss
deep seek v3
671b, 37b active
shared 1 + fine grained 258 experts, 8 active
using sigmoid + softmax topx + top m
aux loss free + seq wise aux
at inference time: can't control the sequence
have stronger balancing that works at a seq level
other bits of deep seek v3
mla -- multihead latent attention
instead of reducing heads, project heads into a lower space
cache the lower space
merge projection matrix into q
rope conflicts with mla caching
apply rope on non compressed dimension
mtp
pass hidden state and send it to a lightweight model to predict 1 token in the future
just do this with 1 token ahead
(play with this)
Follow ups