MoE routes each token to a subset of expert FFN layers (top-k routing). Only k/N experts active per token — same quality at fraction of FLOPs. Switch Transformer (k=1), Mixtral (k=2 of 8). Load balancing loss prevents expert collapse. Key challenge: communication overhead in distributed settings. Mixtral 8x7B matches LLaMA-2-70B at 1/4 FLOPs.
- moe
- mixture-of-experts
- sparse
- routing