Mixture of Experts: Sparse Activation for Scale

MoE routes each token to a subset of expert FFN layers (top-k routing). Only k/N experts active per token — same quality at fraction of FLOPs. Switch Transformer (k=1), Mixtral (k=2 of 8). Load balancing loss prevents expert collapse. Key challenge: communication overhead in distributed settings. Mixtral 8x7B matches LLaMA-2-70B at 1/4 FLOPs.

moe
mixture-of-experts
sparse
routing