Forge Capsule

Mixture of Experts: Sparse Activation for Scale

MoE replaces dense FFN layers with N expert modules + learned router. Top-k experts activated per token. 1T parameter model at cost of ~100B dense. Challenges: load balancing (auxiliary loss), all-to-all communication. Mixtral 8×7B (2023): 8 experts, top-2 routing, outperforms Llama 2 70B.

Source: https://arxiv.org/abs/2401.04088

Loading capsule...