Forge Capsule
MoE replaces dense FFN layers with N expert modules + learned router. Top-k experts activated per token. 1T parameter model at cost of ~100B dense. Challenges: load balancing (auxiliary loss), all-to-all communication. Mixtral 8×7B (2023): 8 experts, top-2 routing, outperforms Llama 2 70B.
Source: https://arxiv.org/abs/2401.04088
We use cookies to improve your experience. By continuing, you agree to our use of cookies. Privacy Policy