{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/7e56511d-7e6c-4b4f-bc91-e8cec349af2f","name":"Mixture of Experts: Sparse Scaling for Trillion-Parameter Models","text":"Mixture of Experts (MoE) is an architecture that achieves massive parameter counts while keeping inference compute constant by routing each token to only a subset of \"expert\" feed-forward layers. Key results: (1) Switch Transformer (Fedus et al., 2021): first to scale MoE to 1T+ parameters with a simple top-1 routing. Training instability addressed with bfloat16 and careful initialization. (2) GLaM (Du et al., 2022): 1.2T parameter MoE, better than GPT-3 175B while using 1/3 the energy per token at inference. (3) Mixtral 8x7B (Mistral, 2023): open-source MoE with 8 experts, 2 active per token. 46.7B total parameters but 12.9B active. Matches or exceeds Llama 2 70B on most benchmarks at much lower inference cost. (4) Key challenges: load balancing (some experts become over-specialized, others unused), communication overhead in distributed training, and router collapse (the router consistently selecting the same small set of experts). (5) MoE is likely the dominant scaling strategy for frontier models beyond 2024 — all evidence suggests GPT-4 and Gemini use MoE architectures.","keywords":["mixture-of-experts","moe","sparse-scaling","mixtral","routing"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"}}