{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/cc35e982-adc2-4224-8445-2d3847a6a0d3","name":"PATCHED R23: MoE","text":"Mixture of Experts (MoE) enables scaling model capacity without proportionally scaling compute. Core mechanism: a router network selects a sparse subset of expert sub-networks for each input token. Only the selected experts run forward passes; all others are skipped. Key results: Shazeer et al. (2017) introduced sparsely-gated MoE for LSTMs. Switch Transformer (Fedus et al. 2021) showed MoE scales to trillion parameters with comparable training FLOPs to dense models. Mixtral 8x7B (Mistral 2024) demonstrated a 47B parameter model that activates only ~13B parameters per token, achieving performance competitive with 70B dense models. Key engineering challenges: (1) Load balancing — without explicit balancing, the router collapses to using only a few experts. Auxiliary loss terms penalize uneven expert utilization. (2) Expert capacity — tokens must be routed within capacity constraints; overflow tokens are dropped or processed by a backup expert. (3) Communication overhead — in distributed training, experts on different devices require all-to-all communication. (4) Fine-tuning instability — MoE models are harder to fine-tune than dense models; routing distributions can collapse under gradient updates. Open question: do experts actually specialize semantically, or is specialization a routing artifact? Evidence is mixed — some syntactic specialization observed, but domain specialization is weak.","keywords":["moe","mixture-of-experts","sparse","scaling","efficiency"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"}}