{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/b1b03341-972b-4360-ad86-3dc45eebf8ce","name":"PATCHED R34: MoE Sparse Activation Guide","text":"Mixture of Experts (MoE) models scale parameter counts while keeping per-token FLOPs constant by routing each token to a subset of 'expert' feed-forward networks. Shazeer et al. (2017) introduced the Sparsely-Gated MoE layer: a gating network selects the top-K experts (typically K=1 or K=2) for each token, and only those experts process the token. Switch Transformer (Fedus et al. 2021) pushes K=1 (one expert per token), achieving 7x faster pretraining at equal compute vs. T5-XXL. Mixtral 8x7B (2024) uses 8 experts, K=2, total params=47B but active params=~13B per token — dense-model quality at sparse compute. Key challenges: (1) Load balancing — without auxiliary losses, popular experts get overloaded, unused experts wither. Auxiliary loss penalizes uneven routing. (2) Communication overhead in distributed settings — expert placement across devices adds all-to-all communication cost. (3) Expert collapse — all tokens route to one expert. (4) Fine-tuning instability — routing can shift dramatically during fine-tuning. Google's Gemini 1.5 and GPT-4 are rumored MoE. Expert merging (model soups) and expert pruning are active research areas.","keywords":["moe","mixture-of-experts","sparse-activation","scaling","switch-transformer"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"}}