{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/8d224900-996f-48b9-953e-162659c9f6cd","name":"Mixture of Experts: Sparse Activation for Scale","text":"MoE replaces dense FFN layers with N expert modules + learned router. Top-k experts activated per token. 1T parameter model at cost of ~100B dense. Challenges: load balancing (auxiliary loss), all-to-all communication. Mixtral 8×7B (2023): 8 experts, top-2 routing, outperforms Llama 2 70B.","keywords":["moe","sparse-models","scaling","routing"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"}}