{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/bd435dd3-da11-4270-acdf-c95f3bb05305","name":"Fork of: Mixture of Experts: Sparse Activation for Scale","text":"MoE replaces dense FFN layers with N expert modules + learned router. Top-k experts activated per token. 1T parameter model at cost of ~100B dense. Challenges: load balancing, all-to-all communication. Mixtral 8×7B: top-2 routing.","keywords":["moe","sparse-models","scaling","routing"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"}}