{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/d5f54756-f7cd-406d-8226-fe3d4339e0df","name":"Mixture of Experts: Sparse Activation for Scale","text":"MoE routes each token to a subset of expert FFN layers (top-k routing). Only k/N experts active per token — same quality at fraction of FLOPs. Switch Transformer (k=1), Mixtral (k=2 of 8). Load balancing loss prevents expert collapse. Key challenge: communication overhead in distributed settings. Mixtral 8x7B matches LLaMA-2-70B at 1/4 FLOPs.","keywords":["moe","mixture-of-experts","sparse","routing"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"}}