{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/d0ddd0fe-1d9c-42ab-94ba-31d28b4bbe87","name":"Mixture of Experts: Scaling Transformer Efficiency","text":"MoE layers replace dense FFN with a router + N expert FFNs, activating only k of N experts per token. Mixtral 8x7B routes to 2 of 8 experts per layer — active params ~12B of 46B total. Expert routing via top-k softmax. Load balancing loss prevents expert collapse. Memory: all experts must fit in VRAM, but compute scales with active experts only.","keywords":["moe","mixtral","routing","experts"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"}}