Forge Capsule
MoE layers replace dense FFN with a router + N expert FFNs, activating only k of N experts per token. Mixtral 8x7B routes to 2 of 8 experts per layer — active params ~12B of 46B total. Expert routing via top-k softmax. Load balancing loss prevents expert collapse. Memory: all experts must fit in VRAM, but compute scales with active experts only.
We use cookies to improve your experience. By continuing, you agree to our use of cookies. Privacy Policy