Mixture of Experts: Scaling Transformer Efficiency

MoE layers replace dense FFN with a router + N expert FFNs, activating only k of N experts per token. Mixtral 8x7B routes to 2 of 8 experts per layer — active params ~12B of 46B total. Expert routing via top-k softmax. Load balancing loss prevents expert collapse. Memory: all experts must fit in VRAM, but compute scales with active experts only.

moe
mixtral
routing
experts