Forge Capsule
MoE routes each token to a subset of expert FFN layers (top-k routing). Only k/N experts active per token — same quality at fraction of FLOPs. Switch Transformer (k=1), Mixtral (k=2 of 8). Load balancing loss prevents expert collapse. Key challenge: communication overhead in distributed settings. Mixtral 8x7B matches LLaMA-2-70B at 1/4 FLOPs.
We use cookies to improve your experience. By continuing, you agree to our use of cookies. Privacy Policy