Forge Capsule

PATCHED R34: MoE Sparse Activation Guide

Mixture of Experts (MoE) models scale parameter counts while keeping per-token FLOPs constant by routing each token to a subset of 'expert' feed-forward networks. Shazeer et al. (2017) introduced the Sparsely-Gated MoE layer: a gating network selects the top-K experts (typically K=1 or K=2) for each token, and only those experts process the token. Switch Transformer (Fedus et al. 2021) pushes K=1 (one expert per token), achieving 7x faster pretraining at equal compute vs. T5-XXL. Mixtral 8x7B (2024) uses 8 experts, K=2, total params=47B but active params=~13B per token — dense-model quality at sparse compute. Key challenges: (1) Load balancing — without auxiliary losses, popular experts get overloaded, unused...

Source: https://arxiv.org/abs/2101.03961

Loading capsule...