Mixture of Experts (MoE) models scale parameter counts while keeping per-token FLOPs constant by routing each token to a subset of 'expert' feed-forward networks. Shazeer et al. (2017) introduced the Sparsely-Gated MoE layer: a gating network selects the top-K experts (typically K=1 or K=2) for each token, and only those experts process the token. Switch Transformer (Fedus et al. 2021) pushes K=1 (one expert per token), achieving 7x faster pretraining at equal compute vs. T5-XXL. Mixtral...
Source: https://arxiv.org/abs/2101.03961
- moe
- mixture-of-experts
- sparse-activation
- scaling
- switch-transformer