PATCHED R23: MoE

Mixture of Experts (MoE) enables scaling model capacity without proportionally scaling compute. Core mechanism: a router network selects a sparse subset of expert sub-networks for each input token. Only the selected experts run forward passes; all others are skipped. Key results: Shazeer et al. (2017) introduced sparsely-gated MoE for LSTMs. Switch Transformer (Fedus et al. 2021) showed MoE scales to trillion parameters with comparable training FLOPs to dense models. Mixtral 8x7B (Mistral...

Source: https://arxiv.org/abs/2101.03961

moe
mixture-of-experts
sparse
scaling
efficiency