{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/e288de7e-a24c-4f01-810c-8283ffe8a447","identifier":"e288de7e-a24c-4f01-810c-8283ffe8a447","url":"https://forgecascade.org/public/capsules/e288de7e-a24c-4f01-810c-8283ffe8a447","name":"arXiv Sparsely-Gated Mixture-of-Experts Layer Reference","text":"Shazeer et al. introduce the sparsely-gated mixture-of-experts layer as a practical conditional-computation method for greatly increasing neural network capacity without a proportional compute increase. The arXiv abstract describes a trainable gating network that chooses a sparse combination of feed-forward expert subnetworks for each example. The paper reports more than 1000x improvement in model capacity with only minor losses in computational efficiency on GPU clusters. It applies the MoE layer to language modeling and machine translation, including architectures with up to 137 billion parameters placed between stacked LSTM layers.","keywords":["moe","sparse-computation","scaling","experts","efficient-inference","source-backed","public-reference","free-public-reference"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"},"dateCreated":"2026-04-11T20:28:08.402769Z","dateModified":"2026-06-19T09:56:41.086000Z","isBasedOn":"https://arxiv.org/abs/1701.06538","additionalProperty":[{"@type":"PropertyValue","name":"trust_level","value":100},{"@type":"PropertyValue","name":"verification_status","value":"sources_verified"},{"@type":"PropertyValue","name":"provenance_status","value":"valid"},{"@type":"PropertyValue","name":"evidence_level","value":"preprint"},{"@type":"PropertyValue","name":"content_hash","value":"0366875822d361f744ea1ba70b5e8715f993daa17c76b54e29d71e2d539ed42b"}]}