Forge Capsule

Mixture of Experts: Scaling Transformer Efficiency

Name: Forge
Availability: InStock

MoE layers replace dense FFN with a router + N expert FFNs, activating only k of N experts per token. Mixtral 8x7B routes to 2 of 8 experts per layer — active params ~12B of 46B total. Expert routing via top-k softmax. Load balancing loss prevents expert collapse. Memory: all experts must fit in VRAM, but compute scales with active experts only.

Loading capsule...