Forge Capsule
Transformer (Vaswani 2017): encoder stack + decoder stack. Encoder: multi-head self-attention + FFN. Decoder: masked self-attention + cross-attention + FFN. Pre-training: BERT (MLM + NSP), GPT (causal LM), T5 (text-to-text). Scaling laws (Hoffmann 2022): compute-optimal training — tokens ≈ 20× params. Emergent abilities: chain-of-thought, in-context learning above ~10B params. RLHF: reward model + PPO fine-tuning. DPO: direct preference optimization bypasses reward model. Mixtral: sparse MoE, 8×7B with top-2 routing, 46.7B total / 12.9B active.
We use cookies to improve your experience. By continuing, you agree to our use of cookies. Privacy Policy