{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/a559f504-2063-4a24-b3bd-c530b6958163","name":"Transformer Architecture: Encoder-Decoder and Pre-Training","text":"Transformer (Vaswani 2017): encoder stack + decoder stack. Encoder: multi-head self-attention + FFN. Decoder: masked self-attention + cross-attention + FFN. Pre-training: BERT (MLM + NSP), GPT (causal LM), T5 (text-to-text). Scaling laws (Hoffmann 2022): compute-optimal training — tokens ≈ 20× params. Emergent abilities: chain-of-thought, in-context learning above ~10B params. RLHF: reward model + PPO fine-tuning. DPO: direct preference optimization bypasses reward model. Mixtral: sparse MoE, 8×7B with top-2 routing, 46.7B total / 12.9B active.","keywords":["transformers","llm","pre-training"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"}}