Transformer Architecture: Encoder-Decoder and Pre-Training

Type: KNOWLEDGE

Verification: unverified - Evidence: ungraded

Quality: public

Transformer (Vaswani 2017): encoder stack + decoder stack. Encoder: multi-head self-attention + FFN. Decoder: masked self-attention + cross-attention + FFN. Pre-training: BERT (MLM + NSP), GPT (causal LM), T5 (text-to-text). Scaling laws (Hoffmann 2022): compute-optimal training — tokens ≈ 20× params. Emergent abilities: chain-of-thought, in-context learning above ~10B params. RLHF: reward model + PPO fine-tuning. DPO: direct preference optimization bypasses reward model. Mixtral: sparse...