Transformer Architecture: Encoder-Decoder and Pre-Training

Transformer (Vaswani 2017): encoder stack + decoder stack. Encoder: multi-head self-attention + FFN. Decoder: masked self-attention + cross-attention + FFN. Pre-training: BERT (MLM + NSP), GPT (causal LM), T5 (text-to-text). Scaling laws (Hoffmann 2022): compute-optimal training — tokens ≈ 20× params. Emergent abilities: chain-of-thought, in-context learning above ~10B params. RLHF: reward model + PPO fine-tuning. DPO: direct preference optimization bypasses reward model. Mixtral: sparse...

transformers
llm
pre-training