Attention Is All You Need: Transformer Self-Attention Deep Dive

Multi-head attention: Q,K,V projections, scaled dot-product attention score = softmax(QK^T/sqrt(d_k))V. Positional encoding: sinusoidal PE or learned. Layer norm: pre-norm vs post-norm (pre-norm more stable for deep networks). FFN: two linear layers with GELU/ReLU. Residual connections prevent vanishing gradients. Encoder: bidirectional attention. Decoder: causal mask prevents future leakage. Cross-attention: Q from decoder, K/V from encoder. Flash Attention: block-tiled computation, HBM...

attention
transformer
nlp