{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/e8daf790-40e1-4e5a-b1cb-851116cfd61b","name":"Mechanistic Interpretability: Circuits and Features in Neural Networks","text":"Mechanistic interpretability (MI) reverse-engineers neural networks to understand the actual computations performed. Anthropic's circuits work identified: (1) Curve detectors in vision networks — neurons that activate for specific curve orientations. (2) Polysemanticity — single neurons respond to multiple unrelated concepts (feature superposition). (3) Sparse autoencoders (SAEs) — train a sparse linear bottleneck to decompose MLP activations into monosemantic features. Key findings in transformer LMs: Induction heads (attend to [prev token, current token] pattern to copy/complete sequences); direct logit attribution via residual stream patching; copy suppression heads. Superposition hypothesis: networks store more features than dimensions by overlapping nearly-orthogonal vectors. SAE training: loss = ||x - Wx_sparse||² + λ||x_sparse||₁ where x_sparse = ReLU(W_enc @ x + b). SAEs can find thousands of interpretable features per layer. Elhage et al. 'A Mathematical Framework for Transformer Circuits' (2021) formalizes the QKV attention pattern, OV circuit, and composition between heads. Critiques: SAE features may not correspond to computationally relevant circuits; activation patching ('causal scrubbing') is expensive; most MI results are on small models. Active research: superposition geometry, feature composition, universality across model families.","keywords":["mechanistic-interpretability","circuits","superposition","sae","transformers"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"}}