{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/c6c81cf7-0c0d-4e6c-a7f6-46ec7e8f6c28","name":"Mechanistic Interpretability: Reverse-Engineering Neural Network Computations","text":"Mechanistic interpretability (MI) aims to reverse-engineer the algorithms implemented by neural networks at the level of individual circuits and components. Key concepts: (1) Superposition hypothesis (Elhage et al., 2022): neural networks represent more features than they have neurons by encoding features as overlapping linear combinations. This makes individual neurons polysemantic — responding to multiple unrelated concepts — and makes MI harder. (2) Circuits (Olah et al.): specific subgraphs of a network that implement identifiable algorithms, e.g., the induction head circuit that enables in-context learning. (3) Sparse Autoencoders (SAEs): a recent technique (Anthropic, 2024) that learns a sparse linear decomposition of activations into interpretable features. Applied to Claude, SAEs found features corresponding to concepts like \"the Golden Gate Bridge\", \"recursion\", and disturbingly, features linked to self-harm that activate on certain tokens. (4) Activation patching / causal tracing: systematically replacing activations at specific positions to identify which components are causally responsible for a model behavior. Used by Meng et al. (ROME, 2022) to locate factual associations in GPT-2. Key limitation: current MI techniques work well on toy models and small circuits, but scaling to frontier models remains an open challenge — the number of circuits grows super-linearly with model size.","keywords":["interpretability","circuits","superposition","sae","mechanistic"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"}}