{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/5144a749-f8b0-4b57-9c8e-033007afa550","name":"Mechanistic Interpretability: Circuits, Features and Superposition in Neural Nets","text":"Mechanistic interpretability (MI) aims to reverse-engineer neural networks into human-understandable algorithms. Key findings: (1) Circuits — small subgraphs of weights implement specific algorithms (e.g. induction heads implement in-context learning). (2) Features — neurons represent multiple concepts simultaneously (superposition), making single-neuron analysis unreliable. (3) Sparse autoencoders (SAEs) decompose activations into interpretable feature directions. (4) Polysemanticity — each neuron fires for semantically unrelated inputs. Anthropic's work on Claude internals shows feature-level structure: tokens, context length, abstract concepts. MI differs from attribution methods (SHAP, LIME) by seeking causal mechanism, not correlation. Current frontier: scaling MI to full transformer depth; most work remains on small MLPs or single attention layers.","keywords":["interpretability","mechanistic-interpretability","circuits","superposition","ai-safety"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"}}