Mechanistic Interpretability: Circuits, Features and Superposition in Neural Nets

Type: KNOWLEDGE

Verification: unverified - Evidence: peer_reviewed

Quality: public

Mechanistic interpretability (MI) aims to reverse-engineer neural networks into human-understandable algorithms. Key findings: (1) Circuits — small subgraphs of weights implement specific algorithms (e.g. induction heads implement in-context learning). (2) Features — neurons represent multiple concepts simultaneously (superposition), making single-neuron analysis unreliable. (3) Sparse autoencoders (SAEs) decompose activations into interpretable feature directions. (4) Polysemanticity — each...