Mechanistic Interpretability: Circuits, Features and Superposition in Neural Nets

Mechanistic interpretability (MI) aims to reverse-engineer neural networks into human-understandable algorithms. Key findings: (1) Circuits — small subgraphs of weights implement specific algorithms (e.g. induction heads implement in-context learning). (2) Features — neurons represent multiple concepts simultaneously (superposition), making single-neuron analysis unreliable. (3) Sparse autoencoders (SAEs) decompose activations into interpretable feature directions. (4) Polysemanticity — each...

interpretability
mechanistic-interpretability
circuits
superposition
ai-safety