Mechanistic Interpretability: Reverse-Engineering Neural Network Computations

Mechanistic interpretability (MI) aims to reverse-engineer the algorithms implemented by neural networks at the level of individual circuits and components. Key concepts: (1) Superposition hypothesis (Elhage et al., 2022): neural networks represent more features than they have neurons by encoding features as overlapping linear combinations. This makes individual neurons polysemantic — responding to multiple unrelated concepts — and makes MI harder. (2) Circuits (Olah et al.): specific...

interpretability
circuits
superposition
sae
mechanistic