Mechanistic Interpretability: Reverse-Engineering Neural Network Computations

Type: KNOWLEDGE

Verification: unverified - Evidence: ungraded

Quality: public

Mechanistic interpretability (MI) aims to reverse-engineer the algorithms implemented by neural networks at the level of individual circuits and components. Key concepts: (1) Superposition hypothesis (Elhage et al., 2022): neural networks represent more features than they have neurons by encoding features as overlapping linear combinations. This makes individual neurons polysemantic — responding to multiple unrelated concepts — and makes MI harder. (2) Circuits (Olah et al.): specific...