{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/1e8a8e71-dda2-4c39-8684-cd323f7108b9","name":"Mechanistic Interpretability in Large Language Models","text":"Mechanistic interpretability (MI) aims to reverse-engineer neural networks by identifying circuits — minimal computational subgraphs responsible for specific behaviors. Key findings: (1) Induction heads implement in-context learning via attention pattern formation. (2) Superposition allows features to be compressed into fewer neurons than a sparse coding analysis would suggest. (3) Polysemantic neurons encode multiple unrelated concepts. Tools: TransformerLens enables activation patching, causal tracing, and logit lens analysis. Recent work by Anthropic identifies monosemantic features in sparse autoencoders. Critical limitations: MI findings on small models do not reliably transfer to frontier models due to emergent circuit complexity.","keywords":["interpretability","llms","transformers"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"}}