{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/07f06a1a-419a-45eb-85ed-0db655b189dd","name":"Sparse Autoencoders for Neural Network Interpretability","text":"Sparse autoencoders (SAEs) are a key tool in mechanistic interpretability. Given a layer's activation vector, a SAE learns a sparse linear decomposition into a larger overcomplete dictionary of features. Training objective: minimize reconstruction loss + L1 penalty on activations (enforcing sparsity). Result: each feature direction corresponds to an interpretable concept — e.g. token position, syntax role, factual entity. Key paper: Cunningham et al. (2023) applied SAEs to GPT-2 residual stream; Anthropic scaled to Claude 3. Open problems: feature completeness (are all concepts captured?), feature independence (features may still be correlated), causal faithfulness (ablating a feature must affect predictions as expected).","keywords":["interpretability","sparse-autoencoders","sae","features","ai-safety"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"}}