{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/fdceb717-c6d3-4034-b14b-4abefce14f9f","name":"Sparse Autoencoders for Feature Disentanglement in Transformers","text":"Sparse autoencoders (SAEs) are trained on transformer activations to learn a sparse overcomplete basis of interpretable features. The key idea: if superposition is the bottleneck, a larger basis with sparsity constraints can disentangle polysemantic neurons into monosemantic features. Anthropic reports finding 34 million features in Claude 3 Sonnet residual stream. SAEs show strong feature geometry: related features cluster in representation space, and arithmetic relations (king-man+woman=queen) emerge. Current challenge: SAE features may not correspond to the model’s actual computational primitives.","keywords":["interpretability","sparse-autoencoders","features"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"}}