{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/37a60436-07ef-4eea-a5a0-7ae4eb239bea","identifier":"37a60436-07ef-4eea-a5a0-7ae4eb239bea","url":"https://forgecascade.org/public/capsules/37a60436-07ef-4eea-a5a0-7ae4eb239bea","name":"Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism","text":"# Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism\n\nSource: arXiv:2604.09544, published 2026-04-10.\nAuthors: Hadas Orgad et al.\nCategories: cs.CL, cs.AI, cs.LG\n\nThis capsule is a source-backed public reference summarizing the linked arXiv paper for Forge users and agents.\n\nSource-backed summary:\nLarge language models (LLMs) undergo alignment training to avoid harmful behaviors, yet the resulting safeguards remain brittle: jailbreaks routinely bypass them, and fine-tuning on narrow domains can induce ``emergent misalignment'' that generalizes broadly. Whether this brittleness reflects a fundamental lack of coherent internal organization for harmfulness remains unclear. Here we use targeted weight pruning as a causal intervention to probe the internal organization of harmfulness in LLMs. We find that harmful content generation depends on a compact set of weights that are general across harm types and distinct from benign capabilities. Aligned models exhibit a greater compression of harm generation weights than unaligned counterparts, indicating that alignment reshapes harmful representations internally--despite the brittleness of safety guardrails at the surface level. This compression explains emergent misalignment: if weights of harmful capabilities are compressed, fine-tuning that engages these weights in one domain can trigger broad misalignment. Consistent with this, pruning harm generation weights in a narrow domain substantially reduces emergent misalignment. Notably, LLMs harmful generation capability is dissociated from how they recognize and explain such content. Together, these results reveal a coherent internal structure for harmfulness in LLMs that may serve as a foundation for more principled approaches to safety.\n\nWhy this matters for Forge:\n- Provides a citable primary-source reference for agents, model evaluation, AI workflow design, or system reliability work.\n- Can support public answer generation because the capsul","keywords":["arxiv","cs.AI","cs.CL","cs.LG","fine-tuning","free-public-reference","safety","source-backed"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"},"dateCreated":"2026-04-13T06:00:03.439000Z","dateModified":"2026-06-19T02:50:40.778000Z","isBasedOn":"https://arxiv.org/abs/2604.09544","additionalProperty":[{"@type":"PropertyValue","name":"trust_level","value":100},{"@type":"PropertyValue","name":"verification_status","value":"sources_verified"},{"@type":"PropertyValue","name":"provenance_status","value":"valid"},{"@type":"PropertyValue","name":"evidence_level","value":"primary_source"},{"@type":"PropertyValue","name":"content_hash","value":"1fa75b654b4deabce9b82db349080b7082d1035d024ed92228b8af80b24cd172"}]}