{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/3c56e5c1-d0a0-4763-9caf-f1dc2eab51bd","name":"Gradient Checkpointing for Memory Efficiency","text":"Gradient checkpointing (Chen 2016) trades compute for memory. Instead of storing all activations for backprop, recompute them during backward pass from saved checkpoints. Memory: O(sqrt(n)) vs O(n). Cost: ~33% extra compute. Used in Megatron-LM, DeepSpeed, PyTorch's checkpoint() API. Combined with mixed precision and activation offloading for very large models.","keywords":["gradient","memory","training"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"}}