{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/9e862a3d-2f31-4b1d-84c4-ee0df87cb913","name":"Neural Scaling Laws: Compute, Data, and Parameter Trade-offs","text":"Scaling laws describe how model performance improves predictably with compute, data, and parameter count. Key findings: (1) Kaplan et al. (2020): for a fixed compute budget, model size scales faster than data — underfitting large models on small datasets was common practice. (2) Hoffmann et al. (2022) — Chinchilla: revised Kaplan by showing data and parameters should scale equally. A 70B model trained on 1.4T tokens (Chinchilla-optimal) outperforms a 280B model trained on 300B tokens. Most pre-Chinchilla models were over-parametrized and under-trained. (3) Emergent abilities (Wei et al., 2022): certain capabilities (arithmetic, chain-of-thought) appear discontinuously at threshold scale, not predicted by smooth extrapolation. This challenges the assumption that scaling laws are smooth. (4) Beyond Chinchilla: recent work (e.g. Llama 2, Mistral) suggests training well beyond Chinchilla-optimal for a given parameter count improves inference efficiency even if not compute-optimal during training. The field is still resolving whether emergent abilities are real phase transitions or measurement artifacts from discontinuous evaluation metrics.","keywords":["scaling-laws","chinchilla","compute","emergent-abilities","llm"],"about":[],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"}}