{"@context":"https://schema.org","@type":"CreativeWork","@id":"https://forgecascade.org/public/capsules/6e41e1e7-391c-471a-b571-845ff65e6296","identifier":"6e41e1e7-391c-471a-b571-845ff65e6296","url":"https://forgecascade.org/public/capsules/6e41e1e7-391c-471a-b571-845ff65e6296","name":"Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate","text":"# Quantifying Hyperparameter Transfer and the Importance of Embedding Layer Learning Rate\n\n**Authors:** Dayal Singh Kalra, Maissam Barkeshli\n**arXiv:** https://arxiv.org/abs/2605.21486v1\n**Published:** 2026-05-20T17:59:40Z\n\n## Abstract\nHyperparameter transfer allows extrapolating optimal optimization hyperparameters from small to large scales, making it critical for training large language models (LLMs). This is done either by fitting a scaling law to the hyperparameters or by a judicious choice of parameterization, such as Maximal Update ($μ$P), that renders optimal hyperparameters approximately scale invariant. In this paper, we first develop a framework to quantify hyperparameter transfer through three metrics: (1) the quality of the scaling law fit, (2) the robustness to extrapolation errors, and (3) the asymptotic loss penalty due to choice of parameterization. Next, we investigate through a comprehensive series of ablations why $μ$P appears to offer high-quality learning rate transfer relative to standard parameterization (SP), as existing theory is inadequate. We find that the overwhelming benefit of $μ$P relative to SP when training with AdamW arises simply from maximizing the learning rate of the embedding layer. In SP, the embedding layer learning rate acts as a bottleneck that induces training instabilities; increasing it by a factor of width to match $μ$P dramatically smooths out training while improving hyperparameter transfer. We also find that weight decay improves the scaling law fits, while, in the fixed token-per-parameter setting, it hurts the robustness of the extrapolation.","keywords":["cs.LG","cond-mat.dis-nn","cs.AI","stat.ML"],"about":[{"@type":"Thing","name":"learning disability"},{"@type":"Thing","name":"Abnormal iron deposition in mitochondria"},{"@type":"Thing","name":"vestibular system"},{"@type":"Thing","name":"learning"},{"@type":"Thing","name":"associative learning"},{"@type":"Thing","name":"ABri amyloidosis"},{"@type":"Thing","name":"Deep longitudinal plantar crease"},{"@type":"Thing","name":"NBL1"},{"@type":"Thing","name":"extensor digitorum communis"},{"@type":"Thing","name":"gingival fibromatosis-progressive deafness syndrome"},{"@type":"Thing","name":"Artificial Intelligence"},{"@type":"Thing","name":"Messaging Applications"},{"@type":"Thing","name":"Downgrade System Image"},{"@type":"Thing","name":"APT5"},{"@type":"Thing","name":"Mustard Tempest"},{"@type":"Thing","name":"SUNBURST"},{"@type":"Thing","name":"BoomBox"},{"@type":"Thing","name":"RAPIDPULSE"}],"citation":[],"isPartOf":{"@type":"Dataset","name":"Forge Cascade Knowledge Graph","url":"https://forgecascade.org"},"publisher":{"@type":"Organization","name":"Forge Cascade","url":"https://forgecascade.org"},"dateCreated":"2026-05-21T06:00:06.188000Z","dateModified":"2026-05-21T06:00:06.188000Z","isBasedOn":"https://arxiv.org/abs/2605.21486v1","additionalProperty":[{"@type":"PropertyValue","name":"trust_level","value":65},{"@type":"PropertyValue","name":"verification_status","value":"source_linked"},{"@type":"PropertyValue","name":"evidence_level","value":"primary_source"}]}