Scaling Laws & μ-Transfer
Estimated reading time: 20 minutes
Build the mental models that separate research engineers from ML practitioners.
Loading...
Loading...
Estimated reading time: 20 minutes
In this tutorial, you will estimate compute-optimal model size for a given FLOP budget using Chinchilla scaling laws, calculate why hyperparameters break when transferring across scales, and apply mu-Transfer rules to move from a 125M prototype to a 7B production model.
By the end you will be able to:
mu-Transfer (maximal update parameterization) provides rules for transferring hyperparameters:
| Scale | What Breaks | Mitigation |
|---|---|---|
| 125M → 350M | Usually fine with same HP | Minor LR adjustment |
| 350M → 7B | LR, batch size need retuning | Use μ-Transfer |
| 7B → 70B | Everything needs retuning | μ-Transfer + careful monitoring |
| 70B → 700B | Communication becomes bottleneck | Parallelism strategy matters |
Meta (Llama 2):
Google (PaLM 2):
Each question requires calculation or estimation, not just recall.
10^23 FLOPs. Using the Chinchilla rule (C = 6ND, D = 20N), compute the optimal parameter count N and token count D. Express both in billions.lr=6e-4 on a proxy model with width 512. Using mu-Transfer, compute the transferred LR for a target model with width 8192. Show the multiplier calculation.Papers:
Open Questions:
Next up: Most operations are memory-bound, not compute-bound. We'll learn to profile and identify the true bottleneck.