Scaling Laws & μ-Transfer
Estimated reading time: 20 minutes
Scaling laws tell us how model performance changes with size. μ-Transfer shows how to transfer hyperparameters across scales.
This lesson connects theory to practice: how do you go from a 125M prototype to a 7B production model without wasting millions in compute?
Learning Progression (Easy -> Hard)#
Use this sequence as you read:
- Start with
Chinchilla Scaling Lawsto build core intuition and shared vocabulary. - Move to
Why Hyperparameters Don't Transferto understand the mechanism behind the intuition. - Apply the idea in
μ-Transfer: The Solutionwith concrete examples or implementation details. - Then zoom out to scale-level tradeoffs so the same concept holds at larger model and system sizes.
- Map the concept to production constraints to understand how teams make practical tradeoffs.
- Finish with research extensions to connect today’s mental model to open problems.
Chinchilla Scaling Laws#
Flow bridge: Start here; this section establishes the base mental model for the rest of the lesson.
Instructor Lens#
Why Hyperparameters Don't Transfer#
Flow bridge: Building on Chinchilla Scaling Laws, this section adds the next layer of conceptual depth.
μ-Transfer: The Solution#
Flow bridge: Building on Why Hyperparameters Don't Transfer, this section adds the next layer of conceptual depth.
μ-Transfer (maximal update parameterization) provides rules for transferring hyperparameters:
Scale Thought Experiment#
Flow bridge: With the local mechanism in place, extend it to larger model, context, and system scales.
| Scale | What Breaks | Mitigation |
|---|---|---|
| 125M → 350M | Usually fine with same HP | Minor LR adjustment |
| 350M → 7B | LR, batch size need retuning | Use μ-Transfer |
| 7B → 70B | Everything needs retuning | μ-Transfer + careful monitoring |
| 70B → 700B | Communication becomes bottleneck | Parallelism strategy matters |
Production Reality#
Flow bridge: Carry these tradeoffs into production constraints and team-level operating decisions.
Meta (Llama 2):
- Trained scaling law models first (many small runs)
- Used scaling laws to predict 70B performance
- Trained longer than Chinchilla-optimal (2T tokens)
- Reasoning: inference cost dominates, worth extra training
Google (PaLM 2):
- Compute-optimal training
- Mixture of different data types
- Careful scaling of model vs data
Checkpoint Questions#
Use these to verify understanding before moving on:
- Can you do this without notes: Interpret Chinchilla scaling laws and compute-optimal model sizing?
- Can you do this without notes: Explain why hyperparameters don't transfer naively across scales?
- Can you do this without notes: Apply μ-Transfer principles to scale up models efficiently?
Research Hooks#
Flow bridge: Use this practical baseline to frame the open research questions that remain unresolved.
Papers:
- "Training Compute-Optimal Large Language Models" (Hoffmann et al., 2022) — The Chinchilla paper
- "Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer" (Yang et al., 2022) — The μ-Transfer paper
Open Questions:
- Do scaling laws hold for multimodal models?
- How should we scale for reasoning vs knowledge tasks?
- Can we predict emergent capabilities from scaling laws?
Next up: Most operations are memory-bound, not compute-bound. We'll learn to profile and identify the true bottleneck.