Scaling Laws & μ-Transfer

Estimated reading time: 20 minutes

Scaling laws tell us how model performance changes with size. μ-Transfer shows how to transfer hyperparameters across scales.

This lesson connects theory to practice: how do you go from a 125M prototype to a 7B production model without wasting millions in compute?

Learning Progression (Easy -> Hard)#

Use this sequence as you read:

  1. Start with Chinchilla Scaling Laws to build core intuition and shared vocabulary.
  2. Move to Why Hyperparameters Don't Transfer to understand the mechanism behind the intuition.
  3. Apply the idea in μ-Transfer: The Solution with concrete examples or implementation details.
  4. Then zoom out to scale-level tradeoffs so the same concept holds at larger model and system sizes.
  5. Map the concept to production constraints to understand how teams make practical tradeoffs.
  6. Finish with research extensions to connect today’s mental model to open problems.

Chinchilla Scaling Laws#

Flow bridge: Start here; this section establishes the base mental model for the rest of the lesson.

Loading diagram...
scaling_laws.py
Loading editor...

Instructor Lens#

Why Hyperparameters Don't Transfer#

Flow bridge: Building on Chinchilla Scaling Laws, this section adds the next layer of conceptual depth.

hp_transfer_problem.py
Loading editor...

μ-Transfer: The Solution#

Flow bridge: Building on Why Hyperparameters Don't Transfer, this section adds the next layer of conceptual depth.

μ-Transfer (maximal update parameterization) provides rules for transferring hyperparameters:

Loading diagram...
mu_transfer.py
Loading editor...
1

Step 1: Standard parameterization fails. With normal PyTorch defaults, you tune on a small model, then the large model diverges. Expensive re-tuning required.

Scale Thought Experiment#

Flow bridge: With the local mechanism in place, extend it to larger model, context, and system scales.

ScaleWhat BreaksMitigation
125M → 350MUsually fine with same HPMinor LR adjustment
350M → 7BLR, batch size need retuningUse μ-Transfer
7B → 70BEverything needs retuningμ-Transfer + careful monitoring
70B → 700BCommunication becomes bottleneckParallelism strategy matters

Production Reality#

Flow bridge: Carry these tradeoffs into production constraints and team-level operating decisions.

Meta (Llama 2):

  • Trained scaling law models first (many small runs)
  • Used scaling laws to predict 70B performance
  • Trained longer than Chinchilla-optimal (2T tokens)
  • Reasoning: inference cost dominates, worth extra training

Google (PaLM 2):

  • Compute-optimal training
  • Mixture of different data types
  • Careful scaling of model vs data

Checkpoint Questions#

Use these to verify understanding before moving on:

  1. Can you do this without notes: Interpret Chinchilla scaling laws and compute-optimal model sizing?
  2. Can you do this without notes: Explain why hyperparameters don't transfer naively across scales?
  3. Can you do this without notes: Apply μ-Transfer principles to scale up models efficiently?

Research Hooks#

Flow bridge: Use this practical baseline to frame the open research questions that remain unresolved.

Papers:

  1. "Training Compute-Optimal Large Language Models" (Hoffmann et al., 2022) — The Chinchilla paper
  2. "Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer" (Yang et al., 2022) — The μ-Transfer paper

Open Questions:

  • Do scaling laws hold for multimodal models?
  • How should we scale for reasoning vs knowledge tasks?
  • Can we predict emergent capabilities from scaling laws?

Next up: Most operations are memory-bound, not compute-bound. We'll learn to profile and identify the true bottleneck.