Work at a Frontier Lab
CoursesProblemsBlogPapersLibrariesDiscussJobs

Track 0: Foundations

Build the mental models that separate research engineers from ML practitioners.

Memory & Compute
The Memory Wall15m
Gradient Flow Under Pressure18m
Optimizers
SGD & Momentum15m
Adam, Warmup & Scheduling18m
Gradient Mechanics
Backprop as Graph Transformation20m
Initialization & Residual Connections18m
Scaling Laws & μ-Transfer20m
Systems Thinking
Bandwidth & Profiling18m
The Debugging Flowchart22m

Track 0: Foundations

Build the mental models that separate research engineers from ML practitioners.

Memory & Compute
The Memory Wall15mGradient Flow Under Pressure18m
Optimizers
SGD & Momentum15mAdam, Warmup & Scheduling18m
Gradient Mechanics
Backprop as Graph Transformation20mInitialization & Residual Connections18mScaling Laws & μ-Transfer20m
Systems Thinking
Bandwidth & Profiling18mThe Debugging Flowchart22m

Loading...

Built with Next.js

PrivacyTermsContactPapersLibrariesJobsDiscuss|GitHub|Work at a Frontier Lab

Loading...

  1. Home
  2. /
  3. Track 0: Foundations
  4. /
  5. Scaling Laws & μ-Transfer

Scaling Laws & μ-Transfer

Estimated reading time: 20 minutes

Previous

←Initialization & Residual Connections

Next

Bandwidth & Profiling→

In this tutorial, you will estimate compute-optimal model size for a given FLOP budget using Chinchilla scaling laws, calculate why hyperparameters break when transferring across scales, and apply mu-Transfer rules to move from a 125M prototype to a 7B production model.

By the end you will be able to:

  • Compute the optimal parameter count and token count for a given compute budget
  • Predict how much loss improvement a 10x compute increase buys
  • Transfer learning rate from a small proxy model to a large target using mu-Transfer scaling rules
💡

Core Idea

Model loss follows a power law: L(N, D) = A/N^alpha + B/D^beta + C, where C is the irreducible loss (the best any model can achieve). For a fixed compute budget C_compute = 6ND, the Chinchilla-optimal split trains on roughly 20 tokens per parameter. 10x more parameters reduces the reducible component of loss (the A/N^alpha term) by roughly 2x, but the irreducible loss C remains — so total loss improvement is always less than 2x.

Chinchilla Scaling Laws#

Loading diagram...
⚠️

The Chinchilla Insight

GPT-3 (175B params, 300B tokens) was undertrained.

Chinchilla (70B params, 1.4T tokens) achieves the same performance with less compute by training a smaller model on more data.

Rule of thumb: Train on ~20 tokens per parameter.

scaling_laws.py
Loading editor...

Why Hyperparameters Don't Transfer#

❌

The Transfer Problem

You tune learning rate on a 125M model: lr = 3e-4 works great! You scale to 7B model with the same LR: training diverges.

Why? The optimal LR depends on model width. Naive transfer wastes millions in failed experiments.

hp_transfer_problem.py
Loading editor...

mu-Transfer: The Solution#

mu-Transfer (maximal update parameterization) provides rules for transferring hyperparameters:

Loading diagram...
mu_transfer.py
Loading editor...
1

Step 1: Standard parameterization fails. With normal PyTorch defaults, you tune on a small model, then the large model diverges. Expensive re-tuning required.

Scale Thought Experiment#

ScaleWhat BreaksMitigation
125M → 350MUsually fine with same HPMinor LR adjustment
350M → 7BLR, batch size need retuningUse μ-Transfer
7B → 70BEverything needs retuningμ-Transfer + careful monitoring
70B → 700BCommunication becomes bottleneckParallelism strategy matters

Production Reality#

Meta (Llama 2):

  • Trained scaling law models first (many small runs)
  • Used scaling laws to predict 70B performance
  • Trained longer than Chinchilla-optimal (2T tokens)
  • Reasoning: inference cost dominates, worth extra training

Google (PaLM 2):

  • Compute-optimal training
  • Mixture of different data types
  • Careful scaling of model vs data

Break It: Scaling Mistakes#

❌

Common Mistakes

  1. Training Chinchilla-optimal when inference cost dominates — Chinchilla minimizes training loss per FLOP, but a smaller model trained longer (like Llama 2 at 2T tokens for 70B params) may be cheaper to serve. If your model serves millions of requests, overtraining a smaller model saves more money overall.

  2. Applying scaling laws outside their fitted range — Chinchilla coefficients were fit on models up to 70B. Extrapolating to 1T+ parameters assumes the power law holds, which is not guaranteed. Emergence effects, data quality ceilings, and architecture changes can all break the extrapolation.

  3. Transferring LR without mu-Transfer scaling — A learning rate of 3e-4 that works at width 256 will cause divergence at width 4096 under standard parameterization. The mu-Transfer multiplier for this case is 256/4096 = 0.0625, giving a transferred LR of approximately 1.9e-5.

Checkpoint Questions#

Each question requires calculation or estimation, not just recall.

  1. You have a compute budget of 10^23 FLOPs. Using the Chinchilla rule (C = 6ND, D = 20N), compute the optimal parameter count N and token count D. Express both in billions.
  2. You tuned lr=6e-4 on a proxy model with width 512. Using mu-Transfer, compute the transferred LR for a target model with width 8192. Show the multiplier calculation.
  3. Meta trained Llama 2 70B on 2T tokens instead of the Chinchilla-optimal ~1.4T. Given that serving cost per token is proportional to parameter count, explain when overtraining (more tokens than optimal) is the right economic decision. Estimate the serving cost ratio of a 70B model vs a Chinchilla-optimal 120B model that achieves the same loss.

Research Hooks#

Papers:

  1. "Training Compute-Optimal Large Language Models" (Hoffmann et al., 2022) — The Chinchilla paper
  2. "Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer" (Yang et al., 2022) — The μ-Transfer paper

Open Questions:

  • Do scaling laws hold for multimodal models?
  • How should we scale for reasoning vs knowledge tasks?
  • Can we predict emergent capabilities from scaling laws?

Next up: Most operations are memory-bound, not compute-bound. We'll learn to profile and identify the true bottleneck.