Work at a Frontier Lab

In this tutorial, you will estimate compute-optimal model size for a given FLOP budget using Chinchilla scaling laws, calculate why hyperparameters break when transferring across scales, and apply mu-Transfer rules to move from a 125M prototype to a 7B production model.

By the end you will be able to:

Compute the optimal parameter count and token count for a given compute budget
Predict how much loss improvement a 10x compute increase buys
Transfer learning rate from a small proxy model to a large target using mu-Transfer scaling rules

Chinchilla Scaling Laws#

Loading diagram...

scaling_laws.py

Loading editor...

Why Hyperparameters Don't Transfer#

hp_transfer_problem.py

Loading editor...

mu-Transfer: The Solution#

mu-Transfer (maximal update parameterization) provides rules for transferring hyperparameters:

Loading diagram...

mu_transfer.py

Loading editor...

Step 1: Standard parameterization fails. With normal PyTorch defaults, you tune on a small model, then the large model diverges. Expensive re-tuning required.

Scale Thought Experiment#

Scale	What Breaks	Mitigation
125M → 350M	Usually fine with same HP	Minor LR adjustment
350M → 7B	LR, batch size need retuning	Use μ-Transfer
7B → 70B	Everything needs retuning	μ-Transfer + careful monitoring
70B → 700B	Communication becomes bottleneck	Parallelism strategy matters

Production Reality#

Meta (Llama 2):

Trained scaling law models first (many small runs)
Used scaling laws to predict 70B performance
Trained longer than Chinchilla-optimal (2T tokens)
Reasoning: inference cost dominates, worth extra training

Google (PaLM 2):

Compute-optimal training
Mixture of different data types
Careful scaling of model vs data

Break It: Scaling Mistakes#

Common Mistakes

Training Chinchilla-optimal when inference cost dominates — Chinchilla minimizes training loss per FLOP, but a smaller model trained longer (like Llama 2 at 2T tokens for 70B params) may be cheaper to serve. If your model serves millions of requests, overtraining a smaller model saves more money overall.
Applying scaling laws outside their fitted range — Chinchilla coefficients were fit on models up to 70B. Extrapolating to 1T+ parameters assumes the power law holds, which is not guaranteed. Emergence effects, data quality ceilings, and architecture changes can all break the extrapolation.
Transferring LR without mu-Transfer scaling — A learning rate of 3e-4 that works at width 256 will cause divergence at width 4096 under standard parameterization. The mu-Transfer multiplier for this case is 256/4096 = 0.0625, giving a transferred LR of approximately 1.9e-5.

Checkpoint Questions#

Each question requires calculation or estimation, not just recall.

You have a compute budget of 10^23 FLOPs. Using the Chinchilla rule (C = 6ND, D = 20N), compute the optimal parameter count N and token count D. Express both in billions.
You tuned lr=6e-4 on a proxy model with width 512. Using mu-Transfer, compute the transferred LR for a target model with width 8192. Show the multiplier calculation.
Meta trained Llama 2 70B on 2T tokens instead of the Chinchilla-optimal ~1.4T. Given that serving cost per token is proportional to parameter count, explain when overtraining (more tokens than optimal) is the right economic decision. Estimate the serving cost ratio of a 70B model vs a Chinchilla-optimal 120B model that achieves the same loss.

Research Hooks#

Papers:

"Training Compute-Optimal Large Language Models" (Hoffmann et al., 2022) — The Chinchilla paper
"Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer" (Yang et al., 2022) — The μ-Transfer paper

Open Questions:

Do scaling laws hold for multimodal models?
How should we scale for reasoning vs knowledge tasks?
Can we predict emergent capabilities from scaling laws?

Next up: Most operations are memory-bound, not compute-bound. We'll learn to profile and identify the true bottleneck.

By the end you will be able to:

Compute the optimal parameter count and token count for a given compute budget
Predict how much loss improvement a 10x compute increase buys
Transfer learning rate from a small proxy model to a large target using mu-Transfer scaling rules

Chinchilla Scaling Laws#

Loading diagram...

scaling_laws.py

Loading editor...

Why Hyperparameters Don't Transfer#

hp_transfer_problem.py

Loading editor...

mu-Transfer: The Solution#

mu-Transfer (maximal update parameterization) provides rules for transferring hyperparameters:

Loading diagram...

mu_transfer.py

Loading editor...

Step 1: Standard parameterization fails. With normal PyTorch defaults, you tune on a small model, then the large model diverges. Expensive re-tuning required.

Scale Thought Experiment#

Scale	What Breaks	Mitigation
125M → 350M	Usually fine with same HP	Minor LR adjustment
350M → 7B	LR, batch size need retuning	Use μ-Transfer
7B → 70B	Everything needs retuning	μ-Transfer + careful monitoring
70B → 700B	Communication becomes bottleneck	Parallelism strategy matters

Production Reality#

Meta (Llama 2):

Trained scaling law models first (many small runs)
Used scaling laws to predict 70B performance
Trained longer than Chinchilla-optimal (2T tokens)
Reasoning: inference cost dominates, worth extra training

Google (PaLM 2):

Compute-optimal training
Mixture of different data types
Careful scaling of model vs data

Break It: Scaling Mistakes#

Common Mistakes

Training Chinchilla-optimal when inference cost dominates — Chinchilla minimizes training loss per FLOP, but a smaller model trained longer (like Llama 2 at 2T tokens for 70B params) may be cheaper to serve. If your model serves millions of requests, overtraining a smaller model saves more money overall.
Applying scaling laws outside their fitted range — Chinchilla coefficients were fit on models up to 70B. Extrapolating to 1T+ parameters assumes the power law holds, which is not guaranteed. Emergence effects, data quality ceilings, and architecture changes can all break the extrapolation.
Transferring LR without mu-Transfer scaling — A learning rate of 3e-4 that works at width 256 will cause divergence at width 4096 under standard parameterization. The mu-Transfer multiplier for this case is 256/4096 = 0.0625, giving a transferred LR of approximately 1.9e-5.

Checkpoint Questions#

Each question requires calculation or estimation, not just recall.

You have a compute budget of 10^23 FLOPs. Using the Chinchilla rule (C = 6ND, D = 20N), compute the optimal parameter count N and token count D. Express both in billions.
You tuned lr=6e-4 on a proxy model with width 512. Using mu-Transfer, compute the transferred LR for a target model with width 8192. Show the multiplier calculation.
Meta trained Llama 2 70B on 2T tokens instead of the Chinchilla-optimal ~1.4T. Given that serving cost per token is proportional to parameter count, explain when overtraining (more tokens than optimal) is the right economic decision. Estimate the serving cost ratio of a 70B model vs a Chinchilla-optimal 120B model that achieves the same loss.

Research Hooks#

Papers:

"Training Compute-Optimal Large Language Models" (Hoffmann et al., 2022) — The Chinchilla paper
"Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer" (Yang et al., 2022) — The μ-Transfer paper

Open Questions:

Do scaling laws hold for multimodal models?
How should we scale for reasoning vs knowledge tasks?
Can we predict emergent capabilities from scaling laws?

Next up: Most operations are memory-bound, not compute-bound. We'll learn to profile and identify the true bottleneck.

Track 0: Foundations

Chinchilla Scaling Laws#

Why Hyperparameters Don't Transfer#

mu-Transfer: The Solution#

Scale Thought Experiment#

Production Reality#

Break It: Scaling Mistakes#

Checkpoint Questions#

Research Hooks#

Scaling Laws & μ-Transfer

Chinchilla Scaling Laws#

Why Hyperparameters Don't Transfer#

mu-Transfer: The Solution#

Scale Thought Experiment#

Production Reality#

Break It: Scaling Mistakes#

Checkpoint Questions#

Research Hooks#