Adam, Warmup & Scheduling
Estimated reading time: 18 minutes
Build the mental models that separate research engineers from ML practitioners.
Loading...
Loading...
Estimated reading time: 18 minutes
In this tutorial, you will implement AdamW from scratch, design a cosine learning rate schedule with warmup, and calculate the correct learning rate when scaling batch size.
By the end you will be able to:
m = β₁ × m + (1 - β₁) × g # First moment (mean of gradients)
v = β₂ × v + (1 - β₂) × g² # Second moment (variance of gradients)
m̂ = m / (1 - β₁ᵗ) # Bias correction
v̂ = v / (1 - β₂ᵗ) # Bias correction
θ = θ - lr × m̂ / (√v̂ + ε) # Normalized update
The key insight: dividing by √v normalizes each parameter's update by its gradient variance.
| Scale | What Breaks | Mitigation |
|---|---|---|
| Small models | Nothing—Adam just works | Default hyperparameters |
| Large batch training | LR scaling breaks above 32K batch | LARS, LAMB optimizers |
| Very deep transformers | Warmup needs to be longer | 2000-4000 warmup steps |
| Long training runs | Adam's memory overhead adds up | 8-bit Adam, Adafactor |
OpenAI (GPT-4 and Predecessors):
8-bit Adam (Memory Optimization):
Each question requires calculation or diagnosis, not just recall.
[0.1, 0.1, 0.1] for 100 steps. Another parameter has gradient [1.0, -1.0, 1.0, -1.0, ...] alternating. Using Adam defaults (beta2=0.999), compute the approximate effective learning rate for each. Which parameter gets larger updates and why?Papers:
Open Questions:
Next up: Backprop isn't "chain rule backwards"—it's graph traversal. Understanding the computation graph lets you catch bugs that waste $2M training runs.