Adam, Warmup & Scheduling
Estimated reading time: 18 minutes
Adaptive learning rates and the art of the training schedule.
Adam is not "SGD but better." It's a different beast entirely—per-parameter learning rates that adapt during training. Understanding when and why to use Adam (and its variants) is essential for research engineering.
Learning Progression (Easy -> Hard)#
Use this sequence as you read:
- Start with
The Adam Algorithmto build core intuition and shared vocabulary. - Move to
AdamW: Why Weight Decay Must Be Decoupledto understand the mechanism behind the intuition. - Apply the idea in
Learning Rate Warmupwith concrete examples or implementation details. - Then zoom out to scale-level tradeoffs so the same concept holds at larger model and system sizes.
- Map the concept to production constraints to understand how teams make practical tradeoffs.
- Finish with research extensions to connect today’s mental model to open problems.
The Adam Algorithm#
Flow bridge: Start here; this section establishes the base mental model for the rest of the lesson.
m = β₁ × m + (1 - β₁) × g # First moment (mean of gradients)
v = β₂ × v + (1 - β₂) × g² # Second moment (variance of gradients)
m̂ = m / (1 - β₁ᵗ) # Bias correction
v̂ = v / (1 - β₂ᵗ) # Bias correction
θ = θ - lr × m̂ / (√v̂ + ε) # Normalized update
The key insight: dividing by √v normalizes each parameter's update by its gradient variance.
Instructor Lens#
AdamW: Why Weight Decay Must Be Decoupled#
Flow bridge: Building on The Adam Algorithm, this section adds the next layer of conceptual depth.
Learning Rate Warmup#
Flow bridge: Building on AdamW: Why Weight Decay Must Be Decoupled, this section adds the next layer of conceptual depth.
Batch Size ↔ Learning Rate Coupling#
Flow bridge: Building on Learning Rate Warmup, this section adds the next layer of conceptual depth.
Scale Thought Experiment#
Flow bridge: With the local mechanism in place, extend it to larger model, context, and system scales.
| Scale | What Breaks | Mitigation |
|---|---|---|
| Small models | Nothing—Adam just works | Default hyperparameters |
| Large batch training | LR scaling breaks above 32K batch | LARS, LAMB optimizers |
| Very deep transformers | Warmup needs to be longer | 2000-4000 warmup steps |
| Long training runs | Adam's memory overhead adds up | 8-bit Adam, Adafactor |
Production Reality#
Flow bridge: Carry these tradeoffs into production constraints and team-level operating decisions.
OpenAI (GPT-4 and Predecessors):
- AdamW with β₁=0.9, β₂=0.95 (slightly lower β₂ for stability)
- Cosine decay with warmup
- Gradient clipping at 1.0
8-bit Adam (Memory Optimization):
- Quantize optimizer states to 8-bit
- 75% memory reduction for optimizer states
- Minimal impact on training quality
Checkpoint Questions#
Use these to verify understanding before moving on:
- Can you do this without notes: Explain Adam's per-parameter learning rate adaptation mechanism?
- Can you do this without notes: Implement AdamW and explain why decoupling weight decay matters?
- Can you do this without notes: Design learning rate warmup and decay schedules for stable training?
Research Hooks#
Flow bridge: Use this practical baseline to frame the open research questions that remain unresolved.
Papers:
- "Decoupled Weight Decay Regularization" (Loshchilov & Hutter, 2019) — The AdamW paper explaining why weight decay matters for transformers
- "On the Variance of the Adaptive Learning Rate and Beyond" (Liu et al., 2020) — RAdam adds warmup to Adam's variance estimate
Open Questions:
- Can we automatically tune warmup length based on model architecture?
- Is there an optimizer that combines Adam's adaptivity with SGD's generalization?
Next up: Backprop isn't "chain rule backwards"—it's graph traversal. Understanding the computation graph lets you catch bugs that waste $2M training runs.