Adam, Warmup & Scheduling

Estimated reading time: 18 minutes

Adaptive learning rates and the art of the training schedule.

Adam is not "SGD but better." It's a different beast entirely—per-parameter learning rates that adapt during training. Understanding when and why to use Adam (and its variants) is essential for research engineering.

Learning Progression (Easy -> Hard)#

Use this sequence as you read:

  1. Start with The Adam Algorithm to build core intuition and shared vocabulary.
  2. Move to AdamW: Why Weight Decay Must Be Decoupled to understand the mechanism behind the intuition.
  3. Apply the idea in Learning Rate Warmup with concrete examples or implementation details.
  4. Then zoom out to scale-level tradeoffs so the same concept holds at larger model and system sizes.
  5. Map the concept to production constraints to understand how teams make practical tradeoffs.
  6. Finish with research extensions to connect today’s mental model to open problems.

The Adam Algorithm#

Flow bridge: Start here; this section establishes the base mental model for the rest of the lesson.

m = β₁ × m + (1 - β₁) × g           # First moment (mean of gradients)
v = β₂ × v + (1 - β₂) × g²          # Second moment (variance of gradients)
m̂ = m / (1 - β₁ᵗ)                   # Bias correction
v̂ = v / (1 - β₂ᵗ)                   # Bias correction
θ = θ - lr × m̂ / (√v̂ + ε)          # Normalized update

The key insight: dividing by √v normalizes each parameter's update by its gradient variance.

Loading diagram...

Instructor Lens#

AdamW: Why Weight Decay Must Be Decoupled#

Flow bridge: Building on The Adam Algorithm, this section adds the next layer of conceptual depth.

adamw_implementation.py
Loading editor...

Learning Rate Warmup#

Flow bridge: Building on AdamW: Why Weight Decay Must Be Decoupled, this section adds the next layer of conceptual depth.

lr_schedule.py
Loading editor...
1

Step 1: Warmup phase (0 → 1000 steps). Learning rate increases linearly from 0 to max_lr. This lets gradients stabilize before we take big steps.

Batch Size ↔ Learning Rate Coupling#

Flow bridge: Building on Learning Rate Warmup, this section adds the next layer of conceptual depth.

batch_lr_scaling.py
Loading editor...

Scale Thought Experiment#

Flow bridge: With the local mechanism in place, extend it to larger model, context, and system scales.

ScaleWhat BreaksMitigation
Small modelsNothing—Adam just worksDefault hyperparameters
Large batch trainingLR scaling breaks above 32K batchLARS, LAMB optimizers
Very deep transformersWarmup needs to be longer2000-4000 warmup steps
Long training runsAdam's memory overhead adds up8-bit Adam, Adafactor

Production Reality#

Flow bridge: Carry these tradeoffs into production constraints and team-level operating decisions.

OpenAI (GPT-4 and Predecessors):

  • AdamW with β₁=0.9, β₂=0.95 (slightly lower β₂ for stability)
  • Cosine decay with warmup
  • Gradient clipping at 1.0

8-bit Adam (Memory Optimization):

  • Quantize optimizer states to 8-bit
  • 75% memory reduction for optimizer states
  • Minimal impact on training quality

Checkpoint Questions#

Use these to verify understanding before moving on:

  1. Can you do this without notes: Explain Adam's per-parameter learning rate adaptation mechanism?
  2. Can you do this without notes: Implement AdamW and explain why decoupling weight decay matters?
  3. Can you do this without notes: Design learning rate warmup and decay schedules for stable training?

Research Hooks#

Flow bridge: Use this practical baseline to frame the open research questions that remain unresolved.

Papers:

  1. "Decoupled Weight Decay Regularization" (Loshchilov & Hutter, 2019) — The AdamW paper explaining why weight decay matters for transformers
  2. "On the Variance of the Adaptive Learning Rate and Beyond" (Liu et al., 2020) — RAdam adds warmup to Adam's variance estimate

Open Questions:

  • Can we automatically tune warmup length based on model architecture?
  • Is there an optimizer that combines Adam's adaptivity with SGD's generalization?

Next up: Backprop isn't "chain rule backwards"—it's graph traversal. Understanding the computation graph lets you catch bugs that waste $2M training runs.