Work at a Frontier Lab

Adaptive learning rates and the art of the training schedule.

Adam is not "SGD but better." It's a different beast entirely—per-parameter learning rates that adapt during training. Understanding when and why to use Adam (and its variants) is essential for research engineering.

Learning Progression (Easy -> Hard)#

Use this sequence as you read:

Start with The Adam Algorithm to build core intuition and shared vocabulary.
Move to AdamW: Why Weight Decay Must Be Decoupled to understand the mechanism behind the intuition.
Apply the idea in Learning Rate Warmup with concrete examples or implementation details.
Then zoom out to scale-level tradeoffs so the same concept holds at larger model and system sizes.
Map the concept to production constraints to understand how teams make practical tradeoffs.
Finish with research extensions to connect today’s mental model to open problems.

The Adam Algorithm#

Flow bridge: Start here; this section establishes the base mental model for the rest of the lesson.

m = β₁ × m + (1 - β₁) × g           # First moment (mean of gradients)
v = β₂ × v + (1 - β₂) × g²          # Second moment (variance of gradients)
m̂ = m / (1 - β₁ᵗ)                   # Bias correction
v̂ = v / (1 - β₂ᵗ)                   # Bias correction
θ = θ - lr × m̂ / (√v̂ + ε)          # Normalized update

The key insight: dividing by √v normalizes each parameter's update by its gradient variance.

Loading diagram...

Instructor Lens#

AdamW: Why Weight Decay Must Be Decoupled#

Flow bridge: Building on The Adam Algorithm, this section adds the next layer of conceptual depth.

adamw_implementation.py

Loading editor...

Learning Rate Warmup#

Flow bridge: Building on AdamW: Why Weight Decay Must Be Decoupled, this section adds the next layer of conceptual depth.

lr_schedule.py

Loading editor...

Step 1: Warmup phase (0 → 1000 steps). Learning rate increases linearly from 0 to max_lr. This lets gradients stabilize before we take big steps.

Batch Size ↔ Learning Rate Coupling#

Flow bridge: Building on Learning Rate Warmup, this section adds the next layer of conceptual depth.

batch_lr_scaling.py

Loading editor...

Scale Thought Experiment#

Flow bridge: With the local mechanism in place, extend it to larger model, context, and system scales.

Scale	What Breaks	Mitigation
Small models	Nothing—Adam just works	Default hyperparameters
Large batch training	LR scaling breaks above 32K batch	LARS, LAMB optimizers
Very deep transformers	Warmup needs to be longer	2000-4000 warmup steps
Long training runs	Adam's memory overhead adds up	8-bit Adam, Adafactor

Production Reality#

Flow bridge: Carry these tradeoffs into production constraints and team-level operating decisions.

OpenAI (GPT-4 and Predecessors):

AdamW with β₁=0.9, β₂=0.95 (slightly lower β₂ for stability)
Cosine decay with warmup
Gradient clipping at 1.0

8-bit Adam (Memory Optimization):

Quantize optimizer states to 8-bit
75% memory reduction for optimizer states
Minimal impact on training quality

Checkpoint Questions#

Use these to verify understanding before moving on:

Can you do this without notes: Explain Adam's per-parameter learning rate adaptation mechanism?
Can you do this without notes: Implement AdamW and explain why decoupling weight decay matters?
Can you do this without notes: Design learning rate warmup and decay schedules for stable training?

Research Hooks#

Flow bridge: Use this practical baseline to frame the open research questions that remain unresolved.

Papers:

"Decoupled Weight Decay Regularization" (Loshchilov & Hutter, 2019) — The AdamW paper explaining why weight decay matters for transformers
"On the Variance of the Adaptive Learning Rate and Beyond" (Liu et al., 2020) — RAdam adds warmup to Adam's variance estimate

Open Questions:

Can we automatically tune warmup length based on model architecture?
Is there an optimizer that combines Adam's adaptivity with SGD's generalization?

Next up: Backprop isn't "chain rule backwards"—it's graph traversal. Understanding the computation graph lets you catch bugs that waste $2M training runs.

Track 0: Foundations

Adam, Warmup & Scheduling

Learning Progression (Easy -> Hard)#

The Adam Algorithm#

Instructor Lens#

AdamW: Why Weight Decay Must Be Decoupled#

Learning Rate Warmup#

Batch Size ↔ Learning Rate Coupling#

Scale Thought Experiment#

Production Reality#

Checkpoint Questions#

Research Hooks#