SGD & Momentum
Estimated reading time: 15 minutes
The simplest optimizers reveal the deepest truths about loss landscapes.
SGD isn't just "gradient descent"—the noise is a feature, not a bug. This lesson builds intuition for how optimizers navigate loss surfaces.
Learning Progression (Easy -> Hard)#
Use this sequence as you read:
- Start with
The Loss Landscape Isn't Convexto build core intuition and shared vocabulary. - Move to
Vanilla SGDto understand the mechanism behind the intuition. - Apply the idea in
Momentum: Smoothing the Ridewith concrete examples or implementation details. - Challenge your understanding in the failure-mode section and check what breaks first.
- Then zoom out to scale-level tradeoffs so the same concept holds at larger model and system sizes.
- Map the concept to production constraints to understand how teams make practical tradeoffs.
The Loss Landscape Isn't Convex#
Flow bridge: Start here; this section establishes the base mental model for the rest of the lesson.
Real neural network loss surfaces have:
Instructor Lens#
Vanilla SGD#
Flow bridge: Building on The Loss Landscape Isn't Convex, this section adds the next layer of conceptual depth.
The simplest update rule:
θ = θ - lr × ∇L(θ)
Each step uses the gradient on the current minibatch. The minibatch noise provides implicit regularization.
Momentum: Smoothing the Ride#
Flow bridge: Building on Vanilla SGD, this section adds the next layer of conceptual depth.
Momentum accumulates gradients over time, like a ball rolling downhill:
v = β × v + ∇L(θ) # Accumulate velocity
θ = θ - lr × v # Update with velocity
Break It: Momentum Gone Wrong#
Flow bridge: Now that the core mechanism is clear, stress-test it under realistic failure conditions.
Scale Thought Experiment#
Flow bridge: With the local mechanism in place, extend it to larger model, context, and system scales.
| Scale | What Breaks | Mitigation |
|---|---|---|
| Small models | Nothing—vanilla SGD often sufficient | Keep it simple |
| Large batches | Noise reduction hurts generalization | Lower LR, increase momentum |
| Deep networks | Gradient scale varies across layers | Per-layer momentum (rare) or Adam |
| Very long training | Momentum can perpetuate stale directions | Momentum warmup, schedule β |
Production Reality#
Flow bridge: Carry these tradeoffs into production constraints and team-level operating decisions.
Why SGD Still Wins Sometimes:
- Vision models (ResNets) often train better with SGD + momentum
- Lower memory than Adam (no per-parameter state)
- With good LR schedules, matches or beats Adam
Google's Recipe for Vision:
- SGD + momentum (β=0.9)
- Cosine LR schedule
- Heavy data augmentation
Checkpoint Questions#
Use these to verify understanding before moving on:
- Can you do this without notes: Explain why minibatch noise helps escape local minima?
- Can you do this without notes: Implement momentum and interpret β as "trust in direction"?
- Can you do this without notes: Visualize oscillation and overshooting failure modes?
Research Hooks#
Flow bridge: Use this practical baseline to frame the open research questions that remain unresolved.
Papers:
- "On the Importance of Initialization and Momentum in Deep Learning" (Sutskever et al., 2013) — Shows momentum's role in escaping saddle points
- "The Marginal Value of Adaptive Gradient Methods" (Wilson et al., 2017) — Argues SGD+momentum generalizes better than Adam in some cases
Open Questions:
- Why does SGD+momentum sometimes generalize better than Adam despite slower convergence?
- Can we get Adam's adaptivity with SGD's generalization?
Next up: Adam takes a different approach—per-parameter learning rates that adapt during training.