SGD & Momentum

Estimated reading time: 15 minutes

The simplest optimizers reveal the deepest truths about loss landscapes.

SGD isn't just "gradient descent"—the noise is a feature, not a bug. This lesson builds intuition for how optimizers navigate loss surfaces.

Learning Progression (Easy -> Hard)#

Use this sequence as you read:

  1. Start with The Loss Landscape Isn't Convex to build core intuition and shared vocabulary.
  2. Move to Vanilla SGD to understand the mechanism behind the intuition.
  3. Apply the idea in Momentum: Smoothing the Ride with concrete examples or implementation details.
  4. Challenge your understanding in the failure-mode section and check what breaks first.
  5. Then zoom out to scale-level tradeoffs so the same concept holds at larger model and system sizes.
  6. Map the concept to production constraints to understand how teams make practical tradeoffs.

The Loss Landscape Isn't Convex#

Flow bridge: Start here; this section establishes the base mental model for the rest of the lesson.

Real neural network loss surfaces have:

Loading diagram...

Instructor Lens#

Vanilla SGD#

Flow bridge: Building on The Loss Landscape Isn't Convex, this section adds the next layer of conceptual depth.

The simplest update rule:

θ = θ - lr × ∇L(θ)

Each step uses the gradient on the current minibatch. The minibatch noise provides implicit regularization.

sgd_visualization.py
Loading editor...

Momentum: Smoothing the Ride#

Flow bridge: Building on Vanilla SGD, this section adds the next layer of conceptual depth.

Momentum accumulates gradients over time, like a ball rolling downhill:

v = β × v + ∇L(θ)     # Accumulate velocity
θ = θ - lr × v         # Update with velocity
momentum_comparison.py
Loading editor...
1

Step 1: No momentum zigzags. Without momentum, SGD bounces back and forth across the narrow valley. Each gradient points "across" the valley, not "along" it.

Break It: Momentum Gone Wrong#

Flow bridge: Now that the core mechanism is clear, stress-test it under realistic failure conditions.

break_momentum.py
Loading editor...

Scale Thought Experiment#

Flow bridge: With the local mechanism in place, extend it to larger model, context, and system scales.

ScaleWhat BreaksMitigation
Small modelsNothing—vanilla SGD often sufficientKeep it simple
Large batchesNoise reduction hurts generalizationLower LR, increase momentum
Deep networksGradient scale varies across layersPer-layer momentum (rare) or Adam
Very long trainingMomentum can perpetuate stale directionsMomentum warmup, schedule β

Production Reality#

Flow bridge: Carry these tradeoffs into production constraints and team-level operating decisions.

Why SGD Still Wins Sometimes:

  • Vision models (ResNets) often train better with SGD + momentum
  • Lower memory than Adam (no per-parameter state)
  • With good LR schedules, matches or beats Adam

Google's Recipe for Vision:

  • SGD + momentum (β=0.9)
  • Cosine LR schedule
  • Heavy data augmentation

Checkpoint Questions#

Use these to verify understanding before moving on:

  1. Can you do this without notes: Explain why minibatch noise helps escape local minima?
  2. Can you do this without notes: Implement momentum and interpret β as "trust in direction"?
  3. Can you do this without notes: Visualize oscillation and overshooting failure modes?

Research Hooks#

Flow bridge: Use this practical baseline to frame the open research questions that remain unresolved.

Papers:

  1. "On the Importance of Initialization and Momentum in Deep Learning" (Sutskever et al., 2013) — Shows momentum's role in escaping saddle points
  2. "The Marginal Value of Adaptive Gradient Methods" (Wilson et al., 2017) — Argues SGD+momentum generalizes better than Adam in some cases

Open Questions:

  • Why does SGD+momentum sometimes generalize better than Adam despite slower convergence?
  • Can we get Adam's adaptivity with SGD's generalization?

Next up: Adam takes a different approach—per-parameter learning rates that adapt during training.