Work at a Frontier Lab
CoursesProblemsBlogPapersLibrariesDiscussJobs

Track 0: Foundations

Build the mental models that separate research engineers from ML practitioners.

Memory & Compute
The Memory Wall15m
Gradient Flow Under Pressure18m
Optimizers
SGD & Momentum15m
Adam, Warmup & Scheduling18m
Gradient Mechanics
Backprop as Graph Transformation20m
Initialization & Residual Connections18m
Scaling Laws & μ-Transfer20m
Systems Thinking
Bandwidth & Profiling18m
The Debugging Flowchart22m

Track 0: Foundations

Build the mental models that separate research engineers from ML practitioners.

Memory & Compute
The Memory Wall15mGradient Flow Under Pressure18m
Optimizers
SGD & Momentum15mAdam, Warmup & Scheduling18m
Gradient Mechanics
Backprop as Graph Transformation20mInitialization & Residual Connections18mScaling Laws & μ-Transfer20m
Systems Thinking
Bandwidth & Profiling18mThe Debugging Flowchart22m

Loading...

Built with Next.js

PrivacyTermsContactPapersLibrariesJobsDiscuss|GitHub|Work at a Frontier Lab

Loading...

  1. Home
  2. /
  3. Track 0: Foundations
  4. /
  5. SGD & Momentum

SGD & Momentum

Estimated reading time: 15 minutes

Previous

←Gradient Flow Under Pressure

Next

Adam, Warmup & Scheduling→

Training a neural network means navigating a high-dimensional loss surface where flat regions stall progress and sharp minima hurt generalization. Choosing how the optimizer moves through that surface — and how much "memory" it carries — directly controls convergence speed, stability, and final model quality.

In this tutorial, you will:

  • Measure how minibatch noise changes SGD's trajectory on a test surface
  • Implement momentum from scratch and tune β to control the speed/stability tradeoff
  • Compare vanilla SGD, classical momentum, and Nesterov momentum side by side
  • Diagnose oscillation, overshooting, and divergence — the three momentum failure modes

By the end, you will be able to pick momentum settings for a training run and predict when they will break.

Why Noise Is a Feature#

Real loss surfaces are not smooth bowls. They contain saddle points (gradient near zero but not a minimum), sharp local minima (low loss but poor generalization), and flat plateaus (tiny gradients, slow progress).

Full-batch gradient descent always moves toward the nearest downhill direction. That sounds optimal, but it means the optimizer gets trapped in the first minimum it reaches — often a sharp one.

Minibatch SGD sees a different loss surface on every step because each batch samples different data. That per-step noise acts as implicit regularization: it randomly kicks the optimizer out of sharp minima while flat, wide minima are stable under the noise.

💡

The Core Tradeoff

Larger batches reduce gradient noise, which speeds up each step but makes the optimizer more likely to settle in sharp minima.

Smaller batches add noise, which slows convergence per step but biases toward flatter minima that generalize better.

This is why learning rate and batch size must be tuned together — they jointly control the noise scale.

Worked example: noise scale#

The effective noise in SGD scales as:

noise ∝ lr / sqrt(batch_size)

Consider two setups training a ResNet-50:

SetupLearning RateBatch SizeRelative Noise
A0.12560.1 / 16 = 0.00625
B0.440960.4 / 64 = 0.00625

Same effective noise under this proxy. If noise scales as lr / sqrt(batch_size), then preserving noise implies a sqrt scaling rule: when batch size scales by k, scale learning rate by sqrt(k).

In practice, some large-batch recipes use linear scaling + warmup (Goyal et al., 2017) for optimization stability, even though it does not keep lr / sqrt(batch_size) constant.

Vanilla SGD: The Baseline#

The simplest update rule:

θ = θ - lr × ∇L(θ)

Each step uses only the gradient from the current minibatch. There is no memory of past gradients.

Try this: Run the cell below and compare the clean (full-batch) and noisy (minibatch) paths. Watch where they end up relative to the optimum at (1, 1).

sgd_visualization.py
Loading editor...
1

Step 1: The Rosenbrock is adversarial for SGD. The valley is 100x steeper across than along. Each gradient step overshoots side-to-side while barely moving forward — classic zigzag.

Momentum: Adding Memory to SGD#

Vanilla SGD has no memory — each step uses only the current gradient. If the gradient keeps pointing in roughly the same direction, we are wasting information by ignoring that consistency.

Momentum fixes this by maintaining a running average of past gradients (the "velocity"):

v = β × v + ∇L(θ)       # Accumulate velocity
θ = θ - lr × v           # Update parameters
⚠️

Interpreting β

Think of β as "how much do I trust my current direction?"

  • β = 0: No memory. Pure SGD — every step is independent.
  • β = 0.9: Standard. 90% history, 10% current gradient. Smooths noise effectively.
  • β = 0.99: Very high inertia. Takes ~100 steps to "forget" an old gradient. Can overshoot.

The effective learning rate with momentum is amplified by a factor of 1 / (1 - β). So momentum β = 0.9 with lr = 0.01 behaves like an effective lr of 0.1 in the steady-state direction.

Worked example: effective learning rate#

When the gradient is constant (steady-state), the velocity converges to v = g / (1 - β) where g is the gradient.

βAmplification Factorlr = 0.01 Effective lr
0.01x0.01
0.910x0.10
0.9520x0.20
0.99100x1.00

This is why high momentum requires lower base learning rates. The combination of lr and β together controls the true step size.

Try this: Run the cell below, then look at three things: (1) final distance to optimum, (2) oscillation count, and (3) which β gives the best tradeoff.

momentum_comparison.py
Loading editor...
1

Step 1: No momentum zigzags. Without momentum (β=0), SGD bounces across the narrow Rosenbrock valley. The gradient points across the valley on each step, not along it, so progress toward the minimum is slow.

Nesterov Momentum: Look Before You Leap#

Classical momentum updates the velocity using the gradient at the current position. Nesterov momentum (Nesterov Accelerated Gradient / NAG) evaluates the gradient at the projected position — where momentum would take you before the correction:

v = β × v + ∇L(θ - lr × β × v)   # Gradient at look-ahead position
θ = θ - lr × v

The intuition: if momentum is about to carry you too far, Nesterov "looks ahead" and applies a correction before arriving. This gives faster convergence on convex problems and often better behavior near minima.

💡

When Nesterov Helps Most

Nesterov momentum matters most in regions where the loss surface curvature changes rapidly — like approaching a minimum. Classical momentum overshoots because it computes the gradient at the old position; Nesterov computes it at the look-ahead position and can "brake" earlier.

In practice, the difference is often small for well-tuned hyperparameters, but Nesterov is strictly better in theory and is the default in most frameworks.

nesterov_comparison.py
Loading editor...

Break It: Three Momentum Failure Modes#

Momentum introduces a new dimension of failure beyond vanilla SGD. Here are the three ways momentum breaks, in order from most to least common.

❌

Failure Mode Checklist

1. Oscillation (too much lr for the curvature)

  • Symptom: loss bounces up and down without decreasing
  • Cause: effective lr = lr / (1 - β) exceeds the stability threshold
  • First fix: halve the learning rate

2. Overshooting (high β near a minimum)

  • Symptom: loss decreases then increases, repeating in a cycle
  • Cause: velocity carries the optimizer past the minimum
  • First fix: reduce β from 0.99 to 0.9

3. Stale momentum (loss landscape changes)

  • Symptom: loss spikes after a learning rate schedule step or data distribution shift
  • Cause: velocity still reflects old gradients from a different loss surface
  • First fix: reset velocity (momentum restart) at schedule boundaries

Try this: Run the cell below to see overshoot in action. Compare three configurations that attempt to fix it — which strategy works best, and at what cost?

break_momentum.py
Loading editor...
1

Step 1: The broken config has effective lr = 0.1. With lr=0.001 and β=0.99, the amplification factor is 100x, giving an effective learning rate of 0.1. On the Rosenbrock surface this is far too aggressive.

Scale Thought Experiment#

How do SGD and momentum choices change as model and batch sizes grow?

ScaleWhat ChangesTypical Response
Small model (10M params)Vanilla SGD often sufficient; noise provides enough regularizationlr=0.01, β=0.9, batch=32
Large batch (4K-32K)Less noise per step; optimizer converges faster but to sharper minimaScale lr linearly with batch; keep β=0.9; add warmup
Deep model (100+ layers)Gradient magnitude varies 100x across layers; single lr strugglesSwitch to Adam (per-parameter lr) or use per-layer lr scaling
Very long training (100K+ steps)Momentum velocity accumulates stale gradients across LR schedule changesRestart momentum at schedule boundaries; or use warmup after decay steps

Memory cost: SGD vs Adam#

One reason SGD+momentum persists at scale is memory:

OptimizerState per Parameter7B Model (FP32 state)
SGD1 float (velocity)7B x 4 bytes = 28 GB
Adam2 floats (m + v)7B x 8 bytes = 56 GB

That extra 28 GB for Adam can be the difference between fitting on one GPU or needing two. For large-scale vision training (where SGD+momentum works well), this matters.

Production Reality#

Where SGD + Momentum Still Wins:

  • Computer vision: ResNets, ViTs at moderate scale
  • Google's recipe: SGD + momentum (β=0.9) + cosine LR schedule + heavy data augmentation
  • Any setting where Adam's memory overhead pushes you to a smaller batch or model

Where Adam Wins:

  • Language models: loss surfaces are less well-conditioned; per-parameter adaptivity matters
  • Fine-tuning: different layers need different learning rates
  • Multi-modal models: combining modalities with very different gradient scales

The Practical Heuristic:

  1. Start with Adam (lr=3e-4, β1=0.9, β2=0.999) — it works for most things
  2. If you're training a vision model and hitting memory limits, try SGD+momentum
  3. If SGD works, it usually generalizes slightly better on held-out data

Checkpoint Questions#

Test your understanding with these operational questions:

  1. Estimate: You train with lr=0.01 and β=0.95. What is the effective learning rate in steady state? If training diverges, what single change do you try first?

  2. Calculate: A team switches from batch size 256 to batch size 2048 (8x increase). Using the sqrt-scaling noise rule, what should the new learning rate be if the original was lr=0.1? How does that compare to the linear-scaling warmup heuristic?

  3. Diagnose: You see this training log pattern — loss decreases for 1000 steps, then spikes after a learning rate schedule drop, then slowly recovers. What is the likely cause, and what would you change?

  4. Compare: For a 13B parameter model on a single 80GB A100, how much optimizer state memory does SGD+momentum require vs Adam (both using FP32 state)? Does Adam fit?

Research Hooks#

Key Papers:

  1. "On the Importance of Initialization and Momentum in Deep Learning" (Sutskever et al., 2013) — Shows momentum's role in escaping saddle points and how β interacts with learning rate schedules.
  2. "The Marginal Value of Adaptive Gradient Methods" (Wilson et al., 2017) — Argues SGD+momentum generalizes better than Adam in some vision settings, sparking the "SGD vs Adam" debate.
  3. "Accurate, Large Minibatch SGD" (Goyal et al., 2017) — The linear scaling rule and gradual warmup, enabling batch sizes up to 8192 for ResNet training.

Open Questions:

  • Why does SGD+momentum sometimes generalize better than Adam despite slower convergence? Is it the implicit regularization from fewer state variables, or something about the optimization trajectory?
  • Can we get Adam's per-parameter adaptivity without doubling the optimizer memory? (LION, Sophia, and other recent optimizers attempt this.)

Next up: Adam takes a different approach — per-parameter learning rates that adapt during training.