SGD & Momentum
Estimated reading time: 15 minutes
Build the mental models that separate research engineers from ML practitioners.
Loading...
Loading...
Estimated reading time: 15 minutes
Training a neural network means navigating a high-dimensional loss surface where flat regions stall progress and sharp minima hurt generalization. Choosing how the optimizer moves through that surface — and how much "memory" it carries — directly controls convergence speed, stability, and final model quality.
In this tutorial, you will:
By the end, you will be able to pick momentum settings for a training run and predict when they will break.
Real loss surfaces are not smooth bowls. They contain saddle points (gradient near zero but not a minimum), sharp local minima (low loss but poor generalization), and flat plateaus (tiny gradients, slow progress).
Full-batch gradient descent always moves toward the nearest downhill direction. That sounds optimal, but it means the optimizer gets trapped in the first minimum it reaches — often a sharp one.
Minibatch SGD sees a different loss surface on every step because each batch samples different data. That per-step noise acts as implicit regularization: it randomly kicks the optimizer out of sharp minima while flat, wide minima are stable under the noise.
The effective noise in SGD scales as:
noise ∝ lr / sqrt(batch_size)
Consider two setups training a ResNet-50:
| Setup | Learning Rate | Batch Size | Relative Noise |
|---|---|---|---|
| A | 0.1 | 256 | 0.1 / 16 = 0.00625 |
| B | 0.4 | 4096 | 0.4 / 64 = 0.00625 |
Same effective noise under this proxy. If noise scales as lr / sqrt(batch_size), then preserving noise implies a sqrt scaling rule: when batch size scales by k, scale learning rate by sqrt(k).
In practice, some large-batch recipes use linear scaling + warmup (Goyal et al., 2017) for optimization stability, even though it does not keep lr / sqrt(batch_size) constant.
The simplest update rule:
θ = θ - lr × ∇L(θ)
Each step uses only the gradient from the current minibatch. There is no memory of past gradients.
Try this: Run the cell below and compare the clean (full-batch) and noisy (minibatch) paths. Watch where they end up relative to the optimum at (1, 1).
Vanilla SGD has no memory — each step uses only the current gradient. If the gradient keeps pointing in roughly the same direction, we are wasting information by ignoring that consistency.
Momentum fixes this by maintaining a running average of past gradients (the "velocity"):
v = β × v + ∇L(θ) # Accumulate velocity
θ = θ - lr × v # Update parameters
When the gradient is constant (steady-state), the velocity converges to v = g / (1 - β) where g is the gradient.
| β | Amplification Factor | lr = 0.01 Effective lr |
|---|---|---|
| 0.0 | 1x | 0.01 |
| 0.9 | 10x | 0.10 |
| 0.95 | 20x | 0.20 |
| 0.99 | 100x | 1.00 |
This is why high momentum requires lower base learning rates. The combination of lr and β together controls the true step size.
Try this: Run the cell below, then look at three things: (1) final distance to optimum, (2) oscillation count, and (3) which β gives the best tradeoff.
Classical momentum updates the velocity using the gradient at the current position. Nesterov momentum (Nesterov Accelerated Gradient / NAG) evaluates the gradient at the projected position — where momentum would take you before the correction:
v = β × v + ∇L(θ - lr × β × v) # Gradient at look-ahead position
θ = θ - lr × v
The intuition: if momentum is about to carry you too far, Nesterov "looks ahead" and applies a correction before arriving. This gives faster convergence on convex problems and often better behavior near minima.
Momentum introduces a new dimension of failure beyond vanilla SGD. Here are the three ways momentum breaks, in order from most to least common.
Try this: Run the cell below to see overshoot in action. Compare three configurations that attempt to fix it — which strategy works best, and at what cost?
How do SGD and momentum choices change as model and batch sizes grow?
| Scale | What Changes | Typical Response |
|---|---|---|
| Small model (10M params) | Vanilla SGD often sufficient; noise provides enough regularization | lr=0.01, β=0.9, batch=32 |
| Large batch (4K-32K) | Less noise per step; optimizer converges faster but to sharper minima | Scale lr linearly with batch; keep β=0.9; add warmup |
| Deep model (100+ layers) | Gradient magnitude varies 100x across layers; single lr struggles | Switch to Adam (per-parameter lr) or use per-layer lr scaling |
| Very long training (100K+ steps) | Momentum velocity accumulates stale gradients across LR schedule changes | Restart momentum at schedule boundaries; or use warmup after decay steps |
One reason SGD+momentum persists at scale is memory:
| Optimizer | State per Parameter | 7B Model (FP32 state) |
|---|---|---|
| SGD | 1 float (velocity) | 7B x 4 bytes = 28 GB |
| Adam | 2 floats (m + v) | 7B x 8 bytes = 56 GB |
That extra 28 GB for Adam can be the difference between fitting on one GPU or needing two. For large-scale vision training (where SGD+momentum works well), this matters.
Where SGD + Momentum Still Wins:
Where Adam Wins:
The Practical Heuristic:
Test your understanding with these operational questions:
Estimate: You train with lr=0.01 and β=0.95. What is the effective learning rate in steady state? If training diverges, what single change do you try first?
Calculate: A team switches from batch size 256 to batch size 2048 (8x increase). Using the sqrt-scaling noise rule, what should the new learning rate be if the original was lr=0.1? How does that compare to the linear-scaling warmup heuristic?
Diagnose: You see this training log pattern — loss decreases for 1000 steps, then spikes after a learning rate schedule drop, then slowly recovers. What is the likely cause, and what would you change?
Compare: For a 13B parameter model on a single 80GB A100, how much optimizer state memory does SGD+momentum require vs Adam (both using FP32 state)? Does Adam fit?
Key Papers:
Open Questions:
Next up: Adam takes a different approach — per-parameter learning rates that adapt during training.