Work at a Frontier Lab
CoursesProblemsBlogPapersLibrariesDiscussJobs

Track 0: Foundations

Build the mental models that separate research engineers from ML practitioners.

Memory & Compute
The Memory Wall15m
Gradient Flow Under Pressure18m
Optimizers
SGD & Momentum15m
Adam, Warmup & Scheduling18m
Gradient Mechanics
Backprop as Graph Transformation20m
Initialization & Residual Connections18m
Scaling Laws & μ-Transfer20m
Systems Thinking
Bandwidth & Profiling18m
The Debugging Flowchart22m

Track 0: Foundations

Build the mental models that separate research engineers from ML practitioners.

Memory & Compute
The Memory Wall15mGradient Flow Under Pressure18m
Optimizers
SGD & Momentum15mAdam, Warmup & Scheduling18m
Gradient Mechanics
Backprop as Graph Transformation20mInitialization & Residual Connections18mScaling Laws & μ-Transfer20m
Systems Thinking
Bandwidth & Profiling18mThe Debugging Flowchart22m

Loading...

Built with Next.js

PrivacyTermsContactPapersLibrariesJobsDiscuss|GitHub|Work at a Frontier Lab

Loading...

  1. Home
  2. /
  3. Track 0: Foundations
  4. /
  5. Gradient Flow Under Pressure

Gradient Flow Under Pressure

Estimated reading time: 18 minutes

Previous

←The Memory Wall

Next

SGD & Momentum→

In this tutorial, you will trace how gradients break when stored in 16-bit floats, build a loss scaler that prevents the breakage, and learn when to reach for BF16 instead.

Mixed-precision training is how every large model ships today. The failure mode is simple: small gradients underflow to exactly zero in FP16, and layers stop learning. By the end of this tutorial you will be able to:

  • Compute the layer depth at which FP16 gradients underflow for a given attenuation factor
  • Implement static and dynamic loss scaling from scratch
  • Decide between BF16 and FP16 for a given hardware and model configuration

IEEE 754 Float Formats#

Every floating-point number is stored as three fields: a sign bit, exponent bits, and mantissa (fraction) bits.

Loading diagram...
💡

The Rule That Matters

Exponent bits control range; mantissa bits control precision. BF16 keeps FP32's 8 exponent bits (same range) but drops mantissa from 23 to 7 bits (less precision). FP16 cuts the exponent to 5 bits, which shrinks the representable range drastically.

For gradient flow, range matters more than precision. A gradient rounded from 1.23456e-6 to 1.23e-6 still updates a parameter. A gradient underflowed to 0.0 does nothing.

Here are the concrete numbers you need to remember:

FormatExponent bitsMantissa bitsSmallest positive normalLargest value
FP32823~1.2 x 10⁻³⁸~3.4 x 10³⁸
FP16510~6.1 x 10⁻⁵ (normal), ~6 x 10⁻⁸ (subnormal)65,504
BF1687~1.2 x 10⁻³⁸~3.4 x 10³⁸

The FP16 floor (~6 x 10⁻⁸ including subnormals) is where gradients go to die. Compare that with BF16's floor (~1.2 x 10⁻³⁸), which matches FP32 and is effectively never a problem during training.

The FP16 Underflow Problem#

⚠️

The Failure Mode

When a gradient value falls below FP16's smallest representable number (~6 x 10⁻⁸), it does not round to a small value. It becomes exactly zero. The parameter stops updating entirely, and there is no error message or warning.

During backpropagation, the gradient at layer k is the product of per-layer Jacobian norms from the output back to layer k. If each layer attenuates the gradient by a factor a (typically 0.7-0.9 for well-initialized networks), the gradient at layer k is approximately a^k.

Worked example: With attenuation factor 0.8 and 50 layers, the gradient at the earliest layer is 0.8⁵⁰ = 1.4 x 10⁻⁵. This is above FP16's floor, so it survives. But with attenuation 0.7 and 50 layers: 0.7⁵⁰ = 1.8 x 10⁻⁸, which is below the FP16 floor. That gradient becomes zero.

Run this simulation to see underflow in action. Before you run it: predict at which layer the gradient will underflow with attenuation = 0.7.

underflow_simulation.py< 1 second
Loading editor...
1

Step 1: Read the table. With attenuation 0.7, underflow begins around layer 46-47. Every layer beyond that receives a zero gradient — those weights are frozen.

Loss Scaling: The Fix for FP16#

Loss scaling prevents underflow by multiplying the loss (and therefore all gradients via the chain rule) by a large constant before the backward pass. After the backward pass, gradients are divided by the same constant to restore correct magnitudes. The key is that the multiplication happens while values are still in a representable range.

Worked example: A gradient of 1e-8 underflows in FP16 (below 6e-8). But if you multiply by 65,536 first, it becomes 6.5e-4 — well within FP16 range. After the backward pass, divide by 65,536 to recover the true gradient.

loss_scaling.py< 1 second
Loading editor...
1

Step 1: Count the saves. Without scaling, gradients below ~6e-8 underflow. With a scale of 65,536, values as small as ~1e-12 survive (65536 x 1e-12 = 6.5e-8, just above the FP16 floor).

Dynamic Loss Scaling#

A fixed scale factor is fragile: too low and gradients still underflow; too high and they overflow. Dynamic loss scaling solves this by adapting the scale during training.

The algorithm:

  1. Start with a high scale (e.g., 65,536)
  2. After each backward pass, check for inf/nan in gradients
  3. If overflow detected: halve the scale, skip this parameter update
  4. If N consecutive steps pass without overflow: double the scale

This is exactly what PyTorch's torch.cuda.amp.GradScaler does internally.

dynamic_scaling.py< 1 second
Loading editor...
1

Step 1: Trace the overflows. At step 3, the scale halves from 65,536 to 32,768. At step 7, it halves again to 16,384. Each overflow resets the good-step counter.

BF16: When You Can Skip Loss Scaling#

BF16 keeps FP32's 8 exponent bits, giving it the same range (~1e-38 to ~3.4e38). This means gradients almost never underflow in BF16, and you do not need loss scaling at all.

The trade-off is precision: BF16 has only 7 mantissa bits (~2-3 significant digits) compared to FP16's 10 mantissa bits (~3-4 digits). For most training workloads, this precision loss has negligible effect on convergence.

FormatRangePrecisionLoss scaling needed?Hardware support
FP32~1e-38 to ~3.4e38~7 digitsNoAll GPUs
FP16~6e-8 to 65,504~3-4 digitsYes (recommended)V100, all modern GPUs
BF16~1e-38 to ~3.4e38~2-3 digitsNoA100, H100, TPU v3+
💡

Decision Rule

If your hardware supports BF16 (A100, H100, TPU v3+), use BF16 by default. It gives you FP32-level range with FP16-level speed, and no loss scaling overhead.

Reach for FP16 + dynamic loss scaling only when: (a) your hardware lacks BF16 support (V100, older GPUs), or (b) a specific kernel/library requires FP16.

Break It: Diagnosing Numerical Issues#

This section walks through the symptoms you will see in practice when numerical precision goes wrong, and the first things to check.

Symptom 1: Loss plateaus early, gradient norms collapse to zero.

  • Likely cause: FP16 underflow in early layers.
  • First check: print per-layer gradient norms. If early layers show exactly 0.0, that confirms underflow.
  • First fix: enable dynamic loss scaling, or switch to BF16.

Symptom 2: Loss spikes to NaN or inf intermittently.

  • Likely cause: overflow. Gradients or activations exceeded FP16's max (65,504).
  • First check: look at the loss scaler's overflow count. If it is rising, your scale is too high.
  • First fix: reduce initial loss scale, or switch to BF16.

Symptom 3: Training converges but final quality is slightly worse than expected.

  • Likely cause: accumulated rounding error from low precision.
  • First check: compare a short training run in FP32 vs your mixed-precision setup. If FP32 is measurably better, precision is hurting.
  • First fix: use FP32 master weights (this is standard in mixed-precision training — compute in FP16/BF16, accumulate in FP32).
diagnose_underflow.py< 1 second
Loading editor...
1

Step 1: Read the diagnostic output. Layers 0-7 are healthy. Layers 8-9 are getting close to the FP16 floor. Layers 10-11 are underflowing (one is below the minimum, one is exactly zero).

Scale Thought Experiment#

How does the precision problem change as you scale up?

ScenarioWhat breaksWhyMitigation
Shallow net (10 layers), small batchNothing — FP16 is fineGradients stay in representable rangeStandard mixed precision
Deep net (100+ layers)Early-layer underflow in FP160.85¹⁰⁰ = 3.4e-7, near FP16 floorBF16, or aggressive loss scaling
Large batch (8K+ samples)Gradient averaging pushes magnitudes downMean of 8K gradients is ~90x smaller than single-sample gradientHigher loss scale to compensate
Long training (weeks)Accumulated FP16 rounding errorsMaster weights drift from true value over millions of stepsFP32 master weights (standard practice)
Very large model (70B+)Activation memory forces FP16/BF16Cannot afford FP32 for activationsBF16 compute + FP32 accumulation

Production Reality#

NVIDIA Mixed Precision (FP16 path):

  • Master weights stored in FP32
  • Forward and backward pass computed in FP16
  • Dynamic loss scaling with GradScaler (starts at 65,536, halves on overflow)
  • Gradient accumulation and optimizer step in FP32
  • Reference: Mixed Precision Training (Micikevicius et al., 2018)

Google TPU / A100+ (BF16 path):

  • BF16 compute, FP32 accumulation
  • No loss scaling needed — BF16 range matches FP32
  • This is the default for most large-scale training today (GPT-4, PaLM, Gemini all use BF16)

Choosing in practice: Use this rubric in a model review:

  1. Check hardware. A100/H100/TPU v3+? Default to BF16.
  2. Check symptoms. Loss plateau + zero gradient norms? Numerical issue, not optimization issue. Fix precision before tuning hyperparameters.
  3. Monitor two signals. Track both overflow counts (from the scaler) and per-layer gradient norms (for underflow). Stable global loss can hide local gradient starvation.

Checkpoint Questions#

Test your understanding with these operational questions:

  1. Estimate: A 96-layer Transformer has average per-layer gradient attenuation of 0.82. At which layer will FP16 gradients first underflow? (Hint: solve 0.82^k = 6e-8 for k.)

  2. Calculate: You are training with loss scale = 32,768 in FP16. What is the smallest gradient that will survive without underflowing? (Hint: the gradient times the scale must exceed the FP16 floor.)

  3. Diagnose: Your training loss has been flat for 2,000 steps. Gradient norms for layers 0-20 are ~1e-3, but layers 21-48 all show exactly 0.0. What is the most likely cause, and what is the cheapest fix?

  4. Decide: You have a cluster of V100 GPUs (no BF16 support) and need to train a 30-layer model. FP16 with no loss scaling, FP16 with dynamic loss scaling, or FP32 — which do you pick and why?

Research Hooks#

Papers:

  1. "Mixed Precision Training" (Micikevicius et al., 2018) — The foundational paper that established loss scaling and FP16 mixed-precision as practical for training. arXiv:1710.03740
  2. "FP8 Formats for Deep Learning" (Micikevicius et al., 2022) — Extends the precision frontier to 8-bit floats, defining E4M3 and E5M2 formats. arXiv:2209.05433

Open questions:

  • Can per-layer adaptive loss scaling outperform global scaling for very deep or heterogeneous architectures?
  • Where is the precision floor for long-context training — does BF16's ~2-digit precision eventually hurt attention score resolution at 128K+ context?
  • FP8 training is emerging on H100s. What are the failure modes, and when does FP8 precision become insufficient?

Next up: We explore how optimizers navigate loss landscapes — and why SGD's noise is a feature, not a bug.