Gradient Flow Under Pressure

Estimated reading time: 18 minutes

When you scale to 70B parameters, gradients become fragile.

FP32 is safe but slow. FP16 is fast but breaks. BF16 is Goldilocks—but only if you understand why. This lesson demystifies mixed precision training.

Learning Progression (Easy -> Hard)#

Use this sequence as you read:

  1. Start with IEEE 754 Float Formats to build core intuition and shared vocabulary.
  2. Move to The FP16 Problem to understand the mechanism behind the intuition.
  3. Apply the idea in Loss Scaling: The Fix for FP16 with concrete examples or implementation details.
  4. Then zoom out to scale-level tradeoffs so the same concept holds at larger model and system sizes.
  5. Map the concept to production constraints to understand how teams make practical tradeoffs.
  6. Finish with research extensions to connect today’s mental model to open problems.

IEEE 754 Float Formats#

Flow bridge: Start here; this section establishes the base mental model for the rest of the lesson.

All modern hardware uses IEEE 754 floating point. Understanding the format explains why training breaks.

Loading diagram...

The key insight: exponent bits determine range, mantissa bits determine precision. BF16 keeps FP32's 8 exponent bits, sacrificing precision for range.

Instructor Lens#

The FP16 Problem#

Flow bridge: Building on IEEE 754 Float Formats, this section adds the next layer of conceptual depth.

Let's simulate how gradients vanish through deep networks:

underflow_simulation.py
Loading editor...
1

Step 1: Understand the math. Each layer multiplies gradients by its Jacobian. If that's 0.7x on average, after 50 layers: 0.7⁵⁰ ≈ 1.8×10⁻⁸ — below FP16's minimum!

Loss Scaling: The Fix for FP16#

Flow bridge: With the local mechanism in place, extend it to larger model, context, and system scales.

The insight: we can scale gradients up before they underflow, compute in a safe range, then scale back down.

loss_scaling.py
Loading editor...

Dynamic Loss Scaling#

Flow bridge: With the local mechanism in place, extend it to larger model, context, and system scales.

In practice, we don't know the right scale factor in advance. Dynamic loss scaling adapts:

  1. Start with a high scale (65536)
  2. If gradients overflow (become inf/nan), halve the scale
  3. If N steps pass without overflow, double the scale
dynamic_scaling.py
Loading editor...

BF16: The Modern Solution#

Flow bridge: Building on Dynamic Loss Scaling, this section adds the next layer of conceptual depth.

FormatRangePrecisionLoss Scaling?Hardware
FP32±3.4×10³⁸~7 digitsNoAll GPUs
FP16±65,504~3 digitsRecommendedV100, older
BF16±3.4×10³⁸~3 digitsNoA100, H100, TPU

Scale Thought Experiment#

Flow bridge: With the local mechanism in place, extend it to larger model, context, and system scales.

ScaleWhat BreaksMitigation
Small batches, shallow netsNothing—FP16 usually fineStandard mixed precision
Large batches (8K+)Gradient averaging → underflowAggressive loss scaling
Very deep nets (100+ layers)Compounding attenuationBF16, gradient checkpointing
Long training runsAccumulated numerical errorsPeriodic FP32 master weights

Production Reality#

Flow bridge: Carry these tradeoffs into production constraints and team-level operating decisions.

NVIDIA Mixed Precision Training:

  • Master weights in FP32, compute in FP16
  • Dynamic loss scaling: starts at 65536, halves on overflow
  • Reference: Mixed Precision Training

Google TPU (BF16 Native):

  • BF16 compute with FP32 accumulation
  • No loss scaling needed—BF16 range matches FP32
  • This is why Google pushed for BF16 hardware support

Teacher Walkthrough: Picking Precision Under Real Constraints#

Flow bridge: Translate the mechanism into a production decision you can defend.

Use this quick rubric the way a staff engineer would in a model review:

  1. Start from hardware support. If your fleet is A100/H100/TPU v4+, default to BF16. On older cards, FP16 plus dynamic loss scaling is still practical.

  2. Check failure signature before changing architecture. If loss plateaus early and gradient norms collapse toward zero, treat this as a numerical issue first, not an optimization-theory issue.

  3. Choose the cheapest stable option. If BF16 is available, prefer it for simplicity. If not, keep FP16 compute but retain FP32 master weights and dynamic scaling.

  4. Verify with two probes, not one. Monitor both overflow counts and underflow-sensitive layers (often early layers). Stable global loss can still hide local gradient starvation.

Checkpoint Questions#

Use these to verify understanding before moving on:

  1. Can you do this without notes: Explain the IEEE 754 representation and why FP16's range is problematic?
  2. Can you do this without notes: Identify gradient underflow conditions and their symptoms?
  3. Can you do this without notes: Implement loss scaling to prevent underflow?

Research Hooks#

Flow bridge: Use this practical baseline to frame the open research questions that remain unresolved.

Papers:

  1. "Mixed Precision Training" (Micikevicius et al., 2018) — The foundational paper that made FP16 training practical
  2. "FP8 Formats for Deep Learning" (Micikevicius et al., 2022) — The next frontier: 8-bit floats for even faster training

Open Questions:

  • Can we design loss scaling that's aware of per-layer gradient distributions?
  • At what point does reduced precision hurt final model quality vs just training dynamics?

Next up: We'll explore how optimizers navigate loss landscapes—and why SGD's noise is a feature, not a bug.