Work at a Frontier Lab

When you scale to 70B parameters, gradients become fragile.

FP32 is safe but slow. FP16 is fast but breaks. BF16 is Goldilocks—but only if you understand why. This lesson demystifies mixed precision training.

Learning Progression (Easy -> Hard)#

Use this sequence as you read:

Start with IEEE 754 Float Formats to build core intuition and shared vocabulary.
Move to The FP16 Problem to understand the mechanism behind the intuition.
Apply the idea in Loss Scaling: The Fix for FP16 with concrete examples or implementation details.
Then zoom out to scale-level tradeoffs so the same concept holds at larger model and system sizes.
Map the concept to production constraints to understand how teams make practical tradeoffs.
Finish with research extensions to connect today’s mental model to open problems.

IEEE 754 Float Formats#

Flow bridge: Start here; this section establishes the base mental model for the rest of the lesson.

All modern hardware uses IEEE 754 floating point. Understanding the format explains why training breaks.

Loading diagram...

The key insight: exponent bits determine range, mantissa bits determine precision. BF16 keeps FP32's 8 exponent bits, sacrificing precision for range.

Instructor Lens#

The FP16 Problem#

Flow bridge: Building on IEEE 754 Float Formats, this section adds the next layer of conceptual depth.

Let's simulate how gradients vanish through deep networks:

underflow_simulation.py

Loading editor...

Step 1: Understand the math. Each layer multiplies gradients by its Jacobian. If that's 0.7x on average, after 50 layers: 0.7⁵⁰ ≈ 1.8×10⁻⁸ — below FP16's minimum!

Loss Scaling: The Fix for FP16#

Flow bridge: With the local mechanism in place, extend it to larger model, context, and system scales.

The insight: we can scale gradients up before they underflow, compute in a safe range, then scale back down.

loss_scaling.py

Loading editor...

Dynamic Loss Scaling#

Flow bridge: With the local mechanism in place, extend it to larger model, context, and system scales.

In practice, we don't know the right scale factor in advance. Dynamic loss scaling adapts:

Start with a high scale (65536)
If gradients overflow (become inf/nan), halve the scale
If N steps pass without overflow, double the scale

dynamic_scaling.py

Loading editor...

BF16: The Modern Solution#

Flow bridge: Building on Dynamic Loss Scaling, this section adds the next layer of conceptual depth.

Format	Range	Precision	Loss Scaling?	Hardware
FP32	±3.4×10³⁸	~7 digits	No	All GPUs
FP16	±65,504	~3 digits	Recommended	V100, older
BF16	±3.4×10³⁸	~3 digits	No	A100, H100, TPU

Scale Thought Experiment#

Flow bridge: With the local mechanism in place, extend it to larger model, context, and system scales.

Scale	What Breaks	Mitigation
Small batches, shallow nets	Nothing—FP16 usually fine	Standard mixed precision
Large batches (8K+)	Gradient averaging → underflow	Aggressive loss scaling
Very deep nets (100+ layers)	Compounding attenuation	BF16, gradient checkpointing
Long training runs	Accumulated numerical errors	Periodic FP32 master weights

Production Reality#

Flow bridge: Carry these tradeoffs into production constraints and team-level operating decisions.

NVIDIA Mixed Precision Training:

Master weights in FP32, compute in FP16
Dynamic loss scaling: starts at 65536, halves on overflow
Reference: Mixed Precision Training

Google TPU (BF16 Native):

BF16 compute with FP32 accumulation
No loss scaling needed—BF16 range matches FP32
This is why Google pushed for BF16 hardware support

Teacher Walkthrough: Picking Precision Under Real Constraints#

Flow bridge: Translate the mechanism into a production decision you can defend.

Use this quick rubric the way a staff engineer would in a model review:

Start from hardware support. If your fleet is A100/H100/TPU v4+, default to BF16. On older cards, FP16 plus dynamic loss scaling is still practical.
Check failure signature before changing architecture. If loss plateaus early and gradient norms collapse toward zero, treat this as a numerical issue first, not an optimization-theory issue.
Choose the cheapest stable option. If BF16 is available, prefer it for simplicity. If not, keep FP16 compute but retain FP32 master weights and dynamic scaling.
Verify with two probes, not one. Monitor both overflow counts and underflow-sensitive layers (often early layers). Stable global loss can still hide local gradient starvation.

Checkpoint Questions#

Use these to verify understanding before moving on:

Can you do this without notes: Explain the IEEE 754 representation and why FP16's range is problematic?
Can you do this without notes: Identify gradient underflow conditions and their symptoms?
Can you do this without notes: Implement loss scaling to prevent underflow?

Research Hooks#

Flow bridge: Use this practical baseline to frame the open research questions that remain unresolved.

Papers:

"Mixed Precision Training" (Micikevicius et al., 2018) — The foundational paper that made FP16 training practical
"FP8 Formats for Deep Learning" (Micikevicius et al., 2022) — The next frontier: 8-bit floats for even faster training

Open Questions:

Can we design loss scaling that's aware of per-layer gradient distributions?
At what point does reduced precision hurt final model quality vs just training dynamics?

Next up: We'll explore how optimizers navigate loss landscapes—and why SGD's noise is a feature, not a bug.

Track 0: Foundations

Gradient Flow Under Pressure

Learning Progression (Easy -> Hard)#

IEEE 754 Float Formats#

Instructor Lens#

The FP16 Problem#

Loss Scaling: The Fix for FP16#

Dynamic Loss Scaling#

BF16: The Modern Solution#

Scale Thought Experiment#

Production Reality#

Teacher Walkthrough: Picking Precision Under Real Constraints#

Checkpoint Questions#

Research Hooks#