Work at a Frontier Lab

Senior engineers have systematic methodology. This capstone integrates everything you've learned into a debugging flowchart.

Training failures are expensive. A $2M training run that diverges after 3 days is a disaster. This lesson gives you the systematic approach to catch problems early.

Learning Progression (Easy -> Hard)#

Use this sequence as you read:

Start with The Master Flowchart to build core intuition and shared vocabulary.
Move to Pattern 1: Divergence (Loss → inf/nan) to understand the mechanism behind the intuition.
Apply the idea in Pattern 2: Plateau (Loss Not Decreasing) with concrete examples or implementation details.
Challenge your understanding in the failure-mode section and check what breaks first.
Then zoom out to scale-level tradeoffs so the same concept holds at larger model and system sizes.
Map the concept to production constraints to understand how teams make practical tradeoffs.

The Master Flowchart#

Flow bridge: Start here; this section establishes the base mental model for the rest of the lesson.

Loading diagram...

Instructor Lens#

Pattern 1: Divergence (Loss → inf/nan)#

Flow bridge: Building on The Master Flowchart, this section adds the next layer of conceptual depth.

diagnose_divergence.py

Loading editor...

Pattern 2: Plateau (Loss Not Decreasing)#

Flow bridge: Building on Pattern 1: Divergence (Loss → inf/nan), this section adds the next layer of conceptual depth.

diagnose_plateau.py

Loading editor...

Pattern 3: Instability (Oscillating Loss)#

Flow bridge: Building on Pattern 2: Plateau (Loss Not Decreasing), this section adds the next layer of conceptual depth.

diagnose_instability.py

Loading editor...

The Complete Debugging Checklist#

Flow bridge: Now that the core mechanism is clear, stress-test it under realistic failure conditions.

Step 1: Check the loss curve shape. Is it diverging (going to inf), plateauing (stuck), or unstable (oscillating)? Each has different causes.

Quick Reference Card#

Flow bridge: Building on The Complete Debugging Checklist, this section adds the next layer of conceptual depth.

Symptom	First Check	Likely Fix
Loss → nan	Gradient norms	Lower LR, gradient clipping
Loss stuck	Gradient magnitude	Higher LR, better init
Loss oscillates	Batch size, LR	Lower LR, larger batch
Val loss rises	Regularization	Dropout, weight decay
Slow progress	Learning rate	Increase LR, check warmup
Memory error	Batch size	Lower batch, gradient checkpointing

Production Debugging Tools#

Flow bridge: Now that the core mechanism is clear, stress-test it under realistic failure conditions.

debugging_tools.py

Loading editor...

Scale Thought Experiment#

Flow bridge: With the local mechanism in place, extend it to larger model, context, and system scales.

Scale	Debugging Challenge	Approach
Local (1 GPU)	Quick iteration	Many small experiments
Single node (8 GPUs)	Longer runs	Log everything, catch issues early
Multi-node (64+ GPUs)	Expensive failures	Extensive validation before scaling
Production (1000+ GPUs)	Can't afford restarts	Automated monitoring, early stopping

Checkpoint Questions#

Use these to verify understanding before moving on:

Can you do this without notes: Apply a systematic methodology to diagnose training failures?
Can you do this without notes: Distinguish between loss plateau, divergence, and instability?
Can you do this without notes: Use gradient statistics to identify root causes?

Research Hooks#

Flow bridge: Use this practical baseline to frame the open research questions that remain unresolved.

Tools to Master:

PyTorch Profiler (torch.profiler)
NVIDIA Nsight Systems
Weights & Biases (wandb)
TensorBoard

Papers:

"On the Difficulty of Training Recurrent Neural Networks" (Pascanu et al., 2013) — Classic analysis of gradient problems
"Visualizing and Understanding Recurrent Networks" (Karpathy et al., 2015) — Debugging through visualization

Congratulations! You've completed Track 0: Foundations. You now have the mental models that separate senior research engineers from juniors. Next steps: Apply these principles to the advanced tracks on LLM training, parallelism, and inference.