The Debugging Flowchart
Estimated reading time: 22 minutes
Senior engineers have systematic methodology. This capstone integrates everything you've learned into a debugging flowchart.
Training failures are expensive. A $2M training run that diverges after 3 days is a disaster. This lesson gives you the systematic approach to catch problems early.
Learning Progression (Easy -> Hard)#
Use this sequence as you read:
- Start with
The Master Flowchartto build core intuition and shared vocabulary. - Move to
Pattern 1: Divergence (Loss → inf/nan)to understand the mechanism behind the intuition. - Apply the idea in
Pattern 2: Plateau (Loss Not Decreasing)with concrete examples or implementation details. - Challenge your understanding in the failure-mode section and check what breaks first.
- Then zoom out to scale-level tradeoffs so the same concept holds at larger model and system sizes.
- Map the concept to production constraints to understand how teams make practical tradeoffs.
The Master Flowchart#
Flow bridge: Start here; this section establishes the base mental model for the rest of the lesson.
Instructor Lens#
Pattern 1: Divergence (Loss → inf/nan)#
Flow bridge: Building on The Master Flowchart, this section adds the next layer of conceptual depth.
Pattern 2: Plateau (Loss Not Decreasing)#
Flow bridge: Building on Pattern 1: Divergence (Loss → inf/nan), this section adds the next layer of conceptual depth.
Pattern 3: Instability (Oscillating Loss)#
Flow bridge: Building on Pattern 2: Plateau (Loss Not Decreasing), this section adds the next layer of conceptual depth.
The Complete Debugging Checklist#
Flow bridge: Now that the core mechanism is clear, stress-test it under realistic failure conditions.
Quick Reference Card#
Flow bridge: Building on The Complete Debugging Checklist, this section adds the next layer of conceptual depth.
| Symptom | First Check | Likely Fix |
|---|---|---|
| Loss → nan | Gradient norms | Lower LR, gradient clipping |
| Loss stuck | Gradient magnitude | Higher LR, better init |
| Loss oscillates | Batch size, LR | Lower LR, larger batch |
| Val loss rises | Regularization | Dropout, weight decay |
| Slow progress | Learning rate | Increase LR, check warmup |
| Memory error | Batch size | Lower batch, gradient checkpointing |
Production Debugging Tools#
Flow bridge: Now that the core mechanism is clear, stress-test it under realistic failure conditions.
Scale Thought Experiment#
Flow bridge: With the local mechanism in place, extend it to larger model, context, and system scales.
| Scale | Debugging Challenge | Approach |
|---|---|---|
| Local (1 GPU) | Quick iteration | Many small experiments |
| Single node (8 GPUs) | Longer runs | Log everything, catch issues early |
| Multi-node (64+ GPUs) | Expensive failures | Extensive validation before scaling |
| Production (1000+ GPUs) | Can't afford restarts | Automated monitoring, early stopping |
Checkpoint Questions#
Use these to verify understanding before moving on:
- Can you do this without notes: Apply a systematic methodology to diagnose training failures?
- Can you do this without notes: Distinguish between loss plateau, divergence, and instability?
- Can you do this without notes: Use gradient statistics to identify root causes?
Research Hooks#
Flow bridge: Use this practical baseline to frame the open research questions that remain unresolved.
Tools to Master:
- PyTorch Profiler (torch.profiler)
- NVIDIA Nsight Systems
- Weights & Biases (wandb)
- TensorBoard
Papers:
- "On the Difficulty of Training Recurrent Neural Networks" (Pascanu et al., 2013) — Classic analysis of gradient problems
- "Visualizing and Understanding Recurrent Networks" (Karpathy et al., 2015) — Debugging through visualization
Congratulations! You've completed Track 0: Foundations. You now have the mental models that separate senior research engineers from juniors. Next steps: Apply these principles to the advanced tracks on LLM training, parallelism, and inference.