The Debugging Flowchart

Estimated reading time: 22 minutes

Senior engineers have systematic methodology. This capstone integrates everything you've learned into a debugging flowchart.

Training failures are expensive. A $2M training run that diverges after 3 days is a disaster. This lesson gives you the systematic approach to catch problems early.

Learning Progression (Easy -> Hard)#

Use this sequence as you read:

  1. Start with The Master Flowchart to build core intuition and shared vocabulary.
  2. Move to Pattern 1: Divergence (Loss → inf/nan) to understand the mechanism behind the intuition.
  3. Apply the idea in Pattern 2: Plateau (Loss Not Decreasing) with concrete examples or implementation details.
  4. Challenge your understanding in the failure-mode section and check what breaks first.
  5. Then zoom out to scale-level tradeoffs so the same concept holds at larger model and system sizes.
  6. Map the concept to production constraints to understand how teams make practical tradeoffs.

The Master Flowchart#

Flow bridge: Start here; this section establishes the base mental model for the rest of the lesson.

Loading diagram...

Instructor Lens#

Pattern 1: Divergence (Loss → inf/nan)#

Flow bridge: Building on The Master Flowchart, this section adds the next layer of conceptual depth.

diagnose_divergence.py
Loading editor...

Pattern 2: Plateau (Loss Not Decreasing)#

Flow bridge: Building on Pattern 1: Divergence (Loss → inf/nan), this section adds the next layer of conceptual depth.

diagnose_plateau.py
Loading editor...

Pattern 3: Instability (Oscillating Loss)#

Flow bridge: Building on Pattern 2: Plateau (Loss Not Decreasing), this section adds the next layer of conceptual depth.

diagnose_instability.py
Loading editor...

The Complete Debugging Checklist#

Flow bridge: Now that the core mechanism is clear, stress-test it under realistic failure conditions.

1

Step 1: Check the loss curve shape. Is it diverging (going to inf), plateauing (stuck), or unstable (oscillating)? Each has different causes.

Quick Reference Card#

Flow bridge: Building on The Complete Debugging Checklist, this section adds the next layer of conceptual depth.

SymptomFirst CheckLikely Fix
Loss → nanGradient normsLower LR, gradient clipping
Loss stuckGradient magnitudeHigher LR, better init
Loss oscillatesBatch size, LRLower LR, larger batch
Val loss risesRegularizationDropout, weight decay
Slow progressLearning rateIncrease LR, check warmup
Memory errorBatch sizeLower batch, gradient checkpointing

Production Debugging Tools#

Flow bridge: Now that the core mechanism is clear, stress-test it under realistic failure conditions.

debugging_tools.py
Loading editor...

Scale Thought Experiment#

Flow bridge: With the local mechanism in place, extend it to larger model, context, and system scales.

ScaleDebugging ChallengeApproach
Local (1 GPU)Quick iterationMany small experiments
Single node (8 GPUs)Longer runsLog everything, catch issues early
Multi-node (64+ GPUs)Expensive failuresExtensive validation before scaling
Production (1000+ GPUs)Can't afford restartsAutomated monitoring, early stopping

Checkpoint Questions#

Use these to verify understanding before moving on:

  1. Can you do this without notes: Apply a systematic methodology to diagnose training failures?
  2. Can you do this without notes: Distinguish between loss plateau, divergence, and instability?
  3. Can you do this without notes: Use gradient statistics to identify root causes?

Research Hooks#

Flow bridge: Use this practical baseline to frame the open research questions that remain unresolved.

Tools to Master:

  • PyTorch Profiler (torch.profiler)
  • NVIDIA Nsight Systems
  • Weights & Biases (wandb)
  • TensorBoard

Papers:

  1. "On the Difficulty of Training Recurrent Neural Networks" (Pascanu et al., 2013) — Classic analysis of gradient problems
  2. "Visualizing and Understanding Recurrent Networks" (Karpathy et al., 2015) — Debugging through visualization

Congratulations! You've completed Track 0: Foundations. You now have the mental models that separate senior research engineers from juniors. Next steps: Apply these principles to the advanced tracks on LLM training, parallelism, and inference.