Backprop as Graph Transformation
Estimated reading time: 20 minutes
Backprop isn't "chain rule backwards." It's a graph traversal.
Understanding the computation graph lets you catch bugs that waste $2M training runs. This lesson teaches you to reason about gradient flow as graph transformation.
Learning Progression (Easy -> Hard)#
Use this sequence as you read:
- Start with
The Computation Graphto build core intuition and shared vocabulary. - Move to
Gradient Accumulation: += Not =to understand the mechanism behind the intuition. - Apply the idea in
Detachment: Breaking the Graphwith concrete examples or implementation details. - Then zoom out to scale-level tradeoffs so the same concept holds at larger model and system sizes.
- Map the concept to production constraints to understand how teams make practical tradeoffs.
- Finish with research extensions to connect today’s mental model to open problems.
The Computation Graph#
Flow bridge: Start here; this section establishes the base mental model for the rest of the lesson.
During forward pass, PyTorch builds a directed acyclic graph (DAG):
Instructor Lens#
Gradient Accumulation: += Not =#
Flow bridge: Building on The Computation Graph, this section adds the next layer of conceptual depth.
Detachment: Breaking the Graph#
Flow bridge: Building on Gradient Accumulation: += Not =, this section adds the next layer of conceptual depth.
Build Your Own Autograd#
Flow bridge: Apply the concept through concrete implementation details before moving to harder edge cases.
Let's implement a complete mini-autograd to see how it all fits together:
Scale Thought Experiment#
Flow bridge: With the local mechanism in place, extend it to larger model, context, and system scales.
| Scale | What Breaks | Mitigation |
|---|---|---|
| Small models | Graph memory is negligible | Standard autograd |
| Large models | Activation memory for backward | Gradient checkpointing |
| Very long sequences | Quadratic attention activations | FlashAttention, chunked backward |
| Multi-GPU | Graph needs to sync across devices | Distributed autograd, tensor parallelism |
Production Reality#
Flow bridge: Carry these tradeoffs into production constraints and team-level operating decisions.
Gradient Checkpointing: Instead of storing all activations, recompute them during backward:
- 30-40% memory savings
- 20-30% compute overhead
- Used in all large model training (Megatron-LM, DeepSpeed, FSDP)
Debugging Tools:
tensor.register_hook(fn)— Inspect gradients during backwardtorch.autograd.grad()— Compute gradients without.backward()torch.autograd.set_detect_anomaly(True)— Find NaN/Inf sources
Checkpoint Questions#
Use these to verify understanding before moving on:
- Can you do this without notes: Trace gradient flow through dynamic computation graphs?
- Can you do this without notes: Identify when gradients are accumulated vs overwritten?
- Can you do this without notes: Debug common gradient flow bugs (no_grad, detach, missing requires_grad)?
Research Hooks#
Flow bridge: Use this practical baseline to frame the open research questions that remain unresolved.
Papers:
- "Training Deep Nets with Sublinear Memory Cost" (Chen et al., 2016) — The original gradient checkpointing paper
- "Automatic Differentiation in Machine Learning: a Survey" (Baydin et al., 2018) — Comprehensive overview of autograd techniques
Open Questions:
- Can we learn optimal checkpointing strategies for a given model architecture?
- How do we efficiently compute second-order gradients (Hessians) for large models?
Next up: Initialization determines whether training starts or stalls. We'll derive why random scale must depend on layer width.