Backprop as Graph Transformation

Estimated reading time: 20 minutes

Backprop isn't "chain rule backwards." It's a graph traversal.

Understanding the computation graph lets you catch bugs that waste $2M training runs. This lesson teaches you to reason about gradient flow as graph transformation.

Learning Progression (Easy -> Hard)#

Use this sequence as you read:

  1. Start with The Computation Graph to build core intuition and shared vocabulary.
  2. Move to Gradient Accumulation: += Not = to understand the mechanism behind the intuition.
  3. Apply the idea in Detachment: Breaking the Graph with concrete examples or implementation details.
  4. Then zoom out to scale-level tradeoffs so the same concept holds at larger model and system sizes.
  5. Map the concept to production constraints to understand how teams make practical tradeoffs.
  6. Finish with research extensions to connect today’s mental model to open problems.

The Computation Graph#

Flow bridge: Start here; this section establishes the base mental model for the rest of the lesson.

During forward pass, PyTorch builds a directed acyclic graph (DAG):

Loading diagram...

Instructor Lens#

Gradient Accumulation: += Not =#

Flow bridge: Building on The Computation Graph, this section adds the next layer of conceptual depth.

gradient_accumulation.py
Loading editor...

Detachment: Breaking the Graph#

Flow bridge: Building on Gradient Accumulation: += Not =, this section adds the next layer of conceptual depth.

detachment_demo.py
Loading editor...
1

Step 1: Normal flow. Without detachment, gradients flow through the entire graph. Both w1 and w2 receive gradients.

Build Your Own Autograd#

Flow bridge: Apply the concept through concrete implementation details before moving to harder edge cases.

Let's implement a complete mini-autograd to see how it all fits together:

mini_autograd.py
Loading editor...

Scale Thought Experiment#

Flow bridge: With the local mechanism in place, extend it to larger model, context, and system scales.

ScaleWhat BreaksMitigation
Small modelsGraph memory is negligibleStandard autograd
Large modelsActivation memory for backwardGradient checkpointing
Very long sequencesQuadratic attention activationsFlashAttention, chunked backward
Multi-GPUGraph needs to sync across devicesDistributed autograd, tensor parallelism

Production Reality#

Flow bridge: Carry these tradeoffs into production constraints and team-level operating decisions.

Gradient Checkpointing: Instead of storing all activations, recompute them during backward:

  • 30-40% memory savings
  • 20-30% compute overhead
  • Used in all large model training (Megatron-LM, DeepSpeed, FSDP)

Debugging Tools:

  • tensor.register_hook(fn) — Inspect gradients during backward
  • torch.autograd.grad() — Compute gradients without .backward()
  • torch.autograd.set_detect_anomaly(True) — Find NaN/Inf sources

Checkpoint Questions#

Use these to verify understanding before moving on:

  1. Can you do this without notes: Trace gradient flow through dynamic computation graphs?
  2. Can you do this without notes: Identify when gradients are accumulated vs overwritten?
  3. Can you do this without notes: Debug common gradient flow bugs (no_grad, detach, missing requires_grad)?

Research Hooks#

Flow bridge: Use this practical baseline to frame the open research questions that remain unresolved.

Papers:

  1. "Training Deep Nets with Sublinear Memory Cost" (Chen et al., 2016) — The original gradient checkpointing paper
  2. "Automatic Differentiation in Machine Learning: a Survey" (Baydin et al., 2018) — Comprehensive overview of autograd techniques

Open Questions:

  • Can we learn optimal checkpointing strategies for a given model architecture?
  • How do we efficiently compute second-order gradients (Hessians) for large models?

Next up: Initialization determines whether training starts or stalls. We'll derive why random scale must depend on layer width.