Initialization & Residual Connections
Estimated reading time: 18 minutes
How you initialize weights determines whether training starts or stalls.
Proper initialization maintains gradient variance across layers. Too small and gradients vanish; too large and they explode. This lesson teaches the math behind initialization schemes.
Learning Progression (Easy -> Hard)#
Use this sequence as you read:
- Start with
The Variance Problemto build core intuition and shared vocabulary. - Move to
Xavier/Glorot Initializationto understand the mechanism behind the intuition. - Apply the idea in
He/Kaiming Initialization for ReLUwith concrete examples or implementation details. - Then zoom out to scale-level tradeoffs so the same concept holds at larger model and system sizes.
- Map the concept to production constraints to understand how teams make practical tradeoffs.
- Finish with research extensions to connect today’s mental model to open problems.
The Variance Problem#
Flow bridge: Start here; this section establishes the base mental model for the rest of the lesson.
Let's see what happens with naive initialization:
Instructor Lens#
Xavier/Glorot Initialization#
Flow bridge: Building on The Variance Problem, this section adds the next layer of conceptual depth.
For linear layers with tanh/sigmoid activations:
W ~ Normal(0, √(2 / (fan_in + fan_out)))
or equivalently:
W ~ Uniform(-√(6 / (fan_in + fan_out)), √(6 / (fan_in + fan_out)))
He/Kaiming Initialization for ReLU#
Flow bridge: Building on Xavier/Glorot Initialization, this section adds the next layer of conceptual depth.
ReLU kills half the activations (sets negative values to 0). This halves the variance at each layer!
To compensate, we need 2x the variance:
W ~ Normal(0, √(2 / fan_in))
Residual Connections Change Everything#
Flow bridge: Building on He/Kaiming Initialization for ReLU, this section adds the next layer of conceptual depth.
Pre-Norm vs Post-Norm#
Flow bridge: Building on Residual Connections Change Everything, this section adds the next layer of conceptual depth.
Scale Thought Experiment#
Flow bridge: With the local mechanism in place, extend it to larger model, context, and system scales.
| Scale | What Breaks | Mitigation |
|---|---|---|
Shallow networks (< 10 layers) | Nothing—most inits work | Keep it simple |
| Deep networks (50+ layers) | Vanishing/exploding gradients | Residual connections, careful init |
| Very wide layers | Initialization scale matters more | Scale by 1/√width |
| Transformers | Self-attention has its own dynamics | Pre-norm, smaller init for residual |
Production Reality#
Flow bridge: Carry these tradeoffs into production constraints and team-level operating decisions.
Modern Transformer Initialization:
- Embeddings: Normal(0, 0.02)
- Attention: Xavier or small constant
- FFN: He for ReLU, Xavier for GELU
- Residual scaling: Divide by √(2 × num_layers) (GPT-2 style)
Why 0.02? Empirically works well for language models. The exact value matters less than being in the right ballpark.
Checkpoint Questions#
Use these to verify understanding before moving on:
- Can you do this without notes: Derive why random initialization scale must depend on layer width?
- Can you do this without notes: Implement Xavier (Glorot) and He (Kaiming) initialization?
- Can you do this without notes: Explain how residual connections change initialization requirements?
Research Hooks#
Flow bridge: Use this practical baseline to frame the open research questions that remain unresolved.
Papers:
- "Understanding the Difficulty of Training Deep Feedforward Neural Networks" (Glorot & Bengio, 2010) — The Xavier initialization paper
- "Delving Deep into Rectifiers" (He et al., 2015) — He initialization for ReLU
Open Questions:
- Can we derive optimal initialization for arbitrary architectures automatically?
- How should initialization change for very sparse networks (MoE)?
Next up: Scaling laws tell us how model performance changes with size—and μ-Transfer shows how to transfer hyperparameters across scales.