Initialization & Residual Connections

Estimated reading time: 18 minutes

How you initialize weights determines whether training starts or stalls.

Proper initialization maintains gradient variance across layers. Too small and gradients vanish; too large and they explode. This lesson teaches the math behind initialization schemes.

Learning Progression (Easy -> Hard)#

Use this sequence as you read:

  1. Start with The Variance Problem to build core intuition and shared vocabulary.
  2. Move to Xavier/Glorot Initialization to understand the mechanism behind the intuition.
  3. Apply the idea in He/Kaiming Initialization for ReLU with concrete examples or implementation details.
  4. Then zoom out to scale-level tradeoffs so the same concept holds at larger model and system sizes.
  5. Map the concept to production constraints to understand how teams make practical tradeoffs.
  6. Finish with research extensions to connect today’s mental model to open problems.

The Variance Problem#

Flow bridge: Start here; this section establishes the base mental model for the rest of the lesson.

Let's see what happens with naive initialization:

variance_explosion.py
Loading editor...

Instructor Lens#

Xavier/Glorot Initialization#

Flow bridge: Building on The Variance Problem, this section adds the next layer of conceptual depth.

For linear layers with tanh/sigmoid activations:

W ~ Normal(0, √(2 / (fan_in + fan_out)))

or equivalently:

W ~ Uniform(-√(6 / (fan_in + fan_out)), √(6 / (fan_in + fan_out)))
xavier_init.py
Loading editor...

He/Kaiming Initialization for ReLU#

Flow bridge: Building on Xavier/Glorot Initialization, this section adds the next layer of conceptual depth.

ReLU kills half the activations (sets negative values to 0). This halves the variance at each layer!

To compensate, we need 2x the variance:

W ~ Normal(0, √(2 / fan_in))
1

Step 1: ReLU halves variance. For input with mean 0, ReLU zeros out half the values. The remaining half has the same variance, but total variance is cut in half.

Residual Connections Change Everything#

Flow bridge: Building on He/Kaiming Initialization for ReLU, this section adds the next layer of conceptual depth.

Loading diagram...
residual_connections.py
Loading editor...

Pre-Norm vs Post-Norm#

Flow bridge: Building on Residual Connections Change Everything, this section adds the next layer of conceptual depth.

Scale Thought Experiment#

Flow bridge: With the local mechanism in place, extend it to larger model, context, and system scales.

ScaleWhat BreaksMitigation
Shallow networks (< 10 layers)Nothing—most inits workKeep it simple
Deep networks (50+ layers)Vanishing/exploding gradientsResidual connections, careful init
Very wide layersInitialization scale matters moreScale by 1/√width
TransformersSelf-attention has its own dynamicsPre-norm, smaller init for residual

Production Reality#

Flow bridge: Carry these tradeoffs into production constraints and team-level operating decisions.

Modern Transformer Initialization:

  • Embeddings: Normal(0, 0.02)
  • Attention: Xavier or small constant
  • FFN: He for ReLU, Xavier for GELU
  • Residual scaling: Divide by √(2 × num_layers) (GPT-2 style)

Why 0.02? Empirically works well for language models. The exact value matters less than being in the right ballpark.

Checkpoint Questions#

Use these to verify understanding before moving on:

  1. Can you do this without notes: Derive why random initialization scale must depend on layer width?
  2. Can you do this without notes: Implement Xavier (Glorot) and He (Kaiming) initialization?
  3. Can you do this without notes: Explain how residual connections change initialization requirements?

Research Hooks#

Flow bridge: Use this practical baseline to frame the open research questions that remain unresolved.

Papers:

  1. "Understanding the Difficulty of Training Deep Feedforward Neural Networks" (Glorot & Bengio, 2010) — The Xavier initialization paper
  2. "Delving Deep into Rectifiers" (He et al., 2015) — He initialization for ReLU

Open Questions:

  • Can we derive optimal initialization for arbitrary architectures automatically?
  • How should initialization change for very sparse networks (MoE)?

Next up: Scaling laws tell us how model performance changes with size—and μ-Transfer shows how to transfer hyperparameters across scales.