Work at a Frontier Lab
CoursesProblemsBlogPapersLibrariesDiscussJobs

Track 0: Foundations

Build the mental models that separate research engineers from ML practitioners.

Memory & Compute
The Memory Wall15m
Gradient Flow Under Pressure18m
Optimizers
SGD & Momentum15m
Adam, Warmup & Scheduling18m
Gradient Mechanics
Backprop as Graph Transformation20m
Initialization & Residual Connections18m
Scaling Laws & μ-Transfer20m
Systems Thinking
Bandwidth & Profiling18m
The Debugging Flowchart22m

Track 0: Foundations

Build the mental models that separate research engineers from ML practitioners.

Memory & Compute
The Memory Wall15mGradient Flow Under Pressure18m
Optimizers
SGD & Momentum15mAdam, Warmup & Scheduling18m
Gradient Mechanics
Backprop as Graph Transformation20mInitialization & Residual Connections18mScaling Laws & μ-Transfer20m
Systems Thinking
Bandwidth & Profiling18mThe Debugging Flowchart22m

Loading...

Built with Next.js

PrivacyTermsContactPapersLibrariesJobsDiscuss|GitHub|Work at a Frontier Lab

Loading...

  1. Home
  2. /
  3. Track 0: Foundations
  4. /
  5. Initialization & Residual Connections

Initialization & Residual Connections

Estimated reading time: 18 minutes

Previous

←Backprop as Graph Transformation

Next

Scaling Laws & μ-Transfer→

In this tutorial, you will measure how activation variance changes across layers under different initialization schemes, implement Xavier and He initialization, and observe how residual connections stabilize gradient flow in deep networks.

By the end you will be able to:

  • Compute the correct weight standard deviation for a given layer width and activation function
  • Choose between Xavier and He initialization based on the nonlinearity
  • Explain why pre-norm transformers are more stable than post-norm
💡

Core Idea

Each layer multiplies activations (forward) and gradients (backward) by a factor that depends on weight variance and layer width. If that factor is above 1, values explode exponentially with depth. If below 1, they vanish. The goal is to set weight variance so the per-layer factor is exactly 1.

The Variance Problem#

Let's see what happens with naive initialization:

variance_explosion.py
Loading editor...

Xavier/Glorot Initialization#

For linear layers with tanh/sigmoid activations:

W ~ Normal(0, √(2 / (fan_in + fan_out)))

or equivalently:

W ~ Uniform(-√(6 / (fan_in + fan_out)), √(6 / (fan_in + fan_out)))
💡

Xavier Derivation

For a linear layer y = Wx:

  • Var(y) = fan_in × Var(W) × Var(x)

To maintain Var(y) = Var(x), we need:

  • Var(W) = 1 / fan_in

Xavier averages forward and backward requirements to get 2/(fan_in + fan_out).

xavier_init.py
Loading editor...

He/Kaiming Initialization for ReLU#

ReLU kills half the activations (sets negative values to 0). This halves the variance at each layer!

To compensate, we need 2x the variance:

W ~ Normal(0, √(2 / fan_in))
1

Step 1: ReLU halves variance. For input with mean 0, ReLU zeros out half the values. The remaining half has the same variance, but total variance is cut in half.

Residual Connections Change Everything#

Loading diagram...
💡

The Skip Connection Magic

In y = x + F(x), the gradient is ∂y/∂x = 1 + ∂F/∂x.

That "+1" provides a gradient highway—gradients can flow directly without attenuation. This is why ResNets can be 1000+ layers deep!

residual_connections.py
Loading editor...

Pre-Norm vs Post-Norm#

⚠️

Layer Norm Placement Matters

Post-norm (original Transformer): y = LayerNorm(x + F(x)) Pre-norm (GPT-2+): y = x + F(LayerNorm(x))

Pre-norm is more stable because LayerNorm is applied before the residual, preventing the residual stream from exploding.

Scale Thought Experiment#

ScaleWhat BreaksMitigation
Shallow networks (< 10 layers)Nothing—most inits workKeep it simple
Deep networks (50+ layers)Vanishing/exploding gradientsResidual connections, careful init
Very wide layersInitialization scale matters moreScale by 1/√width
TransformersSelf-attention has its own dynamicsPre-norm, smaller init for residual

Production Reality#

Modern Transformer Initialization:

  • Embeddings: Normal(0, 0.02)
  • Attention: Xavier or small constant
  • FFN: He for ReLU, Xavier for GELU
  • Residual scaling: Divide by √(2 × num_layers) (GPT-2 style)

Why 0.02? Empirically works well for language models. The exact value matters less than being in the right ballpark.

Break It: Initialization Failures#

❌

Common Mistakes

  1. Using He init with tanh/sigmoid — He initialization assumes ReLU halves the variance. With tanh (which preserves variance near the origin), the 2x factor overshoots, causing activations to grow across layers. Use Xavier for symmetric activations.

  2. Forgetting residual scaling in deep transformers — A 96-layer transformer with standard init accumulates residual variance proportional to depth. Without dividing the residual branch output by sqrt(2 * num_layers), activations grow as sqrt(L) and training becomes unstable.

  3. Initializing all layers identically regardless of role — Embedding layers, attention projections, and FFN layers have different fan-in/fan-out ratios. Using a single std=0.02 everywhere works as a rough heuristic but is suboptimal for very deep or very wide models.

Checkpoint Questions#

Each question requires calculation or diagnosis, not just recall.

  1. A linear layer has fan_in=1024 and fan_out=4096. Compute the Xavier std and the He std. Which should you use if the next operation is GELU (approximately symmetric near zero)?
  2. A 48-layer transformer uses GPT-2-style residual scaling (1/sqrt(2*N)). Compute the scaling factor applied to the output of each residual branch. If someone removes this scaling, by what factor does the residual stream variance grow after all 48 layers (assuming unit-variance branches)?
  3. Training a 20-layer ReLU network diverges at step 1 with NaN loss. You check and find weights were initialized with std=1.0 and hidden_dim=512. Compute what the activation variance is at layer 10 (approximate) and explain why it diverges. What std should you use instead?

Research Hooks#

Papers:

  1. "Understanding the Difficulty of Training Deep Feedforward Neural Networks" (Glorot & Bengio, 2010) — The Xavier initialization paper
  2. "Delving Deep into Rectifiers" (He et al., 2015) — He initialization for ReLU

Open Questions:

  • Can we derive optimal initialization for arbitrary architectures automatically?
  • How should initialization change for very sparse networks (MoE)?

Next up: Scaling laws tell us how model performance changes with size—and μ-Transfer shows how to transfer hyperparameters across scales.