Build the mental models that separate research engineers from ML practitioners.
Loading...
Loading...
Initialization & Residual Connections
Estimated reading time: 18 minutes
In this tutorial, you will measure how activation variance changes across layers under different initialization schemes, implement Xavier and He initialization, and observe how residual connections stabilize gradient flow in deep networks.
By the end you will be able to:
Compute the correct weight standard deviation for a given layer width and activation function
Choose between Xavier and He initialization based on the nonlinearity
Explain why pre-norm transformers are more stable than post-norm
ReLU kills half the activations (sets negative values to 0). This halves the variance at each layer!
To compensate, we need 2x the variance:
W ~ Normal(0, √(2 / fan_in))
1
Step 1: ReLU halves variance. For input with mean 0, ReLU zeros out half the values. The remaining half has the same variance, but total variance is cut in half.
Each question requires calculation or diagnosis, not just recall.
A linear layer has fan_in=1024 and fan_out=4096. Compute the Xavier std and the He std. Which should you use if the next operation is GELU (approximately symmetric near zero)?
A 48-layer transformer uses GPT-2-style residual scaling (1/sqrt(2*N)). Compute the scaling factor applied to the output of each residual branch. If someone removes this scaling, by what factor does the residual stream variance grow after all 48 layers (assuming unit-variance branches)?
Training a 20-layer ReLU network diverges at step 1 with NaN loss. You check and find weights were initialized with std=1.0 and hidden_dim=512. Compute what the activation variance is at layer 10 (approximate) and explain why it diverges. What std should you use instead?