Full Curriculum

Two tracks covering foundations through production LLM training. Start with Track 0 (free) or unlock everything with Premium.

136 lessons84+ hoursRun code in browser
Track 1Premium11 lessons · 3.5 hrs

Track 1: Modern LLM Training

Master LLM-specific architectural decisions — attention, KV cache, tokenization, and scaling.

Attention Mechanics

1. Attention as Routing18 min
2. The KV Cache Trap18 min
3. Prefill vs Decode14 min

Context & Position

4. Long Context Pathologies17 min
5. Rotary Position Embeddings20 min

Tokenization & Data

6. Tokenization Tradeoffs14 min
7. BPE Vocab Sizing17 min
8. Data Mixtures & Contamination20 min

Advanced Architectures

9. Grouped Query Attention17 min
10. Mixture of Experts23 min
11. Capstone: Design a 120M LLM35 min
Track 2Premium21 lessons · 12.0 hrs

Track 2: Distributed Training

Master the parallelism strategies for training models that don't fit on a single GPU.

Why Distributed?

1. The Single-GPU Wall25 min
2. Memory Anatomy of Training30 min
3. The Parallelism Zoo25 min

Communication Primitives

4. AllReduce Deep Dive30 min
5. AllGather & ReduceScatter30 min
6. Bandwidth vs Latency25 min
7. Communication Overlap30 min

Data Parallelism

8. DDP from Scratch35 min
9. Gradient Synchronization30 min
10. FSDP: Sharding Everything40 min

Model Parallelism

11. Tensor Parallelism Basics35 min
12. TP for Attention30 min
13. Pipeline Parallelism40 min
14. The Pipeline Bubble35 min

3D Parallelism & Frameworks

15. 3D Parallelism Design40 min
16. DeepSpeed ZeRO Stages35 min
17. Megatron-LM Patterns35 min

Production Concerns

18. Memory Optimization Stack35 min
19. Fault Tolerance & Elastic Training35 min
20. Debugging Distributed Hell40 min

Capstone

21. Capstone: Simulate 7B Training60 min
Track 3Premium23 lessons · 11.4 hrs

Track 3: Scaling Laws & Compute

Predict model performance from compute budget and plan training runs that maximize capability per dollar.

Foundations of Scaling

1. The Scaling Hypothesis28 min
2. Fitting Power Laws from Data28 min
3. Uncertainty & Prediction Intervals32 min

The Seminal Papers

4. Kaplan et al. — The Original Scaling Laws32 min
5. Hoffmann/Chinchilla — The Revision32 min
6. Kaplan vs Hoffmann — The Full Comparison32 min

Compute & Cost

7. FLOPs Estimation — The 6ND Rule28 min
8. Cost Modeling — From FLOPs to Dollars30 min
9. Budget Planning — Allocating Limited Compute30 min

Data Scaling

10. Data-Constrained Scaling28 min
11. Epoch Degradation - The Cost of Repetition30 min
12. Data Quality vs Quantity28 min

Inference Scaling

13. Training vs Inference Cost30 min
14. Inference-Optimal Training32 min
15. Test-Time Compute — Inference Scaling Laws35 min
16. Quantization Scaling Laws32 min

Advanced Topics

17. Emergent Capabilities - The Claim30 min
18. Debunking Emergence - The Reality32 min
19. Multimodal Scaling Laws32 min
20. Beyond Text - Scaling New Domains28 min

Capstone

21. Capstone Part 1 — Design a Training Plan28 min
22. Capstone Part 2 — Analyze a Real Paper24 min
23. Capstone Part 3 — Build a Cost Estimator22 min
Track 4Premium24 lessons · 12.5 hrs

Track 4: Post-Training & Alignment

Master the post-training pipeline — SFT, RLHF, DPO, and Constitutional AI to make capable models useful and safe.

Foundations of Post-Training

1. Why Post-Training? — The Gap Between Capable and Useful28 min
2. Data Formats & Collection28 min

Supervised Fine-Tuning

3. SFT Fundamentals28 min
4. Catastrophic Forgetting32 min
5. Data Quality for SFT28 min
6. Efficient Fine-Tuning — LoRA and Beyond32 min

Reward Modeling

7. Reward Model Architecture32 min
8. Reward Model Training Dynamics32 min
9. Evaluating Reward Models28 min

RLHF with PPO

10. Policy Gradient Fundamentals32 min
11. PPO Derivation — Trust Regions Made Simple38 min
12. Advantage Estimation — GAE32 min
13. KL Penalty & Reference Models32 min
14. Practical RLHF32 min

Direct Preference Optimization

15. DPO Derivation — The Closed-Form Solution38 min
16. DPO vs RLHF — Tradeoffs in Practice28 min
17. The Preference Optimization Zoo28 min

Constitutional AI & RLAIF

18. Constitutional AI — Anthropic's Approach32 min
19. RLAIF — AI Feedback at Scale33 min
20. Scaling Oversight — Beyond Human Feedback35 min

Evaluation & Deployment

21. Alignment Evaluation — Measuring HHH28 min
22. Iteration & Deployment — The Alignment Feedback Loop28 min

Capstone

23. Capstone Part 1 — Full Pipeline: SFT to DPO38 min
24. Capstone Part 2 — Evaluate & Iterate28 min
Track 5Premium24 lessons · 26.3 hrs

Track 5: Inference at Scale

Master LLM serving — vLLM internals, batching strategies, speculative decoding, and production deployment patterns.

Foundations

1. Why Inference is Different from Training60 min
2. Metrics That Matter65 min
3. Anatomy of a Single Request70 min

Memory Architecture

4. KV Cache Revisited — Serving Perspective75 min
5. PagedAttention — Virtual Memory for Attention80 min
6. Memory Management — OS Concepts for LLM Serving85 min

Batching & Scheduling

7. Continuous Batching — The Key to Throughput70 min
8. Scheduling Strategies65 min
9. Iteration-Level Batching — The vLLM Insight65 min

vLLM Internals

10. vLLM Architecture Overview65 min
11. The vLLM Scheduler Deep Dive75 min
12. Engine Deep Dive — Where Compute Happens70 min

Advanced Optimizations

13. Speculative Decoding — Trading Draft Compute for Latency75 min
14. Prefix Caching — Shared Prompts, Shared Compute60 min
15. Quantization for Serving — Memory vs Quality70 min
16. KV Cache Compression — Squeezing More Sequences65 min

Multi-GPU & Distribution

17. Tensor Parallelism for Inference65 min
18. Pipeline Serving — When TP Doesn't Scale55 min
19. Load Balancing — Multi-Instance Serving60 min

Production Concerns

20. Cost Modeling — The Business of Serving55 min
21. Monitoring & Debugging — Finding the Bottleneck60 min
22. Deployment Patterns — Architectures for Production50 min

Capstone

23. Capstone Part A: Design Inference System60 min
24. Capstone Part B: Optimize & Benchmark60 min
Track 6Premium24 lessons · 15.1 hrs

Track 6: Mechanistic Interpretability

Understand what happens inside neural networks — probing, attention analysis, causal methods, and circuit discovery.

Foundations & Tooling

1. Why Mechanistic Interpretability?28 min
2. Hooks & Activation Access32 min
3. TransformerLens Deep Dive38 min

Probing & Its Limits

4. Linear Probes — The Simplest Tool28 min
5. Probing Methodology32 min
6. The Limits of Probing27 min

Attention Analysis

7. Attention Patterns32 min
8. QK and OV Circuits33 min
9. Attention as Information Movement27 min

MLP Analysis & Features

10. Neurons and Features32 min
11. Superposition38 min
12. Polysemanticity30 min
13. Sparse Autoencoders45 min

Causal Methods

14. Activation Patching35 min
15. Path Patching38 min
16. Ablations and Knockouts32 min

Circuit Discovery

17. Induction Heads38 min
18. The IOI Circuit43 min
19. Circuit Validation35 min

Frontiers & Limitations

20. What Interpretability Can't Tell You30 min
21. Scaling Interpretability: GPT-2 to GPT-435 min
22. Research Frontiers in Mechanistic Interpretability32 min

Capstone

23. Capstone Project: End-to-End Circuit Discovery90 min
24. Capstone Project: Validate and Document Like a Research Paper75 min

Track 0 is completely free. Sign in to save progress and unlock Track 1.