Work at a Frontier Lab

Modern deep learning has a dirty secret: we've been lying to you about compute.

Everyone talks about FLOPs, TFLOPS, and compute utilization. But here's the uncomfortable truth that separates research engineers from practitioners: memory, not compute, is almost always your bottleneck.

Learning Progression (Easy -> Hard)#

Use this sequence as you read:

Start with The Memory Hierarchy to build core intuition and shared vocabulary.
Move to Mental Model: The Roofline to understand the mechanism behind the intuition.
Apply the idea in Toy Implementation: Memory Calculator with concrete examples or implementation details.
Challenge your understanding in the failure-mode section and check what breaks first.
Then zoom out to scale-level tradeoffs so the same concept holds at larger model and system sizes.
Map the concept to production constraints to understand how teams make practical tradeoffs.

The Memory Hierarchy#

Flow bridge: Start here; this section establishes the base mental model for the rest of the lesson.

GPUs have a memory hierarchy, just like CPUs — but the numbers are dramatically different.

Loading diagram...

The key insight: each level is ~10x faster but ~100x smaller. Your job as an ML engineer is to keep data in fast memory as long as possible.

Instructor Lens#

Mental Model: The Roofline#

Flow bridge: Building on The Memory Hierarchy, this section adds the next layer of conceptual depth.

The "roofline model" helps you understand whether you're compute-bound or memory-bound.

Toy Implementation: Memory Calculator#

Flow bridge: Apply the concept through concrete implementation details before moving to harder edge cases.

Let's build a simple tool to estimate memory requirements. Click "Run" to execute the code:

memory_calculator.py

Loading editor...

Step 1: Understand the formula. Model weights are just params × bytes_per_param. For fp16, that's 2 bytes per parameter. A 7B model = 14GB just for weights.

Break It: When Estimates Go Wrong#

Flow bridge: Now that the core mechanism is clear, stress-test it under realistic failure conditions.

# This will OOM on a 40GB GPU, even though the math says it should fit
# Why? Activation memory for the backward pass
# (Note: This is PyTorch code for illustration - won't run in browser)

import torch

model = torch.nn.Linear(16384, 16384).cuda()
x = torch.randn(32, 4096, 16384, device="cuda")  # 8GB input

# Forward seems fine...
y = model(x)  # 8GB output

# But backward needs to store activations
loss = y.sum()
loss.backward()  # OOM! Needs gradients for x, y, and weight

Try calculating the memory yourself:

activation_memory.py

Loading editor...

Scale Thought Experiment#

Flow bridge: With the local mechanism in place, extend it to larger model, context, and system scales.

What happens as we scale from 7B → 70B → 700B parameters?

Scale	Weights (fp16)	Min GPUs (80GB)	Real Requirement
7B	14 GB	1	1 A100 (room for KV cache)
70B	140 GB	2	4+ A100s (tensor parallel)
175B	350 GB	5	8+ A100s (need headroom)
700B	1.4 TB	18	32+ A100s (pipeline parallel)

The non-linear jump in "real requirement" comes from:

Communication overhead between GPUs
Memory fragmentation
Need for KV cache and activation memory
Redundant storage for fault tolerance

Production Reality#

Flow bridge: Carry these tradeoffs into production constraints and team-level operating decisions.

How do big labs actually handle the memory wall?

Mixed precision training — Use fp16/bf16 for compute, fp32 for sensitive operations. Halves memory, doubles throughput.
Gradient checkpointing — Don't store all activations; recompute them during backward. Trades compute for memory.
ZeRO/FSDP — Shard optimizer states, gradients, and weights across GPUs. Memory scales with GPU count.
FlashAttention — Fuse attention operations to avoid materializing the N×N attention matrix. Game-changer for long sequences.
Offloading — Move optimizer states to CPU/NVMe. Slower but enables training models that don't fit.

Teacher Walkthrough: Deciding the First Fix When You Hit OOM#

Flow bridge: Convert memory concepts into a practical incident response sequence.

When a run fails with out-of-memory, apply fixes in this order:

Reduce activation footprint first using checkpointing or smaller micro-batches.
Switch precision strategy to BF16/FP16 where stable.
Shard before scaling hardware blindly using FSDP/ZeRO.
Only then adjust architecture choices such as sequence length, model width, or MoE.

This ordering preserves model intent while minimizing restart cost.

Checkpoint Questions#

Use these to verify understanding before moving on:

Can you do this without notes: Understand why memory bandwidth, not compute, is the bottleneck?
Can you do this without notes: Build a mental model of GPU memory hierarchy?
Can you do this without notes: Calculate memory requirements for model weights and activations?

Research Hooks#

Flow bridge: Use this practical baseline to frame the open research questions that remain unresolved.

The memory wall is an active research area:

Mixture of Experts (MoE): Only activate a subset of parameters per token. More parameters, same memory bandwidth.
Linear attention: Replace O(N²) attention with O(N) alternatives. But accuracy tradeoffs exist.
Quantization-aware training: Train models that work well at int8/int4. Inference memory drops 4-8x.
Hardware evolution: HBM3 promises 2x bandwidth. CXL enables memory pooling across servers.

Next up: We'll see how gradients actually flow through computation graphs, and why understanding this unlocks optimization techniques.

Track 0: Foundations

The Memory Wall