The Memory Wall
Estimated reading time: 15 minutes
Modern deep learning has a dirty secret: we've been lying to you about compute.
Everyone talks about FLOPs, TFLOPS, and compute utilization. But here's the uncomfortable truth that separates research engineers from practitioners: memory, not compute, is almost always your bottleneck.
Learning Progression (Easy -> Hard)#
Use this sequence as you read:
- Start with
The Memory Hierarchyto build core intuition and shared vocabulary. - Move to
Mental Model: The Rooflineto understand the mechanism behind the intuition. - Apply the idea in
Toy Implementation: Memory Calculatorwith concrete examples or implementation details. - Challenge your understanding in the failure-mode section and check what breaks first.
- Then zoom out to scale-level tradeoffs so the same concept holds at larger model and system sizes.
- Map the concept to production constraints to understand how teams make practical tradeoffs.
The Memory Hierarchy#
Flow bridge: Start here; this section establishes the base mental model for the rest of the lesson.
GPUs have a memory hierarchy, just like CPUs — but the numbers are dramatically different.
The key insight: each level is ~10x faster but ~100x smaller. Your job as an ML engineer is to keep data in fast memory as long as possible.
Instructor Lens#
Mental Model: The Roofline#
Flow bridge: Building on The Memory Hierarchy, this section adds the next layer of conceptual depth.
The "roofline model" helps you understand whether you're compute-bound or memory-bound.
Toy Implementation: Memory Calculator#
Flow bridge: Apply the concept through concrete implementation details before moving to harder edge cases.
Let's build a simple tool to estimate memory requirements. Click "Run" to execute the code:
Break It: When Estimates Go Wrong#
Flow bridge: Now that the core mechanism is clear, stress-test it under realistic failure conditions.
# This will OOM on a 40GB GPU, even though the math says it should fit
# Why? Activation memory for the backward pass
# (Note: This is PyTorch code for illustration - won't run in browser)
import torch
model = torch.nn.Linear(16384, 16384).cuda()
x = torch.randn(32, 4096, 16384, device="cuda") # 8GB input
# Forward seems fine...
y = model(x) # 8GB output
# But backward needs to store activations
loss = y.sum()
loss.backward() # OOM! Needs gradients for x, y, and weight
Try calculating the memory yourself:
Scale Thought Experiment#
Flow bridge: With the local mechanism in place, extend it to larger model, context, and system scales.
What happens as we scale from 7B → 70B → 700B parameters?
| Scale | Weights (fp16) | Min GPUs (80GB) | Real Requirement |
|---|---|---|---|
| 7B | 14 GB | 1 | 1 A100 (room for KV cache) |
| 70B | 140 GB | 2 | 4+ A100s (tensor parallel) |
| 175B | 350 GB | 5 | 8+ A100s (need headroom) |
| 700B | 1.4 TB | 18 | 32+ A100s (pipeline parallel) |
The non-linear jump in "real requirement" comes from:
- Communication overhead between GPUs
- Memory fragmentation
- Need for KV cache and activation memory
- Redundant storage for fault tolerance
Production Reality#
Flow bridge: Carry these tradeoffs into production constraints and team-level operating decisions.
How do big labs actually handle the memory wall?
-
Mixed precision training — Use fp16/bf16 for compute, fp32 for sensitive operations. Halves memory, doubles throughput.
-
Gradient checkpointing — Don't store all activations; recompute them during backward. Trades compute for memory.
-
ZeRO/FSDP — Shard optimizer states, gradients, and weights across GPUs. Memory scales with GPU count.
-
FlashAttention — Fuse attention operations to avoid materializing the N×N attention matrix. Game-changer for long sequences.
-
Offloading — Move optimizer states to CPU/NVMe. Slower but enables training models that don't fit.
Teacher Walkthrough: Deciding the First Fix When You Hit OOM#
Flow bridge: Convert memory concepts into a practical incident response sequence.
When a run fails with out-of-memory, apply fixes in this order:
- Reduce activation footprint first using checkpointing or smaller micro-batches.
- Switch precision strategy to BF16/FP16 where stable.
- Shard before scaling hardware blindly using FSDP/ZeRO.
- Only then adjust architecture choices such as sequence length, model width, or MoE.
This ordering preserves model intent while minimizing restart cost.
Checkpoint Questions#
Use these to verify understanding before moving on:
- Can you do this without notes: Understand why memory bandwidth, not compute, is the bottleneck?
- Can you do this without notes: Build a mental model of GPU memory hierarchy?
- Can you do this without notes: Calculate memory requirements for model weights and activations?
Research Hooks#
Flow bridge: Use this practical baseline to frame the open research questions that remain unresolved.
The memory wall is an active research area:
- Mixture of Experts (MoE): Only activate a subset of parameters per token. More parameters, same memory bandwidth.
- Linear attention: Replace O(N²) attention with O(N) alternatives. But accuracy tradeoffs exist.
- Quantization-aware training: Train models that work well at int8/int4. Inference memory drops 4-8x.
- Hardware evolution: HBM3 promises 2x bandwidth. CXL enables memory pooling across servers.
Next up: We'll see how gradients actually flow through computation graphs, and why understanding this unlocks optimization techniques.