The Memory Wall
Estimated reading time: 15 minutes
Build the mental models that separate research engineers from ML practitioners.
Loading...
Loading...
Estimated reading time: 15 minutes
In this tutorial, you will build a reusable workflow for answering two questions before any training or inference run:
By the end, you will be able to:
Think of GPU memory as a stack of containers. The closer to the compute units, the faster (and smaller) the memory.
Takeaway: Most performance tricks reduce trips to HBM:
The roofline model answers: Am I compute-bound or memory-bound?
Worked example (A100 80GB):
| Spec | Value |
|---|---|
| Peak FP16 tensor-core FLOPs | 312 TFLOP/s |
| HBM bandwidth | 2.0 TB/s |
| Ridge point | 312 / 2.0 = 156 FLOPs/byte |
Now classify some ops:
| Operation | Typical Intensity | Bound? |
|---|---|---|
| ReLU, add, dropout | ~1 FLOP/byte | Memory-bound |
| LayerNorm, Softmax | ~5 FLOPs/byte | Memory-bound |
| Large GEMM (4096 × 4096) | ~2000 FLOPs/byte | Compute-bound |
Interpretation: Element-wise ops and reductions are almost always memory-bound. Large matmuls are compute-bound if batch size is big enough.
Before optimizing, know what is eating your VRAM. Here are the main buckets for transformers:
| Bucket | When it matters | Scales with |
|---|---|---|
| Weights | training + inference | num_params |
| KV cache | inference (long context) | batch * seq * hidden * layers |
| Activations | training | batch * seq * hidden * layers |
| Gradients | training | num_params |
| Optimizer states (Adam) | training | num_params |
Run the calculator below. Then change seq_length from 2048 to 128000 and observe how KV cache dominates.
The following code will OOM on a 40 GB GPU, even though the forward tensors look reasonable:
# PyTorch illustration — will not run in browser
import torch
model = torch.nn.Linear(16384, 16384).cuda()
x = torch.randn(32, 4096, 16384, device="cuda") # ~8.6 GB input (fp32)
y = model(x) # ~8.6 GB output — forward seems fine
loss = y.sum()
loss.backward() # OOM — backward needs gradients too
Calculate it yourself:
When a run fails with out-of-memory, apply fixes in this order (cheapest first):
Estimation is for planning. Profiling is for truth. If you have access to a CUDA GPU:
# Run on a machine with CUDA
import torch
torch.cuda.reset_peak_memory_stats()
loss = step() # forward
loss.backward()
optimizer.step()
optimizer.zero_grad(set_to_none=True)
print("peak GB:", torch.cuda.max_memory_allocated() / 1e9)
What happens as we scale from 7B to 700B parameters?
| Scale | Weights (FP16) | Min GPUs (80 GB) | Real Requirement |
|---|---|---|---|
| 7B | 14 GB | 1 | 1 A100 (room for KV cache) |
| 70B | 140 GB | 2 | 4+ A100s (tensor parallel) |
| 175B | 350 GB | 5 | 8+ A100s (need headroom) |
| 700B | 1.4 TB | 18 | 32+ A100s (pipeline parallel) |
The non-linear jump in real requirement comes from:
How do teams handle the memory wall?
Answer these without looking back:
Calculate: A 13B model in BF16. How many GB just for weights? (Show your work.)
Calculate: A 7B model (h=4096, L=32) serving batch=8 at seq=32768 with MHA (kv_head_ratio=1.0). Estimate KV cache in GB. (Formula: 2 * batch * seq * hidden * layers * kv_head_ratio * bytes_per_element)
Decide: An operation has arithmetic intensity of 3 FLOPs/byte. On an A100 (ridge ~156 FLOPs/byte), is it memory-bound or compute-bound? What does this imply about optimization strategy?
Diagnose: Training OOMs at backward. Your estimate shows 35 GB on a 40 GB card. Name two likely causes and the first fix to try.
The memory wall is an active research area:
Next up: We will see how gradients flow through computation graphs, and why understanding this unlocks optimization techniques.