The Memory Wall

Estimated reading time: 15 minutes

Modern deep learning has a dirty secret: we've been lying to you about compute.

Everyone talks about FLOPs, TFLOPS, and compute utilization. But here's the uncomfortable truth that separates research engineers from practitioners: memory, not compute, is almost always your bottleneck.

Learning Progression (Easy -> Hard)#

Use this sequence as you read:

  1. Start with The Memory Hierarchy to build core intuition and shared vocabulary.
  2. Move to Mental Model: The Roofline to understand the mechanism behind the intuition.
  3. Apply the idea in Toy Implementation: Memory Calculator with concrete examples or implementation details.
  4. Challenge your understanding in the failure-mode section and check what breaks first.
  5. Then zoom out to scale-level tradeoffs so the same concept holds at larger model and system sizes.
  6. Map the concept to production constraints to understand how teams make practical tradeoffs.

The Memory Hierarchy#

Flow bridge: Start here; this section establishes the base mental model for the rest of the lesson.

GPUs have a memory hierarchy, just like CPUs — but the numbers are dramatically different.

Loading diagram...

The key insight: each level is ~10x faster but ~100x smaller. Your job as an ML engineer is to keep data in fast memory as long as possible.

Instructor Lens#

Mental Model: The Roofline#

Flow bridge: Building on The Memory Hierarchy, this section adds the next layer of conceptual depth.

The "roofline model" helps you understand whether you're compute-bound or memory-bound.

Toy Implementation: Memory Calculator#

Flow bridge: Apply the concept through concrete implementation details before moving to harder edge cases.

Let's build a simple tool to estimate memory requirements. Click "Run" to execute the code:

memory_calculator.py
Loading editor...
1

Step 1: Understand the formula. Model weights are just params × bytes_per_param. For fp16, that's 2 bytes per parameter. A 7B model = 14GB just for weights.

Break It: When Estimates Go Wrong#

Flow bridge: Now that the core mechanism is clear, stress-test it under realistic failure conditions.

# This will OOM on a 40GB GPU, even though the math says it should fit
# Why? Activation memory for the backward pass
# (Note: This is PyTorch code for illustration - won't run in browser)

import torch

model = torch.nn.Linear(16384, 16384).cuda()
x = torch.randn(32, 4096, 16384, device="cuda")  # 8GB input

# Forward seems fine...
y = model(x)  # 8GB output

# But backward needs to store activations
loss = y.sum()
loss.backward()  # OOM! Needs gradients for x, y, and weight

Try calculating the memory yourself:

activation_memory.py
Loading editor...

Scale Thought Experiment#

Flow bridge: With the local mechanism in place, extend it to larger model, context, and system scales.

What happens as we scale from 7B → 70B → 700B parameters?

ScaleWeights (fp16)Min GPUs (80GB)Real Requirement
7B14 GB11 A100 (room for KV cache)
70B140 GB24+ A100s (tensor parallel)
175B350 GB58+ A100s (need headroom)
700B1.4 TB1832+ A100s (pipeline parallel)

The non-linear jump in "real requirement" comes from:

  • Communication overhead between GPUs
  • Memory fragmentation
  • Need for KV cache and activation memory
  • Redundant storage for fault tolerance

Production Reality#

Flow bridge: Carry these tradeoffs into production constraints and team-level operating decisions.

How do big labs actually handle the memory wall?

  1. Mixed precision training — Use fp16/bf16 for compute, fp32 for sensitive operations. Halves memory, doubles throughput.

  2. Gradient checkpointing — Don't store all activations; recompute them during backward. Trades compute for memory.

  3. ZeRO/FSDP — Shard optimizer states, gradients, and weights across GPUs. Memory scales with GPU count.

  4. FlashAttention — Fuse attention operations to avoid materializing the N×N attention matrix. Game-changer for long sequences.

  5. Offloading — Move optimizer states to CPU/NVMe. Slower but enables training models that don't fit.

Teacher Walkthrough: Deciding the First Fix When You Hit OOM#

Flow bridge: Convert memory concepts into a practical incident response sequence.

When a run fails with out-of-memory, apply fixes in this order:

  1. Reduce activation footprint first using checkpointing or smaller micro-batches.
  2. Switch precision strategy to BF16/FP16 where stable.
  3. Shard before scaling hardware blindly using FSDP/ZeRO.
  4. Only then adjust architecture choices such as sequence length, model width, or MoE.

This ordering preserves model intent while minimizing restart cost.

Checkpoint Questions#

Use these to verify understanding before moving on:

  1. Can you do this without notes: Understand why memory bandwidth, not compute, is the bottleneck?
  2. Can you do this without notes: Build a mental model of GPU memory hierarchy?
  3. Can you do this without notes: Calculate memory requirements for model weights and activations?

Research Hooks#

Flow bridge: Use this practical baseline to frame the open research questions that remain unresolved.

The memory wall is an active research area:

  • Mixture of Experts (MoE): Only activate a subset of parameters per token. More parameters, same memory bandwidth.
  • Linear attention: Replace O(N²) attention with O(N) alternatives. But accuracy tradeoffs exist.
  • Quantization-aware training: Train models that work well at int8/int4. Inference memory drops 4-8x.
  • Hardware evolution: HBM3 promises 2x bandwidth. CXL enables memory pooling across servers.

Next up: We'll see how gradients actually flow through computation graphs, and why understanding this unlocks optimization techniques.