Work at a Frontier Lab
CoursesProblemsBlogPapersLibrariesDiscussJobs

Track 0: Foundations

Build the mental models that separate research engineers from ML practitioners.

Memory & Compute
The Memory Wall15m
Gradient Flow Under Pressure18m
Optimizers
SGD & Momentum15m
Adam, Warmup & Scheduling18m
Gradient Mechanics
Backprop as Graph Transformation20m
Initialization & Residual Connections18m
Scaling Laws & μ-Transfer20m
Systems Thinking
Bandwidth & Profiling18m
The Debugging Flowchart22m

Track 0: Foundations

Build the mental models that separate research engineers from ML practitioners.

Memory & Compute
The Memory Wall15mGradient Flow Under Pressure18m
Optimizers
SGD & Momentum15mAdam, Warmup & Scheduling18m
Gradient Mechanics
Backprop as Graph Transformation20mInitialization & Residual Connections18mScaling Laws & μ-Transfer20m
Systems Thinking
Bandwidth & Profiling18mThe Debugging Flowchart22m

Loading...

Built with Next.js

PrivacyTermsContactPapersLibrariesJobsDiscuss|GitHub|Work at a Frontier Lab

Loading...

  1. Home
  2. /
  3. Track 0: Foundations
  4. /
  5. The Memory Wall

The Memory Wall

Estimated reading time: 15 minutes

Next

Gradient Flow Under Pressure→

In this tutorial, you will build a reusable workflow for answering two questions before any training or inference run:

  • Capacity: Will the tensors fit in VRAM, or will you OOM?
  • Bandwidth: Even if they fit, can the GPU move data fast enough to keep compute busy?

By the end, you will be able to:

  • Estimate VRAM for weights, KV cache, activations, and optimizer states
  • Decide whether an operation is memory-bound or compute-bound
  • Diagnose an OOM and choose the cheapest fix

The Memory Hierarchy#

Think of GPU memory as a stack of containers. The closer to the compute units, the faster (and smaller) the memory.

Loading diagram...

Takeaway: Most performance tricks reduce trips to HBM:

  • Fuse ops so you do not write intermediate tensors to HBM
  • Reuse data in cache/registers when possible
  • Choose algorithms that avoid materializing giant tensors

Mental Model: The Roofline#

The roofline model answers: Am I compute-bound or memory-bound?

⚠️

Two Numbers You Compare

Arithmetic Intensity = FLOPs / bytes_loaded

Ridge point = peak_FLOPs / peak_bandwidth (FLOPs/byte)

  • If intensity < ridge → memory-bound
  • If intensity > ridge → compute-bound

Worked example (A100 80GB):

SpecValue
Peak FP16 tensor-core FLOPs312 TFLOP/s
HBM bandwidth2.0 TB/s
Ridge point312 / 2.0 = 156 FLOPs/byte

Now classify some ops:

OperationTypical IntensityBound?
ReLU, add, dropout~1 FLOP/byteMemory-bound
LayerNorm, Softmax~5 FLOPs/byteMemory-bound
Large GEMM (4096 × 4096)~2000 FLOPs/byteCompute-bound

Interpretation: Element-wise ops and reductions are almost always memory-bound. Large matmuls are compute-bound if batch size is big enough.

Toy Implementation: Memory Calculator#

Before optimizing, know what is eating your VRAM. Here are the main buckets for transformers:

BucketWhen it mattersScales with
Weightstraining + inferencenum_params
KV cacheinference (long context)batch * seq * hidden * layers
Activationstrainingbatch * seq * hidden * layers
Gradientstrainingnum_params
Optimizer states (Adam)trainingnum_params
⚠️

Mixed Precision Is Not Always a Memory Win

In training, you may still keep FP32 gradients, FP32 optimizer states, and sometimes an FP32 master copy of weights.

BF16/FP16 is often a throughput win, but total VRAM can stay similar unless you shard or offload.

Run the calculator below. Then change seq_length from 2048 to 128000 and observe how KV cache dominates.

memory_calculator.py
Loading editor...
1

Step 1: Start with weights. weights = params * bytes/param. In BF16 that is 2 bytes/param, so 7B = 14 GB just for weights.

Break It: When Estimates Go Wrong#

❌

Common Mistakes

  1. Counting only weights — Training adds grads + optimizer states + activations. Inference adds KV cache.

  2. Ignoring overhead — Fragmentation, temporary workspaces, and kernel caches can cost 10-25%.

  3. Forgetting what scales with tokens — KV cache and activations scale with batch * seq. Increasing context is rarely free.

The following code will OOM on a 40 GB GPU, even though the forward tensors look reasonable:

# PyTorch illustration — will not run in browser
import torch

model = torch.nn.Linear(16384, 16384).cuda()
x = torch.randn(32, 4096, 16384, device="cuda")  # ~8.6 GB input (fp32)

y = model(x)  # ~8.6 GB output — forward seems fine

loss = y.sum()
loss.backward()  # OOM — backward needs gradients too

Calculate it yourself:

activation_memory.py
Loading editor...

Fixing OOM: Ordered Checklist#

When a run fails with out-of-memory, apply fixes in this order (cheapest first):

  1. Reduce micro-batch size — Cuts activations and KV cache immediately.
  2. Enable gradient checkpointing — Trades compute for memory by recomputing activations.
  3. Switch to BF16/FP16 — Often a throughput win; sometimes a memory win.
  4. Shard with FSDP/ZeRO — Distributes optimizer states and gradients across GPUs.
  5. Offload to CPU/NVMe — Slower, but enables runs that otherwise do not fit.
  6. Reduce sequence length or model width — Last resort; changes the model.
💡

Operational Heuristic

Treat memory incidents as systems incidents, not immediate model-redesign problems. Most are solved faster through execution and sharding choices.

Optional: Profile Peak Memory (Real GPU)#

Estimation is for planning. Profiling is for truth. If you have access to a CUDA GPU:

# Run on a machine with CUDA
import torch

torch.cuda.reset_peak_memory_stats()

loss = step()  # forward
loss.backward()
optimizer.step()
optimizer.zero_grad(set_to_none=True)

print("peak GB:", torch.cuda.max_memory_allocated() / 1e9)

Scale Thought Experiment#

What happens as we scale from 7B to 700B parameters?

ScaleWeights (FP16)Min GPUs (80 GB)Real Requirement
7B14 GB11 A100 (room for KV cache)
70B140 GB24+ A100s (tensor parallel)
175B350 GB58+ A100s (need headroom)
700B1.4 TB1832+ A100s (pipeline parallel)

The non-linear jump in real requirement comes from:

  • Communication overhead between GPUs
  • Memory fragmentation
  • KV cache and activation memory
  • Redundant storage for fault tolerance

Production Reality#

How do teams handle the memory wall?

  1. Mixed precision — BF16/FP16 where possible, FP32 where necessary.
  2. Gradient checkpointing — Recompute activations in backward; trades compute for memory.
  3. ZeRO/FSDP — Shard optimizer states, gradients, and sometimes weights across GPUs.
  4. FlashAttention — Avoid materializing the full N*N attention matrix.
  5. Offloading — Move some states to CPU/NVMe when nothing else fits.

Checkpoint Questions#

Answer these without looking back:

  1. Calculate: A 13B model in BF16. How many GB just for weights? (Show your work.)

  2. Calculate: A 7B model (h=4096, L=32) serving batch=8 at seq=32768 with MHA (kv_head_ratio=1.0). Estimate KV cache in GB. (Formula: 2 * batch * seq * hidden * layers * kv_head_ratio * bytes_per_element)

  3. Decide: An operation has arithmetic intensity of 3 FLOPs/byte. On an A100 (ridge ~156 FLOPs/byte), is it memory-bound or compute-bound? What does this imply about optimization strategy?

  4. Diagnose: Training OOMs at backward. Your estimate shows 35 GB on a 40 GB card. Name two likely causes and the first fix to try.

Research Hooks#

The memory wall is an active research area:

  • Mixture of Experts (MoE): Activate a subset of parameters per token. More parameters, similar bandwidth.
  • Linear attention: Replace O(N^2) attention with O(N) alternatives. Accuracy tradeoffs exist.
  • Quantization-aware training: Train models that work well at int8/int4. Inference memory drops 4-8x.
  • Hardware evolution: New HBM generations increase bandwidth; interconnects change how we think about memory per GPU.

Next up: We will see how gradients flow through computation graphs, and why understanding this unlocks optimization techniques.