Work at a Frontier Lab
CoursesProblemsBlogPapersLibrariesDiscussJobs

Track 0: Foundations

Build the mental models that separate research engineers from ML practitioners.

Memory & Compute
The Memory Wall15m
Gradient Flow Under Pressure18m
Optimizers
SGD & Momentum15m
Adam, Warmup & Scheduling18m
Gradient Mechanics
Backprop as Graph Transformation20m
Initialization & Residual Connections18m
Scaling Laws & μ-Transfer20m
Systems Thinking
Bandwidth & Profiling18m
The Debugging Flowchart22m

Track 0: Foundations

Build the mental models that separate research engineers from ML practitioners.

Memory & Compute
The Memory Wall15mGradient Flow Under Pressure18m
Optimizers
SGD & Momentum15mAdam, Warmup & Scheduling18m
Gradient Mechanics
Backprop as Graph Transformation20mInitialization & Residual Connections18mScaling Laws & μ-Transfer20m
Systems Thinking
Bandwidth & Profiling18mThe Debugging Flowchart22m

Loading...

Built with Next.js

PrivacyTermsContactPapersLibrariesJobsDiscuss|GitHub|Work at a Frontier Lab

Loading...

  1. Home
  2. /
  3. Track 0: Foundations
  4. /
  5. Bandwidth & Profiling

Bandwidth & Profiling

Estimated reading time: 18 minutes

Previous

←Scaling Laws & μ-Transfer

Next

The Debugging Flowchart→

In this tutorial, you will calculate arithmetic intensity for common transformer operations, use the roofline model to classify each as memory-bound or compute-bound, and estimate the speedup from kernel fusion.

By the end you will be able to:

  • Compute the ridge point for a given GPU and determine which operations fall below it
  • Estimate the time saved by fusing element-wise operations with matrix multiplies
  • Apply a systematic profiling workflow: identify bottleneck first, then choose the right optimization
💡

Core Idea

The H100 delivers ~990 TFLOPS (FP16 Tensor Core) and 3.35 TB/s of memory bandwidth — a ratio of roughly 295 FLOPs per byte. (The often-cited 1,979 TFLOPS is the FP8 peak, which gives a ridge of ~590, but most transformer training uses FP16/BF16.) Any operation with arithmetic intensity below this ridge point is memory-bound and cannot use more than a fraction of available compute. Most transformer operations (LayerNorm, softmax, activation functions) fall well below this threshold.

Arithmetic Intensity#

Arithmetic intensity = FLOPs / Bytes loaded

Loading diagram...
arithmetic_intensity.py
Loading editor...

The Roofline Model#

💡

Reading the Roofline

The roofline shows max achievable performance:

  • Left of ridge point: sloped line (memory-limited)
  • Right of ridge point: flat line (compute-limited)

Most operations live on the sloped part!

roofline_model.py
Loading editor...

Kernel Fusion: The Solution#

⚠️

Why Fused Kernels Matter

Naive implementation of y = dropout(gelu(x @ W)):

  1. Load x, W → compute matmul → store result
  2. Load result → compute GELU → store
  3. Load result → apply dropout → store

Three memory round-trips!

Fused implementation: Load once → matmul → GELU → dropout → store once.

FlashAttention fuses: matmul → softmax → matmul into one kernel.

kernel_fusion.py
Loading editor...
1

Step 1: Identify the bottleneck. Use arithmetic intensity to determine if you're memory-bound or compute-bound. Most operations are memory-bound!

Scale Thought Experiment#

ScaleWhat BreaksMitigation
Single GPUMemory bandwidth for activationsKernel fusion, FlashAttention
Multi-GPU (data parallel)Communication bandwidthGradient compression, overlap
Multi-GPU (tensor parallel)All-reduce synchronizationFast interconnects (NVLink)
Multi-nodeNetwork bandwidthPipeline parallelism

Production Reality#

FlashAttention:

  • Fuses QK matmul → softmax → V matmul
  • Never materializes N×N attention matrix
  • 2-4x speedup on long sequences
  • Memory: O(N) instead of O(N²)

Triton:

  • Write custom fused kernels in Python
  • Compiler generates optimized CUDA
  • Used for research-grade optimizations

Teacher Walkthrough: Profiling Without Lying to Yourself#

When performance is poor, the fastest path to a correct fix is:

  1. Start with one kernel and one metric. Pick the slowest kernel in the trace and determine whether it is memory- or compute-bound via arithmetic intensity.

  2. Validate the bottleneck with counters. If you think you are memory-bound, look for high memory throughput and low math utilization. If you think you are compute-bound, confirm the opposite.

  3. Change the data movement before changing the math. Most wins come from fewer reads and writes: fusion, better layouts, fewer intermediate tensors.

  4. Only then tweak the algorithm. Algorithm changes are high risk and can easily regress correctness; fusion and layout are usually safer first.

💡

Practical Rule

If you cannot point to where bytes move in a profile, you are guessing. The roofline gives a prior, the profiler gives ground truth.

Break It: Optimization Misdirection#

❌

Common Mistakes

  1. Optimizing compute for a memory-bound kernel — Adding tensor cores or mixed precision to a LayerNorm kernel has no effect because the bottleneck is memory bandwidth, not arithmetic throughput. Check arithmetic intensity before choosing an optimization strategy.

  2. Benchmarking fused kernels on tiny inputs — Kernel launch overhead (~5-10 microseconds) dominates at small sizes, masking the fusion benefit. Always benchmark at production-scale shapes (batch 32+, seq 2048+, hidden 4096+).

  3. Ignoring L2 cache effects — Small tensors that fit in L2 cache (50 MB on H100) see effective bandwidth of ~6 TB/s, nearly 2x the HBM bandwidth. A kernel that appears memory-bound at large scale may be compute-bound at small scale. Profile at your actual working set size.

Checkpoint Questions#

Each question requires calculation or diagnosis, not just recall.

  1. A GELU activation on a tensor of shape [32, 2048, 4096] in FP16 requires approximately 10 FLOPs per element. Calculate the arithmetic intensity in FLOPs/byte, compare it to the H100 FP16 ridge point (~295), and compute the theoretical GPU utilization percentage.
  2. An unfused sequence of MatMul + LayerNorm + Dropout reads and writes the intermediate tensor three times. The tensor is [32, 2048, 4096] in FP16. Calculate the total extra bytes transferred compared to a fused version that reads once and writes once. Express the wasted bandwidth in GB.
  3. A profiler shows a transformer layer step takes 12 ms. The MatMul kernels account for 4 ms and achieve 85% compute utilization. The remaining 8 ms is spent on element-wise and normalization kernels at 2% compute utilization. Identify the optimization with the highest impact and estimate the achievable speedup if those kernels are fused with adjacent MatMuls.

Research Hooks#

Papers:

  1. "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness" (Dao et al., 2022) — The key innovation in modern transformers
  2. "Roofline: An Insightful Visual Performance Model for Multicore Architectures" (Williams et al., 2009) — The foundational roofline model

Open Questions:

  • Can we automatically fuse arbitrary computation graphs?
  • How do we optimize for heterogeneous hardware (CPU/GPU/TPU)?

Next up: The capstone lesson—a systematic debugging flowchart that integrates everything you've learned.