Bandwidth & Profiling
Estimated reading time: 18 minutes
Most ops are memory-bound, not compute-bound. We learn to profile and identify the true bottleneck.
Modern GPUs have massive compute throughput but limited memory bandwidth. Understanding this mismatch is key to optimization.
Learning Progression (Easy -> Hard)#
Use this sequence as you read:
- Start with
Arithmetic Intensityto build core intuition and shared vocabulary. - Move to
The Roofline Modelto understand the mechanism behind the intuition. - Apply the idea in
Kernel Fusion: The Solutionwith concrete examples or implementation details. - Then zoom out to scale-level tradeoffs so the same concept holds at larger model and system sizes.
- Map the concept to production constraints to understand how teams make practical tradeoffs.
- Finish with research extensions to connect today’s mental model to open problems.
Arithmetic Intensity#
Flow bridge: Start here; this section establishes the base mental model for the rest of the lesson.
Arithmetic intensity = FLOPs / Bytes loaded
Instructor Lens#
The Roofline Model#
Flow bridge: Building on Arithmetic Intensity, this section adds the next layer of conceptual depth.
Kernel Fusion: The Solution#
Flow bridge: Building on The Roofline Model, this section adds the next layer of conceptual depth.
Scale Thought Experiment#
Flow bridge: With the local mechanism in place, extend it to larger model, context, and system scales.
| Scale | What Breaks | Mitigation |
|---|---|---|
| Single GPU | Memory bandwidth for activations | Kernel fusion, FlashAttention |
| Multi-GPU (data parallel) | Communication bandwidth | Gradient compression, overlap |
| Multi-GPU (tensor parallel) | All-reduce synchronization | Fast interconnects (NVLink) |
| Multi-node | Network bandwidth | Pipeline parallelism |
Production Reality#
Flow bridge: Carry these tradeoffs into production constraints and team-level operating decisions.
FlashAttention:
- Fuses QK matmul → softmax → V matmul
- Never materializes N×N attention matrix
- 2-4x speedup on long sequences
- Memory: O(N) instead of O(N²)
Triton:
- Write custom fused kernels in Python
- Compiler generates optimized CUDA
- Used for research-grade optimizations
Teacher Walkthrough: Profiling Without Lying to Yourself#
Flow bridge: Convert roofline intuition into a repeatable workflow you can use on any model.
When performance is poor, the fastest path to a correct fix is:
-
Start with one kernel and one metric. Pick the slowest kernel in the trace and determine whether it is memory- or compute-bound via arithmetic intensity.
-
Validate the bottleneck with counters. If you think you are memory-bound, look for high memory throughput and low math utilization. If you think you are compute-bound, confirm the opposite.
-
Change the data movement before changing the math. Most wins come from fewer reads and writes: fusion, better layouts, fewer intermediate tensors.
-
Only then tweak the algorithm. Algorithm changes are high risk and can easily regress correctness; fusion and layout are usually safer first.
Checkpoint Questions#
Use these to verify understanding before moving on:
- Can you do this without notes: Calculate arithmetic intensity to predict memory vs compute bound operations?
- Can you do this without notes: Interpret roofline model diagrams for GPU workloads?
- Can you do this without notes: Use profiling tools to identify actual bottlenecks?
Research Hooks#
Flow bridge: Use this practical baseline to frame the open research questions that remain unresolved.
Papers:
- "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness" (Dao et al., 2022) — The key innovation in modern transformers
- "Roofline: An Insightful Visual Performance Model for Multicore Architectures" (Williams et al., 2009) — The foundational roofline model
Open Questions:
- Can we automatically fuse arbitrary computation graphs?
- How do we optimize for heterogeneous hardware (CPU/GPU/TPU)?
Next up: The capstone lesson—a systematic debugging flowchart that integrates everything you've learned.