Bandwidth & Profiling

Estimated reading time: 18 minutes

Most ops are memory-bound, not compute-bound. We learn to profile and identify the true bottleneck.

Modern GPUs have massive compute throughput but limited memory bandwidth. Understanding this mismatch is key to optimization.

Learning Progression (Easy -> Hard)#

Use this sequence as you read:

  1. Start with Arithmetic Intensity to build core intuition and shared vocabulary.
  2. Move to The Roofline Model to understand the mechanism behind the intuition.
  3. Apply the idea in Kernel Fusion: The Solution with concrete examples or implementation details.
  4. Then zoom out to scale-level tradeoffs so the same concept holds at larger model and system sizes.
  5. Map the concept to production constraints to understand how teams make practical tradeoffs.
  6. Finish with research extensions to connect today’s mental model to open problems.

Arithmetic Intensity#

Flow bridge: Start here; this section establishes the base mental model for the rest of the lesson.

Arithmetic intensity = FLOPs / Bytes loaded

Loading diagram...
arithmetic_intensity.py
Loading editor...

Instructor Lens#

The Roofline Model#

Flow bridge: Building on Arithmetic Intensity, this section adds the next layer of conceptual depth.

roofline_model.py
Loading editor...

Kernel Fusion: The Solution#

Flow bridge: Building on The Roofline Model, this section adds the next layer of conceptual depth.

kernel_fusion.py
Loading editor...
1

Step 1: Identify the bottleneck. Use arithmetic intensity to determine if you're memory-bound or compute-bound. Most operations are memory-bound!

Scale Thought Experiment#

Flow bridge: With the local mechanism in place, extend it to larger model, context, and system scales.

ScaleWhat BreaksMitigation
Single GPUMemory bandwidth for activationsKernel fusion, FlashAttention
Multi-GPU (data parallel)Communication bandwidthGradient compression, overlap
Multi-GPU (tensor parallel)All-reduce synchronizationFast interconnects (NVLink)
Multi-nodeNetwork bandwidthPipeline parallelism

Production Reality#

Flow bridge: Carry these tradeoffs into production constraints and team-level operating decisions.

FlashAttention:

  • Fuses QK matmul → softmax → V matmul
  • Never materializes N×N attention matrix
  • 2-4x speedup on long sequences
  • Memory: O(N) instead of O(N²)

Triton:

  • Write custom fused kernels in Python
  • Compiler generates optimized CUDA
  • Used for research-grade optimizations

Teacher Walkthrough: Profiling Without Lying to Yourself#

Flow bridge: Convert roofline intuition into a repeatable workflow you can use on any model.

When performance is poor, the fastest path to a correct fix is:

  1. Start with one kernel and one metric. Pick the slowest kernel in the trace and determine whether it is memory- or compute-bound via arithmetic intensity.

  2. Validate the bottleneck with counters. If you think you are memory-bound, look for high memory throughput and low math utilization. If you think you are compute-bound, confirm the opposite.

  3. Change the data movement before changing the math. Most wins come from fewer reads and writes: fusion, better layouts, fewer intermediate tensors.

  4. Only then tweak the algorithm. Algorithm changes are high risk and can easily regress correctness; fusion and layout are usually safer first.

Checkpoint Questions#

Use these to verify understanding before moving on:

  1. Can you do this without notes: Calculate arithmetic intensity to predict memory vs compute bound operations?
  2. Can you do this without notes: Interpret roofline model diagrams for GPU workloads?
  3. Can you do this without notes: Use profiling tools to identify actual bottlenecks?

Research Hooks#

Flow bridge: Use this practical baseline to frame the open research questions that remain unresolved.

Papers:

  1. "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness" (Dao et al., 2022) — The key innovation in modern transformers
  2. "Roofline: An Insightful Visual Performance Model for Multicore Architectures" (Williams et al., 2009) — The foundational roofline model

Open Questions:

  • Can we automatically fuse arbitrary computation graphs?
  • How do we optimize for heterogeneous hardware (CPU/GPU/TPU)?

Next up: The capstone lesson—a systematic debugging flowchart that integrates everything you've learned.