Bandwidth & Profiling
Estimated reading time: 18 minutes
Build the mental models that separate research engineers from ML practitioners.
Loading...
Loading...
Estimated reading time: 18 minutes
In this tutorial, you will calculate arithmetic intensity for common transformer operations, use the roofline model to classify each as memory-bound or compute-bound, and estimate the speedup from kernel fusion.
By the end you will be able to:
Arithmetic intensity = FLOPs / Bytes loaded
| Scale | What Breaks | Mitigation |
|---|---|---|
| Single GPU | Memory bandwidth for activations | Kernel fusion, FlashAttention |
| Multi-GPU (data parallel) | Communication bandwidth | Gradient compression, overlap |
| Multi-GPU (tensor parallel) | All-reduce synchronization | Fast interconnects (NVLink) |
| Multi-node | Network bandwidth | Pipeline parallelism |
FlashAttention:
Triton:
When performance is poor, the fastest path to a correct fix is:
Start with one kernel and one metric. Pick the slowest kernel in the trace and determine whether it is memory- or compute-bound via arithmetic intensity.
Validate the bottleneck with counters. If you think you are memory-bound, look for high memory throughput and low math utilization. If you think you are compute-bound, confirm the opposite.
Change the data movement before changing the math. Most wins come from fewer reads and writes: fusion, better layouts, fewer intermediate tensors.
Only then tweak the algorithm. Algorithm changes are high risk and can easily regress correctness; fusion and layout are usually safer first.
Each question requires calculation or diagnosis, not just recall.
[32, 2048, 4096] in FP16 requires approximately 10 FLOPs per element. Calculate the arithmetic intensity in FLOPs/byte, compare it to the H100 FP16 ridge point (~295), and compute the theoretical GPU utilization percentage.[32, 2048, 4096] in FP16. Calculate the total extra bytes transferred compared to a fused version that reads once and writes once. Express the wasted bandwidth in GB.Papers:
Open Questions:
Next up: The capstone lesson—a systematic debugging flowchart that integrates everything you've learned.