Lesson 13: Graph & Kernel Optimization — Fusion, CUDA Graphs, and TensorRT

Every GPU operation costs launch overhead, and every operation writes and reads intermediate memory. A model with hundreds of small ops wastes most of its time on overhead. The fix: fuse operations, capture and replay the graph (CUDA Graphs / torch.compile), and finally compile to a tuned engine wit

Running 100 separate operations is like buying 100 items with 100 separate trips to the store. Fusion is doing it all in one trip. TensorRT is planning the optimal route through the whole store in advance.

Kernel fusion: Merging several operations into one kernel — fewer launches and less writing/reading of intermediate memory.
torch.compile / CUDA Graphs: Capture the operation graph once and replay it — eliminating per-run launch overhead.
TensorRT: Compiles a trained model into a GPU-tuned engine: fusion, kernel selection, precision (FP16/INT8), and graph capture.
KV-cache: In LLMs, store past tokens' keys/values so they aren't recomputed for each new token.