AI Inference & GPU Performance

NVIDIA's AI Inference path, made beginner-friendly too: it starts from absolute zero (what a model is, what a GPU is) and ramps up gradually. You'll understand inference, measure latency vs throughput correctly (warmup, torch.cuda.synchronize(), percentiles), find whether you're compute- or memory-bound, and accelerate with batching, mixed precision (FP16/BF16 + Tensor Cores), quantization (INT8/FP8), kernel fusion, CUDA Graphs, and TensorRT — up to a Triton-style benchmark capstone.

Lesson 1: What Even Is AI and a Model? — Starting From Zero
Lesson 2: What Is a GPU, and Why Not Just a CPU?
Lesson 3: What is Inference? — The Intuition, No Code
Lesson 4: Where Does the Data 'Live'? — CPU, GPU, and the Trip Between
Lesson 5: 'Performance Mode' — The Two Switches Before You Run
Lesson 6: Latency vs Throughput — Two Metrics, Two Goals
Lesson 7: Measure Correctly — Warmup, synchronize, and Percentiles
Lesson 8: The Bottleneck — Compute-bound vs Memory-bound
Lesson 9: Batching — How to Raise Throughput
Lesson 10: Dynamic Batching & Serving — Grouping Requests in Real Time
Lesson 11: Precision — FP32 → FP16/BF16 and Tensor Cores
Lesson 12: Quantization — INT8/FP8
Lesson 13: Graph & Kernel Optimization — Fusion, CUDA Graphs, and TensorRT
Lesson 14: The Benchmark Capstone — Putting It All Together