Lesson 6: Latency vs Throughput — Two Metrics, Two Goals
An engineer says 'the model is fast.' The first question at NVIDIA: fast in what sense? Latency (how quickly one answer returns) or throughput (how many answers per second)? These are two different goals that sometimes clash — and you must know which you're targeting before touching the code.
Latency is how long until your dish arrives from the kitchen. Throughput is how many dishes the kitchen sends out per hour. You can cook many dishes together (high throughput) but then each single dish waits a bit longer (higher latency).
- Latency
- The time for a single request to return — from input to output. Measured in milliseconds. Critical for real-time UX.
- Throughput
- How many items/requests are processed per second. Measured in items/sec. Critical for bulk processing and cost.