Lesson 12: Quantization — INT8/FP8

FP16 halved the memory. Quantization goes further: represent weights and activations as 8-bit integers (INT8) — one byte instead of four, 4x smaller and fast on INT units. The cost: a small rounding error. The secret to keeping accuracy is calibration. In this lesson we see the formula, the trade-of

Quantization is like rounding prices to whole dollars instead of cents. You save space and compute fast, and the final result is usually almost identical — as long as you pick the right 'rounding' (scale).

INT8 quantization: Representing values as 8-bit integers (one byte) instead of FP32 (4 bytes). 4x smaller and faster, at an accuracy cost.
Scale: A factor mapping the FP32 range to the integer range. q = round(x / scale); dequantize: x ≈ q * scale.
Calibration: Running representative data to measure the real value range and pick a scale that minimizes rounding error.