Lesson 11: Precision — FP32 → FP16/BF16 and Tensor Cores
Your model is stored in FP32 — 4 bytes per number. But inference usually doesn't need all that precision. Moving to FP16 or BF16 halves the memory and engages the Tensor Cores — dedicated hardware units that multiply matrices in low precision far faster. In this lesson we see how to get that speedup
FP32 is writing every number with tons of digits after the decimal point. Inference usually needs fewer digits (FP16) — half the space, and the GPU multiplies them much faster with dedicated hardware.
- FP16 / BF16
- 16-bit (2-byte) formats instead of FP32 (4-byte). They halve memory and speed up. BF16 keeps FP32's exponent range.
- Tensor Cores
- Hardware units in NVIDIA GPUs for low-precision matrix multiply-accumulate (FP16/BF16/INT8) — far faster than FP32.
- autocast (mixed precision)
- Runs heavy ops in FP16 and keeps sensitive ops in FP32 — speed with stability.