FP32 to INT4 Quantization

Visualize how floating point ranges are compressed into 4-bit integers.

Quantization Scheme

Unsigned Int4 [0, 15]

Dynamic Range

Min Value

Max Value

Scale (s)1.3333e+0

Zero Point (z)8

q = clamp(round(x/s) + z, 0, 15)

Derivation (asymmetric)

s= (r_max - r_min) / (15 - 0)

z= round(0 - r_min / s)

Input Value (x)

Unsigned Int4

x̂ ≈ 2.67

1. Scale & Shift

2.50 / 1.33 + 8

= 9.88

2. Round & Clamp

round(...)

10 (int4)

3. Dequantize

(10 - 8) * 1.33

≈ 2.667

Error

-1.67e-1