Back to Tools

FP32 to INT4 Quantization

Visualize how floating point ranges are compressed into 4-bit integers.

Configuration

Unsigned Int4 [0, 15]

Dynamic Range

Model Parameters

Scale (s)1.3333e+0
Zero Point (z)8
q = clamp(round(x/s) + z, 0, 15)

Derivation (asymmetric)

s= (r_max - r_min) / (15 - 0)
z= round(0 - r_min / s)
Unsigned Int4
10
x
x̂ ≈ 2.67
1. Scale & Shift
2.50 / 1.33 + 8
= 9.88
2. Round & Clamp
round(...)
10 (int4)
3. Dequantize
(10 - 8) * 1.33
2.667

Quantization Grid

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Error
-1.67e-1