Modern AI's "Weight Quantization": From FP32 to INT4, How Models "Slim Down" Without Losing Intelligence

In the AI community, we often hear the term "Quantization." When you see a 70B-parameter model that originally required 140GB of VRAM to run, but after quantization, it runs on consumer-grade GPUs with only 40GB or even less while maintaining nearly identical performance, that is the magic of quantization.

But what exactly is quantization doing? Is it simply "rounding off"?

1. The Core Conflict: Precision vs. Memory

Deep learning models are essentially massive matrix multiplication machines. Each "parameter" (Weight) in the model is typically represented using FP32 (single-precision floating-point) during training. An FP32 number occupies 32 bits (4 bytes).

For a 70B (70 billion parameter) model:
$70 \times 10^9 \text{ parameters} \times 4 \text{ bytes} \approx 280\text{ GB}$

This means you would need several H100 GPUs just to barely load the model. The bottleneck during inference often lies not in computational speed (TFLOPS), but in Memory Bandwidth—the time it takes to move weights from VRAM to the compute units is far slower than the computation itself.

The goal of quantization is simple: represent the same number using fewer bits (Bit-width), thereby reducing memory usage and increasing data transfer speed.

2. The Essence of Quantization: Mapping and Scaling

Quantization is not simple truncation; it is a mapping process. The most common method is Linear Quantization.

Suppose we have a set of FP32 weights $\mathbf{W}$ ranging from $[-1.5, 2.5]$. We want to quantize them to INT8 (range $[-128, 127]$).

The quantization formula is typically:
$$Q = \text{round}\left(\frac{W}{S} + Z\right)$$
Where $S$ is the Scale factor, and $Z$ is the Zero-point.

Scale ($S$): Maps the dynamic range of FP32 to the INT8 range. For example, $S = (2.5 - (-1.5)) / (127 - (-128)) \approx 0.0157$.
Zero-point ($Z$): Handles asymmetric distributions, ensuring that $0$ in FP32 maps precisely to an integer.

Thus, numbers that originally required 32 bits for storage are reduced to 8 bits, cutting memory usage directly to $1/4$ of the original.

3. From INT8 to INT4: The Tipping Point of Precision Collapse

When we compress the bit-width further to INT4, things become tricky. INT4 has only $2^4 = 16$ possible values. If we use simple linear mapping, many subtle weight differences are flattened, causing the model to suffer from "intelligence dropouts" or garbled output.

To solve this problem, the industry has introduced more advanced schemes:

A. Group-wise Quantization

Instead of using a single $S$ for the entire matrix, weights are divided into small groups (e.g., every 64 parameters form a group), with each group having its independent Scale and Zero-point. This greatly captures the characteristics of local distributions.

B. NF4 (NormalFloat 4) — The Core of QLoRA

NF4 is a special quantization format designed for normal distributions. Research has found that the weight distribution of LLMs closely approximates a normal distribution $\mathcal{N}(0, \sigma^2)$. Instead of using evenly spaced integer mapping, NF4 defines these 16 values based on the quantiles of the normal distribution. This means more representation precision is allocated to areas with higher probability density, thereby maintaining astonishing performance at extremely low bit-widths.

4. What is the Cost of Quantization?

Although memory is saved, it comes at a cost:
1. Quantization Error: Every number is approximated $\rightarrow$ the model's predicted probability distribution shifts $\rightarrow$ generation quality slightly declines (Perplexity increases).
2. Dequantization Overhead: GPU compute units typically do not support direct INT4 $\times$ FP16 matrix multiplication. In practice, weights are stored in VRAM as INT4 $\rightarrow$ instantly "dequantized" back to FP16 before computation $\rightarrow$ computation is performed $\rightarrow$ and then the dequantized results are discarded. Although this process is fast, it still incurs overhead.

5. Practical Advice for Developers

If you are hesitant about which version to choose when deploying models:
- FP16/BF16: The baseline, highest precision, suitable for production environments with sufficient resources.
- INT8 (SmoothQuant/GPTQ): A very robust choice with almost imperceptible loss $\rightarrow$ the preferred option.
- INT4 (AWQ/GPTQ/GGUF): The only viable option for consumer-grade hardware $\rightarrow$ for general tasks with tight VRAM constraints, AWQ usually maintains slightly better precision than GPTQ.
- $\le$ INT3: Currently in the academic exploration phase or for extremely small-scale deployments $\rightarrow$ do not expect it to maintain complex logical capabilities.

Summary: Quantization is a typical trade-off art in AI engineering, exchanging "space for time" and "precision for usability." It has allowed LLMs to move from expensive server clusters into the laptops of every developer.

Modern AI's "Weight Quantization": From FP32 to INT4, How Models "Slim Down" Without Losing Intelligence

Modern AI's "Weight Quantization": From FP32 to INT4, How Models "Slim Down" Without Losing Intelligence

1. The Core Conflict: Precision vs. Memory

2. The Essence of Quantization: Mapping and Scaling

3. From INT8 to INT4: The Tipping Point of Precision Collapse

A. Group-wise Quantization

B. NF4 (NormalFloat 4) — The Core of QLoRA

4. What is the Cost of Quantization?

5. Practical Advice for Developers

Comments

Leave a Comment