In the previous post, we examined how differences in Number Formats affect hardware area and power consumption. We established that FP32 is excessively expensive and heavy from a hardware standpoint. Today, we will discuss the technology that puts this heavy data on a diet: Quantization.
Many people think of quantization simply as a "compression technique to reduce model file size." While reduced file size is true, the real reason System Architects are obsessed with quantization lies elsewhere. It is because of 'Bandwidth' and 'Data Movement.'
In this article, we will coldly analyze the physical benefits that occur inside the chip when FP32 (32-bit Floating Point) converts to INT8 (8-bit Integer), and what we lose in return (the Trade-off).
1. The Magic of Bandwidth: Effectively Quadrupling the Highway
The biggest bottleneck limiting system performance is often Memory (the so-called Memory Wall). Let's assume you are using the latest LPDDR5 memory to send 50GB of data per second to the NPU.
- Using FP32: One parameter is 4 Bytes (32 bits). Transferable parameters per second = 50{GB} / 4{B} = 12.5{Billion}.
- Using INT8: One parameter is 1 Byte (8 bits). Transferable parameters per second = 50 {GB} / 1{B} = 50{Billion}.
Even without increasing the physical memory speed, reducing the data size to 1/4 results in a 4x increase in effective Data Throughput.
This isn't just about speed. The efficiency of the chip's internal SRAM also improves by 4x. A 4MB cache can hold only 1 million FP32 parameters, but it can hold 4 million INT8 parameters. This drastically reduces the frequency of expensive DRAM accesses, maximizing power efficiency.
2. The Principle of Quantization
So, how do real numbers become integers? The most widely used method is Affine Quantization (Asymmetric Quantization).
- xfloat: The original floating-point value (Input).
- S (Scale Factor): The multiplier determining how much of the real range fits into one integer step (Step Size).
- Z (Zero Point): The offset determining where the zero point of the real number maps to the integer.
- xint: The converted integer value (-128 ~ 127 or 0 ~ 255 for INT8).
Simply put, it's like changing the markings on a high-resolution analog ruler to sparse digital markings. In this process, error inevitably occurs.
3. The Trade-off: Quantization Noise
The error generated when converting FP32 to INT8 is called Quantization Noise.
32-bit floating-point can express minute differences around 10-45 . However, INT8 can only represent 256 distinct numbers (28). For example, 0.1 and 0.11 might both map to the same integer 10 after quantization.
This loss of information leads to a drop in Model Accuracy.
- Rounding Error: Error caused by rounding to the nearest integer.
- Clipping Error: Error caused by forcing large values (Outliers) outside the representable range to the maximum/minimum values.
Architect’s Insight:
Hardware engineers constantly weigh this trade-off. "Do we sacrifice 1% accuracy to gain 4x speed?"
Fortunately, deep learning models have high Robustness to noise. Just as a photo of a dog with slight noise is still recognized as a dog, a model often produces the same final result even with some quantization noise mixed into the weights. This is the basis for boldly discarding FP32.
4. Dynamic Range and Calibration
The core technology of quantization is deciding "which range to split into 256 parts." This process is called Calibration.
- Min-Max: Sets the range based on the minimum and maximum values of the data. It's easy to implement, but a single Outlier can ruin the precision of the entire range.
- Entropy/Histogram: Sets the range around where the data is most concentrated and boldly cuts off the rest (Clipping). This is a smarter method to minimize information loss.
Hardware accelerators typically manage this Scale Factor (S) and Zero Point (Z) separately per layer (Layer-wise) or per channel (Channel-wise) to preserve precision.
5. Conclusion: Gaining More Than Losing
The path from FP32 to INT8 definitely involves loss (precision). However, the hardware benefits gained (4x bandwidth, increased power efficiency, area reduction) are truly massive.
Modern AI semiconductors are now challenging beyond simple INT8 to INT4, and even 1-bit (Binary) Quantization. "Expressing intelligence with the minimum number of bits"—that is the ultimate goal of NPU architecture.
References: Quantization for Neural Networks