In the previous post, we explored the difference between Training and Inference, seeing how inference-only NPUs lighten the hardware structure. One of the key keywords for this optimization was 'Reduction of Precision.' To a software engineer, data is merely an abstract variable type like float (32-bit) or int (32-bit). However, to a System Architect designing silicon chips, data carries physical 'Weight.'
An increase in the number of bits means more strands of wire to transport data, more Flip-Flops to store it, and most importantly, an exponential increase in the Silicon Area of the logic circuits required to compute them.
In this article, we will analyze why FP32 (Floating Point 32-bit), the standard for deep learning, is such a heavy and expensive format from a hardware perspective, and the butterfly effect that transitioning to INT8 (Fixed Point) brings to system performance.
1. Structural Complexity of IEEE 754: Real Numbers
The floating-point data (FP32) we commonly use follows the IEEE 754 Standard. To represent a very wide Dynamic Range, this format splits a number into three parts:
- Sign (1-bit): Positive/Negative
- Exponent (8-bit): Determines the magnitude/range
- Mantissa (23-bit): Determines the precision/significant digits
While mathematically elegant, this structure is a Nightmare for hardware implementation. Even a simple addition requires a complex sequence of steps:
- Denormalization: Comparing the exponents of two numbers to align their decimal points.
- Mantissa Add: Adding the aligned mantissas.
- Normalization: Bit-shifting the result to restore it to the standard format (1.xxx).
- Rounding: Processing the least significant bits to match precision.
All these steps require complex logic blocks like Comparators, Barrel Shifters, and Leading Zero Detectors. This is why FP32 arithmetic units are expensive.
2. Exponential Increase in Hardware Area
A key metric in hardware design cost is Silicon Die Area. Larger area means fewer chips per wafer (lower Net Die), lower Yield, and higher unit cost. The area of a Multiplier scales with a complexity of roughly Quadratic N2with respect to the input bit width (N).
- FP32 Multiplier: Requires multiplication logic for 23-bit mantissas (roughly 24 * 24 including the hidden bit), plus exponent addition and normalization logic.
- INT8 Multiplier: A simple 8 * 8 integer multiplier. No complex normalization or shifters required.
Quantitative Analysis:
Based on a 45nm process, the area of a single FP32 multiplier is roughly equivalent to that of 18.5 INT8 multipliers. This means by abandoning FP32 for INT8, you can pack about 18 times more processing cores into the same chip area. This is the secret behind the explosive Throughput of NPUs.
3. Power Consumption and Heat
A more serious issue is Power. The complex logic blocks of FP32 mentioned above toggle (switch) every clock cycle, consuming power.
- FP32 Addition: Approx. 0.9 pJ (Pico Joules)
- INT8 Addition: Approx. 0.03 pJ
- Energy Efficiency Gap: ~30x
When running massive models with hundreds of millions of parameters, this 30x difference determines "whether a smartphone battery dies in an hour or lasts all day." Furthermore, power consumption leads directly to Heat, which causes throttling that forces the chip to lower its operating clock speed.
4. Relieving the Memory Bandwidth Bottleneck
The weight of data is felt not just inside the arithmetic units but also on the Memory Bus, the highway transporting data.
- FP32: 4 Bytes per parameter
- INT8: 1 Byte per parameter
Given the same DRAM bandwidth (e.g., 100GB/s), you can supply 4 times more data when loading INT8 compared to FP32.
Considering that most AI inference tasks are Memory-Bound (limited by data loading speed rather than computation speed), reducing data size by 1/4 is the most definitive optimization to boost total system performance by up to 4x.
5. Fixed Point and Quantization
So, how do we convert FP32 to INT8? It’s not just about truncating decimals. We use the concept of Fixed Point .
In hardware, we fix the position of the decimal point (an agreement between the programmer and hardware) and simply run Integer ALUs. This process is called Quantization.
Of course, trying to fit the wide range of 32-bit into a narrow 8-bit container results in information loss (Accuracy Drop). However, Deep Learning models have massive redundancy, so slight errors in individual parameters do not significantly affect the final result. Leveraging this, we can maintain accuracy while drastically lowering hardware costs.
6. Conclusion: The Wisdom of Choosing the Right Container
The role of a Hardware engineer is not to build the most precise calculator possible. It is to "select the smallest (Area) and lowest power (Power) data format that satisfies the required accuracy."
Recently, new formats like BF16 (Bfloat16) or FP8 have emerged as compromises between FP32 and INT8, being adopted in modern chips like the NVIDIA H100. This illustrates the ongoing evolution of hardware design, constantly balancing training stability and inference efficiency.
References: High-Performance Hardware for Machine Learning