AI Architecture 5. 데이터의 무게: FP32가 하드웨어 면적과 전력 소모에 미치는 영향

In the previous post, we explored the difference between Training and Inference, seeing how inference-only NPUs lighten the hardware structure. One of the key keywords for this optimization was 'Reduction of Precision.' To a software engineer, data is merely an abstract variable type like float (32-bit) or int (32-bit). However, to a System Architect designing silicon chips, data carries physical 'Weight.'

An increase in the number of bits means more strands of wire to transport data, more Flip-Flops to store it, and most importantly, an exponential increase in the Silicon Area of the logic circuits required to compute them.

In this article, we will analyze why FP32 (Floating Point 32-bit), the standard for deep learning, is such a heavy and expensive format from a hardware perspective, and the butterfly effect that transitioning to INT8 (Fixed Point) brings to system performance.

Related articles

✅AI Architecture 1. Anatomy of an Artificial Neuron: Y=WX+B on Silicon

✅AI Architecture 2. The Cost of Activation: Free ReLU vs. Expensive Sigmoid

✅AI Architecture 3. The Aesthetics of MatMul: Why Deep Learning Chooses GPUs/NPUs

✅AI Architecture 4. Training vs. Inference

1. Structural Complexity of IEEE 754: Real Numbers

The floating-point data (FP32) we commonly use follows the IEEE 754 Standard. To represent a very wide Dynamic Range, this format splits a number into three parts:

Sign (1-bit): Positive/Negative
Exponent (8-bit): Determines the magnitude/range
Mantissa (23-bit): Determines the precision/significant digits

Value = (-1)^{Sign} \times (1.Mantissa) \times 2^{(Exponent – 127)}

While mathematically elegant, this structure is a Nightmare for hardware implementation. Even a simple addition requires a complex sequence of steps:

Denormalization: Comparing the exponents of two numbers to align their decimal points.
Mantissa Add: Adding the aligned mantissas.
Normalization: Bit-shifting the result to restore it to the standard format (1.xxx).
Rounding: Processing the least significant bits to match precision.

All these steps require complex logic blocks like Comparators, Barrel Shifters, and Leading Zero Detectors. This is why FP32 arithmetic units are expensive.

2. Exponential Increase in Hardware Area

A key metric in hardware design cost is Silicon Die Area. Larger area means fewer chips per wafer (lower Net Die), lower Yield, and higher unit cost. The area of a Multiplier scales with a complexity of roughly Quadratic N²with respect to the input bit width (N).

FP32 Multiplier: Requires multiplication logic for 23-bit mantissas (roughly 24 * 24 including the hidden bit), plus exponent addition and normalization logic.
INT8 Multiplier: A simple 8 * 8 integer multiplier. No complex normalization or shifters required.

Quantitative Analysis:

Based on a 45nm process, the area of a single FP32 multiplier is roughly equivalent to that of 18.5 INT8 multipliers. This means by abandoning FP32 for INT8, you can pack about 18 times more processing cores into the same chip area. This is the secret behind the explosive Throughput of NPUs.

3. Power Consumption and Heat

A more serious issue is Power. The complex logic blocks of FP32 mentioned above toggle (switch) every clock cycle, consuming power.

FP32 Addition: Approx. 0.9 pJ (Pico Joules)
INT8 Addition: Approx. 0.03 pJ
Energy Efficiency Gap: ~30x

When running massive models with hundreds of millions of parameters, this 30x difference determines "whether a smartphone battery dies in an hour or lasts all day." Furthermore, power consumption leads directly to Heat, which causes throttling that forces the chip to lower its operating clock speed.

Energy comparison for fp32 vs int4 hardware

4. Relieving the Memory Bandwidth Bottleneck

The weight of data is felt not just inside the arithmetic units but also on the Memory Bus, the highway transporting data.

FP32: 4 Bytes per parameter
INT8: 1 Byte per parameter

Given the same DRAM bandwidth (e.g., 100GB/s), you can supply 4 times more data when loading INT8 compared to FP32.

Considering that most AI inference tasks are Memory-Bound (limited by data loading speed rather than computation speed), reducing data size by 1/4 is the most definitive optimization to boost total system performance by up to 4x.

5. Fixed Point and Quantization

So, how do we convert FP32 to INT8? It’s not just about truncating decimals. We use the concept of Fixed Point .

Real\_Value \approx Scale\_Factor \times (Integer\_Value – Zero\_Point)

In hardware, we fix the position of the decimal point (an agreement between the programmer and hardware) and simply run Integer ALUs. This process is called Quantization.

Of course, trying to fit the wide range of 32-bit into a narrow 8-bit container results in information loss (Accuracy Drop). However, Deep Learning models have massive redundancy, so slight errors in individual parameters do not significantly affect the final result. Leveraging this, we can maintain accuracy while drastically lowering hardware costs.

6. Conclusion: The Wisdom of Choosing the Right Container

The role of a Hardware engineer is not to build the most precise calculator possible. It is to "select the smallest (Area) and lowest power (Power) data format that satisfies the required accuracy."

Recently, new formats like BF16 (Bfloat16) or FP8 have emerged as compromises between FP32 and INT8, being adopted in modern chips like the NVIDIA H100. This illustrates the ongoing evolution of hardware design, constantly balancing training stability and inference efficiency.

Related articles

✅AI Architecture 1. Anatomy of an Artificial Neuron: Y=WX+B on Silicon

✅AI Architecture 2. The Cost of Activation: Free ReLU vs. Expensive Sigmoid

✅AI Architecture 3. The Aesthetics of MatMul: Why Deep Learning Chooses GPUs/NPUs

✅AI Architecture 4. Training vs. Inference

References: High-Performance Hardware for Machine Learning

[AI Architecture] 5. The Weight of Data (Number Formats): How FP32 Impacts Hardware Area and Power

1. Structural Complexity of IEEE 754: Real Numbers

2. Exponential Increase in Hardware Area

3. Power Consumption and Heat

4. Relieving the Memory Bandwidth Bottleneck

5. Fixed Point and Quantization

6. Conclusion: The Wisdom of Choosing the Right Container

AI Architecture 14. Dataflow Taxonomy: TPU vs Output Stationary vs Row Stationary

AI Architecture 7. MLP and the Memory Wall

AI Architecture 15. The Heart of Systolic Array

AI Architecture 10. Padding and Pooling Hardware Issues

AI Architecture 6. INT8 Quantization Basics

AI Architecture 8. CNN and Locality

Sitemap

Category

Information

1. Structural Complexity of IEEE 754: Real Numbers

2. Exponential Increase in Hardware Area

3. Power Consumption and Heat

4. Relieving the Memory Bandwidth Bottleneck

5. Fixed Point and Quantization

6. Conclusion: The Wisdom of Choosing the Right Container

Similar Posts

Sitemap

Category

Information