Hardware Optimization - Float to Integer

Designing NPU (Neural Processing Unit) architectures at a low-power AI semiconductor startup constantly reminds me of a harsh truth: The world of software engineering is fundamentally different from hardware engineering. Python-based AI frameworks carelessly perform infinite-precision floating-point (float32) arithmetic. However, implementing floating-point logic in a low-power Edge FPGA environment—where thermal management is a matter of life and death—is practically 'hardware suicide' in terms of area and power consumption.

Ultimately, the core competency of a hardware engineer lies in how elegantly you can convert these heavy floating-point operations into fast, lightweight Fixed-Point / Integer Arithmetic. In this article, I will share three 'true hardware optimization' skills I applied while designing an RTL MAC (Multiply-Accumulate) array.

1. Fixed-Point and Shifting

AI model quantization parameters are usually complex floats like 0.00379.... How do we multiply this in hardware? We use the Q-Format (Fixed-point) method: we multiply the float by a large number to eliminate the decimal point, perform the integer math, and then divide it back down to scale.

However, a hardware divider is vastly larger and slower than a multiplier. Therefore, we always scale by 2N (usually 16 or 32). 단위로 스케일링을 합니다. 나눗셈 연산을 비용이 ‘0’에 수렴하는 비트 시프트(>> 16)로 대체하기 위해서입니다.

Result=Input×12700216Result=(Input×12700)16Result = \frac{Input \times 12700}{2^{16}} \quad \Rightarrow \quad Result = (Input \times 12700) \gg 16

2. Drop the Heavy Adders: The Hardware Hacker's Rounding Technique

단순한 우측 시프트(>>)는 소수점 이하를 가차 없이 버리는 내림(Truncation/Floor) 연산입니다. 이 오차가 누적되면 AI 모델의 정확도(Accuracy)가 심각하게 떨어집니다. 따라서 우리는 반올림(Round Half Up)을 구현해야 합니다.

If you code with a software mindset, you usually end up with this:

// Wastes hardware resources (Bad)
shifted = (scaled + 64'sd32768) >>> 16;

216, This is the textbook method: adding 32768 beforehand to force a carry. But from a hardware perspective, this code is terrible. To perform this single addition, a massive 64-bit adder is synthesized. This consumes area and degrades your critical path timing.

True Hardware Optimization Code:

// Optimizes both Area and Timing (Good)
shifted = (scaled >>> 16) + scaled[15];

When we divide by 216, the most significant bit of the discarded fractional part (the bit representing 0.5) is precisely the 15th bit (scaled[15]). If this bit is 1, it means the fractional part is >= 0.5.

Therefore, instead of using a heavy adder for 32768, we perform the shift first and simply add scaled[15] to the LSB of the integer result, almost like a Carry-in. This yields a mathematically identical rounding result (100%), but the synthesis tool handles it flawlessly using existing DSP internal routing without creating extra standalone adders. This is the true coding skill of a hardware engineer.

3. Perfect DSP48E2 Mapping

Out of fear of arithmetic overflow, it is a common habit to define variables generously as 64-bit (logic signed [63:0]).

However, the high-performance math block built into Xilinx FPGAs, the DSP48E2 slice, has a native output width of 48 bits. If you declare a variable as 64 bits, Vivado is forced to cascade two or more expensive DSP slices or waste numerous LUTs to construct a custom 64-bit calculator.

localparam logic signed [14:0] OUT_SCALE = 15'sd12700; 

function automatic logic [26:0] compute_out(input logic [32:0] acc_in);
    // Using 48-bit instead of 64-bit to map perfectly to DSP48E2 (Best)
    logic signed [47:0] scaled, shifted;
    begin
        // 33-bit * 15-bit = 48-bit (Fits perfectly and safely!)
        scaled  = $signed(acc_in) * OUT_SCALE;
        shifted = (scaled >>> 16) + scaled[15];
        
        compute_out = shifted[26:0];
    end
endfunction

Rigorously calculate the maximum mathematical bit-width of your incoming data (acc_in) and your scaling constant (OUT_SCALE). If the maximum possible value fits within 48 bits, you must size your variables to [47:0]. By doing this, the multiplication and addition (rounding) will be perfectly synthesized into exactly one DSP slice, drastically reducing power consumption. For low-power AI semiconductors fighting thermal issues, this is not an option; it is a necessity.

References: AMD

Similar Posts