Designing NPU (Neural Processing Unit) architectures at a low-power AI semiconductor startup constantly reminds me of a harsh truth: The world of software engineering is fundamentally different from hardware engineering. Python-based AI frameworks carelessly perform infinite-precision floating-point (float32) arithmetic. However, implementing floating-point logic in a low-power Edge FPGA environment—where thermal management is a matter of life and death—is practically 'hardware suicide' in terms of area and power consumption.
Ultimately, the core competency of a hardware engineer lies in how elegantly you can convert these heavy floating-point operations into fast, lightweight Fixed-Point / Integer Arithmetic. In this article, I will share three 'true hardware optimization' skills I applied while designing an RTL MAC (Multiply-Accumulate) array.
1. Fixed-Point and Shifting
AI model quantization parameters are usually complex floats like 0.00379.... How do we multiply this in hardware? We use the Q-Format (Fixed-point) method: we multiply the float by a large number to eliminate the decimal point, perform the integer math, and then divide it back down to scale.
However, a hardware divider is vastly larger and slower than a multiplier. Therefore, we always scale by 2N (usually 16 or 32). 단위로 스케일링을 합니다. 나눗셈 연산을 비용이 ‘0’에 수렴하는 비트 시프트(>> 16)로 대체하기 위해서입니다.
2. Drop the Heavy Adders: The Hardware Hacker's Rounding Technique
단순한 우측 시프트(>>)는 소수점 이하를 가차 없이 버리는 내림(Truncation/Floor) 연산입니다. 이 오차가 누적되면 AI 모델의 정확도(Accuracy)가 심각하게 떨어집니다. 따라서 우리는 반올림(Round Half Up)을 구현해야 합니다.
If you code with a software mindset, you usually end up with this:
// Wastes hardware resources (Bad)
shifted = (scaled + 64'sd32768) >>> 16;216, This is the textbook method: adding 32768 beforehand to force a carry. But from a hardware perspective, this code is terrible. To perform this single addition, a massive 64-bit adder is synthesized. This consumes area and degrades your critical path timing.
True Hardware Optimization Code:
// Optimizes both Area and Timing (Good)
shifted = (scaled >>> 16) + scaled[15];When we divide by 216, the most significant bit of the discarded fractional part (the bit representing 0.5) is precisely the 15th bit (scaled[15]). If this bit is 1, it means the fractional part is >= 0.5.
Therefore, instead of using a heavy adder for 32768, we perform the shift first and simply add scaled[15] to the LSB of the integer result, almost like a Carry-in. This yields a mathematically identical rounding result (100%), but the synthesis tool handles it flawlessly using existing DSP internal routing without creating extra standalone adders. This is the true coding skill of a hardware engineer.
3. Perfect DSP48E2 Mapping
Out of fear of arithmetic overflow, it is a common habit to define variables generously as 64-bit (logic signed [63:0]).
However, the high-performance math block built into Xilinx FPGAs, the DSP48E2 slice, has a native output width of 48 bits. If you declare a variable as 64 bits, Vivado is forced to cascade two or more expensive DSP slices or waste numerous LUTs to construct a custom 64-bit calculator.
localparam logic signed [14:0] OUT_SCALE = 15'sd12700;
function automatic logic [26:0] compute_out(input logic [32:0] acc_in);
// Using 48-bit instead of 64-bit to map perfectly to DSP48E2 (Best)
logic signed [47:0] scaled, shifted;
begin
// 33-bit * 15-bit = 48-bit (Fits perfectly and safely!)
scaled = $signed(acc_in) * OUT_SCALE;
shifted = (scaled >>> 16) + scaled[15];
compute_out = shifted[26:0];
end
endfunctionRigorously calculate the maximum mathematical bit-width of your incoming data (acc_in) and your scaling constant (OUT_SCALE). If the maximum possible value fits within 48 bits, you must size your variables to [47:0]. By doing this, the multiplication and addition (rounding) will be perfectly synthesized into exactly one DSP slice, drastically reducing power consumption. For low-power AI semiconductors fighting thermal issues, this is not an option; it is a necessity.
References: AMD