[RTL] RTL Arithmetic: Bit Extension, Saturation Operations, and Rounding

When designing RTL code, it's common to declare wire [7:0] a, b, c and then carelessly write something like assign c = a + b; While this isn't a syntax error, it can be a common cause of "silent bugs," where data can become corrupted under certain circumstances.

Handling numbers in digital circuits goes beyond simple mathematical calculations; it means physically managing bit width and data types (signed/unsigned).

In this article, we will cover the three major arithmetic operations that RTL engineers must master: bit extension rules, the pitfalls of signed operations, and saturation and rounding techniques, which are key to DSP design.

1. Bit width does not increase automatically (Bit Width Expansion)

Hardware can only store values ​​as large as the size of a given register. If the result of an operation exceeds the capacity of the register, an overflow occurs, resulting in data corruption.

Addition / Subtraction

When adding two N-bit numbers, the result requires at most N+1 bits.

  • Rule: Sum_Width = Max(A_Width, B_Width) + 1
  • Example: 8 bits (255) + 8 bits (255) = 510 (9 bits required)

Multiplication

If we multiply N bits by M bits, the result will require N+M bits.

  • Rule: Prod_Width = A_Width + B_Width
  • Example: 8-bit(255) * 8-bit(255) = 65,025 (16-bit required)

Tip: When designing RTL, you should get into the habit of calculating the width of the resulting wire in advance and declaring it generously.

2. ‘Signed’ and ‘Unsigned’

The most fatal bugs in Verilog occur when mixing signed and unsigned numbers.

2's Complement and its Pitfalls

According to the Verilog standard, if an expression contains even a single unsigned variable, the entire expression is treated as unsigned. This is called implicit casting.

reg signed   [3:0] a = -2;  // 1110 (negative)
reg unsigned [3:0] b =  1;  // 0001 (positive)
wire signed  [4:0] result;

assign result = a + b; 
// Expected value: -1
// Actual value: a is interpreted as unsigned 14 -> 14 + 1 = 15 (fatal error!)

Solution: You need to explicitly declare all variables as signed, or force the type using the $signed() system function.

3. Implementing Saturation Logic

In audio or video signal processing, overflow becomes noise.

  • Wrap-around (normal behavior): If the maximum value is exceeded, it wraps around to 0 (e.g. 255 + 1 = 0).
  • Saturation: Clipping to the maximum value when it exceeds the maximum value (e.g. 255 + 1 = 255).

In RTL, the most significant bit (overflow bit) of the operation result is detected and the value is fixed with a MUX.

4. Practical Techniques: Implementing Rounding Efficiently

When performing division or truncation, simply discarding the lower-order bits will always result in an error in the value (floor). To compensate for this, rounding (Round Half Up) is used.

In particular, when performing ‘dividing by 2’ (e.g., finding the average), there is a very efficient practice pattern that utilizes the least significant bit (LSB) without a complex adder.

Principle: If the remainder is 0.5, increase it.

In binary, the LSB (least significant bit) is 1 commands described above to the 0.

  • Shifting 1 bit to the right (>>1) means dividing by 2 (Integer Div).
  • If the LSB that disappears at this time is 1, it means that the original value was X.5.
  • Therefore, adding the LSB to the quotient will naturally round it down.

Practical Code: Finding the Mean (Applying Rounding)

module calc_average_round (
    input  wire [7:0] data_a,
    input  wire [7:0] data_b,
    output reg  [7:0] avg_out
);
    // 1. Calculate the sum (1 bit extension to prevent overflow)
    wire [8:0] sum;
    assign sum = data_a + data_b;

    // 2. Divide by 2 with Rounding
    // Principle: (Sum / 2) + (Sum % 2)
    // sum[8:1] : upper bits (quotient divided by 2)
    // sum[0] : least significant bit (remainder, i.e. whether it is 0.5)
    
    always @(*) begin
        // Add the remainder to the quotient.
        avg_out = sum[8:1] + sum[0];
    end

endmodule

Check the operation

  • When the sum is 10 (1010):
    • sum[8:1] = 101 (5)
    • sum[0] = 0
    • Result: 5 + 0 = 5 (10 / 2 = 5) -> correct
  • When the sum is 11 (1011):
    • sum[8:1] = 101 (5)
    • sum[0] = 1
    • Result: 5 + 1 = 6 (11 / 2 = 5.5 rounded up -> 6) -> correct

This method is very efficient in terms of area and timing because it can implement rounding with just one Adder without a separate comparator.

5. Conclusion: Details Make Quality

In RTL design, arithmetic operations are the most basic, but also the part where most mistakes occur.

  1. Bit Width: Increase the register size to accommodate larger computational results.
  2. Type: Never mix signed and unsigned.
  3. Refinement: Use saturation for filters and image processing, and LSB addition (rounding) for averaging.

References: IEEE Standard

Similar Posts