In the previous 3 Mappings of Conv Operations, we explored the massive trade-off (like Im2Col) of exchanging memory capacity for computation speed to optimize Convolution operations on hardware.
If the primary workload of a CNN accelerator is concentrated on Convolution, there are essential operations that must accompany it for the functional completeness of the architecture: Pooling and Padding.
- padding=1: "Just fill the border with a line of zeros."
- MaxPool2d(2): "Pick the largest number out of this 2x2 grid."
To a software engineer, these are merely options. However, These simple tasks, which account for less than 1% of the total model in terms of FLOPs, present hardware architects with the structural headaches of "Irregularity" and "Buffering."
In this article, we will uncover the hardware issues of Pooling and Padding—the culprits that quietly consume chip Area and complicate control logic behind the main MAC units.
1. Padding: How to Process '0', the Non-Existent Data
Zero Padding is a technique of filling the periphery of an image with zeros to maintain image size or preserve edge features. The question is, "Where do we get these zeros from?"
Software Approach (Memory Waste)
The easiest way is to actually create a new image in memory (DRAM) with a border filled with zeros. However, this is a massive waste of bandwidth. Asking a hardware engineer to use expensive DRAM bandwidth to read meaningless "zero" data is unacceptable.
Hardware Approach (On-the-fly Generation)
Therefore, NPUs use an "On-the-fly (Real-time Generation)" method. Only the original image is stored in memory, and the input port logic calculates Coordinates to insert fake '0's as data is read. This requires complex Control Logic (FSM: Finite State Machine).
- It must check every clock cycle whether the current pixel coordinate (x, y) is outside the image Boundary.
- If it is outside, it must Stall the memory read and instead inject a '0' value into the arithmetic unit via a Multiplexer (MUX).
- This boundary check logic, while seemingly simple, can become a primary cause of Timing issues when the chip operates at high speeds.
2. Pooling: The Dilemma of Streaming Data and Line Buffers
Max Pooling (2 * 2) is an operation that selects the maximum value among 4 pixels. It looks like a very lightweight operation implementable with just a few Comparators. However, the real problem lies in the "Order of Data Arrival."
Hardware reads an image not as a whole, but Row-by-Row, like a TV scan line (Raster Scan order).
The Necessity of Line Buffers
To perform pooling with a 2 * 2 window, data from the first row (Row N) and the second row (Row N+1) are needed simultaneously. However, since data arrives one row at a time, the hardware must store the entire first row somewhere and wait until the second row arrives. The memory required for this is called a Line Buffer.
The wider the image width (e.g., 4K image), the larger the Line Buffer must be.
- Cost Analysis: Just to perform a few comparison operations, we must allocate KB to MB of SRAMto store an entire line of the image. This is a significant overhead in terms of Chip Area.
3. Synchronization and Pipeline Bubbles
Structures where a Pooling layer immediately follows a Convolution layer (Conv-Pool) are very common. Here, a Rate Mismatch problem occurs.
- Conv Output: Spits out a pixel every clock cycle (assuming Stride=1).
- Pool (2 * 2) Input: Waits until 2 rows are collected, then groups 4 pixels to spit out 1 result.
The Pooling unit must remain Idle while waiting for data, and then process it instantly once collected. This process creates Bubbles where the pipeline flow is interrupted. To prevent this, an additional FIFO (First-In-First-Out) buffer is needed between Conv and Pool.
Ultimately, Pooling, which was supposed to be a "simple operation," becomes a rather heavy module accompanied by Line Buffers (SRAM) + FIFOs + Complex Control Logic.
4. The Trap of Global Average Pooling (GAP)
Global Average Pooling, used at the end of ResNet or MobileNet, is even more severe. It needs to average the entire channel size of 7 * 7 or larger.
This means holding the Accumulator value until the entire image is finished. In a streaming architecture, GAP becomes a Latency Bottleneck where the next result cannot be output until all data is received.
5. Conclusion
From a hardware architecture perspective, "Computational Simplicity" and "Implementation Simplicity" are distinct. While Padding and Pooling are kindergarten-level math, for hardware that must stream data in real-time, they are obstacles that disrupt data flow and force buffering.
The recent trend of models like Transformer or Stride=2 Convolutiongradually eliminating Pooling layers is not unrelated to this Hardware Efficiency, in addition to accuracy aspects.
References: Efficient Hardware Architecture for Moving Window Operations