In the previous post, we confirmed how inefficient MLP (Fully Connected Layer) is from a hardware perspective. Due to its structure of fetching a weight once, using it exactly once, and then discarding it, system performance suffers from the Memory Wall phenomenon, limited by memory bandwidth.
However, the real protagonist that allowed deep learning to change the world was not the MLP, but the CNN (Convolutional Neural Network). While algorithm researchers praise CNNs for "capturing spatial features of images well," Hardware Architects like us love CNNs for a completely different reason.
That reason is "Locality" and "Reuse." In this article, we will uncover the physical reasons why the Sliding Window method of CNNs maximizes the efficiency of the SRAM (On-chip Buffer) inside semiconductor chips and why NPUs can only unleash their full performance (TOPS) when running CNNs.
1. Locality: The Reason for Cache Memory
One of the most important concepts in Computer Architecture theory is the Locality of Reference.
- Temporal Locality) If data was referenced recently, it is likely to be referenced again soon.
- Spatial Locality: If data was referenced, data located near it is likely to be referenced soon.
Caches or SRAM buffers inside CPUs, GPUs, or NPUs are designed solely on this principle. The idea is to keep frequently used data in the fast, on-chip SRAM instead of going all the way to slow, power-hungry DRAM. MLP ignores this principle (random connections or single-use). In contrast, CNN is the ultimate champion of this locality principle.
2. Sliding Window: Data Reuse
Let's look at the core operation of CNN, Convolution, from a hardware perspective. A 3 * 3 filter (kernel) moves (slides) one step at a time over a huge input image, stamping its operation. Tremendous Data Reuse occurs during this process.
A. Weight Reuse
In MLP, a weight is multiplied by one input and then finished. But in CNN, a single filter (Weight set) sweeps from the top-left to the bottom-right of the image.
If the input image is 224 * 224, the same 3 * 3 filter weights are reused a staggering 50,176 times (224 * 224).
- DRAM Access: 1 time (Filter Load)
- Operation (MAC): Over 50,000 times
- Result: Arithmetic Intensity explodes.
B. Input Reuse
The Sliding Window moves sideways by one step. At this point, the previous window and the current window share (overlap) most pixels.
When a 3 * 3 window moves one step, 6 out of the 9 pixels are identical to the previous step. This means input data doesn't need to be fetched from DRAM every time; it can be temporarily stored in on-chip registers or a Line Buffer and reused continuously.
3. SRAM Efficiency
Thanks to these reuse characteristics, CNN-specific NPUs (Accelerators) can adopt the following memory hierarchy strategy:
- Load: Fetch filters (Weights) and a part of the image (Input Row) from DRAM to the Global Buffer (Large SRAM) inside the chip.
- Multicast: Distribute the data in the Global Buffer to hundreds of Processing Elements (PEs).
- Compute & Reuse: Each PE stores data in its local Register File (RF) and performs thousands of multiplications without even looking at the DRAM.
This is the secret behind the high performance of NPUs. They minimize high-energy DRAM accesses and complete computations at the low-energy SRAM and Register levels.
Quantitative Fact:
Based on a 45nm process, the energy for a single DRAM access is about 640 pJ, while accessing a small on-chip SRAM (8KB) is about 10 pJ. Thanks to the high reusability of CNNs, we can reduce a 640 pJ cost to 10 pJ, and further down to the register level (0.1 pJ). This is why CNNs are hardware-friendly.
4. Compute-Bound
In the last post, we described MLP as Memory-Bound. The arithmetic units are idle because data isn't arriving.
However, because CNNs have high data reuse rates, once data is fetched, the arithmetic units can chew on it for a long time. In other words, we enter the Compute-Bound domain where computation speed (TOPS) determines overall performance, not memory bandwidth.
From this point on, the architect's skill becomes crucial. "How do we keep thousands of multipliers (MACs) running at 100% utilization?" This concern leads directly to Dataflow Optimization and Mapping Strategies.
5. Conclusion: CNN is a Blessing for Hardware
In conclusion, hardware loves CNNs not just because they are "famous," but because they possess a structure with "High Arithmetic Intensity that allows massive amounts of computation with little memory bandwidth." With the advent of CNNs, AI semiconductors finally moved beyond being 'memory shuttles' to becoming true 'computational accelerators.'
However, nothing is perfect. When trying to map this efficient CNN operation to actual hardware, a new challenge begins: how to unravel (Unroll) the complex 6-level loops.
In the next post, we will explore the three core strategies for processing CNN operations: "Three Mappings of Conv Operations: Direct vs. Im2Col vs. Winograd."
References: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks