In the previous MLP and Memory Wall, we discussed the "memory wall" phenomenon, where memory bandwidth limits system performance. In CNN and Locality, we examined how CNNs elegantly address this problem through sequential data processing.
ResNet (Residual Network), introduced in 2015, is considered one of the greatest inventions in deep learning history. It solved the problem of "deeper layers leading to worse training" with the very simple idea of Skip Connections(Y = F(X) + X). However, while software engineers cheered for ResNet's elegance, hardware architects had to hold their heads in pain.
This seemingly simple addition (+X) completely destroyed the "beautiful sequential memory management rules" that hardware had maintained. Today, we will analyze how ResNet torments the Memory Buffers and Schedulers inside the chip.
1. Sequentiality
Models prior to ResNet (like AlexNet, VGG) had a very simple Chain Structure.
To hardware, this structure makes memory management incredibly easy.
- Write the output of Layer 1 to memory.
- Read it to compute Layer 2.
- The moment Layer 2's output starts being generated, Layer 1's data can be overwritten.
The Lifetime of the data is very short. This meant that with just two small on-chip buffers (SRAM) playing Ping-Pong, huge models could be run without issues.
2. Skip Connection
But ResNet's Skip Connection (Shortcut) breaks this rule.
Input data X goes off to perform the Conv operation (complex 3 * 3 convolution, etc.). At the same time, X must remain alive to be added to the result later. The problem is that while the Conv(X) operation is being performed (Latency), X must be stored somewhere.
- The memory space for X cannot be deallocated until Conv(X) is finished.
- The Lifetime of data X is forcibly extended.
This causes X to occupy limited on-chip memory (SRAM) resources for a long time, leading to a shortage of buffer space for other operations.
3. The Dilemma of Memory Hierarchy: SRAM vs. DRAM
What if the operations inside the Residual Block are extensive, requiring X to be held for a long time, but the chip's internal SRAM capacity is insufficient? The architect is forced to kick X out to off-chip DRAM (Spill) and bring it back later (Fill).
- Read X: Read for Conv operation.
- Spill X: Save X to DRAM for later addition (if SRAM is full).
- Compute F(X): Perform convolution operations.
- Fill X: Read X back from DRAM for the addition (F(X)+X).
This process generates unnecessary DRAM Traffic (Bandwidth consumption).
4. The Nightmare of Streaming Architecture: Synchronization
In Streaming Architectureor FPGA designs, where data flows through a pipeline rather than being processed in batches, an even bigger problem arises.
- Main Path: Conv -> ReLU -> Conv (연산이 많아서 느림)
- Skip Path: Just a wire connection
The two data streams must meet at the final Adder, but their arrival times are different. Data $X$ arriving via the Skip Path must wait until the Main Path's computation is finished.
For this, hardware requires additional FIFO (First-In-First-Out) Buffersto hold the data temporarily. The deeper the model and the larger the image resolution, the larger this FIFO grows—from Kilobytes (KB) to Megabytes (MB)—eating away at the chip's area.
5. Element-wise Addition
Finally, the Element-wise Addition (F(X) + X) itself is a problem. We usually only count Multiplication (MAC) costs, but Addition is a typical Memory-Bound operation.
- Operation: 1 Addition
- Memory Access: 2 Reads (F(X), X), 1 Write (Y)
The Arithmetic Intensity is extremely low. Since ResNet structures must perform this Memory-Bound operation periodically, it causes the NPU's computation units to Stall, waiting for the addition data to load.
6. Conclusion: The Cost of Flexibility, and the Next Step
ResNet's Skip Connections revolutionized deep learning accuracy, but they presented hardware engineers with the challenging homework of "Non-sequential Data Management." To solve this problem, modern NPU compilers employ sophisticated smart Memory Allocation algorithms, or hardware sometimes includes dedicated compressors for Skip Connections.
This concludes the [Category 1. AI & HW Fundamentals] series. Throughout these 12 posts, we have encountered various bottlenecks:
- MLP: Slow due to lack of memory bandwidth (Memory-Bound).
- MobileNet: Slow due to complex structures leaving arithmetic units idle (Utilization Issue).
- ResNet: System stalls due to memory management and buffering requirements (Buffer Management).
So, if the NPU I designed (or am analyzing) is slow, "Whose fault is it?" Is it the Arithmetic Logic Units, or is it the Memory?
In the upcoming [Category 2. NPU Design & Optimization] series, starting with the next post, we will explore the "Roofline Model." This is the ultimate analysis tool for System Architects, capable of diagnosing these complex bottlenecks with a single, clear graph.
References: Deep Residual Learning for Image Recognition