In previous posts, we learned about Quantization techniques to shave down data size to reduce hardware costs. So, why do we try so desperately to reduce data size?
The reason is that a 'monster that devours memory bandwidth' lives inside deep learning models. That monster is named MLP (Multi-Layer Perceptron), or the Fully Connected Layer (FC Layer).
Although it is the "Hello World" level structure learned first in software tutorials, for a Hardware Architect, the MLP is one of the most inefficient and troublesome layers. In this article, we will verify the semiconductor industry adage, "Compute is free, data movement is expensive," by analyzing the Memory Wall phenomenon and how MLPs degrade system performance.
1. Structural Problem of MLP
The definition of a Fully Connected Layer is that "every neuron in the previous layer is connected to every neuron in the next layer." Mathematically, this is Y = WX + B, a Matrix-Vector Multiplication (MVM).
This 'all-to-all connection' is the problem. Let's assume 1,000 input neurons and 1,000 output neurons.
- Required Operations: 1,000 * 1,000 = 1,000,000 (1M) MACs
- Required Parameters (Weights): 1,000 * 1,000 = 1,000,000 (1M) weights
You fetch 1 million parameters from memory, multiply each exactly once, and you're done. There is no Reuse. A weight (Wij) is multiplied with only one input (Xi) and then discarded, not used in the next operation.
2. The Von Neumann Bottleneck and the Memory Wall
Most modern computers follow the Von Neumann Architecture. This structure separates the Processing Unit and the Memory Unit, connecting them via a Bus. The problem arises from the difference in the speed of technological advancement.
- Processor Speed: Has increased thousands of times over the last 20 years (Moore's Law).
- Memory Speed (DRAM): Capacity has increased, but the speed of moving data (Bandwidth) and reaction speed (Latency) have not kept pace.
Due to this gap, the processor screams for data, but the memory cannot supply it in time, leaving the processor idle. This is the Memory Wall.
MLPs collide head-on with this memory wall. Even if the Arithmetic Logic Unit (ALU) shouts, "I'm done calculating! Give me the next number!", the memory bus is clogged hauling weights, resulting in a situation where the chip performs at less than 10% of its potential.
3. Arithmetic Intensity: The Metric of Efficiency
To explain this phenomenon engineering-wise, we use the concept of Arithmetic Intensity.
In short, it indicates "How many operations (bang) can I get for every 1 byte of data fetched from memory (buck)?"
- Convolution Layer (CNN): Sweep a single filter across the entire image and reuse it. Take a single weight and reuse it thousands of times. -> High Arithmetic Intensity (Compute-Bound)
- FC Layer (MLP): Just take one weight, multiply it once, and done. -> Extremely Low Arithmetic Intensity (Memory-Bound)
Even if an NPU can perform 100 trillion operations per second (100 TOPS), when running an MLP, the actual performance might drop to a disastrous 1~2 TOPS due to memory bandwidth limits.
4. Solution 1: Batch Processing
The easiest way to solve this is to increase the Batch Size.
Instead of processing just one input data ($X$), you collect 100 $X$'s and process them at once.
- Batch Size = 1: Loading weight W -> 1 operation -> End
- Batch Size = 100: Loading weight W -> 100 operations (reused for inputs 1-100) -> End
This changes Matrix-Vector Multiplication (MVM) into Matrix-Matrix Multiplication (MMM). You can load weight W into the cache (SRAM) once and reuse it 100 times, increasing Arithmetic Intensity.
(However, as discussed in previous posts, there is a dilemma that increasing batch size is difficult in real-time inference due to Latency constraints.)
5. Solution 2: Massive On-chip SRAM
If you can't increase the batch size? Put all the data inside the chip. DRAM is slow and power-hungry, but SRAM inside the chip is extremely fast.
Recent AI semiconductor startups (Groq, Graphcore, etc.) or Tesla's FSD chips pack hundreds of MBs of massive SRAMonto the chip. This is an attempt to make MLP weights Resident inside the chip, effectively eliminating the Memory Wall. Of course, this has the downside of increasing chip area and cost.
6. Conclusion: Why CNNs Were Loved
The reason CNNs (Convolutional Neural Networks) became mainstream in image processing over pure MLP models wasn't just accuracy, but also hardware efficiency.
MLPs, lacking Locality and Reuse, are 'money pits' for hardware. However, recent Transformer (GPT) models are again using massive matrix operations (Attention based on FC Layers), so Hardware Architects are once again waging war to overcome this memory wall by stacking HBM (High Bandwidth Memory).
In the next post, we will explore a structure with opposite characteristics to MLPs, a favorite of hardware engineers: "CNN and Locality: Maximizing Hardware On-chip Buffer Efficiency."
References: The Memory Wall