AI Architecture 7. MLP와 메모리 장벽(Memory Wall)

In previous posts, we learned about Quantization techniques to shave down data size to reduce hardware costs. So, why do we try so desperately to reduce data size?

The reason is that a 'monster that devours memory bandwidth' lives inside deep learning models. That monster is named MLP (Multi-Layer Perceptron), or the Fully Connected Layer (FC Layer).

Although it is the "Hello World" level structure learned first in software tutorials, for a Hardware Architect, the MLP is one of the most inefficient and troublesome layers. In this article, we will verify the semiconductor industry adage, "Compute is free, data movement is expensive," by analyzing the Memory Wall phenomenon and how MLPs degrade system performance.

Related articles

✅AI Architecture 1. Anatomy of an Artificial Neuron: Y=WX+B on Silicon

✅AI Architecture 2. The Cost of Activation: Free ReLU vs. Expensive Sigmoid

✅AI Architecture 3. The Aesthetics of MatMul: Why Deep Learning Chooses GPUs/NPUs

✅AI Architecture 4. Training vs. Inference

1. Structural Problem of MLP

The definition of a Fully Connected Layer is that "every neuron in the previous layer is connected to every neuron in the next layer." Mathematically, this is Y = WX + B, a Matrix-Vector Multiplication (MVM).

This 'all-to-all connection' is the problem. Let's assume 1,000 input neurons and 1,000 output neurons.

Required Operations: 1,000 * 1,000 = 1,000,000 (1M) MACs
Required Parameters (Weights): 1,000 * 1,000 = 1,000,000 (1M) weights

You fetch 1 million parameters from memory, multiply each exactly once, and you're done. There is no Reuse. A weight (W_ij) is multiplied with only one input (X_i) and then discarded, not used in the next operation.

2. The Von Neumann Bottleneck and the Memory Wall

Most modern computers follow the Von Neumann Architecture. This structure separates the Processing Unit and the Memory Unit, connecting them via a Bus. The problem arises from the difference in the speed of technological advancement.

Processor Speed: Has increased thousands of times over the last 20 years (Moore's Law).
Memory Speed (DRAM): Capacity has increased, but the speed of moving data (Bandwidth) and reaction speed (Latency) have not kept pace.

Due to this gap, the processor screams for data, but the memory cannot supply it in time, leaving the processor idle. This is the Memory Wall.

MLPs collide head-on with this memory wall. Even if the Arithmetic Logic Unit (ALU) shouts, "I'm done calculating! Give me the next number!", the memory bus is clogged hauling weights, resulting in a situation where the chip performs at less than 10% of its potential.

3. Arithmetic Intensity: The Metric of Efficiency

To explain this phenomenon engineering-wise, we use the concept of Arithmetic Intensity.

Arithmetic \ Intensity = \frac{\text{Total Floating Point Operations (FLOPs)}}{\text{Total Data Access in Bytes (Bytes)}}

In short, it indicates "How many operations (bang) can I get for every 1 byte of data fetched from memory (buck)?"

Convolution Layer (CNN): Sweep a single filter across the entire image and reuse it. Take a single weight and reuse it thousands of times. -> High Arithmetic Intensity (Compute-Bound)
FC Layer (MLP): Just take one weight, multiply it once, and done. -> Extremely Low Arithmetic Intensity (Memory-Bound)

Even if an NPU can perform 100 trillion operations per second (100 TOPS), when running an MLP, the actual performance might drop to a disastrous 1~2 TOPS due to memory bandwidth limits.

4. Solution 1: Batch Processing

The easiest way to solve this is to increase the Batch Size.

Instead of processing just one input data ($X$), you collect 100 $X$'s and process them at once.

Batch Size = 1: Loading weight W -> 1 operation -> End
Batch Size = 100: Loading weight W -> 100 operations (reused for inputs 1-100) -> End

This changes Matrix-Vector Multiplication (MVM) into Matrix-Matrix Multiplication (MMM). You can load weight W into the cache (SRAM) once and reuse it 100 times, increasing Arithmetic Intensity.

(However, as discussed in previous posts, there is a dilemma that increasing batch size is difficult in real-time inference due to Latency constraints.)

5. Solution 2: Massive On-chip SRAM

If you can't increase the batch size? Put all the data inside the chip. DRAM is slow and power-hungry, but SRAM inside the chip is extremely fast.

Recent AI semiconductor startups (Groq, Graphcore, etc.) or Tesla's FSD chips pack hundreds of MBs of massive SRAMonto the chip. This is an attempt to make MLP weights Resident inside the chip, effectively eliminating the Memory Wall. Of course, this has the downside of increasing chip area and cost.

6. Conclusion: Why CNNs Were Loved

The reason CNNs (Convolutional Neural Networks) became mainstream in image processing over pure MLP models wasn't just accuracy, but also hardware efficiency.

MLPs, lacking Locality and Reuse, are 'money pits' for hardware. However, recent Transformer (GPT) models are again using massive matrix operations (Attention based on FC Layers), so Hardware Architects are once again waging war to overcome this memory wall by stacking HBM (High Bandwidth Memory).

In the next post, we will explore a structure with opposite characteristics to MLPs, a favorite of hardware engineers: "CNN and Locality: Maximizing Hardware On-chip Buffer Efficiency."

Related articles

✅AI Architecture 1. Anatomy of an Artificial Neuron: Y=WX+B on Silicon

✅AI Architecture 2. The Cost of Activation: Free ReLU vs. Expensive Sigmoid

✅AI Architecture 3. The Aesthetics of MatMul: Why Deep Learning Chooses GPUs/NPUs

✅AI Architecture 4. Training vs. Inference

References: The Memory Wall

AI Architecture 7. MLP and the Memory Wall

1. Structural Problem of MLP

2. The Von Neumann Bottleneck and the Memory Wall

3. Arithmetic Intensity: The Metric of Efficiency

4. Solution 1: Batch Processing

5. Solution 2: Massive On-chip SRAM

6. Conclusion: Why CNNs Were Loved

AI Architecture 12. Skip Connection: ResNet and Bottlenecks

AI Architecture 14. Dataflow Taxonomy: TPU vs Output Stationary vs Row Stationary

AI Architecture 6. INT8 Quantization Basics

AI Architecture 2. The Cost of Activation: Free ReLU vs. Expensive Sigmoid

AI Architecture 1. Anatomy of an Artificial Neuron: Y=WX+B on Silicon

AI Architecture 11. Depthwise Separable Conv: The MobileNet Paradox

Sitemap

Category

Information

1. Structural Problem of MLP

2. The Von Neumann Bottleneck and the Memory Wall

3. Arithmetic Intensity: The Metric of Efficiency

4. Solution 1: Batch Processing

5. Solution 2: Massive On-chip SRAM

6. Conclusion: Why CNNs Were Loved

Similar Posts

Sitemap

Category

Information