AI Architecture 3. The Aesthetics of MatMul: Why Deep Learning Chooses GPUs/NPUs

In previous posts, we examined the significant hardware costs incurred by the fundamental operation of artificial neurons, the MAC (Multiply-Accumulate), and how Activation Functions (like ReLU) help mitigate these costs.

In this post, we broaden our perspective to discuss the massive wave of computation generated by these countless neurons combining: Matrix Multiplication (MatMul).

The history of deep learning is essentially a "struggle to multiply ever-larger matrices faster." In this struggle, the CPU (Central Processing Unit), which held the throne as the brain of computers for decades, has stepped down, and the GPU (Graphics Processing Unit) and NPU (Neural Processing Unit) have ascended as the new masters.

What exactly happens inside deep learning models that renders the general-purpose CPU ineffective? The secret lies in 'Parallelism' and the 'Design Philosophy of Computer Architecture.'

1. The Nature of Deep Learning: A Massive Matrix Factory

The deep learning models we handle, when stripped of their abstractions, are essentially gigantic matrix calculators.

  • Fully Connected Layer (MLP): Multiplication of an input vector and a weight matrix (Y = WX + B).
  • Convolution Layer (CNN): Multiplication of image pixels and filter kernels, which mathematically reduces to large matrix multiplications (Toeplitz Matrix transformation).
  • Attention Mechanism (Transformer): Consecutive multiplications between Query, Key, and Value matrices (Q * KT * V).
Matrix Multiplication operations

Statistically, over 90% of deep learning inference and training time is consumed by MatMul and its associated data movement. Therefore, the performance of AI hardware ultimately boils down to: "How many matrix multiplications can it process simultaneously?"

The problem is, conventional CPUs were not designed to handle this type of workload.

2. The Limits of CPU Architecture: SISD

The CPU is the jack-of-all-trades of a computer. It needs to run operating systems, browse the web, and handle spreadsheet calculations. Its goal is to execute unpredictable and complex tasks quickly. To achieve this, CPUs possess a small number of very powerful Cores. Each core is armed with complex Branch Prediction, Out-of-Order Execution logic, and Large Caches.

This structure is optimized for SISD (Single Instruction, Single Data) - processing one data point with one instruction at a time. Surprisingly, the area dedicated to ALUs (Arithmetic Logic Units) that actually perform additions and multiplications within a CPU core is relatively small. Most of the area is occupied by Control Logic, which is responsible for "figuring out what instruction to execute next and shoveling data around."

Architect’s Insight:

Matrix multiplication in deep learning is an incredibly simple and repetitive task (A1 * B1, A2 * B2, ...). There are almost no branches (if-statements), and the next operation is entirely predictable. Deploying a CPU with complex control logic for such simple tasks is akin to "driving a Ferrari in stop-and-go traffic." Those intelligent control units in the CPU spend most of their time idle during MatMul operations. It's a massive waste of silicon area and power.

3. The GPU Revolution: SIMD

In contrast, GPUs are born different. GPUs were originally created to render 3D graphics. Calculating coordinates and coloring millions of pixels on a screen is a process of repeating identical, independent operations. For such tasks, GPUs adopted the SIMD (Single Instruction, Multiple Data) architecture.

SIMD architecture

SIMD is a method where "a single instruction processes multiple data points simultaneously." Instead of having a complex Control Unit for each core like a CPU, a GPU has a single Instruction Unit that simultaneously commands dozens or hundreds of simple ALUs (often called CUDA cores).

  • CPU: "Core 1 do A, Core 2 do B..." (Individual Instructions)
  • GPU (SIMD): "Alright, all 1000 cores here, start 'multiplying' right now!" (Broadcast Instruction)

The overhead of fetching, decoding, and scheduling instructions is amortized across numerous ALUs, maximizing efficiency. This structure coincidentally aligned perfectly with the matrix multiplication workloads of deep learning.

4. GPU: Extending to SIMT

GPUs use SIMD (Single Instruction, Multiple Data) to apply the same instruction to multiple data elements simultaneously, accelerating vector operations. However, GPUs go beyond simple SIMD by adopting SIMT(Single Instruction, Multiple Threads) model.

SIMT executes one instruction across many threads, but each thread maintains its own registers and program counter, allowing conditional branches and divergence handling. Threads are grouped into warp , which execute in lockstep; when divergence occurs, GPUs use mask-based executionto manage control flow. This approach delivers massive data parallelismwhile preserving flexibility for complex operationslike large-scale matrix multiplication.

5. NPU: Beyond GPUs to Systolic Arrays

While GPUs opened the era of deep learning, they aren't perfect dedicated AI hardware. GPUs still carry baggage unnecessary for deep learning, such as texture units and rasterizers tailored for graphics processing.

The NPU (Neural Processing Unit), born to be 100% dedicated to AI, moves beyond the SIMD concept of GPUs and introduces structures optimized extremely for matrix multiplication. A prime example is the Systolic Array in Google's TPU.

  • GPU: Fetches data from memory (registers), computes, and writes it back. With countless ALUs trying to access memory simultaneously, memory bottlenecks occur easily.
  • NPU (Systolic Array): Input data (matrix rows) enters from the left of the ALU array, and weight data flows in from the top. As data passes through the array, it is passed to adjacent ALUs below and to the right, performing continuous multiplications and additions.

This structure maximizes "Data Reuse," minimizing memory access, which is the biggest enemy of deep learning hardware. This is the decisive reason why NPUs offer superior power efficiency (Performance/Watt) compared to GPUs.

6. Conclusion: From Latency to the Era of Throughput

CPUs were designed to minimize Latency—how quickly a single complex task can be completed.

However, the era of deep learning is dominated by Throughput. Even if individual operations are slightly slower, the key is to increase the total overall processing amount per unit of time by handling millions of operations simultaneously.

This massive requirement for parallel processing, empowered by SIMD, led GPUs to the center of AI, and now NPUs, armed with architectures even more specialized for matrix multiplication, are taking over the mantle.

In the next post, we will explore a difficult challenge that remains despite these hardware advancements: "The Decisive Differences Between Training and Inference and Their Impact on NPU Design."

References: In-Datacenter Performance Analysis of a Tensor Processing Unit

Similar Posts