AI Architecture 15. The Heart of Systolic Array

The Von Neumann architecture, the origin of modern computing, separates the 'Processing Unit (CPU)' from the 'Storage Unit (Memory)'. The fatal flaw of this structure is that data must be fetched from memory and stored back for every operation. As pointed out in previous posts (13. Roofline Model and 14. Dataflow), this data movement cost is the biggest barrier limiting the performance of deep learning accelerators.

In 1978, H.T. Kung and C.E. Leiserson proposed the concept of the 'Systolic Array' to solve this problem. Derived from 'Systole', the phase of the heartbeat when the heart muscle contracts and pumps blood into the circulatory system, this term describes a structure where data, once pumped from memory, flows through an array of Processing Elements (PEs) in a rhythmic manner, being reused multiple times.

This technology became the de facto standard for deep learning hardware when Google adopted it as the core architecture for its AI-specific processor, the TPU (Tensor Processing Unit), in 2017. This article analyzes the operating principles of the Systolic Array and how Google TPU maximized Matrix Multiplication (GEMM) efficiency through it.

1. Structural Characteristics: 2D Mesh Topology of PEs

Unlike general CPUs or GPUs that manage thousands of threads and complexly traverse caches and registers, a Systolic Array possesses a very simple and regular 2D Mesh Topology.

  • Processing Element (PE): The unit of computation occupying each cell of the array. Each PE has a very simple structure, containing only a Multiply-Accumulate (MAC) unit and a small capacity register.
  • Local Interconnect: Each PE is connected only to its immediate neighbors. Since they are not connected to a global data bus, there is no need to send data over long distances, drastically reducing wiring complexity and power consumption.
  • No Global Control: Individual PEs do not decode complex instructions. They perform simple repetitive tasks—processing incoming data and passing it to the neighbor—synchronized to a single clock signal from a central controller.

2. Operating Mechanism: Dataflow Based on Weight Stationary

Google TPU v1 implemented the Weight Stationary (WS) dataflow using a Systolic Array. Let's analyze the operating principle step-by-step using an N * N matrix multiplication (C = A * B).

Systolic data flow
Systolic data flow of the Matrix Multiply Unit
  1. Weight Pre-loading: Before starting the computation, the values of Matrix B (Weights) are pre-loaded into each PE. These values remain fixed (Stationary) in the PE's registers until the operation is complete.
  2. Input Streaming: The values of Matrix A (Input Data) flow from the left side of the array to the right. The key here is 'Skewing'. Data rows do not enter simultaneously; the first row enters at T=0, the second at T=1, and so on. This creates a diagonal 'Wavefront' data flow.
  3. Systolic Flow & Accumulation:
    • Each PE multiplies the input value received from the left (Arow) with its held weight (Bfixed).
    • It adds the result (partial sum) to the value coming down from above.
    • It passes the input value (A) to the right PE and the updated partial sum (C) to the PE below.
    • This process occurs regularly every clock cycle, like a heartbeat.
  4. Output Draining: The completed result values (C) are sequentially pushed out from the bottom of the array.

3. Key Advantages

Why did Google choose the Systolic Array among many architectures? The answer lies in Compute Density and Bandwidth Efficiency.

A. Processing O(N2) Compute with O(N) Bandwidth

This is the most powerful feature of the Systolic Array. An N * N PE array contains N2computing units. However, the I/O ports connected to external memory are only on the boundary, numbering N.

In other words, once data is fetched from memory (N), it is reused N times internally, performing a total of N2operations. This is the most effective way to resolve the memory bandwidth bottleneck and push the system into the Compute-bound region of the Roofline Model.

B. High Area Efficiency

Each PE consists purely of MAC units without complex Control Units or multi-level caches. Therefore, a much larger number of computing units can be integrated into the same silicon area. In the case of TPU v1, a massive 256 * 256 array achieved enormous TOPS (Tera Operations Per Second) even with a low clock speed of 700MHz.

4. Disadvantages and Limitations: Lack of Flexibility

Every design has trade-offs. While the Systolic Array shows extreme efficiency for Matrix Multiplication, it is vulnerable in other tasks.

A. Latency and Pipeline Priming (Fill & Drain Overhead)

It takes time to fill the array so that all PEs are active (Fill) and to drain the last data (Drain). This is called Pipeline Priming time. Therefore, if the data to be processed (Batch Size) is small, the operation finishes before the pipeline is fully filled, causing PE Utilization to drop drastically. This is why TPUs are advantageous for 'Large Batch' inference or training but disadvantageous for real-time 'Batch 1' inference.

B. Difficulty in Handling Sparsity

Systolic Arrays assume Dense Matrices where data is full. Even if the data is a Sparse Matrix with many zeros, the regular rhythm cannot be broken, so meaningless operations multiplying by zero must be performed. To solve this, research on hardware supporting Structured Sparsity is currently underway.

[Conclusion]

The Systolic Array is the architecture that most elegantly implements the grand principle of NPU design—"Minimizing Data Movement Cost"—at the hardware level. This method of performing massive parallel computations with only local connections has become the textbook standard for modern AI semiconductors since the success of Google TPU.

However, for tens of thousands of PEs to operate ceaselessly, a powerful memory system must back them up to supply data without interruption. In the next post, we will explore the data movement strategy of "Memory Hierarchy: DRAM - Global Buffer - PE Register" to feed this Systolic Array.

References: In-Datacenter Performance Analysis of a Tensor Processing Unit

Similar Posts