AI Architecture 14. Dataflow Taxonomy: TPU vs Output Stationary vs Row Stationary

In the previous post, we quantitatively confirmed that hardware performance limits are often determined by memory bandwidth. From a system architect's perspective, this fact carries a more significant implication: the energy cost of moving data to the computation unit is significantly higher than the cost of performing the actual computation.

Research indicates that fetching data from DRAM consumes approximately 200 times more energy than fetching it from a Register File (RF). Therefore, the core of high-performance NPU design is not merely increasing the number of MAC units, but designing a strategy to keep data in on-chip memory or registers as long as possible for Reuse.

This strategy of Spatio-temporal Mappingis called Dataflow. Depending on which data type is kept Stationary, the characteristics of the NPU architecture change completely. This article provides an in-depth analysis of the three primary Dataflows: Weight Stationary (WS), Output Stationary (OS), and Row Stationary (RS).

1. Weight Stationary (WS): Fix the Weights

Concept & Mechanism

Weight Stationaryfixes the Weights (Filters), a key element of deep learning operations, in the registers inside the PE (Processing Element), while allowing Inputs (Input Activations) and Partial Sums to move.

  1. Weights are pre-loaded into PE registers and held stationary.
  2. Input data (Input Feature Maps) are broadcasted or flowed through the array.
  3. Calculated results (Partial Sums) move to adjacent PEs to be accumulated.

Representative Architecture

  • Google TPU (Tensor Processing Unit) v1: Implemented WS using a Systolic Array structure.

Pros

  • Efficiency for CNN/LLM: Filters in CNNs or weight matrices in LLMs are reused repeatedly across multiple inputs once loaded. WS maximizes this property to reduce memory access costs.
  • Simplified Control: Since weights are static after loading, the control logic for flowing inputs is relatively simple.

Cons

  • Partial Sum Movement Cost: Partial sums must continuously move between PEs until accumulation is complete, consuming interconnect bandwidth.

2. Output Stationary (OS): Fix the Results

Concept & Mechanism

Output Stationaryfixes the Partial Sums required to generate the final Output Activation in the PE's internal registers.

  1. Each PE is responsible for one Output Pixel.
  2. Inputs and Weights required to compute this output are streamed to the PE.
  3. Partial sums do not leave the PE until the accumulation is finished.
  4. Only the final result is written out to memory.

Representative Architecture

  • ShiDianNao: An early NPU architecture optimized for tasks with high operation density within specific windows, such as image processing.

Pros

  • Minimize Partial Sum Memory Access: Since the read/write process for partial sums occurs only within registers, global buffer traffic for partial sums is drastically reduced. (This is significant as partial sum data often requires higher bit-width precision).

Cons

  • Input/Weight Bandwidth: Inputs and weights must be broadcasted or unicasted every cycle, potentially increasing the global bandwidth requirement to supply this data.

3. Row Stationary (RS): Maximizing 2D Reuse

Concept & Mechanism

Row Stationaryis the core technology of the Eyeriss architecture proposed by MIT. Unlike WS or OS which fix a single data type, RS is a composite method designed to maximize reuse for Inputs, Weights, and Partial Sums simultaneously.

  1. Due to the nature of Convolution, 2D planar data is processed in a sliding window manner.
  2. RS maps data to PEs in units of 1D Rows.
  3. By increasing the RF (Register File) capacity per PE, it fixes a Row of Weights, flows a Row of Inputs, and manages the Row of Partial Sums internally.

Representative Architecture

  • MIT Eyeriss: An edge NPU pursuing extreme energy efficiency.

Pros

  • Overall Energy Optimization: Achieves balanced reuse across Inputs, Weights, and Outputs without biasing towards a specific data type, thereby minimizing total system energy.

Cons

  • Complex Control Logic: The data mapping scheme is highly complex, increasing the difficulty of compiler and hardware controller design.
  • Increased PE Area: Requires larger local memory (SRAM/RF) per PE to store the complex data sets.

4. Comparison Analysis & Conclusion

Characteristic

Weight Stationary (WS)

Output Stationary (OS)

Row Stationary (RS)

Stationary Data

Weights (Filters)

Partial Sums (Outputs)

Row of Weights & Inputs

Moving Data

Inputs, Partial Sums

Inputs, Weights

Inputs (Diagonal), Psums

Optimization Goal

Min. Weight Reads

Min. Psum R/W

Min. Total Data Movement

Suitable Models

Large CNNs, LLMs (Batch↑)

Depthwise Conv, MLP

General CNN (Mobile/Edge)

Examples

Google TPU, NVDLA

ShiDianNao

MIT Eyeriss

Dataflow comparison
Dataflow comparison

In conclusion, the superior Dataflow is determined by the characteristics of the 'Workload'.

  • WS is advantageous for server-grade inference with large batch sizes and high filter reuse.
  • OS can be beneficial when image sizes are large, channels are few, or partial sum data size is substantial.
  • RS is preferred in mobile/edge environments with strict power constraints, despite the design complexity, due to its highest energy efficiency.

Modern high-performance NPUs (e.g., NVIDIA Tensor Cores, Google TPU v4) are evolving to not be fixed to a single dataflow, but to reconfigure or mix dataflows flexibly according to layer characteristics (Conv vs. FC, Kernel Size, etc.).

References: Efficient Processing of Deep Neural Networks: A Tutorial and Survey

Similar Posts