AI Architecture 14. Dataflow Taxonomy: TPU vs Output Stationary vs Row Stationary

In the previous post, we quantitatively confirmed that hardware performance limits are often determined by memory bandwidth. From a system architect's perspective, this fact carries a more significant implication: the energy cost of moving data to the computation unit is significantly higher than the cost of performing the actual computation.

Research indicates that fetching data from DRAM consumes approximately 200 times more energy than fetching it from a Register File (RF). Therefore, the core of high-performance NPU design is not merely increasing the number of MAC units, but designing a strategy to keep data in on-chip memory or registers as long as possible for Reuse.

This strategy of Spatio-temporal Mappingis called Dataflow. Depending on which data type is kept Stationary, the characteristics of the NPU architecture change completely. This article provides an in-depth analysis of the three primary Dataflows: Weight Stationary (WS), Output Stationary (OS), and Row Stationary (RS).

Related articles

✅AI Architecture 15. The Heart of Systolic Array

1. Weight Stationary (WS): Fix the Weights

Concept & Mechanism

Weight Stationaryfixes the Weights (Filters), a key element of deep learning operations, in the registers inside the PE (Processing Element), while allowing Inputs (Input Activations) and Partial Sums to move.

Weights are pre-loaded into PE registers and held stationary.
Input data (Input Feature Maps) are broadcasted or flowed through the array.
Calculated results (Partial Sums) move to adjacent PEs to be accumulated.

Representative Architecture

Google TPU (Tensor Processing Unit) v1: Implemented WS using a Systolic Array structure.

Pros

Efficiency for CNN/LLM: Filters in CNNs or weight matrices in LLMs are reused repeatedly across multiple inputs once loaded. WS maximizes this property to reduce memory access costs.
Simplified Control: Since weights are static after loading, the control logic for flowing inputs is relatively simple.

Cons

Partial Sum Movement Cost: Partial sums must continuously move between PEs until accumulation is complete, consuming interconnect bandwidth.

2. Output Stationary (OS): Fix the Results

Concept & Mechanism

Output Stationaryfixes the Partial Sums required to generate the final Output Activation in the PE's internal registers.

Each PE is responsible for one Output Pixel.
Inputs and Weights required to compute this output are streamed to the PE.
Partial sums do not leave the PE until the accumulation is finished.
Only the final result is written out to memory.

Representative Architecture

ShiDianNao: An early NPU architecture optimized for tasks with high operation density within specific windows, such as image processing.

Pros

Minimize Partial Sum Memory Access: Since the read/write process for partial sums occurs only within registers, global buffer traffic for partial sums is drastically reduced. (This is significant as partial sum data often requires higher bit-width precision).

Cons

Input/Weight Bandwidth: Inputs and weights must be broadcasted or unicasted every cycle, potentially increasing the global bandwidth requirement to supply this data.

3. Row Stationary (RS): Maximizing 2D Reuse

Concept & Mechanism

Row Stationaryis the core technology of the Eyeriss architecture proposed by MIT. Unlike WS or OS which fix a single data type, RS is a composite method designed to maximize reuse for Inputs, Weights, and Partial Sums simultaneously.

Due to the nature of Convolution, 2D planar data is processed in a sliding window manner.
RS maps data to PEs in units of 1D Rows.
By increasing the RF (Register File) capacity per PE, it fixes a Row of Weights, flows a Row of Inputs, and manages the Row of Partial Sums internally.

Representative Architecture

MIT Eyeriss: An edge NPU pursuing extreme energy efficiency.

Pros

Overall Energy Optimization: Achieves balanced reuse across Inputs, Weights, and Outputs without biasing towards a specific data type, thereby minimizing total system energy.

Cons

Complex Control Logic: The data mapping scheme is highly complex, increasing the difficulty of compiler and hardware controller design.
Increased PE Area: Requires larger local memory (SRAM/RF) per PE to store the complex data sets.

4. Comparison Analysis & Conclusion

Characteristic	Weight Stationary (WS)	Output Stationary (OS)	Row Stationary (RS)
Stationary Data	Weights (Filters)	Partial Sums (Outputs)	Row of Weights & Inputs
Moving Data	Inputs, Partial Sums	Inputs, Weights	Inputs (Diagonal), Psums
Optimization Goal	Min. Weight Reads	Min. Psum R/W	Min. Total Data Movement
Suitable Models	Large CNNs, LLMs (Batch↑)	Depthwise Conv, MLP	General CNN (Mobile/Edge)
Examples	Google TPU, NVDLA	ShiDianNao	MIT Eyeriss

In conclusion, the superior Dataflow is determined by the characteristics of the 'Workload'.

WS is advantageous for server-grade inference with large batch sizes and high filter reuse.
OS can be beneficial when image sizes are large, channels are few, or partial sum data size is substantial.
RS is preferred in mobile/edge environments with strict power constraints, despite the design complexity, due to its highest energy efficiency.

Modern high-performance NPUs (e.g., NVIDIA Tensor Cores, Google TPU v4) are evolving to not be fixed to a single dataflow, but to reconfigure or mix dataflows flexibly according to layer characteristics (Conv vs. FC, Kernel Size, etc.).

Related articles

✅AI Architecture 15. The Heart of Systolic Array

References: Efficient Processing of Deep Neural Networks: A Tutorial and Survey

AI Architecture 14. Dataflow Taxonomy: TPU vs Output Stationary vs Row Stationary

1. Weight Stationary (WS): Fix the Weights

Concept & Mechanism

Representative Architecture

Pros

Cons

2. Output Stationary (OS): Fix the Results

Concept & Mechanism

Representative Architecture

Pros

Cons

3. Row Stationary (RS): Maximizing 2D Reuse

Concept & Mechanism

Representative Architecture

Pros

Cons

4. Comparison Analysis & Conclusion

AI Architecture 12. Skip Connection: ResNet and Bottlenecks

AI Architecture 2. The Cost of Activation: Free ReLU vs. Expensive Sigmoid

AI Architecture 15. The Heart of Systolic Array

AI Architecture 3. The Aesthetics of MatMul: Why Deep Learning Chooses GPUs/NPUs

AI Architecture 4. Training vs. Inference

AI Architecture 7. MLP and the Memory Wall

Sitemap

Category

Information

1. Weight Stationary (WS): Fix the Weights

Concept & Mechanism

Representative Architecture

Pros

Cons

2. Output Stationary (OS): Fix the Results

Concept & Mechanism

Representative Architecture

Pros

Cons

3. Row Stationary (RS): Maximizing 2D Reuse

Concept & Mechanism

Representative Architecture

Pros

Cons

4. Comparison Analysis & Conclusion

Similar Posts

Sitemap

Category

Information