In the previous post, we quantitatively confirmed that hardware performance limits are often determined by memory bandwidth. From a system architect's perspective, this fact carries a more significant implication: the energy cost of moving data to the computation unit is significantly higher than the cost of performing the actual computation.
Research indicates that fetching data from DRAM consumes approximately 200 times more energy than fetching it from a Register File (RF). Therefore, the core of high-performance NPU design is not merely increasing the number of MAC units, but designing a strategy to keep data in on-chip memory or registers as long as possible for Reuse.
This strategy of Spatio-temporal Mappingis called Dataflow. Depending on which data type is kept Stationary, the characteristics of the NPU architecture change completely. This article provides an in-depth analysis of the three primary Dataflows: Weight Stationary (WS), Output Stationary (OS), and Row Stationary (RS).
Related articles
1. Weight Stationary (WS): Fix the Weights
Concept & Mechanism
Weight Stationaryfixes the Weights (Filters), a key element of deep learning operations, in the registers inside the PE (Processing Element), while allowing Inputs (Input Activations) and Partial Sums to move.
- Weights are pre-loaded into PE registers and held stationary.
- Input data (Input Feature Maps) are broadcasted or flowed through the array.
- Calculated results (Partial Sums) move to adjacent PEs to be accumulated.
Representative Architecture
- Google TPU (Tensor Processing Unit) v1: Implemented WS using a Systolic Array structure.
Pros
- Efficiency for CNN/LLM: Filters in CNNs or weight matrices in LLMs are reused repeatedly across multiple inputs once loaded. WS maximizes this property to reduce memory access costs.
- Simplified Control: Since weights are static after loading, the control logic for flowing inputs is relatively simple.
Cons
- Partial Sum Movement Cost: Partial sums must continuously move between PEs until accumulation is complete, consuming interconnect bandwidth.
2. Output Stationary (OS): Fix the Results
Concept & Mechanism
Output Stationaryfixes the Partial Sums required to generate the final Output Activation in the PE's internal registers.
- Each PE is responsible for one Output Pixel.
- Inputs and Weights required to compute this output are streamed to the PE.
- Partial sums do not leave the PE until the accumulation is finished.
- Only the final result is written out to memory.
Representative Architecture
- ShiDianNao: An early NPU architecture optimized for tasks with high operation density within specific windows, such as image processing.
Pros
- Minimize Partial Sum Memory Access: Since the read/write process for partial sums occurs only within registers, global buffer traffic for partial sums is drastically reduced. (This is significant as partial sum data often requires higher bit-width precision).
Cons
- Input/Weight Bandwidth: Inputs and weights must be broadcasted or unicasted every cycle, potentially increasing the global bandwidth requirement to supply this data.
3. Row Stationary (RS): Maximizing 2D Reuse
Concept & Mechanism
Row Stationaryis the core technology of the Eyeriss architecture proposed by MIT. Unlike WS or OS which fix a single data type, RS is a composite method designed to maximize reuse for Inputs, Weights, and Partial Sums simultaneously.
- Due to the nature of Convolution, 2D planar data is processed in a sliding window manner.
- RS maps data to PEs in units of 1D Rows.
- By increasing the RF (Register File) capacity per PE, it fixes a Row of Weights, flows a Row of Inputs, and manages the Row of Partial Sums internally.
Representative Architecture
- MIT Eyeriss: An edge NPU pursuing extreme energy efficiency.
Pros
- Overall Energy Optimization: Achieves balanced reuse across Inputs, Weights, and Outputs without biasing towards a specific data type, thereby minimizing total system energy.
Cons
- Complex Control Logic: The data mapping scheme is highly complex, increasing the difficulty of compiler and hardware controller design.
- Increased PE Area: Requires larger local memory (SRAM/RF) per PE to store the complex data sets.
4. Comparison Analysis & Conclusion
|
Characteristic 1352_df1c85-a8> |
Weight Stationary (WS) 1352_9e8b61-f1> |
Output Stationary (OS) 1352_74ec69-52> |
Row Stationary (RS) 1352_c72ba8-5d> |
|---|---|---|---|
|
Stationary Data 1352_a29ed2-94> |
Weights (Filters) 1352_02cc3a-ac> |
Partial Sums (Outputs) 1352_1c0b34-37> |
Row of Weights & Inputs 1352_6d4956-08> |
|
Moving Data 1352_612155-c9> |
Inputs, Partial Sums 1352_7c05ac-eb> |
Inputs, Weights 1352_faad8d-7a> |
Inputs (Diagonal), Psums 1352_643514-d3> |
|
Optimization Goal 1352_546f7c-ef> |
Min. Weight Reads 1352_82d79b-a6> |
Min. Psum R/W 1352_9ec72c-f3> |
Min. Total Data Movement 1352_5c7ddc-1b> |
|
Suitable Models 1352_cac23f-f7> |
Large CNNs, LLMs (Batch↑) 1352_6bbf8c-17> |
Depthwise Conv, MLP 1352_487e62-a0> |
General CNN (Mobile/Edge) 1352_c14ba5-5f> |
|
Examples 1352_bde87f-79> |
Google TPU, NVDLA 1352_88688b-93> |
ShiDianNao 1352_f7cbd6-0b> |
MIT Eyeriss 1352_b6f620-75> |
In conclusion, the superior Dataflow is determined by the characteristics of the 'Workload'.
- WS is advantageous for server-grade inference with large batch sizes and high filter reuse.
- OS can be beneficial when image sizes are large, channels are few, or partial sum data size is substantial.
- RS is preferred in mobile/edge environments with strict power constraints, despite the design complexity, due to its highest energy efficiency.
Modern high-performance NPUs (e.g., NVIDIA Tensor Cores, Google TPU v4) are evolving to not be fixed to a single dataflow, but to reconfigure or mix dataflows flexibly according to layer characteristics (Conv vs. FC, Kernel Size, etc.).
Related articles
References: Efficient Processing of Deep Neural Networks: A Tutorial and Survey