AI Architecture 16. Memory Hierarchy: Minimize Data Movement Costs

As we explored in the previous post (Systolic Array), while powerful Processing Elements (PEs) are essential, the primary concern for a system architect is: "How can we supply data seamlessly and cost-effectively?"

According to Professor Mark Horowitz of Stanford, at the 45nm process node, the energy required to fetch data from DRAM is 200 to 1,000 times higher than the energy consumed by a 64-bit Floating-Point Multiply-Accumulate (FMA) operation itself. Performance and power efficiency depend not on the number of units, but on "Minimizing Data Movement Distance." To achieve this, NPUs employ a highly optimized 3-level memory hierarchy.

1. Quantitative Understanding of Data Movement Costs

The memory hierarchy is designed based on the fundamental correlation between physical distance and power consumption. (Normalized energy cost at 45nm):

  • PE Register File: 1 (Closest, Cheapest)
  • PE Local Scratchpad: ~2x
  • Global Buffer (Shared SRAM): ~20x
  • Off-chip Memory (DRAM): ~200x to 1,000x (Farthest, Most Expensive)

This disparity clarifies the ultimate mission of NPU design: "Minimize DRAM access and solve as much as possible on-chip." This is the essence of Data Reuse.

2. 3-Level NPU Memory Hierarchy Analysis

Level 1: Off-chip Memory (DRAM/HBM) - The Warehouse

  • Role: Stores all model parameters (Weights) and large-scale feature maps.
  • Characteristic: Largest capacity (GBs) but high latency and limited bandwidth (Memory Wall). Modern high-performance NPUs use HBM (High Bandwidth Memory) to overcome bandwidth limitations.

Level 2: Global Buffer (On-chip SRAM) - The Distribution Center

  • Role: A buffer zone between DRAM and PEs. It prefetches and stores "Tiles" of data to be processed next.
  • Characteristic: High-speed SRAM with several MBs of capacity (e.g., Google TPU v1’s 24MB Unified Buffer).
  • Strategy: Uses Double Buffering to hide DRAM access latency by loading the next data chunk into Buffer B while the PE array processes Buffer A.

Level 3: PE Register File (RF) - The Workbench

  • Role: Supplies data directly to the MAC units within the PE.
  • Characteristic: Tiny capacity (KBs) but single-cycle access and negligible energy consumption.
  • Strategy: This is where Dataflow(WS, OS, RS)happens. By keeping specific data (e.g., weights) Stationary in registers, it prevents redundant data requests to higher hierarchy levels.

3. Strategy: Tiling (Blocking)

When model weights exceed on-chip buffer capacity, the hardware utilizes Tiling.

  1. Slicing: Divide massive matrices into smaller "Tiles" that fit into the buffer.
  2. Mapping: Load one tile from DRAM to the buffer.
  3. Maximum Reuse: Iteratively process the data within the tile across the PE array.
  4. Write-back: Save the final result back to DRAM.

Maintaining a high Arithmetic Intensity is vital; otherwise, the system becomes Memory-bound due to frequent DRAM fetches.

4. Conclusion

A superior NPU is not just about TOPS; it's about the intelligent utilization of the narrow but fast on-chip space to minimize trips to the expensive DRAM warehouse.

References: NVIDIA Tech Blog

Similar Posts