AI Architecture 16. 메모리 계층 구조(Memory Hierarchy): 데이터 이동 비용 최소화 전략

As we explored in the previous post (Systolic Array), while powerful Processing Elements (PEs) are essential, the primary concern for a system architect is: "How can we supply data seamlessly and cost-effectively?"

According to Professor Mark Horowitz of Stanford, at the 45nm process node, the energy required to fetch data from DRAM is 200 to 1,000 times higher than the energy consumed by a 64-bit Floating-Point Multiply-Accumulate (FMA) operation itself. Performance and power efficiency depend not on the number of units, but on "Minimizing Data Movement Distance." To achieve this, NPUs employ a highly optimized 3-level memory hierarchy.

Related articles

✅AI Architecture 14. Dataflow Taxonomy: TPU vs Output Stationary vs Row Stationary

✅AI Architecture 15. The Heart of Systolic Array

1. Quantitative Understanding of Data Movement Costs

The memory hierarchy is designed based on the fundamental correlation between physical distance and power consumption. (Normalized energy cost at 45nm):

PE Register File: 1 (Closest, Cheapest)
PE Local Scratchpad: ~2x
Global Buffer (Shared SRAM): ~20x
Off-chip Memory (DRAM): ~200x to 1,000x (Farthest, Most Expensive)

This disparity clarifies the ultimate mission of NPU design: "Minimize DRAM access and solve as much as possible on-chip." This is the essence of Data Reuse.

2. 3-Level NPU Memory Hierarchy Analysis

Level 1: Off-chip Memory (DRAM/HBM) - The Warehouse

Role: Stores all model parameters (Weights) and large-scale feature maps.
Characteristic: Largest capacity (GBs) but high latency and limited bandwidth (Memory Wall). Modern high-performance NPUs use HBM (High Bandwidth Memory) to overcome bandwidth limitations.

Level 2: Global Buffer (On-chip SRAM) - The Distribution Center

Role: A buffer zone between DRAM and PEs. It prefetches and stores "Tiles" of data to be processed next.
Characteristic: High-speed SRAM with several MBs of capacity (e.g., Google TPU v1’s 24MB Unified Buffer).
Strategy: Uses Double Buffering to hide DRAM access latency by loading the next data chunk into Buffer B while the PE array processes Buffer A.

Level 3: PE Register File (RF) - The Workbench

Role: Supplies data directly to the MAC units within the PE.
Characteristic: Tiny capacity (KBs) but single-cycle access and negligible energy consumption.
Strategy: This is where Dataflow(WS, OS, RS)happens. By keeping specific data (e.g., weights) Stationary in registers, it prevents redundant data requests to higher hierarchy levels.

3. Strategy: Tiling (Blocking)

When model weights exceed on-chip buffer capacity, the hardware utilizes Tiling.

Slicing: Divide massive matrices into smaller "Tiles" that fit into the buffer.
Mapping: Load one tile from DRAM to the buffer.
Maximum Reuse: Iteratively process the data within the tile across the PE array.
Write-back: Save the final result back to DRAM.

Maintaining a high Arithmetic Intensity is vital; otherwise, the system becomes Memory-bound due to frequent DRAM fetches.

4. Conclusion

A superior NPU is not just about TOPS; it's about the intelligent utilization of the narrow but fast on-chip space to minimize trips to the expensive DRAM warehouse.

Related articles

✅AI Architecture 14. Dataflow Taxonomy: TPU vs Output Stationary vs Row Stationary

✅AI Architecture 15. The Heart of Systolic Array

References: NVIDIA Tech Blog

AI Architecture 16. Memory Hierarchy: Minimize Data Movement Costs

1. Quantitative Understanding of Data Movement Costs

2. 3-Level NPU Memory Hierarchy Analysis

Level 1: Off-chip Memory (DRAM/HBM) - The Warehouse

Level 2: Global Buffer (On-chip SRAM) - The Distribution Center

Level 3: PE Register File (RF) - The Workbench

3. Strategy: Tiling (Blocking)

4. Conclusion

AI Architecture 12. Skip Connection: ResNet and Bottlenecks

AI Architecture 4. Training vs. Inference

AI Architecture 13. Roofline Model Analysis

AI Architecture 2. The Cost of Activation: Free ReLU vs. Expensive Sigmoid

AI Architecture 10. Padding and Pooling Hardware Issues

AI Architecture 6. INT8 Quantization Basics

Sitemap

Category

Information

1. Quantitative Understanding of Data Movement Costs

2. 3-Level NPU Memory Hierarchy Analysis

Level 1: Off-chip Memory (DRAM/HBM) - The Warehouse

Level 2: Global Buffer (On-chip SRAM) - The Distribution Center

Level 3: PE Register File (RF) - The Workbench

3. Strategy: Tiling (Blocking)

4. Conclusion

Similar Posts

Sitemap

Category

Information