{"id":1383,"date":"2026-02-26T14:42:45","date_gmt":"2026-02-26T05:42:45","guid":{"rendered":"https:\/\/rtlearner.com\/?p=1383"},"modified":"2026-02-27T10:07:58","modified_gmt":"2026-02-27T01:07:58","slug":"ai-architecture-16-npu-optimization-memory-hierarchy","status":"publish","type":"post","link":"https:\/\/rtlearner.com\/en\/ai-architecture-16-npu-optimization-memory-hierarchy\/","title":{"rendered":"AI Architecture 16. Memory Hierarchy: Minimize Data Movement Costs"},"content":{"rendered":"<p class=\"wp-block-paragraph\">As we explored in the previous post (<a href=\"https:\/\/rtlearner.com\/en\/ai-architecture-15-systolic-array-architecture\/\">Systolic Array<\/a>), while powerful Processing Elements (PEs) are essential, the primary concern for a system architect is: \"How can we supply data seamlessly and cost-effectively?\"<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">According to Professor Mark Horowitz of Stanford, at the 45nm process node, the energy required to fetch data from DRAM is 200 to 1,000 times higher than the energy consumed by a 64-bit Floating-Point Multiply-Accumulate (FMA) operation itself. Performance and power efficiency depend not on the number of units, but on \"Minimizing Data Movement Distance.\" To achieve this, NPUs employ a highly optimized 3-level memory hierarchy.<\/p>\n\n\n<style>.kb-table-of-content-nav.kb-table-of-content-id1383_694810-81 .kb-table-of-content-wrap{padding-top:var(--global-kb-spacing-sm, 1.5rem);padding-right:var(--global-kb-spacing-sm, 1.5rem);padding-bottom:var(--global-kb-spacing-sm, 1.5rem);padding-left:var(--global-kb-spacing-sm, 1.5rem);box-shadow:0px 0px 14px 0px rgba(0, 0, 0, 0.2);}.kb-table-of-content-nav.kb-table-of-content-id1383_694810-81 .kb-table-of-contents-title-wrap{padding-top:0px;padding-right:0px;padding-bottom:0px;padding-left:0px;}.kb-table-of-content-nav.kb-table-of-content-id1383_694810-81 .kb-table-of-contents-title{font-weight:regular;font-style:normal;}.kb-table-of-content-nav.kb-table-of-content-id1383_694810-81 .kb-table-of-content-wrap .kb-table-of-content-list{font-weight:regular;font-style:normal;margin-top:var(--global-kb-spacing-sm, 1.5rem);margin-right:0px;margin-bottom:0px;margin-left:0px;}@media all and (max-width: 767px){.kb-table-of-content-nav.kb-table-of-content-id1383_694810-81 .kb-table-of-contents-title{font-size:var(--global-kb-font-size-md, 1.25rem);}.kb-table-of-content-nav.kb-table-of-content-id1383_694810-81 .kb-table-of-content-wrap .kb-table-of-content-list{font-size:var(--global-kb-font-size-sm, 0.9rem);}}<\/style>\n\n<style>.kadence-column1383_8b4693-43 > .kt-inside-inner-col{box-shadow:0px 0px 14px 0px rgba(0, 0, 0, 0.2);}.kadence-column1383_8b4693-43 > .kt-inside-inner-col,.kadence-column1383_8b4693-43 > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column1383_8b4693-43 > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column1383_8b4693-43 > .kt-inside-inner-col{flex-direction:column;}.kadence-column1383_8b4693-43 > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column1383_8b4693-43 > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column1383_8b4693-43{position:relative;}@media all and (max-width: 1024px){.kadence-column1383_8b4693-43 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column1383_8b4693-43 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column1383_8b4693-43\"><div class=\"kt-inside-inner-col\">\n<p class=\"wp-block-paragraph\"><strong>Related articles<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2705<a href=\"https:\/\/rtlearner.com\/en\/ai-architecture-14-dataflow-taxonomy-ws-os-rs\/\" data-type=\"post\" data-id=\"1352\">AI Architecture 14. Dataflow Taxonomy: TPU vs Output Stationary vs Row Stationary<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2705<a href=\"https:\/\/rtlearner.com\/en\/ai-architecture-15-systolic-array-architecture\/\" data-type=\"post\" data-id=\"1364\">AI Architecture 15. The Heart of Systolic Array<\/a><\/p>\n<\/div><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">1. Quantitative Understanding of Data Movement Costs<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The memory hierarchy is designed based on the fundamental correlation between physical distance and power consumption. (Normalized energy cost at 45nm):<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>PE Register File:<\/strong> 1 (Closest, Cheapest)<\/li>\n\n\n\n<li><strong>PE Local Scratchpad:<\/strong> ~2x<\/li>\n\n\n\n<li><strong>Global Buffer (Shared SRAM):<\/strong> ~20x<\/li>\n\n\n\n<li><strong>Off-chip Memory (DRAM):<\/strong> <strong>~200x to 1,000x<\/strong> (Farthest, Most Expensive)<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">This disparity clarifies the ultimate mission of NPU design: <strong>\"Minimize DRAM access and solve as much as possible on-chip.\"<\/strong> This is the essence of Data Reuse.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">2. 3-Level NPU Memory Hierarchy Analysis<\/h2>\n\n\n\n<h3 class=\"wp-block-heading\">Level 1: Off-chip Memory (DRAM\/HBM) - The Warehouse<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Role:<\/strong> Stores all model parameters (Weights) and large-scale feature maps.<\/li>\n\n\n\n<li><strong>Characteristic:<\/strong> Largest capacity (GBs) but high latency and limited bandwidth (Memory Wall). Modern high-performance NPUs use HBM (High Bandwidth Memory) to overcome bandwidth limitations.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Level 2: Global Buffer (On-chip SRAM) - The Distribution Center<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Role:<\/strong> A buffer zone between DRAM and PEs. It prefetches and stores \"Tiles\" of data to be processed next.<\/li>\n\n\n\n<li><strong>Characteristic:<\/strong> High-speed SRAM with several MBs of capacity (e.g., Google TPU v1\u2019s 24MB Unified Buffer).<\/li>\n\n\n\n<li><strong>Strategy:<\/strong> <strong>Uses Double Buffering<\/strong> to hide DRAM access latency by loading the next data chunk into Buffer B while the PE array processes Buffer A.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">Level 3: PE Register File (RF) - The Workbench<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Role:<\/strong> Supplies data directly to the MAC units within the PE.<\/li>\n\n\n\n<li><strong>Characteristic:<\/strong> Tiny capacity (KBs) but single-cycle access and negligible energy consumption.<\/li>\n\n\n\n<li><strong>Strategy:<\/strong> This is where <a href=\"https:\/\/rtlearner.com\/en\/ai-architecture-14-dataflow-taxonomy-ws-os-rs\/\">Dataflow(WS, OS, RS)<\/a>happens. By keeping specific data (e.g., weights) Stationary in registers, it prevents redundant data requests to higher hierarchy levels.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">3. Strategy: Tiling (Blocking)<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">When model weights exceed on-chip buffer capacity, the hardware utilizes <strong>Tiling<\/strong>.<\/p>\n\n\n\n<ol start=\"1\" class=\"wp-block-list\">\n<li><strong>Slicing:<\/strong> Divide massive matrices into smaller \"Tiles\" that fit into the buffer.<\/li>\n\n\n\n<li><strong>Mapping:<\/strong> Load one tile from DRAM to the buffer.<\/li>\n\n\n\n<li><strong>Maximum Reuse:<\/strong> Iteratively process the data within the tile across the PE array.<\/li>\n\n\n\n<li><strong>Write-back:<\/strong> Save the final result back to DRAM.<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\">Maintaining a high Arithmetic Intensity is vital; otherwise, the system becomes Memory-bound due to frequent DRAM fetches.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">4. Conclusion<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">A superior NPU is not just about TOPS; it's about the <strong>intelligent utilization of the narrow but fast on-chip space to minimize trips to the expensive DRAM warehouse<\/strong>.<\/p>\n\n\n<style>.kadence-column1383_80ea5b-17 > .kt-inside-inner-col{box-shadow:0px 0px 14px 0px rgba(0, 0, 0, 0.2);}.kadence-column1383_80ea5b-17 > .kt-inside-inner-col,.kadence-column1383_80ea5b-17 > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column1383_80ea5b-17 > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column1383_80ea5b-17 > .kt-inside-inner-col{flex-direction:column;}.kadence-column1383_80ea5b-17 > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column1383_80ea5b-17 > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column1383_80ea5b-17{position:relative;}@media all and (max-width: 1024px){.kadence-column1383_80ea5b-17 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column1383_80ea5b-17 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column1383_80ea5b-17\"><div class=\"kt-inside-inner-col\">\n<p class=\"wp-block-paragraph\"><strong>Related articles<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2705<a href=\"https:\/\/rtlearner.com\/en\/ai-architecture-14-dataflow-taxonomy-ws-os-rs\/\" data-type=\"post\" data-id=\"1352\">AI Architecture 14. Dataflow Taxonomy: TPU vs Output Stationary vs Row Stationary<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2705<a href=\"https:\/\/rtlearner.com\/en\/ai-architecture-15-systolic-array-architecture\/\" data-type=\"post\" data-id=\"1364\">AI Architecture 15. The Heart of Systolic Array<\/a><\/p>\n<\/div><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">References: <a href=\"https:\/\/developer.nvidia.com\/blog\/\" target=\"_blank\" rel=\"noopener\">NVIDIA Tech Blog<\/a><\/p>","protected":false},"excerpt":{"rendered":"<p>As we explored in the previous post (Systolic Array), while powerful Processing Elements (PEs) are essential, the primary concern for a system architect is: \"How can we supply data seamlessly and cost-effectively?\"<\/p>","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_kadence_starter_templates_imported_post":false,"_kad_post_transparent":"","_kad_post_title":"","_kad_post_layout":"","_kad_post_sidebar_id":"","_kad_post_content_style":"","_kad_post_vertical_padding":"","_kad_post_feature":"","_kad_post_feature_position":"","_kad_post_header":false,"_kad_post_footer":false,"_kad_post_classname":"","footnotes":""},"categories":[116],"tags":[117,118],"class_list":["post-1383","post","type-post","status-publish","format-standard","hentry","category-ai-and-hw-fundamentals","tag-ai","tag-architecture"],"_links":{"self":[{"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/posts\/1383","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/comments?post=1383"}],"version-history":[{"count":4,"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/posts\/1383\/revisions"}],"predecessor-version":[{"id":1419,"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/posts\/1383\/revisions\/1419"}],"wp:attachment":[{"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/media?parent=1383"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/categories?post=1383"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/tags?post=1383"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}<!-- This website is optimized by Airlift. Learn more: https://airlift.net. Template:. Learn more: https://airlift.net. Template: 6a283708d36f733e95476362. Config Timestamp: 2026-06-09 15:53:43 UTC, Cached Timestamp: 2026-06-30 23:44:55 UTC, Optimization Time: 2.17ms -->