{"id":1294,"date":"2026-01-14T10:16:23","date_gmt":"2026-01-14T01:16:23","guid":{"rendered":"https:\/\/rtlearner.com\/?p=1294"},"modified":"2026-01-14T10:16:26","modified_gmt":"2026-01-14T01:16:26","slug":"ai-architecture-8-cnn-locality-sram-data-reuse","status":"publish","type":"post","link":"https:\/\/rtlearner.com\/en\/ai-architecture-8-cnn-locality-sram-data-reuse\/","title":{"rendered":"AI Architecture 8. CNN and Locality"},"content":{"rendered":"<p class=\"wp-block-paragraph\">In the previous post, we confirmed how inefficient MLP (Fully Connected Layer) is from a hardware perspective. Due to its structure of fetching a weight once, using it exactly once, and then discarding it, system performance suffers from the Memory Wall phenomenon, limited by memory bandwidth.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">However, the real protagonist that allowed deep learning to change the world was not the MLP, but the CNN (Convolutional Neural Network). While algorithm researchers praise CNNs for \"capturing spatial features of images well,\" Hardware Architects like us love CNNs for a completely different reason.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">That reason is \"Locality\" and \"Reuse.\" In this article, we will uncover the physical reasons why the <strong>Sliding Window<\/strong> method of CNNs maximizes the efficiency of the <strong>SRAM (On-chip Buffer)<\/strong> inside semiconductor chips and why NPUs can only unleash their full performance (TOPS) when running CNNs.<\/p>\n\n\n<style>.kb-table-of-content-nav.kb-table-of-content-id1294_317aa1-60 .kb-table-of-content-wrap{padding-top:var(--global-kb-spacing-sm, 1.5rem);padding-right:var(--global-kb-spacing-sm, 1.5rem);padding-bottom:var(--global-kb-spacing-sm, 1.5rem);padding-left:var(--global-kb-spacing-sm, 1.5rem);box-shadow:0px 0px 14px 0px rgba(0, 0, 0, 0.2);}.kb-table-of-content-nav.kb-table-of-content-id1294_317aa1-60 .kb-table-of-contents-title-wrap{padding-top:0px;padding-right:0px;padding-bottom:0px;padding-left:0px;}.kb-table-of-content-nav.kb-table-of-content-id1294_317aa1-60 .kb-table-of-contents-title{font-weight:regular;font-style:normal;}.kb-table-of-content-nav.kb-table-of-content-id1294_317aa1-60 .kb-table-of-content-wrap .kb-table-of-content-list{font-weight:regular;font-style:normal;margin-top:var(--global-kb-spacing-sm, 1.5rem);margin-right:0px;margin-bottom:0px;margin-left:0px;}@media all and (max-width: 767px){.kb-table-of-content-nav.kb-table-of-content-id1294_317aa1-60 .kb-table-of-contents-title{font-size:var(--global-kb-font-size-md, 1.25rem);}.kb-table-of-content-nav.kb-table-of-content-id1294_317aa1-60 .kb-table-of-content-wrap .kb-table-of-content-list{font-size:var(--global-kb-font-size-sm, 0.9rem);}}<\/style>\n\n<style>.kadence-column1294_eef87d-a6 > .kt-inside-inner-col{box-shadow:0px 0px 14px 0px rgba(0, 0, 0, 0.2);}.kadence-column1294_eef87d-a6 > .kt-inside-inner-col,.kadence-column1294_eef87d-a6 > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column1294_eef87d-a6 > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column1294_eef87d-a6 > .kt-inside-inner-col{flex-direction:column;}.kadence-column1294_eef87d-a6 > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column1294_eef87d-a6 > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column1294_eef87d-a6{position:relative;}@media all and (max-width: 1024px){.kadence-column1294_eef87d-a6 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column1294_eef87d-a6 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column1294_eef87d-a6\"><div class=\"kt-inside-inner-col\">\n<p class=\"wp-block-paragraph\"><strong>Related articles<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2705<a href=\"https:\/\/rtlearner.com\/en\/ai-architecture-1-neuron-hardware-mac-analysis\/\" data-type=\"post\" data-id=\"1248\">AI Architecture 1. Anatomy of an Artificial Neuron: Y=WX+B on Silicon<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2705<a href=\"https:\/\/rtlearner.com\/en\/ai-architecture-2-activation-relu-vs-sigmoid\/\" data-type=\"post\" data-id=\"1255\">AI Architecture 2. The Cost of Activation: Free ReLU vs. Expensive Sigmoid<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2705<a href=\"https:\/\/rtlearner.com\/en\/ai-architecture-3-matmul-simd-parallel-processing\/\" data-type=\"post\" data-id=\"1263\">AI Architecture 3. The Aesthetics of MatMul: Why Deep Learning Chooses GPUs\/NPUs<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2705<a href=\"https:\/\/rtlearner.com\/en\/ai-architecture-4-training-vs-inference\/\" data-type=\"post\" data-id=\"1267\">AI Architecture 4. Training vs. Inference<\/a><\/p>\n<\/div><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">1. Locality: The Reason for Cache Memory<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">One of the most important concepts in Computer Architecture theory is the Locality of Reference.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Temporal Locality)<\/strong> If data was referenced recently, it is likely to be referenced again soon.<\/li>\n\n\n\n<li><strong>Spatial Locality:<\/strong> If data was referenced, data located near it is likely to be referenced soon.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">Caches or SRAM buffers inside CPUs, GPUs, or NPUs are designed solely on this principle. The idea is to keep frequently used data in the fast, on-chip SRAM instead of going all the way to slow, power-hungry DRAM. MLP ignores this principle (random connections or single-use). In contrast, CNN is the ultimate champion of this locality principle.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">2. Sliding Window: Data Reuse<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Let's look at the core operation of CNN, Convolution, from a hardware perspective. A 3 * 3 filter (kernel) moves (slides) one step at a time over a huge input image, stamping its operation. Tremendous Data Reuse occurs during this process.<\/p>\n\n\n<style>.kb-image1294_f081b4-85.kb-image-is-ratio-size, .kb-image1294_f081b4-85 .kb-image-is-ratio-size{max-width:260px;width:100%;}.wp-block-kadence-column > .kt-inside-inner-col > .kb-image1294_f081b4-85.kb-image-is-ratio-size, .wp-block-kadence-column > .kt-inside-inner-col > .kb-image1294_f081b4-85 .kb-image-is-ratio-size{align-self:unset;}.kb-image1294_f081b4-85 figure{max-width:260px;}.kb-image1294_f081b4-85 .image-is-svg, .kb-image1294_f081b4-85 .image-is-svg img{width:100%;}.kb-image1294_f081b4-85 .kb-image-has-overlay:after{opacity:0.3;}@media all and (max-width: 767px){.kb-image1294_f081b4-85.kb-image-is-ratio-size, .kb-image1294_f081b4-85 .kb-image-is-ratio-size{max-width:260px;width:100%;}.kb-image1294_f081b4-85 figure{max-width:260px;}}<\/style>\n<div class=\"wp-block-kadence-image kb-image1294_f081b4-85\"><figure class=\"aligncenter size-full\"><img data-dominant-color=\"e0e383\" data-has-transparency=\"false\" style=\"--dominant-color: #e0e383;\" loading=\"lazy\" decoding=\"async\" width=\"264\" height=\"200\" src=\"https:\/\/rtlearner.com\/wp-content\/uploads\/2026\/01\/image-3-2.jpg\" alt=\"Kernel sliding\" class=\"kb-img wp-image-1305 not-transparent\" srcset=\"https:\/\/rtlearner.com\/wp-content\/uploads\/2026\/01\/image-3-2.jpg 264w, https:\/\/rtlearner.com\/wp-content\/uploads\/2026\/01\/image-3-2-16x12.jpg 16w\" sizes=\"auto, (max-width: 264px) 100vw, 264px\" \/><figcaption>Kernel sliding<\/figcaption><\/figure><\/div>\n\n\n\n<h3 class=\"wp-block-heading\">A. Weight Reuse<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">In MLP, a weight is multiplied by one input and then finished. But in CNN, a single filter (Weight set) sweeps from the top-left to the bottom-right of the image.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">If the input image is 224 * 224, the same 3 * 3 filter weights are reused a staggering 50,176 times (224 * 224).<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>DRAM Access:<\/strong> 1 time (Filter Load)<\/li>\n\n\n\n<li><strong>Operation (MAC):<\/strong> Over 50,000 times<\/li>\n\n\n\n<li><strong>Result:<\/strong> Arithmetic Intensity explodes.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">B. Input Reuse<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The Sliding Window moves sideways by one step. At this point, the previous window and the current window share (overlap) most pixels.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">When a 3 * 3 window moves one step, 6 out of the 9 pixels are identical to the previous step. This means input data doesn't need to be fetched from DRAM every time; it can be temporarily stored in on-chip registers or a Line Buffer and reused continuously.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">3. SRAM Efficiency<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Thanks to these reuse characteristics, CNN-specific NPUs (Accelerators) can adopt the following memory hierarchy strategy:<\/p>\n\n\n\n<ol start=\"1\" class=\"wp-block-list\">\n<li><strong>Load:<\/strong> Fetch filters (Weights) and a part of the image (Input Row) from DRAM to the Global Buffer (Large SRAM) inside the chip.<\/li>\n\n\n\n<li><strong>Multicast:<\/strong> Distribute the data in the Global Buffer to hundreds of Processing Elements (PEs).<\/li>\n\n\n\n<li><strong>Compute &amp; Reuse:<\/strong> Each PE stores data in its local Register File (RF) and performs thousands of multiplications without even looking at the DRAM.<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\">This is the secret behind the high performance of NPUs. They minimize high-energy DRAM accesses and complete computations at the low-energy SRAM and Register levels.<\/p>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p class=\"wp-block-paragraph\">Quantitative Fact:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Based on a 45nm process, the energy for a single DRAM access is about 640 pJ, while accessing a small on-chip SRAM (8KB) is about 10 pJ. Thanks to the high reusability of CNNs, we can reduce a 640 pJ cost to 10 pJ, and further down to the register level (0.1 pJ). This is why CNNs are hardware-friendly.<\/p>\n<\/blockquote>\n\n\n\n<h2 class=\"wp-block-heading\">4. Compute-Bound<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">In the last post, we described MLP as Memory-Bound. The arithmetic units are idle because data isn't arriving.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">However, because CNNs have high data reuse rates, once data is fetched, the arithmetic units can chew on it for a long time. In other words, we enter the <strong>Compute-Bound<\/strong> domain where computation speed (TOPS) determines overall performance, not memory bandwidth.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">From this point on, the architect's skill becomes crucial. \"How do we keep thousands of multipliers (MACs) running at 100% utilization?\" This concern leads directly to Dataflow Optimization and Mapping Strategies.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">5. Conclusion: CNN is a Blessing for Hardware<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">In conclusion, hardware loves CNNs not just because they are \"famous,\" but because they possess a structure with \"High Arithmetic Intensity that allows massive amounts of computation with little memory bandwidth.\" With the advent of CNNs, AI semiconductors finally moved beyond being 'memory shuttles' to becoming true 'computational accelerators.'<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">However, nothing is perfect. When trying to map this efficient CNN operation to actual hardware, a new challenge begins: how to unravel (Unroll) the complex 6-level loops.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In the next post, we will explore the three core strategies for processing CNN operations: \"Three Mappings of Conv Operations: Direct vs. Im2Col vs. Winograd.\"<\/p>\n\n\n<style>.kadence-column1294_60030b-65 > .kt-inside-inner-col{box-shadow:0px 0px 14px 0px rgba(0, 0, 0, 0.2);}.kadence-column1294_60030b-65 > .kt-inside-inner-col,.kadence-column1294_60030b-65 > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column1294_60030b-65 > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column1294_60030b-65 > .kt-inside-inner-col{flex-direction:column;}.kadence-column1294_60030b-65 > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column1294_60030b-65 > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column1294_60030b-65{position:relative;}@media all and (max-width: 1024px){.kadence-column1294_60030b-65 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column1294_60030b-65 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column1294_60030b-65\"><div class=\"kt-inside-inner-col\">\n<p class=\"wp-block-paragraph\"><strong>Related articles<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2705<a href=\"https:\/\/rtlearner.com\/en\/ai-architecture-1-neuron-hardware-mac-analysis\/\" data-type=\"post\" data-id=\"1248\">AI Architecture 1. Anatomy of an Artificial Neuron: Y=WX+B on Silicon<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2705<a href=\"https:\/\/rtlearner.com\/en\/ai-architecture-2-activation-relu-vs-sigmoid\/\" data-type=\"post\" data-id=\"1255\">AI Architecture 2. The Cost of Activation: Free ReLU vs. Expensive Sigmoid<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2705<a href=\"https:\/\/rtlearner.com\/en\/ai-architecture-3-matmul-simd-parallel-processing\/\" data-type=\"post\" data-id=\"1263\">AI Architecture 3. The Aesthetics of MatMul: Why Deep Learning Chooses GPUs\/NPUs<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2705<a href=\"https:\/\/rtlearner.com\/en\/ai-architecture-4-training-vs-inference\/\" data-type=\"post\" data-id=\"1267\">AI Architecture 4. Training vs. Inference<\/a><\/p>\n<\/div><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">References: <em><a href=\"https:\/\/ieeexplore.ieee.org\/document\/7738524\" target=\"_blank\" rel=\"noopener\">An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks<\/a><\/em><\/p>","protected":false},"excerpt":{"rendered":"<p>In the previous post, we confirmed how inefficient MLP (Fully Connected Layer) is from a hardware perspective. Due to its structure of fetching a weight once,<\/p>","protected":false},"author":1,"featured_media":1305,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_kadence_starter_templates_imported_post":false,"_kad_post_transparent":"","_kad_post_title":"","_kad_post_layout":"","_kad_post_sidebar_id":"","_kad_post_content_style":"","_kad_post_vertical_padding":"","_kad_post_feature":"","_kad_post_feature_position":"","_kad_post_header":false,"_kad_post_footer":false,"_kad_post_classname":"","footnotes":""},"categories":[116],"tags":[117,118],"class_list":["post-1294","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-and-hw-fundamentals","tag-ai","tag-architecture"],"_links":{"self":[{"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/posts\/1294","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/comments?post=1294"}],"version-history":[{"count":4,"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/posts\/1294\/revisions"}],"predecessor-version":[{"id":1322,"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/posts\/1294\/revisions\/1322"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/media\/1305"}],"wp:attachment":[{"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/media?parent=1294"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/categories?post=1294"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/tags?post=1294"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}<!-- This website is optimized by Airlift. Learn more: https://airlift.net. Template:. Learn more: https://airlift.net. Template: 6a283708d36f733e95476362. Config Timestamp: 2026-06-09 15:53:43 UTC, Cached Timestamp: 2026-07-01 01:23:48 UTC, Optimization Time: 2.29ms -->