{"id":1324,"date":"2026-01-16T11:21:18","date_gmt":"2026-01-16T02:21:18","guid":{"rendered":"https:\/\/rtlearner.com\/?p=1324"},"modified":"2026-01-19T11:26:56","modified_gmt":"2026-01-19T02:26:56","slug":"ai-architecture-10-pooling-padding-hardware-issue","status":"publish","type":"post","link":"https:\/\/rtlearner.com\/en\/ai-architecture-10-pooling-padding-hardware-issue\/","title":{"rendered":"AI Architecture 10. Padding and Pooling Hardware Issues"},"content":{"rendered":"<p class=\"wp-block-paragraph\">In the previous <a href=\"https:\/\/rtlearner.com\/en\/ai-architecture-9-convolution-operation-mapping\/\" data-type=\"post\" data-id=\"1312\">3 Mappings of Conv Operations<\/a>, we explored the massive trade-off (like Im2Col) of exchanging memory capacity for computation speed to optimize Convolution operations on hardware.<\/p>\n\n\n\n<p class=\"wp-block-paragraph translation-block\">If the primary workload of a CNN accelerator is concentrated on Convolution, there are essential operations that must accompany it for the functional completeness of the architecture: <strong>Pooling<\/strong> and <strong>Padding<\/strong>.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>padding=1: \"Just fill the border with a line of zeros.\"<\/li>\n\n\n\n<li>MaxPool2d(2): \"Pick the largest number out of this 2x2 grid.\"<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">To a software engineer, these are merely options. However, These simple tasks, which account for less than 1% of the total model in terms of FLOPs, present hardware architects with the structural headaches of \"Irregularity\" and \"Buffering.\"<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In this article, we will uncover the hardware issues of Pooling and Padding\u2014the culprits that quietly consume chip Area and complicate control logic behind the main MAC units.<\/p>\n\n\n<style>.kb-table-of-content-nav.kb-table-of-content-id1324_830c74-ec .kb-table-of-content-wrap{padding-top:var(--global-kb-spacing-sm, 1.5rem);padding-right:var(--global-kb-spacing-sm, 1.5rem);padding-bottom:var(--global-kb-spacing-sm, 1.5rem);padding-left:var(--global-kb-spacing-sm, 1.5rem);box-shadow:0px 0px 14px 0px rgba(0, 0, 0, 0.2);}.kb-table-of-content-nav.kb-table-of-content-id1324_830c74-ec .kb-table-of-contents-title-wrap{padding-top:0px;padding-right:0px;padding-bottom:0px;padding-left:0px;}.kb-table-of-content-nav.kb-table-of-content-id1324_830c74-ec .kb-table-of-contents-title{font-weight:regular;font-style:normal;}.kb-table-of-content-nav.kb-table-of-content-id1324_830c74-ec .kb-table-of-content-wrap .kb-table-of-content-list{font-weight:regular;font-style:normal;margin-top:var(--global-kb-spacing-sm, 1.5rem);margin-right:0px;margin-bottom:0px;margin-left:0px;}@media all and (max-width: 767px){.kb-table-of-content-nav.kb-table-of-content-id1324_830c74-ec .kb-table-of-contents-title{font-size:var(--global-kb-font-size-md, 1.25rem);}.kb-table-of-content-nav.kb-table-of-content-id1324_830c74-ec .kb-table-of-content-wrap .kb-table-of-content-list{font-size:var(--global-kb-font-size-sm, 0.9rem);}}<\/style>\n\n<style>.kadence-column1324_202a14-85 > .kt-inside-inner-col{box-shadow:0px 0px 14px 0px rgba(0, 0, 0, 0.2);}.kadence-column1324_202a14-85 > .kt-inside-inner-col,.kadence-column1324_202a14-85 > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column1324_202a14-85 > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column1324_202a14-85 > .kt-inside-inner-col{flex-direction:column;}.kadence-column1324_202a14-85 > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column1324_202a14-85 > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column1324_202a14-85{position:relative;}@media all and (max-width: 1024px){.kadence-column1324_202a14-85 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column1324_202a14-85 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column1324_202a14-85\"><div class=\"kt-inside-inner-col\">\n<p class=\"wp-block-paragraph\"><strong>Related articles<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2705<a href=\"https:\/\/rtlearner.com\/en\/ai-architecture-1-neuron-hardware-mac-analysis\/\" data-type=\"post\" data-id=\"1248\">AI Architecture 1. Anatomy of an Artificial Neuron: Y=WX+B on Silicon<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2705<a href=\"https:\/\/rtlearner.com\/en\/ai-architecture-2-activation-relu-vs-sigmoid\/\" data-type=\"post\" data-id=\"1255\">AI Architecture 2. The Cost of Activation: Free ReLU vs. Expensive Sigmoid<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2705<a href=\"https:\/\/rtlearner.com\/en\/ai-architecture-3-matmul-simd-parallel-processing\/\" data-type=\"post\" data-id=\"1263\">AI Architecture 3. The Aesthetics of MatMul: Why Deep Learning Chooses GPUs\/NPUs<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2705<a href=\"https:\/\/rtlearner.com\/en\/ai-architecture-4-training-vs-inference\/\" data-type=\"post\" data-id=\"1267\">AI Architecture 4. Training vs. Inference<\/a><\/p>\n<\/div><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">1. Padding: How to Process '0', the Non-Existent Data<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Zero Padding is a technique of filling the periphery of an image with zeros to maintain image size or preserve edge features. The question is, \"Where do we get these zeros from?\"<\/p>\n\n\n<style>.kb-image1324_87f535-49.kb-image-is-ratio-size, .kb-image1324_87f535-49 .kb-image-is-ratio-size{max-width:330px;width:100%;}.wp-block-kadence-column > .kt-inside-inner-col > .kb-image1324_87f535-49.kb-image-is-ratio-size, .wp-block-kadence-column > .kt-inside-inner-col > .kb-image1324_87f535-49 .kb-image-is-ratio-size{align-self:unset;}.kb-image1324_87f535-49 figure{max-width:330px;}.kb-image1324_87f535-49 .image-is-svg, .kb-image1324_87f535-49 .image-is-svg img{width:100%;}.kb-image1324_87f535-49 .kb-image-has-overlay:after{opacity:0.3;}@media all and (max-width: 767px){.kb-image1324_87f535-49.kb-image-is-ratio-size, .kb-image1324_87f535-49 .kb-image-is-ratio-size{max-width:290px;width:100%;}.kb-image1324_87f535-49 figure{max-width:290px;}}<\/style>\n<div class=\"wp-block-kadence-image kb-image1324_87f535-49\"><figure class=\"aligncenter size-full\"><a href=\"https:\/\/medium.com\/@draj0718\/zero-padding-in-convolutional-neural-networks-bf1410438e99\" class=\"kb-advanced-image-link\" target=\"_blank\" rel=\"noopener noreferrer\"><img data-dominant-color=\"a5c2d6\" data-has-transparency=\"false\" style=\"--dominant-color: #a5c2d6;\" loading=\"lazy\" decoding=\"async\" width=\"336\" height=\"336\" src=\"https:\/\/rtlearner.com\/wp-content\/uploads\/2026\/01\/image-3-5-edited.jpg\" alt=\"Zero Padding\" class=\"kb-img wp-image-1326 not-transparent\" srcset=\"https:\/\/rtlearner.com\/wp-content\/uploads\/2026\/01\/image-3-5-edited.jpg 336w, https:\/\/rtlearner.com\/wp-content\/uploads\/2026\/01\/image-3-5-edited-300x300.jpg 300w, https:\/\/rtlearner.com\/wp-content\/uploads\/2026\/01\/image-3-5-edited-150x150.jpg 150w, https:\/\/rtlearner.com\/wp-content\/uploads\/2026\/01\/image-3-5-edited-12x12.jpg 12w\" sizes=\"auto, (max-width: 336px) 100vw, 336px\" \/><\/a><figcaption>Zero Padding<\/figcaption><\/figure><\/div>\n\n\n\n<h3 class=\"wp-block-heading\">Software Approach (Memory Waste)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">The easiest way is to actually create a new image in memory (DRAM) with a border filled with zeros. However, this is a massive waste of bandwidth. Asking a hardware engineer to use expensive DRAM bandwidth to read meaningless \"zero\" data is unacceptable.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Hardware Approach (On-the-fly Generation)<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Therefore, NPUs use an <strong>\"On-the-fly (Real-time Generation)\"<\/strong> method. Only the original image is stored in memory, and the input port logic calculates Coordinates to insert fake '0's as data is read. This requires complex Control Logic (FSM: Finite State Machine).<\/p>\n\n\n\n<ol start=\"1\" class=\"wp-block-list\">\n<li>It must check every clock cycle whether the current pixel coordinate (x, y) is outside the image Boundary.<\/li>\n\n\n\n<li>If it is outside, it must Stall the memory read and instead inject a '0' value into the arithmetic unit via a Multiplexer (MUX).<\/li>\n\n\n\n<li>This boundary check logic, while seemingly simple, can become a primary cause of Timing issues when the chip operates at high speeds.<\/li>\n<\/ol>\n\n\n\n<h2 class=\"wp-block-heading\">2. Pooling: The Dilemma of Streaming Data and Line Buffers<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Max Pooling (2 * 2) is an operation that selects the maximum value among 4 pixels. It looks like a very lightweight operation implementable with just a few Comparators. However, the real problem lies in the \"Order of Data Arrival.\"<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Hardware reads an image not as a whole, but Row-by-Row, like a TV scan line (Raster Scan order).<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">The Necessity of Line Buffers<\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">To perform pooling with a 2 * 2 window, data from the first row (Row N) and the second row (Row N+1) are needed simultaneously. However, since data arrives one row at a time, the hardware must store the entire first row somewhere and wait until the second row arrives. The memory required for this is called a Line Buffer.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The wider the image width (e.g., 4K image), the larger the Line Buffer must be.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Cost Analysis:<\/strong> Just to perform a few comparison operations, we must allocate KB to MB of <strong>SRAM<\/strong>to store an entire line of the image. This is a significant overhead in terms of Chip Area.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">3. Synchronization and Pipeline Bubbles<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Structures where a Pooling layer immediately follows a Convolution layer (Conv-Pool) are very common. Here, a Rate Mismatch problem occurs.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Conv Output:<\/strong> Spits out a pixel every clock cycle (assuming Stride=1).<\/li>\n\n\n\n<li><strong>Pool (2 * 2) Input:<\/strong> Waits until 2 rows are collected, then groups 4 pixels to spit out 1 result.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph\">The Pooling unit must remain Idle while waiting for data, and then process it instantly once collected. This process creates Bubbles where the pipeline flow is interrupted. To prevent this, an additional FIFO (First-In-First-Out) buffer is needed between Conv and Pool.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Ultimately, Pooling, which was supposed to be a \"simple operation,\" becomes a rather heavy module accompanied by <strong>Line Buffers (SRAM) + FIFOs + Complex Control Logic<\/strong>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">4. The Trap of Global Average Pooling (GAP)<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Global Average Pooling, used at the end of ResNet or MobileNet, is even more severe. It needs to average the entire channel size of 7 * 7 or larger.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This means holding the Accumulator value until the entire image is finished. In a streaming architecture, GAP becomes a Latency Bottleneck where the next result cannot be output until all data is received.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">5. Conclusion<\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">From a hardware architecture perspective, \"Computational Simplicity\" and \"Implementation Simplicity\" are distinct. While Padding and Pooling are kindergarten-level math, for hardware that must stream data in real-time, they are obstacles that disrupt data flow and force buffering.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">The recent trend of models like <strong>Transformer<\/strong> or <strong>Stride=2 Convolution<\/strong>gradually eliminating Pooling layers is not unrelated to this Hardware Efficiency, in addition to accuracy aspects.<\/p>\n\n\n<style>.kadence-column1324_734951-41 > .kt-inside-inner-col{box-shadow:0px 0px 14px 0px rgba(0, 0, 0, 0.2);}.kadence-column1324_734951-41 > .kt-inside-inner-col,.kadence-column1324_734951-41 > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column1324_734951-41 > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column1324_734951-41 > .kt-inside-inner-col{flex-direction:column;}.kadence-column1324_734951-41 > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column1324_734951-41 > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column1324_734951-41{position:relative;}@media all and (max-width: 1024px){.kadence-column1324_734951-41 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column1324_734951-41 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column1324_734951-41\"><div class=\"kt-inside-inner-col\">\n<p class=\"wp-block-paragraph\"><strong>Related articles<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2705<a href=\"https:\/\/rtlearner.com\/en\/ai-architecture-1-neuron-hardware-mac-analysis\/\" data-type=\"post\" data-id=\"1248\">AI Architecture 1. Anatomy of an Artificial Neuron: Y=WX+B on Silicon<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2705<a href=\"https:\/\/rtlearner.com\/en\/ai-architecture-2-activation-relu-vs-sigmoid\/\" data-type=\"post\" data-id=\"1255\">AI Architecture 2. The Cost of Activation: Free ReLU vs. Expensive Sigmoid<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2705<a href=\"https:\/\/rtlearner.com\/en\/ai-architecture-3-matmul-simd-parallel-processing\/\" data-type=\"post\" data-id=\"1263\">AI Architecture 3. The Aesthetics of MatMul: Why Deep Learning Chooses GPUs\/NPUs<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2705<a href=\"https:\/\/rtlearner.com\/en\/ai-architecture-4-training-vs-inference\/\" data-type=\"post\" data-id=\"1267\">AI Architecture 4. Training vs. Inference<\/a><\/p>\n<\/div><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">References: <em><a href=\"https:\/\/ieeexplore.ieee.org\/document\/4384049\" target=\"_blank\" rel=\"noopener\">Efficient Hardware Architecture for Moving Window Operations<\/a><\/em><\/p>","protected":false},"excerpt":{"rendered":"<p>In the previous 3 Mappings of Conv Operations, we explored the massive trade-off (like Im2Col) of exchanging memory ...<\/p>","protected":false},"author":1,"featured_media":1326,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_kadence_starter_templates_imported_post":false,"_kad_post_transparent":"","_kad_post_title":"","_kad_post_layout":"","_kad_post_sidebar_id":"","_kad_post_content_style":"","_kad_post_vertical_padding":"","_kad_post_feature":"","_kad_post_feature_position":"","_kad_post_header":false,"_kad_post_footer":false,"_kad_post_classname":"","footnotes":""},"categories":[116],"tags":[117,118],"class_list":["post-1324","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-and-hw-fundamentals","tag-ai","tag-architecture"],"_links":{"self":[{"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/posts\/1324","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/comments?post=1324"}],"version-history":[{"count":2,"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/posts\/1324\/revisions"}],"predecessor-version":[{"id":1370,"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/posts\/1324\/revisions\/1370"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/media\/1326"}],"wp:attachment":[{"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/media?parent=1324"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/categories?post=1324"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/tags?post=1324"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}<!-- This website is optimized by Airlift. Learn more: https://airlift.net. Template:. Learn more: https://airlift.net. Template: 6a283708d36f733e95476362. Config Timestamp: 2026-06-09 15:53:43 UTC, Cached Timestamp: 2026-07-01 01:28:26 UTC, Optimization Time: 3.38ms -->