{"id":1255,"date":"2026-01-06T12:24:32","date_gmt":"2026-01-06T03:24:32","guid":{"rendered":"https:\/\/rtlearner.com\/?p=1255"},"modified":"2026-01-08T11:54:17","modified_gmt":"2026-01-08T02:54:17","slug":"ai-architecture-2-activation-relu-vs-sigmoid","status":"publish","type":"post","link":"https:\/\/rtlearner.com\/en\/ai-architecture-2-activation-relu-vs-sigmoid\/","title":{"rendered":"AI Architecture 2. The Cost of Activation: Free ReLU vs. Expensive Sigmoid"},"content":{"rendered":"<p class=\"wp-block-paragraph translation-block\">In the previous post, we examined how expensive the MAC (Multiply-Accumulate) operation\u2014the core of artificial neurons\u2014is in terms of hardware, specifically regarding Multiplier area and Memory Bandwidth.<\/p>\n\n\n\n<div class=\"wp-block-math\"><math display=\"block\"><semantics><mrow><mi>Y<\/mi><mo>=<\/mo><mo movablelimits=\"false\">\u2211<\/mo><mo form=\"prefix\" stretchy=\"false\">(<\/mo><mi>W<\/mi><mo>\u00d7<\/mo><mi>X<\/mi><mo form=\"postfix\" stretchy=\"false\">)<\/mo><mo>+<\/mo><mi>B<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">Y = \\sum (W \\times X) + B<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"wp-block-paragraph translation-block\">Once the MAC operation is complete, the resulting value ($Y$) must pass through a gateway called the Activation Function before moving to the next layer. For a software engineer, this is a trivial task, often just a single function call like torch.relu() or torch.sigmoid(). However, for a Hardware engineer, the activation function is a complex optimization problem balancing \"Computational Complexity\" and \"Silicon Area.\"<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\uc774\ubc88 \uae00\uc5d0\uc11c\ub294 AI \ub525\ub7ec\ub2dd\uc758 \ube44\uc120\ud615\uc131(Non-linearity)\uc744 \ub2f4\ub2f9\ud558\ub294 \ud65c\uc131\ud654 \ud568\uc218\ub4e4\uc774 \uc2e4\ub9ac\ucf58 \uc704\uc5d0\uc11c \uc5b4\ub5bb\uac8c \uad6c\ud604\ub418\ub294\uc9c0, \uadf8\ub9ac\uace0 \uc65c \ud604\ub300 NPU\ub4e4\uc774 <strong>Sigmoid<\/strong>\ub97c \uae30\ud53c\ud558\uace0 <strong>ReLU<\/strong>\ub97c \uc120\ud560 \uc218\ubc16\uc5d0 \uc5c6\ub294\uc9c0 \uadf8 \ubb3c\ub9ac\uc801\uc778 \uc774\uc720\ub97c \ud30c\ud5e4\uccd0 \ubcf4\uaca0\uc2b5\ub2c8\ub2e4.<\/p>\n\n\n<style>.kb-table-of-content-nav.kb-table-of-content-id1255_dc1b99-9a .kb-table-of-content-wrap{padding-top:var(--global-kb-spacing-sm, 1.5rem);padding-right:var(--global-kb-spacing-sm, 1.5rem);padding-bottom:var(--global-kb-spacing-sm, 1.5rem);padding-left:var(--global-kb-spacing-sm, 1.5rem);box-shadow:0px 0px 14px 0px rgba(0, 0, 0, 0.2);}.kb-table-of-content-nav.kb-table-of-content-id1255_dc1b99-9a .kb-table-of-contents-title-wrap{padding-top:0px;padding-right:0px;padding-bottom:0px;padding-left:0px;}.kb-table-of-content-nav.kb-table-of-content-id1255_dc1b99-9a .kb-table-of-contents-title{font-weight:regular;font-style:normal;}.kb-table-of-content-nav.kb-table-of-content-id1255_dc1b99-9a .kb-table-of-content-wrap .kb-table-of-content-list{font-weight:regular;font-style:normal;margin-top:var(--global-kb-spacing-sm, 1.5rem);margin-right:0px;margin-bottom:0px;margin-left:0px;}@media all and (max-width: 767px){.kb-table-of-content-nav.kb-table-of-content-id1255_dc1b99-9a .kb-table-of-contents-title{font-size:var(--global-kb-font-size-md, 1.25rem);}.kb-table-of-content-nav.kb-table-of-content-id1255_dc1b99-9a .kb-table-of-content-wrap .kb-table-of-content-list{font-size:var(--global-kb-font-size-sm, 0.9rem);}}<\/style>\n\n<style>.kadence-column1255_b5b72b-b5 > .kt-inside-inner-col{box-shadow:0px 0px 14px 0px rgba(0, 0, 0, 0.2);}.kadence-column1255_b5b72b-b5 > .kt-inside-inner-col,.kadence-column1255_b5b72b-b5 > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column1255_b5b72b-b5 > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column1255_b5b72b-b5 > .kt-inside-inner-col{flex-direction:column;}.kadence-column1255_b5b72b-b5 > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column1255_b5b72b-b5 > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column1255_b5b72b-b5{position:relative;}@media all and (max-width: 1024px){.kadence-column1255_b5b72b-b5 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column1255_b5b72b-b5 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column1255_b5b72b-b5\"><div class=\"kt-inside-inner-col\">\n<p class=\"wp-block-paragraph\"><strong>Related articles<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2705<a href=\"https:\/\/rtlearner.com\/en\/ai-architecture-1-neuron-hardware-mac-analysis\/\" data-type=\"post\" data-id=\"1248\">AI Architecture 1. Anatomy of an Artificial Neuron: Y=WX+B on Silicon<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2705<a href=\"https:\/\/rtlearner.com\/en\/ai-architecture-3-matmul-simd-parallel-processing\/\" data-type=\"post\" data-id=\"1263\">AI Architecture 3. The Aesthetics of MatMul: Why Deep Learning Chooses GPUs\/NPUs<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2705<a href=\"https:\/\/rtlearner.com\/en\/ai-architecture-4-training-vs-inference\/\" data-type=\"post\" data-id=\"1267\">AI Architecture 4. Training vs. Inference<\/a><\/p>\n<\/div><\/div>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>1. Necessity of Non-linearity<\/strong><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Before discussing hardware costs, we must address why we use expensive activation functions at all. Without activation functions, a deep learning model, no matter how deep (Deep Layer), collapses into a single giant linear transformation.<\/p>\n\n\n\n<div class=\"wp-block-math\"><math display=\"block\"><semantics><mrow><mi>Y<\/mi><mo>=<\/mo><msub><mi>W<\/mi><mn>2<\/mn><\/msub><mo form=\"prefix\" stretchy=\"false\">(<\/mo><msub><mi>W<\/mi><mn>1<\/mn><\/msub><mi>X<\/mi><mo form=\"postfix\" stretchy=\"false\">)<\/mo><mo>=<\/mo><mo form=\"prefix\" stretchy=\"false\">(<\/mo><msub><mi>W<\/mi><mn>2<\/mn><\/msub><msub><mi>W<\/mi><mn>1<\/mn><\/msub><mo form=\"postfix\" stretchy=\"false\">)<\/mo><mi>X<\/mi><mo>=<\/mo><msub><mi>W<\/mi><mrow><mi>n<\/mi><mi>e<\/mi><mi>w<\/mi><\/mrow><\/msub><mi>X<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">Y = W_2(W_1X) = (W_2 W_1)X = W_{new}X<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">To learn complex patterns in data, we absolutely need Non-linear Functions to break this linearity. The problem is that implementing non-linearity in hardware is surprisingly tricky.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>2. Sigmoid &amp; Tanh: The Transcendental Nightmare<\/strong><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">Let's look at the formulas for <strong>Sigmoid<\/strong>and <strong>Tanh(Hyperbolic Tangent)<\/strong> , which were popular in early deep learning models.<\/p>\n\n\n<style>.kb-image1255_f2abe5-7d.kb-image-is-ratio-size, .kb-image1255_f2abe5-7d .kb-image-is-ratio-size{max-width:500px;width:100%;}.wp-block-kadence-column > .kt-inside-inner-col > .kb-image1255_f2abe5-7d.kb-image-is-ratio-size, .wp-block-kadence-column > .kt-inside-inner-col > .kb-image1255_f2abe5-7d .kb-image-is-ratio-size{align-self:unset;}.kb-image1255_f2abe5-7d figure{max-width:500px;}.kb-image1255_f2abe5-7d .image-is-svg, .kb-image1255_f2abe5-7d .image-is-svg img{width:100%;}.kb-image1255_f2abe5-7d .kb-image-has-overlay:after{opacity:0.3;}@media all and (max-width: 767px){.kb-image1255_f2abe5-7d.kb-image-is-ratio-size, .kb-image1255_f2abe5-7d .kb-image-is-ratio-size{max-width:290px;width:100%;}.kb-image1255_f2abe5-7d figure{max-width:290px;}}<\/style>\n<div class=\"wp-block-kadence-image kb-image1255_f2abe5-7d\"><figure class=\"aligncenter size-full\"><img data-dominant-color=\"f7f7f6\" data-has-transparency=\"false\" style=\"--dominant-color: #f7f7f6;\" loading=\"lazy\" decoding=\"async\" width=\"523\" height=\"348\" src=\"https:\/\/rtlearner.com\/wp-content\/uploads\/2026\/01\/image-3-3.jpg\" alt=\"Sigmoid &amp; Tanh\" class=\"kb-img wp-image-1308 not-transparent\" srcset=\"https:\/\/rtlearner.com\/wp-content\/uploads\/2026\/01\/image-3-3.jpg 523w, https:\/\/rtlearner.com\/wp-content\/uploads\/2026\/01\/image-3-3-300x200.jpg 300w, https:\/\/rtlearner.com\/wp-content\/uploads\/2026\/01\/image-3-3-18x12.jpg 18w\" sizes=\"auto, (max-width: 523px) 100vw, 523px\" \/><figcaption>Sigmoid &amp; Tanh<\/figcaption><\/figure><\/div>\n\n\n\n<div class=\"wp-block-math\"><math display=\"block\"><semantics><mrow><mi>S<\/mi><mi>i<\/mi><mi>g<\/mi><mi>m<\/mi><mi>o<\/mi><mi>i<\/mi><mi>d<\/mi><mo form=\"prefix\" stretchy=\"false\">(<\/mo><mi>x<\/mi><mo form=\"postfix\" stretchy=\"false\">)<\/mo><mo>=<\/mo><mfrac><mn>1<\/mn><mrow><mn>1<\/mn><mo>+<\/mo><msup><mi>e<\/mi><mrow><mo lspace=\"0em\" rspace=\"0em\">\u2212<\/mo><mi>x<\/mi><\/mrow><\/msup><\/mrow><\/mfrac><mo separator=\"true\">,<\/mo><mspace width=\"1em\"><\/mspace><mi>T<\/mi><mi>a<\/mi><mi>n<\/mi><mi>h<\/mi><mo form=\"prefix\" stretchy=\"false\">(<\/mo><mi>x<\/mi><mo form=\"postfix\" stretchy=\"false\">)<\/mo><mo>=<\/mo><mfrac><mrow><msup><mi>e<\/mi><mi>x<\/mi><\/msup><mo>\u2212<\/mo><msup><mi>e<\/mi><mrow><mo lspace=\"0em\" rspace=\"0em\">\u2212<\/mo><mi>x<\/mi><\/mrow><\/msup><\/mrow><mrow><msup><mi>e<\/mi><mi>x<\/mi><\/msup><mo>+<\/mo><msup><mi>e<\/mi><mrow><mo lspace=\"0em\" rspace=\"0em\">\u2212<\/mo><mi>x<\/mi><\/mrow><\/msup><\/mrow><\/mfrac><\/mrow><annotation encoding=\"application\/x-tex\">Sigmoid(x) = \\frac{1}{1 + e^{-x}}, \\quad Tanh(x) = \\frac{e^x &#8211; e^{-x}}{e^x + e^{-x}}<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">The term that drives hardware engineers to despair is the exponential function e<sup>x<\/sup> .<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Unlike addition or multiplication, exponentials are Transcendental Functions. Digital Logic is fundamentally based on binary arithmetic (0s and 1s), so hardware that perfectly calculates transcendental functions does not exist. There is only \"Approximation.\"<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Here are three common ways to implement e<sup>x<\/sup>in hardware:<\/p>\n\n\n\n<ol start=\"1\" class=\"wp-block-list\">\n<li><strong>Taylor Series:<\/strong>\n<ul class=\"wp-block-list\">\n<li><strong>Cost:<\/strong> Requires numerous multipliers and adders. As the range of $x$ widens, the number of required terms explodes, increasing Latency.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>CORDIC Algorithm:<\/strong> An iterative method using rotation operations.\n<ul class=\"wp-block-list\">\n<li><strong>Cost:<\/strong> Uses fewer multipliers but requires multiple clock cycles, resulting in low Throughput.<\/li>\n<\/ul>\n<\/li>\n\n\n\n<li><strong>Look-Up Table (LUT):<\/strong> Storing pre-calculated values in memory and retrieving them. (<strong>Most common method<\/strong>)<\/li>\n<\/ol>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>3. The LUT Dilemma: Area vs. Precision<\/strong><\/h2>\n\n\n\n<p class=\"wp-block-paragraph translation-block\">Most NPUs use the LUT (Look-Up Table) method to implement Sigmoid or Tanh. When an input $x$ arrives, the corresponding output $y$ is fetched from memory (ROM\/RAM).<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">However, a critical trade-off exists:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>For High Precision:<\/strong> he table size grows. For instance, storing results for all 16-bit inputs requires <\/li>\n<\/ul>\n\n\n\n<div class=\"wp-block-math\"><math display=\"block\"><semantics><mrow><msup><mn>2<\/mn><mn>16<\/mn><\/msup><mo>\u00d7<\/mo><mn>16<\/mn><mtext>bit<\/mtext><mo>\u2248<\/mo><mn>1<\/mn><mtext>Mb<\/mtext><\/mrow><annotation encoding=\"application\/x-tex\">2^{16} \\times 16\\text{bit} \\approx 1\\text{Mb}<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<ul class=\"wp-block-list\">\n<li>of memory. Using this much on-chip memory (SRAM) for a single neuron is prohibitively expensive.<\/li>\n\n\n\n<li><strong>For Small Area:<\/strong> We must reduce the table size. Instead, we use <strong>PWL(Piecewise Linear Approximation)<\/strong> , which divides the curve into segments and approximates them with straight lines. However, this complicates control logic and introduces approximation errors that can degrade model accuracy.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph translation-block\">Ultimately, Sigmoid is a headache for hardware, either \"consuming expensive memory\" or \"overworking the arithmetic units.\"<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>4. ReLU: The Savior of Hardware Architects<\/strong><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">When ReLU (Rectified Linear Unit) became the standard with AlexNet in 2012, hardware engineers likely cheered louder than AI researchers.<\/p>\n\n\n\n<div class=\"wp-block-math\"><math display=\"block\"><semantics><mrow><mi>R<\/mi><mi>e<\/mi><mi>L<\/mi><mi>U<\/mi><mo form=\"prefix\" stretchy=\"false\">(<\/mo><mi>x<\/mi><mo form=\"postfix\" stretchy=\"false\">)<\/mo><mo>=<\/mo><mrow><mi>max<\/mi><mo>\u2061<\/mo><\/mrow><mo form=\"prefix\" stretchy=\"false\">(<\/mo><mn>0<\/mn><mo separator=\"true\">,<\/mo><mi>x<\/mi><mo form=\"postfix\" stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">ReLU(x) = \\max(0, x)<\/annotation><\/semantics><\/math><\/div>\n\n\n<style>.kb-image1255_cb5c9c-33.kb-image-is-ratio-size, .kb-image1255_cb5c9c-33 .kb-image-is-ratio-size{max-width:500px;width:100%;}.wp-block-kadence-column > .kt-inside-inner-col > .kb-image1255_cb5c9c-33.kb-image-is-ratio-size, .wp-block-kadence-column > .kt-inside-inner-col > .kb-image1255_cb5c9c-33 .kb-image-is-ratio-size{align-self:unset;}.kb-image1255_cb5c9c-33 figure{max-width:500px;}.kb-image1255_cb5c9c-33 .image-is-svg, .kb-image1255_cb5c9c-33 .image-is-svg img{width:100%;}.kb-image1255_cb5c9c-33 .kb-image-has-overlay:after{opacity:0.3;}@media all and (max-width: 767px){.kb-image1255_cb5c9c-33.kb-image-is-ratio-size, .kb-image1255_cb5c9c-33 .kb-image-is-ratio-size{max-width:290px;width:100%;}.kb-image1255_cb5c9c-33 figure{max-width:290px;}}<\/style>\n<div class=\"wp-block-kadence-image kb-image1255_cb5c9c-33\"><figure class=\"aligncenter size-large\"><img data-dominant-color=\"fcfcfc\" data-has-transparency=\"false\" style=\"--dominant-color: #fcfcfc;\" loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"768\" src=\"https:\/\/rtlearner.com\/wp-content\/uploads\/2026\/01\/image-3-4-1024x768.jpg\" alt=\"\" class=\"kb-img wp-image-1310 not-transparent\" srcset=\"https:\/\/rtlearner.com\/wp-content\/uploads\/2026\/01\/image-3-4-1024x768.jpg 1024w, https:\/\/rtlearner.com\/wp-content\/uploads\/2026\/01\/image-3-4-300x225.jpg 300w, https:\/\/rtlearner.com\/wp-content\/uploads\/2026\/01\/image-3-4-768x576.jpg 768w, https:\/\/rtlearner.com\/wp-content\/uploads\/2026\/01\/image-3-4-16x12.jpg 16w, https:\/\/rtlearner.com\/wp-content\/uploads\/2026\/01\/image-3-4.jpg 1200w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><figcaption>ReLU<\/figcaption><\/figure><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">What is the hardware cost of this formula? Surprisingly, it is Near Zero Cost.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In hardware, numbers are typically represented in Two's Complement. In this system, determining whether a number is positive or negative only requires checking the MSB (Most Significant Bit, or Sign Bit).<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>MSB = 0 (Positive):<\/strong> Pass the input x as is.<\/li>\n\n\n\n<li><strong>MSB = 1 (Negative):<\/strong> Mask all bits to 0.<\/li>\n<\/ul>\n\n\n\n<p class=\"wp-block-paragraph translation-block\">Implemented in logic circuits, this requires just one Multiplexer (MUX) or a few AND gates. Compared to the thousands of gates needed for exponentials or the massive area of LUTs, ReLU's area and power consumption effectively converge to zero.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>5. The Hidden Benefit: Sparsity &amp; Zero-Skipping<\/strong><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">The gift ReLU gives to hardware goes beyond being cheap to implement. The real gift is 'Zero'.<\/p>\n\n\n\n<p class=\"wp-block-paragraph translation-block\">After passing through ReLU, roughly 50% or more of the input data becomes exactly 0. This is called Sparsity. Hardware architectures actively exploit these zeros to optimize performance.<\/p>\n\n\n\n<ol start=\"1\" class=\"wp-block-list\">\n<li><strong>Memory Savings:<\/strong> Data that is 0 often doesn't need to be stored or transmitted to DRAM (Compression).<\/li>\n\n\n\n<li><strong>Zero-Skipping:<\/strong> If an input is zero, the hardware can be designed to skip the multiplication entirely (Clock Gating), drastically saving power.<\/li>\n<\/ol>\n\n\n\n<p class=\"wp-block-paragraph\">Sigmoid outputs values close to 0 but rarely exactly 0, making it difficult to apply such sparsity-based optimizations. In contrast, ReLU acts as a traffic light, telling the hardware, \"You don't need to compute this!\"<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>6. Modern Trends: GELU and Swish<\/strong><\/h2>\n\n\n\n<p class=\"wp-block-paragraph translation-block\">Recent models like Transformers (BERT, GPT) and EfficientNet use functions like GELU or Swish.<\/p>\n\n\n\n<div class=\"wp-block-math\"><math display=\"block\"><semantics><mrow><mi>S<\/mi><mi>w<\/mi><mi>i<\/mi><mi>s<\/mi><mi>h<\/mi><mo form=\"prefix\" stretchy=\"false\">(<\/mo><mi>x<\/mi><mo form=\"postfix\" stretchy=\"false\">)<\/mo><mo>=<\/mo><mi>x<\/mi><mo>\u22c5<\/mo><mi>S<\/mi><mi>i<\/mi><mi>g<\/mi><mi>m<\/mi><mi>o<\/mi><mi>i<\/mi><mi>d<\/mi><mo form=\"prefix\" stretchy=\"false\">(<\/mo><mi>x<\/mi><mo form=\"postfix\" stretchy=\"false\">)<\/mo><mo separator=\"true\">,<\/mo><mspace width=\"1em\"><\/mspace><mi>G<\/mi><mi>E<\/mi><mi>L<\/mi><mi>U<\/mi><mo form=\"prefix\" stretchy=\"false\">(<\/mo><mi>x<\/mi><mo form=\"postfix\" stretchy=\"false\">)<\/mo><mo>\u2248<\/mo><mi>x<\/mi><mo>\u22c5<\/mo><mrow><mi mathvariant=\"normal\">\u03a6<\/mi><\/mrow><mo form=\"prefix\" stretchy=\"false\">(<\/mo><mi>x<\/mi><mo form=\"postfix\" stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">Swish(x) = x \\cdot Sigmoid(x), \\quad GELU(x) \\approx x \\cdot \\Phi(x)<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">These functions combine the benefits of ReLU with the smoothness of curves, offering higher accuracy. However, from a hardware perspective, the \"Sigmoid Nightmare\" returns.<\/p>\n\n\n\n<p class=\"wp-block-paragraph translation-block\">To handle these, modern NPUs often employ \"Hardware-friendly approximations\" (piecewise linear versions similar to ReLU) or dedicate small, specialized LUT units (Special Function Units, SFU). The balancing act between accuracy and hardware cost is ongoing.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>7. Conclusion: Hardware-Friendly Algorithms Survive<\/strong><\/h2>\n\n\n\n<p class=\"wp-block-paragraph\">While software researchers introduce complex formulas to boost accuracy by 1%, System Architects worry about the silicon area and power those formulas will consume.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Ultimately, the history of AI deep learning has been one of triumphs won by \"structures that are easy to implement in hardware.\" Sigmoid was replaced by ReLU due to its learning efficiency, but its explosive hardware efficiency allowed it to become a mainstream trend.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">In the next post, we will discuss the core reason why GPUs and NPUs overtook CPUs in AI computing: \"The Aesthetics of Matrix Multiplication (MatMul) and Parallel Processing.\"<\/p>\n\n\n<style>.kadence-column1255_e03927-75 > .kt-inside-inner-col{box-shadow:0px 0px 14px 0px rgba(0, 0, 0, 0.2);}.kadence-column1255_e03927-75 > .kt-inside-inner-col,.kadence-column1255_e03927-75 > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column1255_e03927-75 > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column1255_e03927-75 > .kt-inside-inner-col{flex-direction:column;}.kadence-column1255_e03927-75 > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column1255_e03927-75 > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column1255_e03927-75{position:relative;}@media all and (max-width: 1024px){.kadence-column1255_e03927-75 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column1255_e03927-75 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column1255_e03927-75\"><div class=\"kt-inside-inner-col\">\n<p class=\"wp-block-paragraph\"><strong>Related articles<\/strong><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2705<a href=\"https:\/\/rtlearner.com\/en\/ai-architecture-1-neuron-hardware-mac-analysis\/\" data-type=\"post\" data-id=\"1248\">AI Architecture 1. Anatomy of an Artificial Neuron: Y=WX+B on Silicon<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2705<a href=\"https:\/\/rtlearner.com\/en\/ai-architecture-3-matmul-simd-parallel-processing\/\" data-type=\"post\" data-id=\"1263\">AI Architecture 3. The Aesthetics of MatMul: Why Deep Learning Chooses GPUs\/NPUs<\/a><\/p>\n\n\n\n<p class=\"wp-block-paragraph\">\u2705<a href=\"https:\/\/rtlearner.com\/en\/ai-architecture-4-training-vs-inference\/\" data-type=\"post\" data-id=\"1267\">AI Architecture 4. Training vs. Inference<\/a><\/p>\n<\/div><\/div>\n\n\n\n<p class=\"wp-block-paragraph\">References: <em><a href=\"https:\/\/ieeexplore.ieee.org\/document\/8114708\" target=\"_blank\" rel=\"noopener\">Efficient Processing of Deep Neural Networks<\/a><\/em><\/p>","protected":false},"excerpt":{"rendered":"<p>In the previous post, we examined how expensive the MAC (Multiply-Accumulate) operation\u2014the core of artificial neurons\u2014is in terms of hardware, specifically regarding Multiplier area and Memory Bandwidth.<\/p>","protected":false},"author":1,"featured_media":1308,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_kadence_starter_templates_imported_post":false,"_kad_post_transparent":"","_kad_post_title":"","_kad_post_layout":"","_kad_post_sidebar_id":"","_kad_post_content_style":"","_kad_post_vertical_padding":"","_kad_post_feature":"","_kad_post_feature_position":"","_kad_post_header":false,"_kad_post_footer":false,"_kad_post_classname":"","footnotes":""},"categories":[116],"tags":[117,118],"class_list":["post-1255","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-and-hw-fundamentals","tag-ai","tag-architecture"],"_links":{"self":[{"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/posts\/1255","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/comments?post=1255"}],"version-history":[{"count":7,"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/posts\/1255\/revisions"}],"predecessor-version":[{"id":1317,"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/posts\/1255\/revisions\/1317"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/media\/1308"}],"wp:attachment":[{"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/media?parent=1255"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/categories?post=1255"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/tags?post=1255"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}<!-- This website is optimized by Airlift. Learn more: https://airlift.net. Template:. Learn more: https://airlift.net. Template: 6a283708d36f733e95476362. Config Timestamp: 2026-06-09 15:53:43 UTC, Cached Timestamp: 2026-07-01 01:27:55 UTC, Optimization Time: 3.18ms -->