{"id":1270,"date":"2026-01-09T09:45:45","date_gmt":"2026-01-09T00:45:45","guid":{"rendered":"https:\/\/rtlearner.com\/?p=1270"},"modified":"2026-01-08T11:54:40","modified_gmt":"2026-01-08T02:54:40","slug":"ai-architecture-5-number-formats-fp32-hardware-cost","status":"publish","type":"post","link":"https:\/\/rtlearner.com\/en\/ai-architecture-5-number-formats-fp32-hardware-cost\/","title":{"rendered":"[AI Architecture] 5. The Weight of Data (Number Formats): How FP32 Impacts Hardware Area and Power"},"content":{"rendered":"<p>In the previous post, we explored the difference between Training and Inference, seeing how inference-only NPUs lighten the hardware structure. One of the key keywords for this optimization was 'Reduction of Precision.' To a software engineer, data is merely an abstract variable type like <code>float<\/code> (32-bit) or <code>int<\/code> (32-bit). However, to a System Architect designing silicon chips, data carries physical 'Weight.'<\/p>\n\n\n\n<p class=\"translation-block\">An increase in the number of bits means more <strong>strands of wire<\/strong> to transport data, more <strong>Flip-Flops<\/strong> to store it, and most importantly, an exponential increase in the Silicon Area of the logic circuits required to compute them.<\/p>\n\n\n\n<p>In this article, we will analyze why FP32 (Floating Point 32-bit), the standard for deep learning, is such a heavy and expensive format from a hardware perspective, and the butterfly effect that transitioning to INT8 (Fixed Point) brings to system performance.<\/p>\n\n\n<style>.kb-table-of-content-nav.kb-table-of-content-id1270_9367dd-6b .kb-table-of-content-wrap{padding-top:var(--global-kb-spacing-sm, 1.5rem);padding-right:var(--global-kb-spacing-sm, 1.5rem);padding-bottom:var(--global-kb-spacing-sm, 1.5rem);padding-left:var(--global-kb-spacing-sm, 1.5rem);box-shadow:0px 0px 14px 0px rgba(0, 0, 0, 0.2);}.kb-table-of-content-nav.kb-table-of-content-id1270_9367dd-6b .kb-table-of-contents-title-wrap{padding-top:0px;padding-right:0px;padding-bottom:0px;padding-left:0px;}.kb-table-of-content-nav.kb-table-of-content-id1270_9367dd-6b .kb-table-of-contents-title{font-weight:regular;font-style:normal;}.kb-table-of-content-nav.kb-table-of-content-id1270_9367dd-6b .kb-table-of-content-wrap .kb-table-of-content-list{font-weight:regular;font-style:normal;margin-top:var(--global-kb-spacing-sm, 1.5rem);margin-right:0px;margin-bottom:0px;margin-left:0px;}@media all and (max-width: 767px){.kb-table-of-content-nav.kb-table-of-content-id1270_9367dd-6b .kb-table-of-contents-title{font-size:var(--global-kb-font-size-md, 1.25rem);}.kb-table-of-content-nav.kb-table-of-content-id1270_9367dd-6b .kb-table-of-content-wrap .kb-table-of-content-list{font-size:var(--global-kb-font-size-sm, 0.9rem);}}<\/style>\n\n<style>.kadence-column1270_e94cc8-04 > .kt-inside-inner-col{box-shadow:0px 0px 14px 0px rgba(0, 0, 0, 0.2);}.kadence-column1270_e94cc8-04 > .kt-inside-inner-col,.kadence-column1270_e94cc8-04 > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column1270_e94cc8-04 > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column1270_e94cc8-04 > .kt-inside-inner-col{flex-direction:column;}.kadence-column1270_e94cc8-04 > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column1270_e94cc8-04 > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column1270_e94cc8-04{position:relative;}@media all and (max-width: 1024px){.kadence-column1270_e94cc8-04 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column1270_e94cc8-04 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column1270_e94cc8-04\"><div class=\"kt-inside-inner-col\">\n<p><strong>Related articles<\/strong><\/p>\n\n\n\n<p>\u2705<a href=\"https:\/\/rtlearner.com\/en\/ai-architecture-1-neuron-hardware-mac-analysis\/\" data-type=\"post\" data-id=\"1248\">AI Architecture 1. Anatomy of an Artificial Neuron: Y=WX+B on Silicon<\/a><\/p>\n\n\n\n<p>\u2705<a href=\"https:\/\/rtlearner.com\/en\/ai-architecture-2-activation-relu-vs-sigmoid\/\" data-type=\"post\" data-id=\"1255\">AI Architecture 2. The Cost of Activation: Free ReLU vs. Expensive Sigmoid<\/a><\/p>\n\n\n\n<p>\u2705<a href=\"https:\/\/rtlearner.com\/en\/ai-architecture-3-matmul-simd-parallel-processing\/\" data-type=\"post\" data-id=\"1263\">AI Architecture 3. The Aesthetics of MatMul: Why Deep Learning Chooses GPUs\/NPUs<\/a><\/p>\n\n\n\n<p>\u2705<a href=\"https:\/\/rtlearner.com\/en\/ai-architecture-4-training-vs-inference\/\" data-type=\"post\" data-id=\"1267\">AI Architecture 4. Training vs. Inference<\/a><\/p>\n<\/div><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">1. Structural Complexity of IEEE 754: Real Numbers<\/h2>\n\n\n\n<p>The floating-point data (FP32) we commonly use follows the <strong>IEEE 754 Standard<\/strong>. To represent a very wide Dynamic Range, this format splits a number into three parts:<\/p>\n\n\n\n<ol start=\"1\" class=\"wp-block-list\">\n<li><strong>Sign (1-bit):<\/strong> Positive\/Negative<\/li>\n\n\n\n<li><strong>Exponent (8-bit):<\/strong> Determines the magnitude\/range<\/li>\n\n\n\n<li><strong>Mantissa (23-bit):<\/strong> Determines the precision\/significant digits<\/li>\n<\/ol>\n\n\n\n<div class=\"wp-block-math\"><math display=\"block\"><semantics><mrow><mi>V<\/mi><mi>a<\/mi><mi>l<\/mi><mi>u<\/mi><mi>e<\/mi><mo>=<\/mo><mo form=\"prefix\" stretchy=\"false\">(<\/mo><mo form=\"prefix\" stretchy=\"false\">\u2212<\/mo><mn>1<\/mn><msup><mo form=\"postfix\" stretchy=\"false\">)<\/mo><mrow><mi>S<\/mi><mi>i<\/mi><mi>g<\/mi><mi>n<\/mi><\/mrow><\/msup><mo>\u00d7<\/mo><mo form=\"prefix\" stretchy=\"false\">(<\/mo><mn>1.<\/mn><mi>M<\/mi><mi>a<\/mi><mi>n<\/mi><mi>t<\/mi><mi>i<\/mi><mi>s<\/mi><mi>s<\/mi><mi>a<\/mi><mo form=\"postfix\" stretchy=\"false\">)<\/mo><mo>\u00d7<\/mo><msup><mn>2<\/mn><mrow><mo form=\"prefix\" stretchy=\"false\" lspace=\"0em\" rspace=\"0em\">(<\/mo><mi>E<\/mi><mi>x<\/mi><mi>p<\/mi><mi>o<\/mi><mi>n<\/mi><mi>e<\/mi><mi>n<\/mi><mi>t<\/mi><mo>\u2212<\/mo><mn>127<\/mn><mo form=\"postfix\" stretchy=\"false\" lspace=\"0em\" rspace=\"0em\">)<\/mo><\/mrow><\/msup><\/mrow><annotation encoding=\"application\/x-tex\">Value = (-1)^{Sign} \\times (1.Mantissa) \\times 2^{(Exponent &#8211; 127)}<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p>While mathematically elegant, this structure is a Nightmare for hardware implementation. Even a simple addition requires a complex sequence of steps:<\/p>\n\n\n\n<ol start=\"1\" class=\"wp-block-list\">\n<li><strong>Denormalization:<\/strong> Comparing the exponents of two numbers to align their decimal points.<\/li>\n\n\n\n<li><strong>Mantissa Add:<\/strong> Adding the aligned mantissas.<\/li>\n\n\n\n<li><strong>Normalization:<\/strong> Bit-shifting the result to restore it to the standard format (1.xxx).<\/li>\n\n\n\n<li><strong>Rounding:<\/strong> Processing the least significant bits to match precision.<\/li>\n<\/ol>\n\n\n\n<p>All these steps require complex logic blocks like <strong>Comparators, Barrel Shifters, and Leading Zero Detectors.<\/strong> This is why FP32 arithmetic units are expensive.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">2. Exponential Increase in Hardware Area<\/h2>\n\n\n\n<p>A key metric in hardware design cost is Silicon Die Area. Larger area means fewer chips per wafer (lower Net Die), lower Yield, and higher unit cost. The area of a Multiplier scales with a complexity of roughly Quadratic N<sup>2<\/sup>with respect to the input bit width (N).<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>FP32 Multiplier:<\/strong> Requires multiplication logic for 23-bit mantissas (roughly 24 * 24 including the hidden bit), plus exponent addition and normalization logic.<\/li>\n\n\n\n<li><strong>INT8 Multiplier:<\/strong> A simple 8 * 8 integer multiplier. No complex normalization or shifters required.<\/li>\n<\/ul>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p>Quantitative Analysis:<\/p>\n\n\n\n<p>Based on a 45nm process, the area of a single FP32 multiplier is roughly equivalent to that of 18.5 INT8 multipliers. This means by abandoning FP32 for INT8, you can pack about 18 times more processing cores into the same chip area. This is the secret behind the explosive Throughput of NPUs.<\/p>\n<\/blockquote>\n\n\n\n<h2 class=\"wp-block-heading\">3. Power Consumption and Heat<\/h2>\n\n\n\n<p>A more serious issue is Power. The complex logic blocks of FP32 mentioned above toggle (switch) every clock cycle, consuming power.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>FP32 Addition:<\/strong> Approx. 0.9 pJ (Pico Joules)<\/li>\n\n\n\n<li><strong>INT8 Addition:<\/strong> Approx. 0.03 pJ<\/li>\n\n\n\n<li><strong>Energy Efficiency Gap:<\/strong> <strong>~30x<\/strong><\/li>\n<\/ul>\n\n\n\n<p>When running massive models with hundreds of millions of parameters, this 30x difference determines \"whether a smartphone battery dies in an hour or lasts all day.\" Furthermore, power consumption leads directly to Heat, which causes throttling that forces the chip to lower its operating clock speed.<\/p>\n\n\n<style>.kb-image1270_aacbbc-84.kb-image-is-ratio-size, .kb-image1270_aacbbc-84 .kb-image-is-ratio-size{max-width:700px;width:100%;}.wp-block-kadence-column > .kt-inside-inner-col > .kb-image1270_aacbbc-84.kb-image-is-ratio-size, .wp-block-kadence-column > .kt-inside-inner-col > .kb-image1270_aacbbc-84 .kb-image-is-ratio-size{align-self:unset;}.kb-image1270_aacbbc-84 figure{max-width:700px;}.kb-image1270_aacbbc-84 .image-is-svg, .kb-image1270_aacbbc-84 .image-is-svg img{width:100%;}.kb-image1270_aacbbc-84 .kb-image-has-overlay:after{opacity:0.3;}@media all and (max-width: 767px){.kb-image1270_aacbbc-84.kb-image-is-ratio-size, .kb-image1270_aacbbc-84 .kb-image-is-ratio-size{max-width:290px;width:100%;}.kb-image1270_aacbbc-84 figure{max-width:290px;}}<\/style>\n<div class=\"wp-block-kadence-image kb-image1270_aacbbc-84\"><figure class=\"aligncenter size-full\"><img data-dominant-color=\"cacce4\" data-has-transparency=\"false\" style=\"--dominant-color: #cacce4;\" loading=\"lazy\" decoding=\"async\" width=\"850\" height=\"680\" src=\"https:\/\/rtlearner.com\/wp-content\/uploads\/2026\/01\/image-1-2.jpg\" alt=\"\" class=\"kb-img wp-image-1276 not-transparent\" srcset=\"https:\/\/rtlearner.com\/wp-content\/uploads\/2026\/01\/image-1-2.jpg 850w, https:\/\/rtlearner.com\/wp-content\/uploads\/2026\/01\/image-1-2-300x240.jpg 300w, https:\/\/rtlearner.com\/wp-content\/uploads\/2026\/01\/image-1-2-768x614.jpg 768w, https:\/\/rtlearner.com\/wp-content\/uploads\/2026\/01\/image-1-2-15x12.jpg 15w\" sizes=\"auto, (max-width: 850px) 100vw, 850px\" \/><figcaption>Energy comparison for fp32 vs int4 hardware<\/figcaption><\/figure><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">4. Relieving the Memory Bandwidth Bottleneck<\/h2>\n\n\n\n<p>The weight of data is felt not just inside the arithmetic units but also on the Memory Bus, the highway transporting data.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li class=\"translation-block\">FP32: 4 Bytes per parameter<\/li>\n\n\n\n<li class=\"translation-block\">INT8: 1 Byte per parameter<\/li>\n<\/ul>\n\n\n\n<p>Given the same DRAM bandwidth (e.g., 100GB\/s), you can supply 4 times more data when loading INT8 compared to FP32.<\/p>\n\n\n\n<p>Considering that most AI inference tasks are Memory-Bound (limited by data loading speed rather than computation speed), reducing data size by 1\/4 is the most definitive optimization to boost total system performance by up to 4x.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">5. Fixed Point and Quantization<\/h2>\n\n\n\n<p>So, how do we convert FP32 to INT8? It\u2019s not just about truncating decimals. We use the concept of <strong>Fixed Point<\/strong> .<\/p>\n\n\n\n<div class=\"wp-block-math\"><math display=\"block\"><semantics><mrow><mi>R<\/mi><mi>e<\/mi><mi>a<\/mi><mi>l<\/mi><mi>_<\/mi><mi>V<\/mi><mi>a<\/mi><mi>l<\/mi><mi>u<\/mi><mi>e<\/mi><mo>\u2248<\/mo><mi>S<\/mi><mi>c<\/mi><mi>a<\/mi><mi>l<\/mi><mi>e<\/mi><mi>_<\/mi><mi>F<\/mi><mi>a<\/mi><mi>c<\/mi><mi>t<\/mi><mi>o<\/mi><mi>r<\/mi><mo>\u00d7<\/mo><mo form=\"prefix\" stretchy=\"false\">(<\/mo><mi>I<\/mi><mi>n<\/mi><mi>t<\/mi><mi>e<\/mi><mi>g<\/mi><mi>e<\/mi><mi>r<\/mi><mi>_<\/mi><mi>V<\/mi><mi>a<\/mi><mi>l<\/mi><mi>u<\/mi><mi>e<\/mi><mo>\u2212<\/mo><mi>Z<\/mi><mi>e<\/mi><mi>r<\/mi><mi>o<\/mi><mi>_<\/mi><mi>P<\/mi><mi>o<\/mi><mi>i<\/mi><mi>n<\/mi><mi>t<\/mi><mo form=\"postfix\" stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">Real\\_Value \\approx Scale\\_Factor \\times (Integer\\_Value &#8211; Zero\\_Point)<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p>In hardware, we fix the position of the decimal point (an agreement between the programmer and hardware) and simply run Integer ALUs. This process is called Quantization.<\/p>\n\n\n\n<p>Of course, trying to fit the wide range of 32-bit into a narrow 8-bit container results in information loss (Accuracy Drop). However, Deep Learning models have massive redundancy, so slight errors in individual parameters do not significantly affect the final result. Leveraging this, we can maintain accuracy while drastically lowering hardware costs.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">6. Conclusion: The Wisdom of Choosing the Right Container<\/h2>\n\n\n\n<p>The role of a Hardware engineer is not to build the most precise calculator possible. It is to \"select the smallest (Area) and lowest power (Power) data format that satisfies the required accuracy.\"<\/p>\n\n\n\n<p>Recently, new formats like BF16 (Bfloat16) or FP8 have emerged as compromises between FP32 and INT8, being adopted in modern chips like the NVIDIA H100. This illustrates the ongoing evolution of hardware design, constantly balancing training stability and inference efficiency.<\/p>\n\n\n<style>.kadence-column1270_fdfbdc-f7 > .kt-inside-inner-col{box-shadow:0px 0px 14px 0px rgba(0, 0, 0, 0.2);}.kadence-column1270_fdfbdc-f7 > .kt-inside-inner-col,.kadence-column1270_fdfbdc-f7 > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column1270_fdfbdc-f7 > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column1270_fdfbdc-f7 > .kt-inside-inner-col{flex-direction:column;}.kadence-column1270_fdfbdc-f7 > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column1270_fdfbdc-f7 > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column1270_fdfbdc-f7{position:relative;}@media all and (max-width: 1024px){.kadence-column1270_fdfbdc-f7 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column1270_fdfbdc-f7 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column1270_fdfbdc-f7\"><div class=\"kt-inside-inner-col\">\n<p><strong>Related articles<\/strong><\/p>\n\n\n\n<p>\u2705<a href=\"https:\/\/rtlearner.com\/en\/ai-architecture-1-neuron-hardware-mac-analysis\/\" data-type=\"post\" data-id=\"1248\">AI Architecture 1. Anatomy of an Artificial Neuron: Y=WX+B on Silicon<\/a><\/p>\n\n\n\n<p>\u2705<a href=\"https:\/\/rtlearner.com\/en\/ai-architecture-2-activation-relu-vs-sigmoid\/\" data-type=\"post\" data-id=\"1255\">AI Architecture 2. The Cost of Activation: Free ReLU vs. Expensive Sigmoid<\/a><\/p>\n\n\n\n<p>\u2705<a href=\"https:\/\/rtlearner.com\/en\/ai-architecture-3-matmul-simd-parallel-processing\/\" data-type=\"post\" data-id=\"1263\">AI Architecture 3. The Aesthetics of MatMul: Why Deep Learning Chooses GPUs\/NPUs<\/a><\/p>\n\n\n\n<p>\u2705<a href=\"https:\/\/rtlearner.com\/en\/ai-architecture-4-training-vs-inference\/\" data-type=\"post\" data-id=\"1267\">AI Architecture 4. Training vs. Inference<\/a><\/p>\n<\/div><\/div>\n\n\n\n<p>References: <a href=\"https:\/\/neurips.cc\/virtual\/2015\/tutorial\/4894\" target=\"_blank\" rel=\"noopener\">High-Performance Hardware for Machine Learning<\/a><\/p>","protected":false},"excerpt":{"rendered":"<p>In the previous post, we explored the difference between Training and Inference, seeing how inference-only NPUs lighten the hardware structure. One of the key keywords for this optimization was 'Reduction of Precision.'<\/p>","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_kadence_starter_templates_imported_post":false,"_kad_post_transparent":"","_kad_post_title":"","_kad_post_layout":"","_kad_post_sidebar_id":"","_kad_post_content_style":"","_kad_post_vertical_padding":"","_kad_post_feature":"","_kad_post_feature_position":"","_kad_post_header":false,"_kad_post_footer":false,"_kad_post_classname":"","footnotes":""},"categories":[116],"tags":[117,118],"class_list":["post-1270","post","type-post","status-publish","format-standard","hentry","category-ai-and-hw-fundamentals","tag-ai","tag-architecture"],"_links":{"self":[{"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/posts\/1270","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/comments?post=1270"}],"version-history":[{"count":5,"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/posts\/1270\/revisions"}],"predecessor-version":[{"id":1319,"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/posts\/1270\/revisions\/1319"}],"wp:attachment":[{"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/media?parent=1270"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/categories?post=1270"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/tags?post=1270"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}<!-- This website is optimized by Airlift. Learn more: https://airlift.net. Template:. Learn more: https://airlift.net. Template: 69b92da9d36f73cd2808d6e8. Config Timestamp: 2026-03-17 10:32:09 UTC, Cached Timestamp: 2026-04-18 14:25:53 UTC -->