{"id":1278,"date":"2026-01-12T00:33:57","date_gmt":"2026-01-11T15:33:57","guid":{"rendered":"https:\/\/rtlearner.com\/?p=1278"},"modified":"2026-01-08T11:54:49","modified_gmt":"2026-01-08T02:54:49","slug":"ai-architecture-6-int8-quantization-basics","status":"publish","type":"post","link":"https:\/\/rtlearner.com\/en\/ai-architecture-6-int8-quantization-basics\/","title":{"rendered":"AI Architecture 6. INT8 Quantization Basics"},"content":{"rendered":"<p>In the previous post, we examined how differences in Number Formats affect hardware area and power consumption. We established that FP32 is excessively expensive and heavy from a hardware standpoint. Today, we will discuss the technology that puts this heavy data on a diet: Quantization.<\/p>\n\n\n\n<p class=\"translation-block\">Many people think of quantization simply as a \"compression technique to reduce model file size.\" While reduced file size is true, the real reason System Architects are obsessed with quantization lies elsewhere. It is because of <strong>'Bandwidth'<\/strong> and <strong>'Data Movement.'<\/strong><\/p>\n\n\n\n<p>In this article, we will coldly analyze the physical benefits that occur inside the chip when FP32 (32-bit Floating Point) converts to INT8 (8-bit Integer), and what we lose in return (the Trade-off).<\/p>\n\n\n<style>.kb-table-of-content-nav.kb-table-of-content-id1278_54c870-c7 .kb-table-of-content-wrap{padding-top:var(--global-kb-spacing-sm, 1.5rem);padding-right:var(--global-kb-spacing-sm, 1.5rem);padding-bottom:var(--global-kb-spacing-sm, 1.5rem);padding-left:var(--global-kb-spacing-sm, 1.5rem);box-shadow:0px 0px 14px 0px rgba(0, 0, 0, 0.2);}.kb-table-of-content-nav.kb-table-of-content-id1278_54c870-c7 .kb-table-of-contents-title-wrap{padding-top:0px;padding-right:0px;padding-bottom:0px;padding-left:0px;}.kb-table-of-content-nav.kb-table-of-content-id1278_54c870-c7 .kb-table-of-contents-title{font-weight:regular;font-style:normal;}.kb-table-of-content-nav.kb-table-of-content-id1278_54c870-c7 .kb-table-of-content-wrap .kb-table-of-content-list{font-weight:regular;font-style:normal;margin-top:var(--global-kb-spacing-sm, 1.5rem);margin-right:0px;margin-bottom:0px;margin-left:0px;}@media all and (max-width: 767px){.kb-table-of-content-nav.kb-table-of-content-id1278_54c870-c7 .kb-table-of-contents-title{font-size:var(--global-kb-font-size-md, 1.25rem);}.kb-table-of-content-nav.kb-table-of-content-id1278_54c870-c7 .kb-table-of-content-wrap .kb-table-of-content-list{font-size:var(--global-kb-font-size-sm, 0.9rem);}}<\/style>\n\n<style>.kadence-column1278_914e97-d0 > .kt-inside-inner-col{box-shadow:0px 0px 14px 0px rgba(0, 0, 0, 0.2);}.kadence-column1278_914e97-d0 > .kt-inside-inner-col,.kadence-column1278_914e97-d0 > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column1278_914e97-d0 > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column1278_914e97-d0 > .kt-inside-inner-col{flex-direction:column;}.kadence-column1278_914e97-d0 > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column1278_914e97-d0 > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column1278_914e97-d0{position:relative;}@media all and (max-width: 1024px){.kadence-column1278_914e97-d0 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column1278_914e97-d0 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column1278_914e97-d0\"><div class=\"kt-inside-inner-col\">\n<p><strong>Related articles<\/strong><\/p>\n\n\n\n<p>\u2705<a href=\"https:\/\/rtlearner.com\/en\/ai-architecture-1-neuron-hardware-mac-analysis\/\" data-type=\"post\" data-id=\"1248\">AI Architecture 1. Anatomy of an Artificial Neuron: Y=WX+B on Silicon<\/a><\/p>\n\n\n\n<p>\u2705<a href=\"https:\/\/rtlearner.com\/en\/ai-architecture-2-activation-relu-vs-sigmoid\/\" data-type=\"post\" data-id=\"1255\">AI Architecture 2. The Cost of Activation: Free ReLU vs. Expensive Sigmoid<\/a><\/p>\n\n\n\n<p>\u2705<a href=\"https:\/\/rtlearner.com\/en\/ai-architecture-3-matmul-simd-parallel-processing\/\" data-type=\"post\" data-id=\"1263\">AI Architecture 3. The Aesthetics of MatMul: Why Deep Learning Chooses GPUs\/NPUs<\/a><\/p>\n\n\n\n<p>\u2705<a href=\"https:\/\/rtlearner.com\/en\/ai-architecture-4-training-vs-inference\/\" data-type=\"post\" data-id=\"1267\">AI Architecture 4. Training vs. Inference<\/a><\/p>\n<\/div><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">1. The Magic of Bandwidth: Effectively Quadrupling the Highway<\/h2>\n\n\n\n<p>The biggest bottleneck limiting system performance is often Memory (the so-called Memory Wall). Let's assume you are using the latest LPDDR5 memory to send 50GB of data per second to the NPU.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Using FP32:<\/strong> One parameter is 4 Bytes (32 bits). Transferable parameters per second = 50{GB} \/ 4{B} = 12.5{Billion}.<\/li>\n\n\n\n<li><strong>Using INT8:<\/strong> One parameter is 1 Byte (8 bits). Transferable parameters per second = 50 {GB} \/ 1{B} = 50{Billion}.<\/li>\n<\/ul>\n\n\n\n<p>Even without increasing the physical memory speed, reducing the data size to 1\/4 results in a 4x increase in effective Data Throughput.<\/p>\n\n\n\n<p>This isn't just about speed. The efficiency of the chip's internal <strong>SRAM<\/strong> also improves by 4x. A 4MB cache can hold only 1 million FP32 parameters, but it can hold 4 million INT8 parameters. This drastically reduces the frequency of expensive DRAM accesses, maximizing power efficiency.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">2. The Principle of Quantization<\/h2>\n\n\n\n<p>So, how do real numbers become integers? The most widely used method is Affine Quantization (Asymmetric Quantization).<\/p>\n\n\n\n<div class=\"wp-block-math\"><math display=\"block\"><semantics><mrow><msub><mi>x<\/mi><mrow><mi>i<\/mi><mi>n<\/mi><mi>t<\/mi><\/mrow><\/msub><mo>=<\/mo><mtext>round<\/mtext><mrow><mo fence=\"true\" form=\"prefix\">(<\/mo><mfrac><msub><mi>x<\/mi><mrow><mi>f<\/mi><mi>l<\/mi><mi>o<\/mi><mi>a<\/mi><mi>t<\/mi><\/mrow><\/msub><mi>S<\/mi><\/mfrac><mo fence=\"true\" form=\"postfix\">)<\/mo><\/mrow><mo>+<\/mo><mi>Z<\/mi><\/mrow><annotation encoding=\"application\/x-tex\">x_{int} = \\text{round} \\left( \\frac{x_{float}}{S} \\right) + Z<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<div class=\"wp-block-math\"><math display=\"block\"><semantics><mrow><msub><mi>x<\/mi><mrow><mi>d<\/mi><mi>e<\/mi><mi>q<\/mi><mi>u<\/mi><mi>a<\/mi><mi>n<\/mi><mi>t<\/mi><\/mrow><\/msub><mo>=<\/mo><mi>S<\/mi><mo>\u00d7<\/mo><mo form=\"prefix\" stretchy=\"false\">(<\/mo><msub><mi>x<\/mi><mrow><mi>i<\/mi><mi>n<\/mi><mi>t<\/mi><\/mrow><\/msub><mo>\u2212<\/mo><mi>Z<\/mi><mo form=\"postfix\" stretchy=\"false\">)<\/mo><\/mrow><annotation encoding=\"application\/x-tex\">x_{dequant} = S \\times (x_{int} &#8211; Z)<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>x<sub>float<\/sub>:<\/strong> The original floating-point value (Input).<\/li>\n\n\n\n<li><strong>S (Scale Factor):<\/strong> The multiplier determining how much of the real range fits into one integer step (Step Size).<\/li>\n\n\n\n<li><strong>Z (Zero Point):<\/strong> The offset determining where the zero point of the real number maps to the integer.<\/li>\n\n\n\n<li><strong>x<sub>int<\/sub>:<\/strong> The converted integer value (-128 ~ 127 or 0 ~ 255 for INT8).<\/li>\n<\/ul>\n\n\n\n<p class=\"translation-block\">Simply put, it's like changing the markings on a high-resolution analog ruler to sparse digital markings. In this process, <strong>error<\/strong> inevitably occurs.<\/p>\n\n\n<style>.kb-image1278_fdc1e8-56.kb-image-is-ratio-size, .kb-image1278_fdc1e8-56 .kb-image-is-ratio-size{max-width:550px;width:100%;}.wp-block-kadence-column > .kt-inside-inner-col > .kb-image1278_fdc1e8-56.kb-image-is-ratio-size, .wp-block-kadence-column > .kt-inside-inner-col > .kb-image1278_fdc1e8-56 .kb-image-is-ratio-size{align-self:unset;}.kb-image1278_fdc1e8-56 figure{max-width:550px;}.kb-image1278_fdc1e8-56 .image-is-svg, .kb-image1278_fdc1e8-56 .image-is-svg img{width:100%;}.kb-image1278_fdc1e8-56 .kb-image-has-overlay:after{opacity:0.3;}@media all and (max-width: 767px){.kb-image1278_fdc1e8-56.kb-image-is-ratio-size, .kb-image1278_fdc1e8-56 .kb-image-is-ratio-size{max-width:290px;width:100%;}.kb-image1278_fdc1e8-56 figure{max-width:290px;}}<\/style>\n<div class=\"wp-block-kadence-image kb-image1278_fdc1e8-56\"><figure class=\"aligncenter size-full\"><img data-dominant-color=\"f7f6f6\" data-has-transparency=\"false\" style=\"--dominant-color: #f7f6f6;\" loading=\"lazy\" decoding=\"async\" width=\"600\" height=\"378\" src=\"https:\/\/rtlearner.com\/wp-content\/uploads\/2026\/01\/image-1-3.jpg\" alt=\"Quantization\" class=\"kb-img wp-image-1280 not-transparent\" srcset=\"https:\/\/rtlearner.com\/wp-content\/uploads\/2026\/01\/image-1-3.jpg 600w, https:\/\/rtlearner.com\/wp-content\/uploads\/2026\/01\/image-1-3-300x189.jpg 300w, https:\/\/rtlearner.com\/wp-content\/uploads\/2026\/01\/image-1-3-18x12.jpg 18w\" sizes=\"auto, (max-width: 600px) 100vw, 600px\" \/><figcaption>Quantization<\/figcaption><\/figure><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">3. The Trade-off: Quantization Noise<\/h2>\n\n\n\n<p>The error generated when converting FP32 to INT8 is called Quantization Noise.<\/p>\n\n\n\n<p>32-bit floating-point can express minute differences around 10<sup>-45<\/sup> . However, INT8 can only represent 256 distinct numbers (2<sup>8<\/sup>). For example, 0.1 and 0.11 might both map to the same integer 10 after quantization.<\/p>\n\n\n\n<p>This loss of information leads to a <strong>drop in Model Accuracy<\/strong>.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Rounding Error:<\/strong> Error caused by rounding to the nearest integer.<\/li>\n\n\n\n<li><strong>Clipping Error:<\/strong> Error caused by forcing large values (Outliers) outside the representable range to the maximum\/minimum values.<\/li>\n<\/ul>\n\n\n\n<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\">\n<p>Architect&#8217;s Insight:<\/p>\n\n\n\n<p>Hardware engineers constantly weigh this trade-off. \"Do we sacrifice 1% accuracy to gain 4x speed?\"<\/p>\n\n\n\n<p>Fortunately, deep learning models have high Robustness to noise. Just as a photo of a dog with slight noise is still recognized as a dog, a model often produces the same final result even with some quantization noise mixed into the weights. This is the basis for boldly discarding FP32.<\/p>\n<\/blockquote>\n\n\n\n<h2 class=\"wp-block-heading\">4. Dynamic Range and Calibration<\/h2>\n\n\n\n<p class=\"translation-block\">The core technology of quantization is deciding \"which range to split into 256 parts.\" This process is called <strong>Calibration<\/strong>.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Min-Max:<\/strong> Sets the range based on the minimum and maximum values of the data. It's easy to implement, but a single Outlier can ruin the precision of the entire range.<\/li>\n\n\n\n<li><strong>Entropy\/Histogram:<\/strong> Sets the range around where the data is most concentrated and boldly cuts off the rest (Clipping). This is a smarter method to minimize information loss.<\/li>\n<\/ul>\n\n\n\n<p>Hardware accelerators typically manage this Scale Factor (S) and Zero Point (Z) separately per layer (Layer-wise) or per channel (Channel-wise) to preserve precision.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">5. Conclusion: Gaining More Than Losing<\/h2>\n\n\n\n<p>The path from FP32 to INT8 definitely involves loss (precision). However, the hardware benefits gained (4x bandwidth, increased power efficiency, area reduction) are truly massive.<\/p>\n\n\n\n<p>Modern AI semiconductors are now challenging beyond simple INT8 to <strong>INT4, and even 1-bit (Binary) Quantization<\/strong>. \"Expressing intelligence with the minimum number of bits\"\u2014that is the ultimate goal of NPU architecture.<\/p>\n\n\n<style>.kadence-column1278_301e77-7a > .kt-inside-inner-col{box-shadow:0px 0px 14px 0px rgba(0, 0, 0, 0.2);}.kadence-column1278_301e77-7a > .kt-inside-inner-col,.kadence-column1278_301e77-7a > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column1278_301e77-7a > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column1278_301e77-7a > .kt-inside-inner-col{flex-direction:column;}.kadence-column1278_301e77-7a > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column1278_301e77-7a > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column1278_301e77-7a{position:relative;}@media all and (max-width: 1024px){.kadence-column1278_301e77-7a > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column1278_301e77-7a > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column1278_301e77-7a\"><div class=\"kt-inside-inner-col\">\n<p><strong>Related articles<\/strong><\/p>\n\n\n\n<p>\u2705<a href=\"https:\/\/rtlearner.com\/en\/ai-architecture-1-neuron-hardware-mac-analysis\/\" data-type=\"post\" data-id=\"1248\">AI Architecture 1. Anatomy of an Artificial Neuron: Y=WX+B on Silicon<\/a><\/p>\n\n\n\n<p>\u2705<a href=\"https:\/\/rtlearner.com\/en\/ai-architecture-2-activation-relu-vs-sigmoid\/\" data-type=\"post\" data-id=\"1255\">AI Architecture 2. The Cost of Activation: Free ReLU vs. Expensive Sigmoid<\/a><\/p>\n\n\n\n<p>\u2705<a href=\"https:\/\/rtlearner.com\/en\/ai-architecture-3-matmul-simd-parallel-processing\/\" data-type=\"post\" data-id=\"1263\">AI Architecture 3. The Aesthetics of MatMul: Why Deep Learning Chooses GPUs\/NPUs<\/a><\/p>\n\n\n\n<p>\u2705<a href=\"https:\/\/rtlearner.com\/en\/ai-architecture-4-training-vs-inference\/\" data-type=\"post\" data-id=\"1267\">AI Architecture 4. Training vs. Inference<\/a><\/p>\n<\/div><\/div>\n\n\n\n<p>References: <em><a href=\"https:\/\/leimao.github.io\/article\/Neural-Networks-Quantization\/\" target=\"_blank\" rel=\"noopener\">Quantization for Neural Networks<\/a><\/em><\/p>","protected":false},"excerpt":{"rendered":"<p>In the previous post, we examined how differences in Number Formats affect hardware area and power consumption. We established that FP32 ...<\/p>","protected":false},"author":1,"featured_media":1280,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_kadence_starter_templates_imported_post":false,"_kad_post_transparent":"","_kad_post_title":"","_kad_post_layout":"","_kad_post_sidebar_id":"","_kad_post_content_style":"","_kad_post_vertical_padding":"","_kad_post_feature":"","_kad_post_feature_position":"","_kad_post_header":false,"_kad_post_footer":false,"_kad_post_classname":"","footnotes":""},"categories":[116],"tags":[117,118],"class_list":["post-1278","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-and-hw-fundamentals","tag-ai","tag-architecture"],"_links":{"self":[{"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/posts\/1278","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/comments?post=1278"}],"version-history":[{"count":4,"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/posts\/1278\/revisions"}],"predecessor-version":[{"id":1320,"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/posts\/1278\/revisions\/1320"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/media\/1280"}],"wp:attachment":[{"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/media?parent=1278"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/categories?post=1278"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/tags?post=1278"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}<!-- This website is optimized by Airlift. Learn more: https://airlift.net. Template:. Learn more: https://airlift.net. Template: 69b92da9d36f73cd2808d6e8. Config Timestamp: 2026-03-17 10:32:09 UTC, Cached Timestamp: 2026-05-16 08:58:48 UTC -->