{"id":1343,"date":"2026-01-21T08:56:01","date_gmt":"2026-01-20T23:56:01","guid":{"rendered":"https:\/\/rtlearner.com\/?p=1343"},"modified":"2026-01-21T08:56:02","modified_gmt":"2026-01-20T23:56:02","slug":"ai-architecture-13-roofline-model-analysis","status":"publish","type":"post","link":"https:\/\/rtlearner.com\/en\/ai-architecture-13-roofline-model-analysis\/","title":{"rendered":"AI Architecture 13. Roofline Model Analysis"},"content":{"rendered":"<p>In our previous posts, we discussed the two main culprits degrading deep learning model performance: '<a href=\"https:\/\/rtlearner.com\/en\/ai-architecture-7-mlp-layer-memory-wall\/\">Memory-bound<\/a>' and '<a href=\"https:\/\/rtlearner.com\/en\/ai-architecture-8-cnn-locality-sram-data-reuse\/\">Compute-bound<\/a>' bottlenecks. However, in practice, when deploying a new model onto an NPU, it is difficult to intuitively judge, \"This model has a memory problem,\" because complex layers are intertwined.<\/p>\n\n\n\n<p>At this point, the 'Roofline Model' becomes an essential analysis framework for engineers. Proposed by the UC Berkeley research team in 2009, this model quantitatively visualizes the correlation between processor compute performance and memory bandwidth on a 2D graph. It defines the 'Theoretical Performance Roof' that the hardware can achieve and serves as an absolute standard for determining optimization direction by identifying the current efficiency level of the model relative to that threshold.<\/p>\n\n\n<style>.kb-table-of-content-nav.kb-table-of-content-id1343_87576b-82 .kb-table-of-content-wrap{padding-top:var(--global-kb-spacing-sm, 1.5rem);padding-right:var(--global-kb-spacing-sm, 1.5rem);padding-bottom:var(--global-kb-spacing-sm, 1.5rem);padding-left:var(--global-kb-spacing-sm, 1.5rem);box-shadow:0px 0px 14px 0px rgba(0, 0, 0, 0.2);}.kb-table-of-content-nav.kb-table-of-content-id1343_87576b-82 .kb-table-of-contents-title-wrap{padding-top:0px;padding-right:0px;padding-bottom:0px;padding-left:0px;}.kb-table-of-content-nav.kb-table-of-content-id1343_87576b-82 .kb-table-of-contents-title{font-weight:regular;font-style:normal;}.kb-table-of-content-nav.kb-table-of-content-id1343_87576b-82 .kb-table-of-content-wrap .kb-table-of-content-list{font-weight:regular;font-style:normal;margin-top:var(--global-kb-spacing-sm, 1.5rem);margin-right:0px;margin-bottom:0px;margin-left:0px;}@media all and (max-width: 767px){.kb-table-of-content-nav.kb-table-of-content-id1343_87576b-82 .kb-table-of-contents-title{font-size:var(--global-kb-font-size-md, 1.25rem);}.kb-table-of-content-nav.kb-table-of-content-id1343_87576b-82 .kb-table-of-content-wrap .kb-table-of-content-list{font-size:var(--global-kb-font-size-sm, 0.9rem);}}<\/style>\n\n<style>.kadence-column1343_4cdc71-97 > .kt-inside-inner-col{box-shadow:0px 0px 14px 0px rgba(0, 0, 0, 0.2);}.kadence-column1343_4cdc71-97 > .kt-inside-inner-col,.kadence-column1343_4cdc71-97 > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column1343_4cdc71-97 > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column1343_4cdc71-97 > .kt-inside-inner-col{flex-direction:column;}.kadence-column1343_4cdc71-97 > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column1343_4cdc71-97 > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column1343_4cdc71-97{position:relative;}@media all and (max-width: 1024px){.kadence-column1343_4cdc71-97 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column1343_4cdc71-97 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column1343_4cdc71-97\"><div class=\"kt-inside-inner-col\">\n<p><strong>Related articles<\/strong><\/p>\n\n\n\n<p>\u2705<\/p>\n<\/div><\/div>\n\n\n\n<h2 class=\"wp-block-heading\">1. Structure of the Roofline Model: Two Axes Determining Performance<\/h2>\n\n\n\n<p>To interpret the Roofline graph, one must clearly understand the engineering definitions of the X and Y axes.<\/p>\n\n\n<style>.kb-image1343_67a088-3b.kb-image-is-ratio-size, .kb-image1343_67a088-3b .kb-image-is-ratio-size{max-width:650px;width:100%;}.wp-block-kadence-column > .kt-inside-inner-col > .kb-image1343_67a088-3b.kb-image-is-ratio-size, .wp-block-kadence-column > .kt-inside-inner-col > .kb-image1343_67a088-3b .kb-image-is-ratio-size{align-self:unset;}.kb-image1343_67a088-3b figure{max-width:650px;}.kb-image1343_67a088-3b .image-is-svg, .kb-image1343_67a088-3b .image-is-svg img{width:100%;}.kb-image1343_67a088-3b .kb-image-has-overlay:after{opacity:0.3;}@media all and (max-width: 767px){.kb-image1343_67a088-3b.kb-image-is-ratio-size, .kb-image1343_67a088-3b .kb-image-is-ratio-size{max-width:290px;width:100%;}.kb-image1343_67a088-3b figure{max-width:290px;}}<\/style>\n<div class=\"wp-block-kadence-image kb-image1343_67a088-3b\"><figure class=\"aligncenter size-full\"><a href=\"https:\/\/www.researchgate.net\/figure\/Roofline-model-based-on-49-The-x-axis-represents-the-operational-or-computational_fig2_362880728\" class=\"kb-advanced-image-link\" target=\"_blank\" rel=\"noopener noreferrer\"><img data-dominant-color=\"f7f0e8\" data-has-transparency=\"false\" style=\"--dominant-color: #f7f0e8;\" loading=\"lazy\" decoding=\"async\" width=\"850\" height=\"551\" src=\"https:\/\/rtlearner.com\/wp-content\/uploads\/2026\/01\/image-3-7.jpg\" alt=\"Roofline model\" class=\"kb-img wp-image-1347 not-transparent\" srcset=\"https:\/\/rtlearner.com\/wp-content\/uploads\/2026\/01\/image-3-7.jpg 850w, https:\/\/rtlearner.com\/wp-content\/uploads\/2026\/01\/image-3-7-300x194.jpg 300w, https:\/\/rtlearner.com\/wp-content\/uploads\/2026\/01\/image-3-7-768x498.jpg 768w, https:\/\/rtlearner.com\/wp-content\/uploads\/2026\/01\/image-3-7-18x12.jpg 18w\" sizes=\"auto, (max-width: 850px) 100vw, 850px\" \/><\/a><figcaption>Roofline model<\/figcaption><\/figure><\/div>\n\n\n\n<h3 class=\"wp-block-heading\">A. Y-Axis: Attainable Performance<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Unit:<\/strong> GFLOPS (Giga Floating-point Operations Per Second) or TOPS (Tera Operations Per Second).<\/li>\n\n\n\n<li><strong>Meaning:<\/strong> Represents the number of operations processable per second. A higher value indicates faster processing speed, which is proportional to the number of Processing Elements (PEs) and the Clock Frequency of the hardware.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">B. X-Axis: Arithmetic Intensity (Operational Intensity)<\/h3>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Unit:<\/strong> FLOPs\/Byte (or Ops\/Byte).<\/li>\n\n\n\n<li><strong>Meaning:<\/strong> The core metric of this model, indicating \"How many operations are performed per 1 byte of data loaded from memory?\"<\/li>\n<\/ul>\n\n\n\n<div class=\"wp-block-math\"><math display=\"block\"><semantics><mrow><mtext>Arithmetic&nbsp;Intensity<\/mtext><mo>=<\/mo><mfrac><mtext>Total&nbsp;FLOPs&nbsp;(Workload)<\/mtext><mtext>Total&nbsp;Bytes&nbsp;Transferred&nbsp;(Memory&nbsp;Traffic)<\/mtext><\/mfrac><\/mrow><annotation encoding=\"application\/x-tex\">\\text{Arithmetic Intensity} = \\frac{\\text{Total FLOPs (Workload)}}{\\text{Total Bytes Transferred (Memory Traffic)}}<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p>High arithmetic intensity means that once data is loaded from memory, it is reused repeatedly in on-chip memory (registers, cache) to perform many operations (High Data Reuse). Conversely, low intensity means low data reuse, where data is discarded after performing few operations immediately upon loading.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">2. Two Performance Limits: Slanted and Flat<\/h2>\n\n\n\n<p>The shape of the Roofline graph is divided into two limit lines, 'Slanted' and 'Flat', depending on the physical constraints of the hardware.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">A. Slanted Roof: Memory-bound Region<\/h3>\n\n\n\n<p>The left side of the graph, the <strong>low Arithmetic Intensity (X-axis) region<\/strong>.<\/p>\n\n\n\n<div class=\"wp-block-math\"><math display=\"block\"><semantics><mrow><mtext>Attainable&nbsp;Performance<\/mtext><mo>=<\/mo><mtext>Arithmetic&nbsp;Intensity<\/mtext><mo>\u00d7<\/mo><mtext>Peak&nbsp;Memory&nbsp;Bandwidth<\/mtext><\/mrow><annotation encoding=\"application\/x-tex\">\\text{Attainable Performance} = \\text{Arithmetic Intensity} \\times \\text{Peak Memory Bandwidth}<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Characteristics: In this zone, increasing the number of Processing Elements (PEs) does not improve overall performance (Y-axis). The slope of performance is determined entirely by Memory Bandwidth.<\/li>\n\n\n\n<li><strong>Relevant Operations:<\/strong> Element-wise operations (Add, Mul), Activation functions (ReLU, Sigmoid), Batch Normalization, etc.<\/li>\n\n\n\n<li><strong>Cause of Bottleneck:<\/strong> Data transfer speed cannot keep up with computation speed, causing PEs to wait in an idle\/stalled state until data arrives.<\/li>\n\n\n\n<li><strong>Optimization Direction:<\/strong> Physically expand memory bandwidth (e.g., using HBM) or reduce the amount of transferred data through model compression\/quantization.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">B. Flat Roof: Compute-bound Region<\/h3>\n\n\n\n<p>The right side of the graph, the <strong>high Arithmetic Intensity region<\/strong>.<\/p>\n\n\n\n<div class=\"wp-block-math\"><math display=\"block\"><semantics><mrow><mtext>Attainable&nbsp;Performance<\/mtext><mo>=<\/mo><mtext>Peak&nbsp;Compute&nbsp;Performance<\/mtext><\/mrow><annotation encoding=\"application\/x-tex\">\\text{Attainable Performance} = \\text{Peak Compute Performance}<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Characteristics: Once arithmetic intensity exceeds a certain threshold, memory bandwidth no longer acts as a limiting factor. The performance limit here is determined by the NPU's Peak Compute Performance, causing the graph to plateau.<\/li>\n\n\n\n<li><strong>Relevant Operations:<\/strong> Convolution layers with large kernels, Fully Connected (Dense) layers, etc.<\/li>\n\n\n\n<li><strong>Cause of Bottleneck:<\/strong> The utilization of computation units is already nearing 100%, meaning physical calculation capacity is saturated.<\/li>\n\n\n\n<li><strong>Optimization Direction:<\/strong> Increase clock frequency or increase the number of parallelizable PEs.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">3. Ridge Point: The Inflection Point of Optimization<\/h2>\n\n\n\n<p>The point where the slanted and flat lines intersect is called the Ridge Point (or Knee Point).<\/p>\n\n\n<style>.kb-image1343_15e63c-09.kb-image-is-ratio-size, .kb-image1343_15e63c-09 .kb-image-is-ratio-size{max-width:650px;width:100%;}.wp-block-kadence-column > .kt-inside-inner-col > .kb-image1343_15e63c-09.kb-image-is-ratio-size, .wp-block-kadence-column > .kt-inside-inner-col > .kb-image1343_15e63c-09 .kb-image-is-ratio-size{align-self:unset;}.kb-image1343_15e63c-09 figure{max-width:650px;}.kb-image1343_15e63c-09 .image-is-svg, .kb-image1343_15e63c-09 .image-is-svg img{width:100%;}.kb-image1343_15e63c-09 .kb-image-has-overlay:after{opacity:0.3;}@media all and (max-width: 767px){.kb-image1343_15e63c-09.kb-image-is-ratio-size, .kb-image1343_15e63c-09 .kb-image-is-ratio-size{max-width:290px;width:100%;}.kb-image1343_15e63c-09 figure{max-width:290px;}}<\/style>\n<div class=\"wp-block-kadence-image kb-image1343_15e63c-09\"><figure class=\"aligncenter size-large is-resized\"><img data-dominant-color=\"f8f8f8\" data-has-transparency=\"false\" style=\"--dominant-color: #f8f8f8;\" decoding=\"async\" src=\"https:\/\/rtlearner.com\/wp-content\/uploads\/2026\/01\/image-3-8-1024x614.jpg\" alt=\"Ridge Point\" class=\"kb-img wp-image-1349 not-transparent\" width=\"728px\" height=\"auto\" srcset=\"https:\/\/rtlearner.com\/wp-content\/uploads\/2026\/01\/image-3-8-1024x614.jpg 1024w, https:\/\/rtlearner.com\/wp-content\/uploads\/2026\/01\/image-3-8-300x180.jpg 300w, https:\/\/rtlearner.com\/wp-content\/uploads\/2026\/01\/image-3-8-768x461.jpg 768w, https:\/\/rtlearner.com\/wp-content\/uploads\/2026\/01\/image-3-8-18x12.jpg 18w, https:\/\/rtlearner.com\/wp-content\/uploads\/2026\/01\/image-3-8.jpg 1200w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><figcaption>Ridge Point<\/figcaption><\/figure><\/div>\n\n\n\n<div class=\"wp-block-math\"><math display=\"block\"><semantics><mrow><mtext>Ridge&nbsp;Point&nbsp;(X-value)<\/mtext><mo>=<\/mo><mfrac><mtext>Peak&nbsp;Compute&nbsp;Performance<\/mtext><mtext>Peak&nbsp;Memory&nbsp;Bandwidth<\/mtext><\/mfrac><\/mrow><annotation encoding=\"application\/x-tex\">\\text{Ridge Point (X-value)} = \\frac{\\text{Peak Compute Performance}}{\\text{Peak Memory Bandwidth}}<\/annotation><\/semantics><\/math><\/div>\n\n\n\n<p>The X-value at this point is a crucial indicator defining the characteristics of the hardware architecture.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Ridge Point located to the Right:<\/strong> Memory bandwidth is insufficient relative to the hardware's compute performance. Therefore, extremely high data reuse (high arithmetic intensity) is required to extract maximum hardware performance.<\/li>\n\n\n\n<li><strong>Ridge Point located to the Left:<\/strong> Maximum performance can be reached even with relatively low arithmetic intensity. This implies the memory subsystem is robustly designed.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">4. Practical NPU Optimization Strategy: Move the Dot<\/h2>\n\n\n\n<p>When a specific layer of a deep learning model is analyzed and plotted as a coordinate (dot) on the Roofline graph, if that dot is located below the roof (Limit Line), it signals the need for optimization. Optimization is the engineering process of moving this dot 'Up' or 'Right'.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">A. Ceiling Analysis (Moving Up: Improving Utilization)<\/h3>\n\n\n\n<p>If the coordinate is significantly below the roof line, it suggests a failure to fully utilize the theoretical performance of the hardware.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Causes:<\/strong> Inefficient instruction scheduling, pipeline stalls, latency due to cache misses, software overhead, etc.<\/li>\n\n\n\n<li><strong>Solution:<\/strong> Improve hardware utilization by increasing cache hit rates through Loop Tiling or optimizing the instruction pipeline at the compiler level.<\/li>\n<\/ul>\n\n\n\n<h3 class=\"wp-block-heading\">B. Increasing AI (Moving Right: Enhancing Arithmetic Intensity)<\/h3>\n\n\n\n<p>Moving a coordinate from the Memory-bound region (slanted) to the right (towards Compute-bound) allows for higher performance under the same memory bandwidth constraints.<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Strategy:<\/strong> Layer Fusion is most effective. For example, instead of writing the output of a Conv layer to DRAM, it is immediately used as input for the next ReLU or Pooling operation within registers or L1 cache. This reduces the denominator (Memory Traffic), thereby drastically increasing Arithmetic Intensity.<\/li>\n<\/ul>\n\n\n\n<h2 class=\"wp-block-heading\">5. Conclusion and Implications: Direction of Architecture Design<\/h2>\n\n\n\n<p>Early CNN models had high computation density, exhibiting strong Compute-bound characteristics. However, recent Transformer-based LLMs (Large Language Models) have seen a surge in parameter count, displaying typical Memory-bound characteristics.<\/p>\n\n\n\n<p>Consequently, modern NPU architecture trends are evolving not just to increase the number of compute cores, but to <strong>steepen the slope of the Roofline (secure bandwidth)<\/strong> by integrating HBM (High Bandwidth Memory) or maximizing on-chip SRAM capacity.<\/p>\n\n\n\n<p>In conclusion, NPU performance optimization begins not with a vague judgment that \"my model is slow,\" but with a quantitative understanding of \"Which zone on the Roofline graph does each layer of my model occupy?\". This is the primary analytical capability required of system designers and AI engineers.<\/p>\n\n\n<style>.kadence-column1343_292809-e5 > .kt-inside-inner-col{box-shadow:0px 0px 14px 0px rgba(0, 0, 0, 0.2);}.kadence-column1343_292809-e5 > .kt-inside-inner-col,.kadence-column1343_292809-e5 > .kt-inside-inner-col:before{border-top-left-radius:0px;border-top-right-radius:0px;border-bottom-right-radius:0px;border-bottom-left-radius:0px;}.kadence-column1343_292809-e5 > .kt-inside-inner-col{column-gap:var(--global-kb-gap-sm, 1rem);}.kadence-column1343_292809-e5 > .kt-inside-inner-col{flex-direction:column;}.kadence-column1343_292809-e5 > .kt-inside-inner-col > .aligncenter{width:100%;}.kadence-column1343_292809-e5 > .kt-inside-inner-col:before{opacity:0.3;}.kadence-column1343_292809-e5{position:relative;}@media all and (max-width: 1024px){.kadence-column1343_292809-e5 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}@media all and (max-width: 767px){.kadence-column1343_292809-e5 > .kt-inside-inner-col{flex-direction:column;justify-content:center;}}<\/style>\n<div class=\"wp-block-kadence-column kadence-column1343_292809-e5\"><div class=\"kt-inside-inner-col\">\n<p><strong>Related articles<\/strong><\/p>\n\n\n\n<p>\u2705<\/p>\n<\/div><\/div>\n\n\n\n<p>References: <a href=\"https:\/\/en.wikipedia.org\/wiki\/Roofline_model\" target=\"_blank\" rel=\"noopener\">Wikipedia<\/a><\/p>","protected":false},"excerpt":{"rendered":"<p>In our previous posts, we discussed the two main culprits degrading deep learning model performance:<\/p>","protected":false},"author":1,"featured_media":1349,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_kadence_starter_templates_imported_post":false,"_kad_post_transparent":"","_kad_post_title":"","_kad_post_layout":"","_kad_post_sidebar_id":"","_kad_post_content_style":"","_kad_post_vertical_padding":"","_kad_post_feature":"","_kad_post_feature_position":"","_kad_post_header":false,"_kad_post_footer":false,"_kad_post_classname":"","footnotes":""},"categories":[116],"tags":[117,118],"class_list":["post-1343","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-and-hw-fundamentals","tag-ai","tag-architecture"],"_links":{"self":[{"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/posts\/1343","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/comments?post=1343"}],"version-history":[{"count":5,"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/posts\/1343\/revisions"}],"predecessor-version":[{"id":1378,"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/posts\/1343\/revisions\/1378"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/media\/1349"}],"wp:attachment":[{"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/media?parent=1343"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/categories?post=1343"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/rtlearner.com\/en\/wp-json\/wp\/v2\/tags?post=1343"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}<!-- This website is optimized by Airlift. Learn more: https://airlift.net. Template:. Learn more: https://airlift.net. Template: 69b92da9d36f73cd2808d6e8. Config Timestamp: 2026-03-17 10:32:09 UTC, Cached Timestamp: 2026-05-22 18:14:01 UTC -->