AI Architecture 13. Roofline Model Analysis

In our previous posts, we discussed the two main culprits degrading deep learning model performance: 'Memory-bound' and 'Compute-bound' bottlenecks. However, in practice, when deploying a new model onto an NPU, it is difficult to intuitively judge, "This model has a memory problem," because complex layers are intertwined.

At this point, the 'Roofline Model' becomes an essential analysis framework for engineers. Proposed by the UC Berkeley research team in 2009, this model quantitatively visualizes the correlation between processor compute performance and memory bandwidth on a 2D graph. It defines the 'Theoretical Performance Roof' that the hardware can achieve and serves as an absolute standard for determining optimization direction by identifying the current efficiency level of the model relative to that threshold.

Related articles

1. Structure of the Roofline Model: Two Axes Determining Performance

To interpret the Roofline graph, one must clearly understand the engineering definitions of the X and Y axes.

Roofline model
Roofline model

A. Y-Axis: Attainable Performance

  • Unit: GFLOPS (Giga Floating-point Operations Per Second) or TOPS (Tera Operations Per Second).
  • Meaning: Represents the number of operations processable per second. A higher value indicates faster processing speed, which is proportional to the number of Processing Elements (PEs) and the Clock Frequency of the hardware.

B. X-Axis: Arithmetic Intensity (Operational Intensity)

  • Unit: FLOPs/Byte (or Ops/Byte).
  • Meaning: The core metric of this model, indicating "How many operations are performed per 1 byte of data loaded from memory?"
Arithmetic Intensity=Total FLOPs (Workload)Total Bytes Transferred (Memory Traffic)\text{Arithmetic Intensity} = \frac{\text{Total FLOPs (Workload)}}{\text{Total Bytes Transferred (Memory Traffic)}}

High arithmetic intensity means that once data is loaded from memory, it is reused repeatedly in on-chip memory (registers, cache) to perform many operations (High Data Reuse). Conversely, low intensity means low data reuse, where data is discarded after performing few operations immediately upon loading.

2. Two Performance Limits: Slanted and Flat

The shape of the Roofline graph is divided into two limit lines, 'Slanted' and 'Flat', depending on the physical constraints of the hardware.

A. Slanted Roof: Memory-bound Region

The left side of the graph, the low Arithmetic Intensity (X-axis) region.

Attainable Performance=Arithmetic Intensity×Peak Memory Bandwidth\text{Attainable Performance} = \text{Arithmetic Intensity} \times \text{Peak Memory Bandwidth}
  • Characteristics: In this zone, increasing the number of Processing Elements (PEs) does not improve overall performance (Y-axis). The slope of performance is determined entirely by Memory Bandwidth.
  • Relevant Operations: Element-wise operations (Add, Mul), Activation functions (ReLU, Sigmoid), Batch Normalization, etc.
  • Cause of Bottleneck: Data transfer speed cannot keep up with computation speed, causing PEs to wait in an idle/stalled state until data arrives.
  • Optimization Direction: Physically expand memory bandwidth (e.g., using HBM) or reduce the amount of transferred data through model compression/quantization.

B. Flat Roof: Compute-bound Region

The right side of the graph, the high Arithmetic Intensity region.

Attainable Performance=Peak Compute Performance\text{Attainable Performance} = \text{Peak Compute Performance}
  • Characteristics: Once arithmetic intensity exceeds a certain threshold, memory bandwidth no longer acts as a limiting factor. The performance limit here is determined by the NPU's Peak Compute Performance, causing the graph to plateau.
  • Relevant Operations: Convolution layers with large kernels, Fully Connected (Dense) layers, etc.
  • Cause of Bottleneck: The utilization of computation units is already nearing 100%, meaning physical calculation capacity is saturated.
  • Optimization Direction: Increase clock frequency or increase the number of parallelizable PEs.

3. Ridge Point: The Inflection Point of Optimization

The point where the slanted and flat lines intersect is called the Ridge Point (or Knee Point).

Ridge Point
Ridge Point
Ridge Point (X-value)=Peak Compute PerformancePeak Memory Bandwidth\text{Ridge Point (X-value)} = \frac{\text{Peak Compute Performance}}{\text{Peak Memory Bandwidth}}

The X-value at this point is a crucial indicator defining the characteristics of the hardware architecture.

  • Ridge Point located to the Right: Memory bandwidth is insufficient relative to the hardware's compute performance. Therefore, extremely high data reuse (high arithmetic intensity) is required to extract maximum hardware performance.
  • Ridge Point located to the Left: Maximum performance can be reached even with relatively low arithmetic intensity. This implies the memory subsystem is robustly designed.

4. Practical NPU Optimization Strategy: Move the Dot

When a specific layer of a deep learning model is analyzed and plotted as a coordinate (dot) on the Roofline graph, if that dot is located below the roof (Limit Line), it signals the need for optimization. Optimization is the engineering process of moving this dot 'Up' or 'Right'.

A. Ceiling Analysis (Moving Up: Improving Utilization)

If the coordinate is significantly below the roof line, it suggests a failure to fully utilize the theoretical performance of the hardware.

  • Causes: Inefficient instruction scheduling, pipeline stalls, latency due to cache misses, software overhead, etc.
  • Solution: Improve hardware utilization by increasing cache hit rates through Loop Tiling or optimizing the instruction pipeline at the compiler level.

B. Increasing AI (Moving Right: Enhancing Arithmetic Intensity)

Moving a coordinate from the Memory-bound region (slanted) to the right (towards Compute-bound) allows for higher performance under the same memory bandwidth constraints.

  • Strategy: Layer Fusion is most effective. For example, instead of writing the output of a Conv layer to DRAM, it is immediately used as input for the next ReLU or Pooling operation within registers or L1 cache. This reduces the denominator (Memory Traffic), thereby drastically increasing Arithmetic Intensity.

5. Conclusion and Implications: Direction of Architecture Design

Early CNN models had high computation density, exhibiting strong Compute-bound characteristics. However, recent Transformer-based LLMs (Large Language Models) have seen a surge in parameter count, displaying typical Memory-bound characteristics.

Consequently, modern NPU architecture trends are evolving not just to increase the number of compute cores, but to steepen the slope of the Roofline (secure bandwidth) by integrating HBM (High Bandwidth Memory) or maximizing on-chip SRAM capacity.

In conclusion, NPU performance optimization begins not with a vague judgment that "my model is slow," but with a quantitative understanding of "Which zone on the Roofline graph does each layer of my model occupy?". This is the primary analytical capability required of system designers and AI engineers.

Related articles

References: Wikipedia

Similar Posts