AI Architecture 11. Depthwise Separable Conv: The MobileNet Paradox

In the previous 3 Mappings of Conv Operations, we looked at a strategy to sacrifice memory and gain computational speed (GEMM) through the Im2Col method when processing standard convolutions in hardware.

In 2017, Google introduced MobileNet, a revolutionary model for mobile environments. It reduced the number of operations (FLOPs) and parameters to about 1/10th of existing models while maintaining decent accuracy. The secret lay in a unique structure called Depthwise Separable Convolution.

However, this structure presented hardware engineers with an intriguing phenomenon known as "The MobileNet Paradox."

"The FLOPs were reduced by 90%, so why didn't the actual execution speed (Latency) increase by the same amount?"

In this article, we will uncover the physical causes of why a seemingly perfect software diet leads to the side effect of plummeting Utilizationinside hardware (NPUs/GPUs), from the perspective of a System Architect.

1. Depthwise Separable Convolution

First, let's briefly review how MobileNet reduced operations. The key is splitting the standard Conv operation into two steps:

  1. Depthwise Conv: Applies a separate filter to each input channel. (No information exchange between channels)
  2. Pointwise Conv: Mixes information between channels using a 1 * 1 filter.

Given input data H * W * C and kernel size K * K, the operation count decreases as follows:

  • Standard Conv:
HWCK2N (N = number of output channel)H \cdot W \cdot C \cdot K^2 \cdot N \text{ (N = number of output channel)}
  • Depthwise Separable:
(HWCK2)+(HWCN)(H \cdot W \cdot C \cdot K^2) + (H \cdot W \cdot C \cdot N)

Mathematically, this results in about 8~9x reduction in operations. Looking at the numbers alone, the hardware does 9 times less work, so it should be 9 times faster. But reality tells a different story.

2. Cause of Paradox 1: Low Arithmetic Intensity

The core metric determining hardware performance is Arithmetic Intensity. That is, "How many operations are performed for every byte of data fetched from memory?"

  • Standard Conv: One input channel interacts with all N output channel filters. Data Reuse is very high.
  • Depthwise Conv: Input channel C interacts only with kernel channel C. There is absolutely no Cross-channel Reuse.

Depthwise Conv painstakingly fetches data from memory, multiplies it a few times, and finishes. This is similar to the [MLP and Memory Wall] problem discussed in Post #7. Before the arithmetic units can get busy, memory bandwidth becomes the bottleneck (Memory-Bound), causing performance degradation.

3. Cause of Paradox 2: Fragmentation of MAC Utilization

Most high-performance NPUs or GPUs bundle dozens or hundreds of MAC units into a massive array (like a Systolic Array) or wide vector registers (SIMD) to process huge matrix multiplications (GEMM).

For example, let's assume an NPU where 64 MACs operate as a group.

  • Pointwise Conv (1 * 1, Channel Mixing): Since it involves inter-channel operations, the 64 MACs are fully utilized (Dense), running at 100% efficiency.
  • Depthwise Conv (Channel Independent): Each channel is independent. If the hardware scheduler attempts parallelization by channel but is forced to process only 1 channel at a time due to memory layout or data dependency issues?
    • Only 1 out of 64 works, and 63 are idle. (Utilization 1.5%)

This is the MAC Starvation phenomenon. To the hardware, Depthwise operations are "workloads chopped too finely," reducing the efficiency of large-scale parallel processing units.

4. Memory Layout Conflict: Channel-First vs. Channel-Last

Hardware efficiency is determined by how data is laid out in memory.

  • NCWH (Channel-First): Channels come first.
  • NHWC (Channel-Last): Channels are grouped by pixel position.

Generally, high-performance NPUs prefer NHWC for vector operations. This is because fetching all channels for a specific pixel at once is efficient for 1 * 1 Pointwise Conv.

However, Depthwise Conv is a spatial operation requiring pixels from a 3 * 3 area. In an NHWC structure, neighboring pixels are far apart in memory, causing Cache Misses or necessitating complex Shuffle logic.

5. Conclusion: FLOPs is Not Speed

MobileNet is undoubtedly an excellent model. However, it is also a prime example proving that "Low FLOPs" does not necessarily mean "Fast Hardware Speed (Low Latency)."

  • Standard Conv: High operation count, but performs Cost-effective Labordue to high hardware utilization.
  • Depthwise Conv: Low operation count, but performs 비효율적인 노동due to high hardware utilization.

Recent NPU architectures are evolving to solve this by including dedicated Depthwise acceleration engines or by Fusing Pointwise Conv with other layers. As Hardware engineers, we must not be blinded by simple FLOPs numbers but must be able to see through to the actual load the hardware pipeline will experience.

In the next post, we will explore the hardware headache hidden behind the deep learning revolution brought by ResNet: "ResNet and Bottlenecks: Problems Skip Connections Pose to Hardware Memory Management and Buffer Scheduling."

References: Efficient Convolutional Neural Networks for Mobile Vision Applications

Similar Posts