AI Architecture 11. Depthwise Separable Conv: MobileNet의 역설

In the previous 3 Mappings of Conv Operations, we looked at a strategy to sacrifice memory and gain computational speed (GEMM) through the Im2Col method when processing standard convolutions in hardware.

In 2017, Google introduced MobileNet, a revolutionary model for mobile environments. It reduced the number of operations (FLOPs) and parameters to about 1/10th of existing models while maintaining decent accuracy. The secret lay in a unique structure called Depthwise Separable Convolution.

However, this structure presented hardware engineers with an intriguing phenomenon known as "The MobileNet Paradox."

"The FLOPs were reduced by 90%, so why didn't the actual execution speed (Latency) increase by the same amount?"

In this article, we will uncover the physical causes of why a seemingly perfect software diet leads to the side effect of plummeting Utilizationinside hardware (NPUs/GPUs), from the perspective of a System Architect.

Related articles

✅AI Architecture 1. Anatomy of an Artificial Neuron: Y=WX+B on Silicon

✅AI Architecture 2. The Cost of Activation: Free ReLU vs. Expensive Sigmoid

✅AI Architecture 3. The Aesthetics of MatMul: Why Deep Learning Chooses GPUs/NPUs

✅AI Architecture 4. Training vs. Inference

1. Depthwise Separable Convolution

First, let's briefly review how MobileNet reduced operations. The key is splitting the standard Conv operation into two steps:

Depthwise Conv: Applies a separate filter to each input channel. (No information exchange between channels)
Pointwise Conv: Mixes information between channels using a 1 * 1 filter.

Given input data H * W * C and kernel size K * K, the operation count decreases as follows:

Standard Conv:

H \cdot W \cdot C \cdot K^2 \cdot N \text{ (N = number of output channel)}

Depthwise Separable:

(H \cdot W \cdot C \cdot K^2) + (H \cdot W \cdot C \cdot N)

Mathematically, this results in about 8~9x reduction in operations. Looking at the numbers alone, the hardware does 9 times less work, so it should be 9 times faster. But reality tells a different story.

2. Cause of Paradox 1: Low Arithmetic Intensity

The core metric determining hardware performance is Arithmetic Intensity. That is, "How many operations are performed for every byte of data fetched from memory?"

Standard Conv: One input channel interacts with all N output channel filters. Data Reuse is very high.
Depthwise Conv: Input channel C interacts only with kernel channel C. There is absolutely no Cross-channel Reuse.

Depthwise Conv painstakingly fetches data from memory, multiplies it a few times, and finishes. This is similar to the [MLP and Memory Wall] problem discussed in Post #7. Before the arithmetic units can get busy, memory bandwidth becomes the bottleneck (Memory-Bound), causing performance degradation.

3. Cause of Paradox 2: Fragmentation of MAC Utilization

Most high-performance NPUs or GPUs bundle dozens or hundreds of MAC units into a massive array (like a Systolic Array) or wide vector registers (SIMD) to process huge matrix multiplications (GEMM).

For example, let's assume an NPU where 64 MACs operate as a group.

Pointwise Conv (1 * 1, Channel Mixing): Since it involves inter-channel operations, the 64 MACs are fully utilized (Dense), running at 100% efficiency.
Depthwise Conv (Channel Independent): Each channel is independent. If the hardware scheduler attempts parallelization by channel but is forced to process only 1 channel at a time due to memory layout or data dependency issues?
- Only 1 out of 64 works, and 63 are idle. (Utilization 1.5%)

This is the MAC Starvation phenomenon. To the hardware, Depthwise operations are "workloads chopped too finely," reducing the efficiency of large-scale parallel processing units.

4. Memory Layout Conflict: Channel-First vs. Channel-Last

Hardware efficiency is determined by how data is laid out in memory.

NCWH (Channel-First): Channels come first.
NHWC (Channel-Last): Channels are grouped by pixel position.

Generally, high-performance NPUs prefer NHWC for vector operations. This is because fetching all channels for a specific pixel at once is efficient for 1 * 1 Pointwise Conv.

However, Depthwise Conv is a spatial operation requiring pixels from a 3 * 3 area. In an NHWC structure, neighboring pixels are far apart in memory, causing Cache Misses or necessitating complex Shuffle logic.

5. Conclusion: FLOPs is Not Speed

MobileNet is undoubtedly an excellent model. However, it is also a prime example proving that "Low FLOPs" does not necessarily mean "Fast Hardware Speed (Low Latency)."

Standard Conv: High operation count, but performs Cost-effective Labordue to high hardware utilization.
Depthwise Conv: Low operation count, but performs 비효율적인 노동due to high hardware utilization.

Recent NPU architectures are evolving to solve this by including dedicated Depthwise acceleration engines or by Fusing Pointwise Conv with other layers. As Hardware engineers, we must not be blinded by simple FLOPs numbers but must be able to see through to the actual load the hardware pipeline will experience.

In the next post, we will explore the hardware headache hidden behind the deep learning revolution brought by ResNet: "ResNet and Bottlenecks: Problems Skip Connections Pose to Hardware Memory Management and Buffer Scheduling."

Related articles

✅AI Architecture 1. Anatomy of an Artificial Neuron: Y=WX+B on Silicon

✅AI Architecture 2. The Cost of Activation: Free ReLU vs. Expensive Sigmoid

✅AI Architecture 3. The Aesthetics of MatMul: Why Deep Learning Chooses GPUs/NPUs

✅AI Architecture 4. Training vs. Inference

References: Efficient Convolutional Neural Networks for Mobile Vision Applications

AI Architecture 11. Depthwise Separable Conv: The MobileNet Paradox

1. Depthwise Separable Convolution

2. Cause of Paradox 1: Low Arithmetic Intensity

3. Cause of Paradox 2: Fragmentation of MAC Utilization

4. Memory Layout Conflict: Channel-First vs. Channel-Last

5. Conclusion: FLOPs is Not Speed

AI Architecture 14. Dataflow Taxonomy: TPU vs Output Stationary vs Row Stationary

AI Architecture 12. Skip Connection: ResNet and Bottlenecks

AI Architecture 9. Three Mappings of Conv Operations: Direct vs. Im2Col vs. Winograd

AI Architecture 1. Anatomy of an Artificial Neuron: Y=WX+B on Silicon

AI Architecture 4. Training vs. Inference

AI Architecture 15. The Heart of Systolic Array

Sitemap

Category

Information

1. Depthwise Separable Convolution

2. Cause of Paradox 1: Low Arithmetic Intensity

3. Cause of Paradox 2: Fragmentation of MAC Utilization

4. Memory Layout Conflict: Channel-First vs. Channel-Last

5. Conclusion: FLOPs is Not Speed

Similar Posts

Sitemap

Category

Information