AI Architecture 9. Convolution 연산의 3가지 매핑: Direct vs Im2Col vs Winograd

In the previous post, we learned that hardware loves CNNs (Convolutional Neural Networks) because of Locality and Data Reuse. Theoretically, CNNs seem like the perfect hardware-friendly algorithm.

However, when trying to implement this algorithm on silicon, architects face a serious hurdle: "Complex Loop Structures." Convolution operations essentially consist of 6 to 7 nested loops (Batch, Out Channel, In Channel, Height, Width, Kernel H, Kernel W). If mapped directly to hardware (Direct Conv), the memory access patterns become chaotic, causing cache efficiency (Hit Rate) to plummet.

So, engineers came up with a brilliant idea: "Even if we waste some memory, what if we convert this complex operation into simple Matrix Multiplication?"

Today, we will compare three representative strategies used by hardware and compilers to process Conv operations— Direct, Im2Col, and Winograd—and uncover the law of "Equivalent Exchange between Memory and Speed" hidden within.

Related articles

✅AI Architecture 1. Anatomy of an Artificial Neuron: Y=WX+B on Silicon

✅AI Architecture 2. The Cost of Activation: Free ReLU vs. Expensive Sigmoid

✅AI Architecture 3. The Aesthetics of MatMul: Why Deep Learning Chooses GPUs/NPUs

✅AI Architecture 4. Training vs. Inference

1. Direct Convolution

The most intuitive method is to implement the mathematical formula as is. The Sliding Window scans the image one step at a time, multiplying and accumulating.

Pros: Requires absolutely no additional memory (Zero Memory Overhead). You only need the input image and weights.
Cons: Hardware implementation is hellish.
- Irregular Memory Access: Data addresses jump discontinuously every time the window moves.
- Difficulty in Parallelization: It is very tricky to fully utilize SIMD (Parallel Processing) units.

Unless it's early-stage hardware or an embedded environment with extremely limited memory, the Direct method is rarely used due to its slow speed.

2. Im2Col (Image to Column)

This is the standard method used by most GPUs (cuDNN) and deep learning frameworks today. The core idea is "Convert 3D Convolution into 2D Matrix Multiplication (GEMM)."

How it Works

Im2Col Transformation: Pixels in the area passed by the 3 * 3 kernel (Receptive Field) are extracted and flattened into a single long Column Vector.
Repeat this for all window positions to create a massive Input Matrix.
Flatten the filters (Weights) similarly to create a Weight Matrix.
Now, multiply the two massive matrices (GEMM). Trade-off

Trade-off

Pros (Speed): The complex Conv operation turns into a GEMM (General Matrix Multiply) problem, which is highly optimized. It allows 100% utilization of the matrix operation units in GPUs or NPUs, drastically increasing speed.
Cons (Memory): Fatal Memory Waste occurs. When overlapping pixels in the Sliding Window are converted to a matrix, Duplication occurs.
- For example, using a 3 * 3 kernel increases the data volume by about 9 times compared to the original image (assuming Stride=1).

The architect asks: "Is it okay for memory usage to increase 9-fold?" In most cases, the answer is "Yes." This is because Computational Throughput is more valuable than memory capacity.

3. Winograd Algorithm

If Im2Col changed the "Structure" to gain speed, Winograd changes the "Math" to reduce the number of operations itself.

Normally, generating a 2 * 2 output with a 3 * 3 filter requires 2 * 2 * 9 = 36 multiplications. However, using the Winograd algorithm, this can be reduced to 16 multiplications. (About 2.25x reduction)

How it Works

The input (d) and filter (g) are transformed using a method similar to Fourier Transform, followed by Element-wise Multiplication, and then Inverse Transformed.

Y = A^T [(G g G^T) \odot (B^T d B)] A

G, B, A: Pre-defined transformation matrices (Constants)
⊙: Element-wise multiplication

Trade-off

Pros: Multiplication (MAC) count is drastically reduced. It shows overwhelming performance in models dominated by 3 * 3 Convs.
Cons 1 (Transform Cost): Multiplications decrease, but Additions increase during the transform process. It may also require dedicated hardware units for transformation.
Cons 2 (Precision): Transformation matrices contain many floating-point constants, which can cause Numerical Errors. It is tricky to combine with INT8 quantization.

4. The Hardware Architect's Choice

So, which method should we choose? The answer is "It depends."

GPUs with ample memory needing high performance: Use Im2Col + GEMM without hesitation. Parallel processing is more important than memory waste.
Mobile NPUs with small kernel sizes (3 * 3): Actively utilize Winograd to increase computational efficiency.
IoT MCUs with extremely limited memory: We have no choice but to use Direct Conv or perform Im2Col in very small Tile units.

5. Conclusion

The history of CNN acceleration is a constant tug-of-war between "Memory Space" and "Computation Time."

Im2Col: Discard space to buy time.
Winograd: Trade complexity (additions) to reduce multiplications.

As a Hardware engineer, we must accurately understand the environment where the chip will be used (Memory Bandwidth, Capacity, Target Performance) and choose the optimal weapon among these algorithms to map onto the hardware.

In the next post, we will explore another essential element of CNNs, which looks simple but complicates hardware Line Buffer design: "Hardware Issues with Pooling and Padding."

Related articles

✅AI Architecture 1. Anatomy of an Artificial Neuron: Y=WX+B on Silicon

✅AI Architecture 2. The Cost of Activation: Free ReLU vs. Expensive Sigmoid

✅AI Architecture 3. The Aesthetics of MatMul: Why Deep Learning Chooses GPUs/NPUs

✅AI Architecture 4. Training vs. Inference

References: Fast Algorithms for Convolutional Neural Networks

AI Architecture 9. Three Mappings of Conv Operations: Direct vs. Im2Col vs. Winograd

1. Direct Convolution

2. Im2Col (Image to Column)

How it Works

Trade-off

3. Winograd Algorithm

How it Works

Trade-off

4. The Hardware Architect's Choice

5. Conclusion

AI Architecture 11. Depthwise Separable Conv: The MobileNet Paradox

AI Architecture 15. The Heart of Systolic Array

AI Architecture 10. Padding and Pooling Hardware Issues

AI Architecture 8. CNN and Locality

AI Architecture 4. Training vs. Inference

AI Architecture 14. Dataflow Taxonomy: TPU vs Output Stationary vs Row Stationary

Sitemap

Category

Information

1. Direct Convolution

2. Im2Col (Image to Column)

How it Works

Trade-off

3. Winograd Algorithm

How it Works

Trade-off

4. The Hardware Architect's Choice

5. Conclusion

Similar Posts

Sitemap

Category

Information