AI Architecture 4. Training vs. Inference

In previous posts, we learned that MAC operations, Memory Hierarchies, and Parallel Processing (SIMD) form the foundation of deep learning hardware. Now, we stand at the biggest crossroad in the AI semiconductor market: Training vs. Inference. Beast-like GPUs like NVIDIA's H100 handle both. However, recent NPUs in smartphones or autonomous driving chips focus solely on 'Inference.'

To a software engineer, inference is just a single line of code switching to model.eval() and turning off gradients with torch.no_grad().

But to us, Hardware Architects, "removing the training function" implies a massive 'diet' that fundamentally overturns the chip's design philosophy. In this post, we will analyze the dramatic changes and optimizations that occur inside the hardware when we put down the heavy burden of Backpropagation.

1. Backpropagation: The Heaviest Load on Hardware

The core of Training is Backpropagation. It involves calculating the error (Loss) between the model's prediction and the ground truth, calculating gradients, and updating weights (W).

  • During Training: The outputs of Layer 1, Layer 2... up to Layer 100 must all be stored in memory (DRAM/HBM). They are needed later for the Backward Pass to compute derivatives.
  • During Inference: Once Layer 1's output is used to calculate Layer 2, Layer 1's data can be immediately discarded.

This difference is massive. Inference-only NPUs can run sufficiently with small On-chip Buffers (SRAM) that temporarily hold data. In contrast, training chips inevitably require tens of gigabytes of expensive HBM (High Bandwidth Memory). This single factor accounts for about 80% of why inference-only designs are cheaper.

Backpropagation

2. Freedom of Precision: From FP32 to INT8

During training, Gradient values can be extremely small. Multiplying tiny values like 0.00001 repeatedly can lead to the Gradient Vanishing problem, where numbers disappear into zero. To prevent this, training hardware must support high-precision arithmetic units with a wide Dynamic Range, such as FP32 (32-bit Floating Point) or BF16 (Brain Floating Point).

However, a model that has finished training (Frozen Weights) is much more robust. Even if we lower the precision of weights slightly, the final output (whether it's a dog or a cat) rarely changes.

  • They remove complex Floating Point Units (FPUs).
  • Instead, they adopt simple and small Integer ALUs (INT8 or INT4).
  • This allows packing 4x to 8x more processing units into the same silicon area.

3. Dataflow and Buffering

For training chips (GPUs), Throughput is king. You just need to shove in hundreds of images at once (Batch Size = 256, 512...) to increase the average processing speed.

But for inference, especially for Real-time Services, Latency is life.

  • You can't make a user wait for 255 other users' questions to pile up before answering their chatbot query.
  • An autonomous car can't wait to collect 32 frames of images before hitting the brakes when it sees an obstacle.

Therefore, inference NPUs must be designed to perform well even with a Batch Size = 1. This means the speed of fetching Weights is critical. A training chip fetches weights once and reuses them 256 times, but an inference chip (at Batch 1) fetches weights once, uses them once, and discards them.

Because of this, inference-only NPUs obsess more over Memory Bandwidth Efficiency or Weight Stationary architectures that keep weights pinned inside the chip.

4. Disappearing Hardware Blocks (Logic Removal)

The moment you declare "I will only do inference," the architect can erase many blocks from the chip blueprint.

  1. Removal of Transpose Units: Backpropagation requires multiplying by the transpose of the weight matrix (WT). Hardware to flip matrices is needed for training but unnecessary for inference chips.
  2. Removal of Gradient Accumulators: Logic and buffers to accumulate gradients across batches are not needed.
  3. Removal of Complex Loss Function Logic: Hardware for operations like Cross-Entropy to compare against ground truth is unnecessary.

Replacing this unnecessary logic with more Cache memory or designing for Low Power to extend battery life is the core competitiveness of Edge NPUs.

5. Conclusion: Use the Right Tool for the Job

In the past, a single GPU handled all AI operations, but the market is now diverging. We have "Training Beast Chips (NVIDIA H100, Google TPU v5)" that train massive models in data centers, and "Lightweight Inference Chips (Apple Neural Engine, Qualcomm Hexagon)" that briskly output results on smartphones or edge devices.

As System Architects, we must clearly understand this difference. "The freedom of memory and simplicity of circuits gained by discarding backpropagation." This is the secret to how AI was able to fit into the smartphones in our palms.

In the next post, we will dive into the microscopic world inside the chip and explore the way data of 0s and 1s is represented:

References: In-Datacenter Performance Analysis of a Tensor Processing Unit

Similar Posts