‘A virtual DPU within a GPU’: Could clever hardware hack be behind DeepSeek’s groundbreaking AI efficiency?

LovabledanielsJanuary 30, 2025158 Views

A new approach called DualPipe seems to be the key to DeekSeek’s success
One expert describes it as an on-GPU virtual DPU that maximizes bandwidth efficiency
While DeepSeek has used Nvidia GPUs only, one wonders how AMD’s Instinct would fare

China’s DeepSeek AI chatbot has stunned the tech industry, representing a credible alternative to OpenAI’s ChatGPT at a fraction of the cost.

A recent paper revealed DeepSeek V3 was trained on a cluster of 2,048 Nvidia H800 GPUs – crippled versions of the H100 (we can only imagine how much more powerful it would be running on AMD Instinct accelerators!). It reportedly required 2.79 million GPU-hours for pretraining, fine-tuning on 14.8 trillion tokens, and cost – according to calculations made by The Next Platform – a mere $5.58 million.

But exactly how DeepSeek’s developers managed this feat is likely down to a clever hack.

A virtual DPU on the GPU itself

First, some background. DeepSeek is an advanced Mixture-of-Experts (MoE) language model designed to optimize performance by selectively activating only the most relevant parts of its architecture for each task. The third version of the model, DeepSeek-V3, features a total of 671 billion parameters, with only 37 billion activated for any given token prediction. This selective activation massively reduces computational costs while maintaining high performance and accuracy – which you’ll see if you try it.

It’s easy to be skeptical of DeepSeek and the claims made regarding its training, but the paper reveals some of the magic the developers came up with to make the most of the crippled hardware they had to work with. This includes the creation of the DualPipe algorithm for efficient pipeline parallelism.

According to the information published by DeepSeek, DualPipe overlaps forward and backward computation, reduces latency, and optimizes data movement across GPUs. By efficiently managing communication, it minimizes idle time (pipeline bubbles) and dynamically balances GPU compute cores (Streaming Multiprocessors) between computation and communication, preventing data transfer bottlenecks as the model scales.

A commenter on The Next Platform describes DualPipe as “essentially creating a virtual DPU on the GPU itself to handle all-to-all communication,” which highlights its role in optimizing data transfer efficiency.

The paper goes into further detail, “In order to ensure sufficient computational performance for DualPipe, we customize efficient cross-node all-to-all communication kernels (including dispatching and combining) to conserve the number of SMs dedicated to communication. The implementation of the kernels is co-designed with the MoE gating algorithm and the network topology of our cluster. To be specific, in our cluster, cross-node GPUs are fully interconnected with IB, and intra-node communications are handled via NVLink.”

Example DualPipe scheduling for 8 PP ranks and 20 micro-batches in two directions. The micro-batches in the reverse direction are symmetric to those in the forward direction, so we omit their batch ID for illustration simplicity. Two cells enclosed by a shared black border have mutually overlapped computation and communication. (Image credit: DeekSeek)

Weekly update

This Korean serial killer drama is setting new streaming records at Disney+ and I wish knew about it sooner

All-topographic neural networks more closely mimic the human visual system

Cloudflare wants to fix Spain’s blocking of illegal football streams ahead of next LaLiga season

Weekly Newsletter

‘A virtual DPU within a GPU’: Could clever hardware hack be behind DeepSeek’s groundbreaking AI efficiency?

Leave a comment

Leave a Reply Cancel reply

Explore more

All-topographic neural networks more closely mimic the human visual system

Cloudflare wants to fix Spain’s blocking of illegal football streams ahead of next LaLiga season

World’s largest AI chip maker hit by crypto scam – Cerebras says token isn’t real, so don’t fall for it

‘Thithi president!’: Supporters rally for banned Ivorian opposition hopeful | Politics News

James Gunn confirms who’ll play the villain in Supergirl – and why Jason Momoa’s Lobo is a vital part of the DCU movie’s plot

Bilinear sequence regression model shows why AI excels at learning from word sequences

How Europe can source critical raw materials at home

Indigenous engagement is essential for small modular nuclear reactor projects

Get to Know Us

Let's keep in touch