← back to projects

ZenFlow — stall-free offloading

How asynchronous gradient updates keep the GPU busy while offloading to the CPU.

The idea — split gradients by impact

top-k important → applied on GPU now the rest → offloaded to CPU, applied async
GPUforward / backward
CPUasync update

High-impact gradients are applied immediately on the GPU; the remaining gradients are offloaded to the CPU and updated asynchronously, fully overlapped with the next step's compute.

The result — no more GPU stalls

ZeRO-Offload  — GPU sits idle while the CPU updates
GPU compute
stall
GPU compute
stall
ZenFlow  — GPU never pauses; CPU updates overlap underneath
GPU compute — uninterrupted
GPU compute GPU stall (idle, waiting on CPU) CPU async update (overlapped) important gradient → instant GPU update