← back to projects

λScale — fast serverless LLM inference

RDMA multicast + execute-while-load eliminates cold-start in serverless GPU clusters.

The idea — execute while loading

RDMA MULTICAST Model Source RDMA sender GPU Node 1 layers arriving… GPU Node 2 layers arriving… GPU Node 3 layers arriving…

The source node multicasts model layer chunks over RDMA to all GPU nodes simultaneously. As each node receives layers (teal progress bar), it immediately begins inference on the loaded layers (bright sweep) while the rest are still in transit — a pipelined execution with zero cold-start wait.

The result — no cold-start wait

Baseline serverless  — full model load first, then execute
model load
execute
λScale  — load and execute overlap; finishes ~45% sooner
load
execute
done ~45% earlier
model loading (RDMA multicast) inference execution overlap zone (execute-while-load) layer chunk in flight (RDMA)