λScale — fast serverless LLM inference

RDMA multicast + execute-while-load eliminates cold-start in serverless GPU clusters.

The idea — execute while loading

The source node multicasts model layer chunks over RDMA to all GPU nodes simultaneously. As each node receives layers (teal progress bar), it immediately begins inference on the loaded layers (bright sweep) while the rest are still in transit — a pipelined execution with zero cold-start wait.

The result — no cold-start wait

Baseline serverless — full model load first, then execute

model load

execute

λScale — load and execute overlap; finishes ~45% sooner

load

execute

done ~45% earlier

model loading (RDMA multicast) inference execution overlap zone (execute-while-load) layer chunk in flight (RDMA)