← back to projects
DLRover-RM — elastic resource optimization
How runtime-adaptive scaling, seamless migration, and OOM prevention cut DLRM training cost in the cloud.
Elastic workers · seamless migration · OOM prevention
Resource Manager
W1active
W2active
W3active
W4added
W5active
The resource manager continuously monitors runtime metrics.
It elastically adds or removes workers (W4) to keep utilization high.
When a worker fails (W3 flashes red), it is seamlessly migrated — no full restart —
thanks to flash checkpointing. As memory pressure approaches the OOM limit,
the manager pre-adjusts allocation to prevent crashes.
Result — shorter job completion time
Static allocation — over-provisioned, stalls on failure
DLRover-RM — elastic, ~31% faster, higher completion rate
Active worker
Migrating worker (seamless)
Worker failure / OOM zone
CPU utilization (high)
Memory pressure / low utilization