← back to projects

DLRover-RM — elastic resource optimization

How runtime-adaptive scaling, seamless migration, and OOM prevention cut DLRM training cost in the cloud.

Elastic workers · seamless migration · OOM prevention

Resource Manager
W1active
W2active
W3active
W4added
W5active
CPU utilization
Memory pressure

The resource manager continuously monitors runtime metrics. It elastically adds or removes workers (W4) to keep utilization high. When a worker fails (W3 flashes red), it is seamlessly migrated — no full restart — thanks to flash checkpointing. As memory pressure approaches the OOM limit, the manager pre-adjusts allocation to prevent crashes.

Result — shorter job completion time

Static allocation — over-provisioned, stalls on failure
100% (baseline JCT)
DLRover-RM — elastic, ~31% faster, higher completion rate
~69% of baseline
Active worker Migrating worker (seamless) Worker failure / OOM zone CPU utilization (high) Memory pressure / low utilization