← back to projects

DLRover-RM — elastic resource optimization

How runtime-adaptive scaling, seamless migration, and OOM prevention cut DLRM training cost in the cloud.

Elastic workers · seamless migration · OOM prevention

Resource Manager

W1active

W2active

W3active

W4added

W5active

CPU utilization

Memory pressure

The resource manager continuously monitors runtime metrics. It elastically adds or removes workers (W4) to keep utilization high. When a worker fails (W3 flashes red), it is seamlessly migrated — no full restart — thanks to flash checkpointing. As memory pressure approaches the OOM limit, the manager pre-adjusts allocation to prevent crashes.

Result — shorter job completion time

Static allocation — over-provisioned, stalls on failure

100% (baseline JCT)

DLRover-RM — elastic, ~31% faster, higher completion rate

~69% of baseline

Active worker Migrating worker (seamless) Worker failure / OOM zone CPU utilization (high) Memory pressure / low utilization