ZenFlow: Stall-Free Offloading Training via Asynchronous Updates
Async updates kill the offloading stall — train bigger models on the same GPUs.
PhD candidate at DS² Lab, University of Virginia, with Yue Cheng; also working closely with Juncheng Yang at Harvard. Building systems across the LLM stack — training, inference, storage — with a soft spot for the I/O bottleneck. When I'm not chasing throughput, I'm chasing bebop.
Async updates kill the offloading stall — train bigger models on the same GPUs.
Pipeline-parallel LoRA training across multiple GPUs.
Resource autoscaling that understands recommendation training workloads.
A tensor-centric storage layer for AI model hubs — compressing checkpoints by exploiting their internal structure.
Synergistic dedup + compression tuned to how LLM weights actually look on disk.
Adapt the serving stack at runtime — swap layers, resize KV cache, ride the workload.
Cold start is no longer a death sentence for serverless LLMs.
SLO-aware LLM serving — TTFT/TPOT guards with credit-based batching for workloads with heterogeneous deadlines.
First benchmark for text-to-infographic generation — 600 tests across 30 infographic types, automated reliability checks via atomic yes/no questions.
Human-agent system for interactive educational documents — multi-agent pipeline (Planner / Executor / Evaluator) plus a human-readable DocSpec IR.
Stall-free async offloading for LLM training. Integrated into DeepSpeed via official PR.
Tensor-centric storage layer for AI model hubs.
Model-aware dedup + compression tuned to how LLM weights actually look on disk.
Adapt the serving stack at runtime — swap layers, resize KV cache, ride the workload.
Fast scaling for serverless LLM inference.
First benchmark for text-to-infographic generation reliability.