← back to projects

MorphServe — runtime-adaptive LLM serving

How quantized layer swapping and KV cache resizing keep SLOs intact under load spikes.

Mechanism — layer swapping + KV cache resizing

Transformer layers
KV cache
Mem. pressure
LOW PRESSURE
HIGH PRESSURE
FP16
FP16 INT8
FP16
FP16
FP16 INT8
FP16
KV cache
HI LO
P95 TTFT
SLO limit

As memory pressure spikes, MorphServe swaps low-impact layers to INT8 (amber, slimmer) and shrinks the KV cache to free capacity. Both restore when pressure drops. Throughout the spike, P95 TTFT stays well inside the SLO threshold — ~92% fewer violations vs. full-precision serving.

FP16 layer (full precision) INT8 layer (quantized swap) KV cache capacity Memory pressure P95 TTFT (stays in SLO)