← back to projects

mLoRA — pipeline-parallel LoRA fine-tuning

Fine-tuning many LoRA adapters concurrently on multiple GPUs with no idle bubbles.

Pipeline comparison — naive sequential vs. mLoRA interleaved

shared frozen base model — loaded once, reused by all adapters

Naive / sequential — adapters run one at a time, GPU stages sit idle (bubbles)

GPU 0

GPU 1

GPU 2

GPU 3

A

B

C

D

A

B

C

D

A

B

C

D

A

B

C

D

mLoRA — adapters interleaved; bubbles filled, all GPUs stay busy

GPU 0

GPU 1

GPU 2

GPU 3

A

B

C

D

A

B

C

D

A

B

C

D

A

B

C

D

D

A

B

C

D

A

B

C

D

A

B

C

D

A

B

C

C

D

A

B

C

D

A

B

C

D

A

B

C

D

A

B

B

C

D

A

B

C

D

A

B

C

D

A

B

C

D

A

In the naive schedule, each adapter's 4 pipeline stages run one-at-a-time, leaving 3 out of 4 GPU stages idle (bubble) at every step. mLoRA's LoRA-aware pipeline interleaves adapters A–D so that every GPU is executing a stage at every time step — eliminating bubbles and cutting average fine-tuning time by ~30%.

Adapter A Adapter B Adapter C Adapter D pipeline bubble (idle) shared frozen base model