Close

Presentation

This content is available for: Tech Program Reg Pass, Exhibits Reg Pass. Upgrade Registration
Balancing Latency and Throughput of Distributed Inference by Interleaved Parallelism
DescriptionDistributed large model inference is still in a dilemma where balancing latency and throughput, or rather cost and effect. Tensor parallelism, while capable of optimizing latency, entails a substantial expenditure. Conversely, pipeline parallelism excels in throughput but falls short in minimizing execution time.

To address this challenge, we introduce a novel solution - interleaved parallelism. This approach interleaves computation and communication across requests. Our proposed runtime system harnesses GPU scheduling techniques to facilitate the overlapping of communication and computation kernels, thereby enabling this pioneering parallelism for distributed large model inference. Extensive evaluations show that our proposal outperforms existing parallelism approaches across models and devices, presenting the best latency and throughput in most cases.
Event Type
Posters
Research Posters
TimeTuesday, 14 November 202310am - 5pm MST
Registration Categories
TP
XO/EX