Presentation

· Contributors · Organizations · Search Program · My Schedule · Happening Now · Maps

This content is available for: Tech Program Reg Pass. Upgrade Registration

BLAD: Adaptive Load Balanced Scheduling and Operator Overlap Pipeline for Accelerating the Dynamic GNN Training

DescriptionDynamic graph networks are widely used for learning time-evolving graphs, but prior work on training these networks is inefficient due to communication overhead, long synchronization, and poor resource usage. Our investigation shows that communication and synchronization can be reduced by carefully scheduling the workload and the execution order of operators in GNNs can be adjusted without hurting training convergence.

We propose a system called BLAD to consider the above factors, comprising a two-level load scheduler and an overlap-aware topology manager. The scheduler allocates each snapshot group to a GPU, alleviating cross-GPU communication.
The snapshots in a group are then carefully allocated to processes on a GPU, enabling overlap of compute-intensive NN operators and memory-intensive graph operators. The topology manager adjusts the operators' execution order to maximize the overlap. Experiments show that it achieves 27.2% speed up on training time on average without affecting final accuracy, compared to state-of-the-art solutions.

Authors

Kaihua Fu

Shanghai Jiao Tong University

Quan Chen

Shanghai Jiao Tong University

Yuzhuo Yang

Shanghai Jiao Tong University

Jiuchen Shi

Shanghai Jiao Tong University

Chao Li

Shanghai Jiao Tong University

Minyi Guo