Presentation

· Contributors · Organizations · Search Program · My Schedule · Happening Now · Maps

This content is available for: Tech Program Reg Pass. Upgrade Registration

Optimizing MPI Collectives on Shared Memory Multi-Cores

DescriptionCollective communication operations, such as broadcasting and reductions, often contribute to performance bottlenecks in Message Passing Interface (MPI) programs. As the number of processor cores integrated into CPUs increases, running multiple MPI processes on shared-memory machines to leverage hardware parallelism is becoming increasingly common. In this context, optimizing MPI collective communications for shared-memory execution is crucial. This paper identifies two primary limitations of existing MPI collective implementations on shared-memory systems. The first is the extensive redundant data movements when performing reduction collectives, and the second is the ineffective use of non-temporal instructions to optimize streamed data processing. To address these challenges, we propose two optimization techniques designed to minimize data movements and enhance the use of non-temporal instructions. We integrate our optimizations into the OpenMPI and evaluate their performance through micro-benchmarks and real-world application tests on two multi-core clusters. Experiments show that our approach significantly outperforms existing techniques by 1.2-6.4x.

Authors

Jintao Peng

National University of Defense Technology (NUDT), China

Jianbin Fang

National University of Defense Technology (NUDT), China

Jie Liu

National University of Defense Technology (NUDT), China

Min Xie

National University of Defense Technology (NUDT), China

Yi Dai

National University of Defense Technology (NUDT), China

Bo Yang