OSU INAM: A Profiling and Visualization Tool for High-Performance GPU-enabled HPC Clusters
TimeTuesday, June 23rd2:45pm - 2:50pm
DescriptionAs heterogeneous computing (CPUs, GPUs, etc.) and, networking (NVLinks, X-Bus, etc.) hardware continue to advance, it becomes increasingly essential and challenging to understand the interactions between High-Performance Computing (HPC) and Deep Learning applications/frameworks, the communication middleware they rely on, the underlying communication fabric these high-performance middlewares depend on, and the schedulers that manage HPC clusters. Such understanding will enable application developers/users, system administrators, and middleware developers to maximize the efficiency and performance of individual components that comprise a modern HPC system and solve different grand challenge problems. Moreover, determining the root cause of performance degradation is complex for the domain scientist. The scale of emerging HPC clusters further exacerbates the problem and brings new challenges to gather, store and visualize the information in a real-time manner. These issues lead to the following broad challenge: How can we design a tool that enables an in-depth understanding of the communication traffic on the interconnect and GPU through tight integration with the MPI runtime at scale? The tool can profile a cluster of more than 2000 nodes in sub-seconds granularity and is being deployed at various large scale supercomputer centers.