Close

Presentation

This content is available for: Workshop Reg Pass. Upgrade Registration
Enabling Performance for NGC Containers on the Slingshot 11 Interconnect
DescriptionContainers based on NVIDIA GPU Cloud (NGC) images have become increasingly popular for deploying optimized software on NVIDIA GPUs, particularly in the context of ML/AI frameworks and models. However, it's important to note that the software stack within NGC images lacks the components necessary to interact with the HPE Slingshot 11 interconnect, which is a high-speed network utilized in some of the world's most powerful supercomputers. This limitation adds to the challenge of efficiently running containers for this noteworthy combination of systems and use cases.

This presentation aims to share insights into the process of enabling NGC-based containers to leverage Slingshot 11. The discussion will cover key elements for optimizing application performance, including the NCCL communication collectives, the libfabric communication framework, and GPUDirect RDMA. The presentation will also feature quantitative results from synthetic benchmarks that measure communication bandwidth and deep learning performance using the PyTorch framework.
Event Type
Workshop
TimeMonday, 13 November 20239:50am - 9:55am MST
Location607
Registration Categories
W