Close

Presentation

This content is available for: Workshop Reg Pass. Upgrade Registration
Programming Model for Habana/Gaudi2 Accelerators and Its Impact on Deep Learning Inference/Training Performance at Scale
DescriptionI will discuss the multi-stream based execution environment of Habana/Gaudi systems that is exposed to deep learning frameworks and I will show how one can combine compute, networking and DMA at high performance and with low run-time overheads. I will highlight the performance of Habana Collective Communication Library at scale in terms of bandwidth, message rate and demonstrate its impact on deep learning training and inference performance of a few neural network models including vision and Large Language Models. In the second part of the talk, I will highlight the challenges in communication scaling, especially the associated congestion that we observe between leaf and spine switches in certain conditions. I will highlight solutions that we are currently deploying including congestion control algorithms and packet/message spraying techniques at the endpoint and share our results.
Event Type
Workshop
TimeMonday, 13 November 20232pm - 2:30pm MST
Location708
Tags
Large Scale Systems
Middleware and System Software
Programming Frameworks and System Software
Registration Categories
W