Close

Presentation

This content is available for: Tech Program Reg Pass, Exhibits Reg Pass. Upgrade Registration
Scaling Up to 32 GPUs to a Single Node Without Changing a Single Line of Code
DescriptionThis technical deep dive will demonstrate scaling an application up to 32 accelerators to a single node — which until now was only possible on a supercomputer. This is achieved without needing to modify the application software for HPC or AI workloads, saving users considerable time and effort in porting software.

This new capability was made possible by a deep integration between the engineering teams of AMD and GigaIO. It utilizes off-the-shelf servers and GPUs connected over GigaIO’s native PCIe memory fabric, which provides the same performance and latency as if those accelerators were housed within the server sheet metal.

This talk will cover the steps to create this first-of-its-kind server, the GigaIO SuperNODE, including how to identify and resolve issues that prevent the enumeration of large numbers of GPUs, such as hardcoded limits within ROCm, physical address bit inconsistencies between CPUs (Milan, Genoa) and GPUs, and memory address issues in the VBIOS.

GigaIO will demonstrate how frameworks such as Pytorch and TensorFlow “just work” when run on this all-AMD system, without changing a single line of code. The plug-n-play nature of this solution opens new possibilities for generative AI and machine learning workloads, especially given the current availability constraints on GPUs.

Limitations encountered include the need for server vendors to be willing to modify their server BIOS to accommodate the unexpected number of PCIe end-points and to support dynamic allocation of resources. As such, this solution is only available on selected platforms from those server vendors who have undertaken that effort. Other limiting factors include the total number of BUS IDs and MMIO space.
Event Type
Exhibitor Forum
TimeTuesday, 14 November 20231:30pm - 2pm MST
Location503-504
Tags
Accelerators
Artificial Intelligence/Machine Learning
Registration Categories
TP
XO/EX