Close

Presentation

This content is available for: Tech Program Reg Pass, Exhibits Reg Pass. Upgrade Registration
Supercluster-Scale ML Training with Oracle Cloud Infrastructure
DescriptionWe have seen a substantial increase in the use of the Oracle Cloud Infrastructure (OCI) for training of large-scale language models (LLM), as more and more startups and established companies seek to gain an edge with increasingly large and more accurate models. These models share a need for efficient GPU cluster computing – that is, the ability to scale training to hundreds or thousands of GPUs for an extended period of time while maintaining performance and efficiency. Performance is crucial both at the level of individual GPU and of scaling efficiently across the network. Scaling these large training models can be very complex and certainly difficult to tune, requiring a cost-effective infrastructure that can provide availability, resiliency, and performance at scale.

In this talk, we will discuss our approach to support the needs of these large-scale language models, building on years of experience running HPC on a bare-metal instances with a very low latency network. We will present Oracle’s “SuperCluster”, which scales to thousands or tens of thousands of Nvidia A100 and H100 GPUs with low latency and high inter-node bandwidth of up to 3,200Gbps. This time-tested bare-metal instance platform is combined with intelligent job placement, locality awareness, and additional tuning to make ML work at the largest scales. Oracle’s SuperClusters have been rigorously tested on well-known public benchmarks such as Megatron, where it reaches very high throughputs, as well as on proprietary cutting-edge models that are commonly used in machine learning. We will show examples of use from various companies and will discuss the challenges that were addressed to run these models at such scale. We will finish the presentation with a discussion of some of the open research problems that still need to be addressed in this area.
Event Type
Exhibitor Forum
TimeTuesday, 14 November 20232:30pm - 3pm MST
Location503-504
Tags
Accelerators
Artificial Intelligence/Machine Learning
Registration Categories
TP
XO/EX