Close

Presentation

This content is available for: Tech Program Reg Pass, Exhibits Reg Pass. Upgrade Registration
Strong Scaling of State-of-the-Art LLM Inference with Groq Software-Scheduled Deterministic Networks
DescriptionIn this talk, we will demonstrate Groq’s approach to synchronous, software-scheduled AI accelerator networks and showcase how we use it to unlock state-of-the-art performance and latency on Large Language Models (LLMs), including Llama-2 70B, scaled to over 500 GroqChip™ Language Processors™.

Traditional HPC systems and data centers use dynamic time- and space-sharing, where platforms dynamically coordinate the use of compute, memory, and network resources among threads or workloads. This is a natural solution for arbitrary compute workloads, whose unpredictability makes such mediation a prerequisite. Unfortunately, this results in compounding inefficiency and complexity at all layers of the stack: processor architecture, memory, networking, and more. Modern AI workloads, however, have a predictable structure allowing for efficient static scheduling of compute and network resources.

Groq is turning this theory to practice by making components deterministic from the ground-up to stand-up large-scale synchronous compute platforms and empower software to make more orchestration decisions statically. Unlike traditional networks where packets can collide and congestion can develop, all traffic in the Groq network is completely pre-planned by Groq™ Compiler with zero network collisions. This maximizes not only the utilization of the links, but the number of minimal paths that can be taken between chips. Deterministic compute and static orchestration does introduce new software and hardware challenges and co-optimization opportunities, which we will discuss in this talk.

Overcoming these challenges unlocks the opportunity for greater compute and power efficiency on AI workloads. Groq’s software-scheduled networks offer key advantages including: (1) a global network load balancing via compiler-driven network traffic scheduling; (2) high network bandwidth efficiency via low control overhead; and (3) low latency chip-to-chip communication via a router-less, handshake-less direct topology. We showcase these advantages by demonstrating state-of-the-art performance on LLM models, including Llama-2 70B, scaled to over 500 Language Processors.
Event Type
Exhibitor Forum
TimeThursday, 16 November 202311am - 11:30am MST
Location503-504
Tags
Accelerators
Artificial Intelligence/Machine Learning
Architecture and Networks
Hardware Technologies
Registration Categories
TP
XO/EX