Close

Presentation

This content is available for: Workshop Reg Pass. Upgrade Registration
Experiences Detecting Defective Hardware in Exascale Supercomputers
DescriptionIn May 2022, the newest supercomputer to top the TOP 500 list was Frontier at Oak Ridge National Laboratory, capable of computing more than 1.1 quintillion (10^18) floating-point calculations every second. Driving this ground-breaking rate of computing is Frontier's more than 37,000 graphics processing units (GPUs) and 9,408 central processing units (CPUs). At this scale, the smallest margin of error may generate hundreds of errors across the system. In this work, we describe and evaluate two strategies for finding hardware-level faults in Frontier's 9,408 compute nodes. There are two strategies developed: the first uses the Slurm scheduler to scavenge available compute time to run the node screen, the second enforces a weekly screen of each node. Using June 2023 as a case study, we find that the first scheduling strategy consumed ten times the resources as the second scheduling strategy, but successfully detected five hardware defects in Frontier.
Event Type
Workshop
TimeFriday, 17 November 20239:10am - 9:35am MST
Location503-504
Tags
Applications
Exascale
Heterogeneous Computing
Programming Frameworks and System Software
State of the Practice
Registration Categories
W