Close

Presentation

This content is available for: Workshop Reg Pass. Upgrade Registration
Using Benford's Law to Identify Unusual Failure Regions
DescriptionFault tolerance remains a key challenge for current high performance computing systems. Effective and efficient scheduling of mitigation methods continues to be a critical issue in the face of dynamic and difficult-to-predict error rates found on many systems. Using failure data from the Astra supercomputer, we examine the efficacy of a simple method to determine if a sliding window of recent failures contains an unusual pattern of errors. Specifically, we investigate using Benford’s Law to predict the likelihood that the system is currently in a period of unusual failure occurrences. While still in its initial stages, this work provides critical analysis of failure status for extreme-scale systems and a simple form of prediction for determining when the scheduling of failure mitigation may be suboptimal and needs to be reevaluated due to the unusual pattern of errors that are occurring.
Event Type
Workshop
TimeSunday, 12 November 20234:26pm - 4:33pm MST
Location605
Tags
Fault Handling and Tolerance
Large Scale Systems
Registration Categories
W