Close

Presentation

This content is available for: Workshop Reg Pass. Upgrade Registration
Disk Failure Trends in Alpine Storage System
DescriptionLarge-scale HPC systems demand extensive disk-based storage for data generated by HPC applications, necessitating scalable reliability, availability, and failure management. Extracted failure data from HPC storage offers valuable insights for preventing and managing failures, spanning understanding storage robustness, guiding system design and deployment, and creating durable data protection schemes. This paper introduces a failure dataset from OLCF’s Summit supercomputer's file system, Alpine, encompassing 4000+ events over 2.75 years from 32000+ disks. Before analysis, we delve into Alpine's components and introduce IBM Spectrum Scale technology, then assess collected data for failure distribution and burst correlations. We infer that, proximity to enclosure fan modules heightens disk failure rates. Also, burst failure analysis highlights 1/3rd of failures occurring in bursts, with 90% non-spatially correlated, impacting multiple racks.
Event Type
Workshop
TimeSunday, 12 November 20234:20pm - 4:26pm MST
Location605
Tags
Fault Handling and Tolerance
Large Scale Systems
Registration Categories
W