Close

Presentation

This content is available for: Workshop Reg Pass. Upgrade Registration
Recovery from Silent Data Corruption via Spatial Data Prediction
DescriptionHigh-performance computing applications are central to advancement in many fields of science and engineering. Central to this advancement is the supposed reliability of the HPC system. However, as system size grows and hardware components run with near-threshold voltages, transient upset events become more likely. Many works have explored the problem of detecting silent data corruption; however, recovery is often left to checkpoint-restart or application-specific techniques. Recovering from a checkpoint incurs overhead due to reading a checkpoint and recomputing lost work. Allowing the application to recover just the corrupted data enables faster and more efficient recovery. This paper explores using spatial similarities to recover from silent data corruption. We explore several reconstruction methods and evaluate their effectiveness at recovering corrupted entries in data arrays. Results show that the Lorenzo 1-Layer prediction method yields the best results, with over half of its reconstructions having less than 1% relative error across all applications.
Event Type
Workshop
TimeSunday, 12 November 20233:55pm - 4:20pm MST
Location605
Tags
Fault Handling and Tolerance
Large Scale Systems
Registration Categories
W