Close

Presentation

This content is available for: Workshop Reg Pass. Upgrade Registration
Dynamic Selective Protection of Sparse Iterative Solvers via ML Prediction of Soft Error Impacts
DescriptionSoft errors occur frequently on large computing platforms due to the increasing scale and complexity of HPC systems. Various resilience techniques have been proposed to protect scientific applications from soft errors. Among them, system-level replication often involves duplicating or triplicating the entire computation, resulting in high resilience overhead. This paper proposes dynamic selective protection for sparse iterative solvers, in particular for the Preconditioned Conjugate Gradient (PCG) solver, at the system level to reduce the resilience overhead. We leverage machine learning (ML) to predict the impact of soft errors that strike different elements of a key computation at different iterations of the solver. Based on the result of the prediction, we design a dynamic strategy to selectively protect those elements that result in a large performance degradation if struck by soft errors. An experimental evaluation demonstrates that our dynamic protection strategy reduces the resilience overhead compared to existing algorithms.
Event Type
Workshop
TimeSunday, 12 November 20234:33pm - 4:40pm MST
Location605
Tags
Fault Handling and Tolerance
Large Scale Systems
Registration Categories
W