Close

Presentation

This content is available for: Workshop Reg Pass. Upgrade Registration
Asynchronous Multi-Level Checkpointing: An Enabler of Reproducibility using Checkpoint History Analytics
DescriptionHigh-performance computing applications are increasingly integrating checkpointing libraries for reproducibility analytics. However, capturing an entire checkpoint history for reproducibility study faces the challenges of high-frequency checkpointing across thousands of processes. As a result, the runtime overhead affects application performance and intermediate results when interleaving is introduced during floating-point calculations. In this paper, we extend asynchronous multi-level checkpoint/restart to study the intermediate results generated from scientific workflows. We present an initial prototype of a framework that captures, caches and compares checkpoint histories from different runs of a scientific application executed using identical input files. We also study the impact of our proposed approach by evaluating the reproducibility of classical molecular dynamics simulations executed using the NWChem software. Experiment results show that our proposed solution improves the checkpoint write bandwidth when capturing checkpoints for reproducibility analysis by a minimum of 30x and up to 211x compared to the default NWChem checkpointing approach.
Event Type
Workshop
TimeSunday, 12 November 20234:15pm - 4:40pm MST
Location710
Tags
Fault Handling and Tolerance
Registration Categories
W