Presentation

· Contributors · Organizations · Search Program · My Schedule · Happening Now · Maps

This content is available for: Workshop Reg Pass. Upgrade Registration

Checkpoint/Restart for CUDA Kernels

SessionFourth International Symposium on Checkpointing for Supercomputing (SuperCheck-SC23)

DescriptionIn HPC clusters, it has become common to employ Checkpoint/Restart, that is, saving the execution state of applications in order to restore their computational progress at a later point in time. The benefits of this technique for clusters include more flexibility when reacting to changing workloads and an increased fault tolerance. While many clusters already benefit from C/R tools for traditional CPU applications, there is a lack of comparable tools enabling preemptive and transparent C/R for heterogeneous computing, where applications execute partly on accelerator devices, such as GPUs. This is despite the increasing use of GPUs as accelerators in HPC clusters. Therefore, we propose a novel C/R tool that enables saving the execution state of CUDA kernels, thus allowing preemptive C/R of GPU. We show that full-featured C/R for NVIDIA GPUs is possible despite the proprietary nature of the hardware and software of these devices.

Author/Presenters

Niklas Eiling

RWTH Aachen University

Stefan Lankes

RWTH Aachen University

Antonello Monti

RWTH Aachen University

Event Type

Workshop

TimeSunday, 12 November 20233:25pm - 3:50pm MST

Location710

ask a question

give feedback