Close

Presentation

This content is available for: Workshop Reg Pass. Upgrade Registration
Lightning Talk: Inherent Checkpointing Properties of Nested Parallelism
DescriptionOur discovery was a complete accident. We worked on nested parallelism in OpenMP for easier work decomposition and better work balancing:

#pragma omp parallel num_threads(N)
{
#pragma omp parallel num_threads(M)
{
#pragma omp parallel num_threads(K)
{
}
}
}

Creating thousands of threads (for real-life N, M, K values) was out of question, and queuing thousands of tasks at each nesting level had a significant sequential overhead, so we came up with a concept of task groups.

A task group is a descriptor that has information on the entire data range for all tasks and on the number of tasks expected to process that data range (N, M, or K, for each nesting level, respectively).

The task group descriptor can be queued by the main thread, and worker threads will create individual tasks from that descriptor, in parallel. Each individual task is responsible for finding its portion of the data range based on the task’s index within the group.

That sounded like how GPU kernels typically worked. Hence, we converted nested parallel task groups into GPU kernels at a pre-compilation stage, to let our runtime offload those kernels (or task groups) to GPUs.

Now, when we already have individual kernels that can locate their data, why not compile those kernels into self-contained libraries that can be executed on remote systems? That’s how each OpenMP-like nested parallel region became a task group, a kernel, and a library.

And then we ran into all sorts of issues once we tested the runtime on loosely coupled nodes on a local network (some GPUs were not super-reliable either).

More thinking brought understanding that our runtime actually “knows” whether the execution on a remote node or a GPU was unsuccessful, so it can report that status back to the main program.

We were able to craft a runtime that enables nested parallel execution on remote systems and GPUs (and GPUs on remote systems) and reports the execution status for each nesting level of parallel work.

What a programmer needs to do now, is simply check for error after the return from a parallel region and restart execution (given the inputs are separate from outputs and have not been overwritten):

#pragma omp parallel num_threads(N)
{
while(1)
{
int error=0;
#pragma omp parallel num_threads(M) status(&error)
{
#pragma omp parallel num_threads(K)
{
}
}
if(!error)
{
break;
}
}
}

We implemented N-body simulation (millions of bodies) and ran it on two systems: one significantly more powerful than the other, to ensure active work-stealing over network. The experiment took roughly 30 minutes, which gave us plenty of time for disrupting it by deleting all intermediate files generated by our runtime and switching network connections on and off. After comparing the outputs with those from an undisturbed run, we found the results to be identical (within the GPU floating-point rounding error).

That is why we now advocate for nested parallelism as a promising solution for remote parallel execution, with “natural” checkpoints and simple restarting capabilities at each nesting level.
Event Type
Workshop
TimeSunday, 12 November 20235pm - 5:10pm MST
Location710
Tags
Fault Handling and Tolerance
Registration Categories
W