Close

Presentation

This content is available for: Workshop Reg Pass. Upgrade Registration
Lightning Talk: Diaspora – Resilient Event Processing for Irregular, Distributed Scientific Applications
DescriptionModern science increasingly requires the coordinated use of advanced computing, networking, instruments, and experimental facilities: collectively, research infrastructure. This infrastructure reaches from HPC systems to high-data-rate instruments and less well-connected edge systems, and also encompasses cloud-hosted services. These resources and their applications can generate many events, and because many science applications span locations, scientists need to consume events from many sources. To meet this need, we are developing Diaspora, a resilient, hierarchical event streaming approach that scales to meet the needs of modern science. Such complex, distributed applications have myriad hard and soft failure modes. The widely used coordinated checkpoint-restart resilience solution simply requires that processes agree on a globally consistent state, which then can be independently captured piece-wise by the processes and restarted from in case of failures. However, such approaches have limited applicability at very large scales that may involve geographically distributed resources, because the problem of agreeing on a globally consistent state is not tractable. Under such circumstances, there is a need to envision new abstractions to achieve resilience. This talk briefly introduces such abstractions that we propose in Diaspora. Notably, we envision the use of an event-streaming backbone that allows both loosely and tightly coupled workflow components to communicate and persist data in a resilient fashion. This context opens new opportunities to apply checkpointing techniques, which we will highlight. Furthermore, we will also describe the scientific applications targeted by the project, including federated learning, astronomical image processing, and x-ray image processing at advanced photon sources.
Event Type
Workshop
TimeSunday, 12 November 20232:50pm - 3pm MST
Location710
Tags
Fault Handling and Tolerance
Registration Categories
W