Close

Presentation

This content is available for: Tutorial Reg Pass. Upgrade Registration
Scalable Big Data Processing on High Performance Computing Systems
DescriptionThere are several popular Big Data processing frameworks including Apache Spark and Dask. These frameworks are not capable of exploiting high-speed and low-latency networks like InfiniBand, Omni-Path, Slingshot, and others. In the High Performance Computing (HPC) community, the Message Passing Interface (MPI) libraries are widely adopted to tackle this issue by executing scientific and engineering applications on parallel hardware connected via fast interconnect.

This tutorial introduces MPI4Spark and MPI4Dask that are enhanced Spark and Dask frameworks, respectively, and capable of utilizing MPI for communication in a parallel and distributed setting on HPC systems. MPI4Spark can launch the Spark ecosystem using MPI launchers to utilize MPI communication. It also maintains isolation for application execution by forking new processes using Dynamic Process Management (DPM). MPI4Spark also provides portability and performance benefits as it can utilize popular HPC interconnects. MPI4Dask is an MPI-based custom Dask framework that is targeted for modern HPC clusters built with CPU and NVIDIA GPUs.

This tutorial provides a detailed overview of the design, implementation, and evaluation of MPI4Spark and MPI4Dask on state-of-the-art HPC systems. Later, we also cover writing, running, and demonstrating user Big Data applications on HPC systems.
Event Type
Tutorial
TimeMonday, 13 November 20231:30pm - 5pm MST
Location303
Tags
Architecture and Networks
Data Movement and Memory
Message Passing
Registration Categories
TUT