Search
Organizations
Contributors
Presentations
Art of HPC
Posters
TP
W
TUT
XO/EX
DescriptionThe recorded visualization was built with JavaScript using the D3 and Anime.js libraries. Historical run data from the Kestrel supercomputer was queried using SQL from NREL's internal sys admin database and bundled into a JSON file for use by the JavaScript code. The JSON file was organized by minute-long time-steps, each signifying the state of the jobs in a particular case on the supercomputer at a particular point in time.
The video itself is a screen recording of the visualization being run on a MacBook Pro with an Intel i9 2.3 GHz 8-Core processor and an AMD Radeon Pro 5500 Graphics Card. The screen recording software is the default Quicktime Software and it was edited for length using Adobe Premiere Pro.
The video itself is a screen recording of the visualization being run on a MacBook Pro with an Intel i9 2.3 GHz 8-Core processor and an AMD Radeon Pro 5500 Graphics Card. The screen recording software is the default Quicktime Software and it was edited for length using Adobe Premiere Pro.
Workshop
Accelerators
Heterogeneous Computing
Parallel Programming Methods, Models, Languages and Environments
W
DescriptionIn today’s HPC landscape and especially in tomorrow’s even more complex systems, performance optimization and portability are critical for maximizing computational efficiency, minimizing energy consumption, and ensuring that applications can seamlessly adapt to rapidly evolving heterogeneous architectures.This talk will discuss challenges and solutions for performance optimization and portability of applications in modern HPC systems featuring increasingly heterogeneous architectures. Drawing from recent experiences in optimizing legacy applications, new simulation frameworks, and complex data analysis pipelines, we will examine approaches to effectively leveraging multiple levels of parallelism—both within nodes and across nodes—while maintaining performance portability. Topics will include scheduling libraries and autotuning, scalable domain decomposition, and runtime scheduling of workflows integrating AI, data management, and simulations. The discussion will conclude with recommendations for exploiting the multilevel parallelism and heterogeneity of next-generation accelerated HPC systems.
Art of HPC
Posters
TP
W
TUT
XO/EX
DescriptionThis artifact was created with Pandas, Matplotlib, and NetworkX. Data was gathered from the Kestrel cluster at the National Renewable Energy Laboratory with Slurm via the sacct and sinfo commands.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionEdge computing, the notion of moving computational tasks from servers to the data-generating network edge, is an increasingly popular model for data processing. 5G wireless technologies offer an opportunity to enable complex distributed edge computing workflows by minimizing the overhead incurred in transmitting data to peer devices. In this work, we demonstrate the use and performance of edge devices in distributed computation workloads using Hadoop MapReduce on a cluster of six 5G-connected Raspberry Pis. Specifically, we first determine the network capabilities (i.e., latency and throughput) across millimeter wave (mmWave) 5G links and then analyze the scalability and performance of our cluster. Our experiment uses 5G radios at the Agricultural and Rural (ARA) Wireless Living Lab, spanning over six miles in diameter.
Workshop
Heterogeneous Computing
Parallel Programming Methods, Models, Languages and Environments
Task Parallelism
W
DescriptionOne of the most mathematically challenging tasks within the aerospace industry is the design of the aircraft aerodynamics. Indeed, the various aerodynamic physical models involve non-linear partial differential equations of all types : elliptic, parabolic and hyperbolic. These are solved using Computational Fluid Dynamics, and require High-Performance Computing when solving for million/billion unknowns. The presentation will cover advances in the fidelity of aerodynamic models and applications to aerothermodynamic models towards ice accretion. In particular, it will highlight the use of the Chapel language in the main holistic solver within Prof. Laurendeau’s aerodynamic laboratory.
Workshop
Codesign
Data Movement and Memory
Facilities
W
DescriptionUnderstanding the performance potential and data placement challenges in Non-Uniform Memory Access (NUMA) architectures is crucial for optimizing High-Performance Computing (HPC) systems. We will present a quantitative approach, using simulations and models, that provides essential insights into how system architecture impacts microbenchmarks and real-world applications. We model a NUMA architecture with ARMv8 Neoverse V1 processors, leveraging the gem5 and VPSim simulation platforms. Combining these tools enables us to optimize simulation speed during early-stage exploration while preserving the accuracy necessary to evaluate design performance in later stages.We will present case studies that examine the performance implications of different NUMA node configurations, SLC (System Level Cache) group assignments, and Network-on-Chip (NoC) settings. These case studies reveal critical design trade-offs, offering valuable input for the co-design process, where HPC SoC architects and system integrators collaborate. This work is conducted within the European Processor Initiative (EPI) framework, focusing on developing new, energy-efficient hardware architectures for future exascale systems.
Workshop
Artificial Intelligence/Machine Learning
W
DescriptionTraining large language models is becoming increasingly complex due to the rapid expansion in their size, resulting in significant computational costs. To address this challenge, various model growth methodologies have been proposed to leverage smaller pre-trained models to incrementally build larger models and reduce computational requirements. These methods typically involve mapping parameters from small models to large ones using either static functions or learned mappings. Although these approaches have demonstrated effectiveness, there is a lack of comprehensive comparative evaluations in the literature. Additionally, combining different methodologies could potentially yield superior performance. This study provides a uniform evaluation of multiple state-of-the-art model growth techniques and their combinations, revealing that efficient combination techniques can reduce the training cost (in TFLOPs) of individual methods by up to 80%.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionWe are designing an automatic ticket answering service for computing centers such as Texas Advanced Computing Center (TACC), National Center for Supercomputing Applications (NCSA), and San Diego Supercomputer Center (SDSC). In this work, we investigate the capability and feasibility of open source language models (LLMs) on the ticket answering task. We compare four open source LLMs (OPT-6.7B, Falcon-7B, Llama 2-7B, and Llama 3.1-8B) by fine-tuning them with a curated dataset with over 110,000 historical question/answer pairs. Our results show that fine-tuned LLMs are capable of generating reasonable answers. Llama-7B has a lower validation loss and perplexity than OPT-6.7B and Falcon-7B. We also observe that fine-tuning with LoRA introduces non-trivial generalization loss compared with dense fine-tuning. We will design an evaluation dataset and perform quantitative evaluation for the three LLMs in the future.
Paper
Algorithms
Artificial Intelligence/Machine Learning
Graph Algorithms
Linear Algebra
TP
DescriptionExploiting matrix symmetry to halve memory footprint offers an opportunity for accelerating memory-bound computations like Sparse Matrix-Vector Multiplication (SpMV). However, symmetric SpMV incurs data conflicts when concurrently writing the output vector. Previous approaches fail to address this issue efficiently. This paper proposes DCS-SpMV, a Divide-and-Conquer (DC) algorithm for efficient Symmetric SpMV. The key idea is to recursively divide the matrix-induced onflict graph into independent subgraphs for parallel execution, and construct separate subgraphs to avoid data conflicts. Our DC algorithm transforms the input matrix into a lowconflict part and a high-conflict part, which motivates us to design a conflict-aware hybrid solution that executes these two parts using DCS-SpMV and traditional SpMV respectively.
We develop a machine learning model to predict an optimal hybrid implementation for a given matrix and architecture. We evaluate our work on both X86 and ARM CPUs, demonstrating significant performance improvement over the state-of-the-art.
We develop a machine learning model to predict an optimal hybrid implementation for a given matrix and architecture. We evaluate our work on both X86 and ARM CPUs, demonstrating significant performance improvement over the state-of-the-art.
Paper
Accelerators
Energy Efficiency
Facilities
Resource Management
State of the Practice
TP
DescriptionWe present ExaDigiT, an open-source framework for developing comprehensive digital twins of liquid-cooled supercomputers. It integrates three main modules: (1) a resource allocator and power simulator, (2) a transient thermo-fluidic cooling model, and (3) an augmented reality model of the supercomputer and central energy plant. The framework enables the study of ``what-if'' scenarios, system optimizations, and virtual prototyping of future systems. Using Frontier as a case study, we demonstrate the framework's capabilities by replaying six months of system telemetry for systematic verification and validation. Such a comprehensive analysis of a liquid-cooled exascale supercomputer is the first of its kind. ExaDigiT elucidates complex transient cooling system dynamics, runs synthetic or real workloads, and predicts energy losses due to rectification and voltage conversion. Throughout our paper, we present lessons learned to benefit HPC practitioners developing similar digital twins. We envision the digital twin will be a key enabler for sustainable, energy-efficient supercomputing.
Workshop
Compilers
Parallel Programming Methods, Models, Languages and Environments
Performance Optimization
W
DescriptionAs new compute systems are developed, there is still a need to compile Fortran for execution on leading edge systems.
In order to achieve this, compilers are continuously under development.
Though the specification of Fortran is extensive, it is helpful to prioritize the development of key features of desired applications to get to execute them as soon as possible.
Identifying key features largely done through querying software experts, who then manually report on which features key features are present.
This is both time consuming and error prone.
To automate this, we present a compiler plugin to Flang that operates on a program's parse tree representation and detects key features.
We show the result of our tool on four applications.
We show the discrepancies between our tool and the manual characterization of three of the applications, as well as generate a characterization for an application not yet profiled.
In order to achieve this, compilers are continuously under development.
Though the specification of Fortran is extensive, it is helpful to prioritize the development of key features of desired applications to get to execute them as soon as possible.
Identifying key features largely done through querying software experts, who then manually report on which features key features are present.
This is both time consuming and error prone.
To automate this, we present a compiler plugin to Flang that operates on a program's parse tree representation and detects key features.
We show the result of our tool on four applications.
We show the discrepancies between our tool and the manual characterization of three of the applications, as well as generate a characterization for an application not yet profiled.
Workshop
Data Movement and Memory
I/O, Storage, Archive
W
DescriptionThere are significant differences between emerging AI and data analytics workloads and traditional HPC workloads with regard to storage and programming frameworks. We extend DAOS with a queryable global shared low-latency/high-bandwidth cache and a resilient runtime that intercepts calls to popular analytics frameworks and offloads them to worker processes running on the HPC system. The result is a solution that offers bandwidth and latency benefits over vanilla DAOS and that enables ordinary programmers to interactively use popular programming frameworks like Python to solve huge problems on HPC systems without stranding resources.
Workshop
State of the Practice
System Administration
W
DescriptionAccurate wait time prediction for HPC jobs contributes to a positive user experience but has historically been a challenging task. Previous models lack the accuracy needed for confident predictions, and many were developed before the rise of deep learning.
In this work, we investigate and develop a neural network-based model to accurately predict wait times for jobs submitted to the Anvil HPC cluster. Data was taken from the Slurm Workload Manager on the cluster and transformed before performing additional feature engineering from jobs’ priorities, partitions, and states. We developed a hierarchical model that classifies job queue times into bins before applying regression, outperforming traditional methods. The model was then integrated into a CLI tool for queue time prediction. This study explores which queue time prediction methods are most applicable for modern HPC systems and shows that deep learning-based prediction models are viable solutions.
In this work, we investigate and develop a neural network-based model to accurately predict wait times for jobs submitted to the Anvil HPC cluster. Data was taken from the Slurm Workload Manager on the cluster and transformed before performing additional feature engineering from jobs’ priorities, partitions, and states. We developed a hierarchical model that classifies job queue times into bins before applying regression, outperforming traditional methods. The model was then integrated into a CLI tool for queue time prediction. This study explores which queue time prediction methods are most applicable for modern HPC systems and shows that deep learning-based prediction models are viable solutions.
Paper
Algorithms
Data Movement and Memory
I/O, Storage, Archive
Performance Optimization
Scientific and Information Visualization
Visualization
TP
DescriptionMulti-resolution methods such as Adaptive Mesh Refinement (AMR) can enhance storage efficiency for HPC applications generating vast volumes of data. However, their applicability is limited and cannot be universally deployed across all applications. Furthermore, integrating lossy compression with multi-resolution techniques to further boost storage efficiency encounters significant barriers. To this end, we introduce an innovative workflow that facilitates high-quality multi-resolution data compression for both uniform and AMR simulations. Initially, to extend the usability of multi-resolution techniques, our workflow employs a compression-oriented Region of Interest (ROI) extraction method, transforming uniform data into a multi-resolution format. Subsequently, to bridge the gap between multi-resolution techniques and lossy compressors, we optimize three distinct compressors, ensuring their optimal performance on multi-resolution data. Lastly, we incorporate an advanced uncertainty visualization method into our workflow to understand the potential impacts of lossy compression. Experimental evaluation demonstrates that our workflow achieves significant compression quality improvements.
Workshop
Performance Optimization
Programming Frameworks and System Software
W
DescriptionThere has been a healthy growth of heterogeneous programming models that cover different paradigms in the HPC space.
Selecting an appropriate programming model for new projects is challenging: how does one select a model that is both productive and performant?
The same applies for existing projects aiming to leverage heterogeneous offload capabilities.
While characterisation of programming model performance has been abundant and comprehensive, productivity metrics are often reduced to basic measures like Source Line of Code (SLOC).
This study introduces a novel model divergence measure to objectively evaluate productivity.
We cover common aspects of productivity, including syntax, semantics, and optimisation overhead.
We present a productivity analysis framework supporting GCC and Clang, covering models for C/C++ and Fortran.
We evaluate our metric using this framework on mini-apps from SPEChpc and other established mini-apps, and propose a combined productivity and performance probability visualisation for a comprehensive picture.
Selecting an appropriate programming model for new projects is challenging: how does one select a model that is both productive and performant?
The same applies for existing projects aiming to leverage heterogeneous offload capabilities.
While characterisation of programming model performance has been abundant and comprehensive, productivity metrics are often reduced to basic measures like Source Line of Code (SLOC).
This study introduces a novel model divergence measure to objectively evaluate productivity.
We cover common aspects of productivity, including syntax, semantics, and optimisation overhead.
We present a productivity analysis framework supporting GCC and Clang, covering models for C/C++ and Fortran.
We evaluate our metric using this framework on mini-apps from SPEChpc and other established mini-apps, and propose a combined productivity and performance probability visualisation for a comprehensive picture.
Workshop
Applications and Application Frameworks
Distributed Computing
Middleware and System Software
W
DescriptionMicroservices architecture is a promising approach for developing reusable scientific workflow capabilities for integrating diverse resources, such as experimental and observational instruments and advanced computational and data management systems, across many distributed organizations and facilities.
In this paper, we describe how the INTERSECT Open Architecture leverages federated systems of microservices to construct interconnected science ecosystems, review how the INTERSECT software development kit eases microservice capability development, and demonstrate the use of such capabilities for deploying an example multi-facility INTERSECT ecosystem.
In this paper, we describe how the INTERSECT Open Architecture leverages federated systems of microservices to construct interconnected science ecosystems, review how the INTERSECT software development kit eases microservice capability development, and demonstrate the use of such capabilities for deploying an example multi-facility INTERSECT ecosystem.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionFederated learning is a privacy-preserving machine learning approach. It allows numerous geographically distributed clients to collaboratively train a large model while maintaining local data privacy. In heterogeneous device settings, limited network bandwidth is a major bottleneck that constrains system performance. In this work, we propose a novel gradient compression method for federated learning that aims to achieve communication efficiency and a low error floor by estimating the prototype of gradients on both the server and client sides and sending only the difference between the real gradient and the estimated prototype. This approach further reduces the total bits required for model updates. Additionally, the memory requirement will be lighter on the client side but heavier on the server side compared to traditional error feedback methods. Experiments on training neural networks show that our method is more communication-efficient with little impact on training and test accuracy.
Workshop
Algorithms
Heterogeneous Computing
W
DescriptionDomain scientists in the field of computational science often face challenges in developing optimized code for high-performance computing, especially GPUs. Considering the increase of heterogeneity in a node of HPC computing facilities, there is a demand to develop performance portable solutions for the core computation kernels in a scientific library. We demonstrate a
performance portable multi-GPU solution for an implementation of the Euler equations using ProtoX and IRIS. ProtoX is a domain-specific language using a partial differential equation library called Proto as its front end and SPIRAL, a code generation system, in its back end to generate optimized kernels for different architectures. These kernels are orchestrated through the intelligent runtime system --- IRIS to provide portability. Two levels of optimizations within IRIS, namely DAG and task fusion, are explored to efficiently utilize computing resources in a multi-GPU environment. Performance improvement through these optimizations is showcased on AMD and NVIDIA GPUs.
performance portable multi-GPU solution for an implementation of the Euler equations using ProtoX and IRIS. ProtoX is a domain-specific language using a partial differential equation library called Proto as its front end and SPIRAL, a code generation system, in its back end to generate optimized kernels for different architectures. These kernels are orchestrated through the intelligent runtime system --- IRIS to provide portability. Two levels of optimizations within IRIS, namely DAG and task fusion, are explored to efficiently utilize computing resources in a multi-GPU environment. Performance improvement through these optimizations is showcased on AMD and NVIDIA GPUs.
ACM Gordon Bell Climate Modeling Finalist
TP
DescriptionOcean general circulation models (OGCMs) are indispensable for studying multi-scale oceanic processes and climate change. High-resolution ocean simulations require immense computational power and thus become a challenge in climate science. We present LICOMK++, a performance-portable OGCM using Kokkos, to facilitate global kilometer-scale ocean simulations. The breakthroughs include:
(1) We enhance cutting-edge Kokkos with the Sunway architecture, enabling LICOMK++ to become the first performance-portable OGCM on diversified architectures, i.e., Sunway processors, CUDA/HIP-based GPUs, and ARM CPUs.
(2) LICOMK++ overcomes the one simulated-years-per-day (SYPD) performance challenge for global realistic OGCM at 1-km resolution. It records 1.05 and 1.70 SYPD with a parallel efficiency of 54.8% and 55.6% scaling on almost the entire new Sunway supercomputer and two-thirds of the ORISE supercomputer.
(3) LICOMK++ is the first global 1-km-resolution realistic OGCM to generate scientific results. It successfully reproduces mesoscale and submesoscale structures that have considerable climate effects.
(1) We enhance cutting-edge Kokkos with the Sunway architecture, enabling LICOMK++ to become the first performance-portable OGCM on diversified architectures, i.e., Sunway processors, CUDA/HIP-based GPUs, and ARM CPUs.
(2) LICOMK++ overcomes the one simulated-years-per-day (SYPD) performance challenge for global realistic OGCM at 1-km resolution. It records 1.05 and 1.70 SYPD with a parallel efficiency of 54.8% and 55.6% scaling on almost the entire new Sunway supercomputer and two-thirds of the ORISE supercomputer.
(3) LICOMK++ is the first global 1-km-resolution realistic OGCM to generate scientific results. It successfully reproduces mesoscale and submesoscale structures that have considerable climate effects.
Paper
Accelerators
Artificial Intelligence/Machine Learning
Codesign
State of the Practice
System Administration
TP
DescriptionModern scientific software in high performance computing is often complex, and many parallel applications and libraries depend on several other software or libraries. Developers and users of such complex software often use package managers for building them. Package managers depend on humans to codify package constraints, and the dependency graph of a software package can often become large. In this paper, we propose a methodology that uses historical build results to assist a package manager in selecting the best versions of package dependencies with an aim to improve the likelihood of a successful build. We train a machine learning (ML) model to predict the probability of build outcomes of different configurations of packages in the Spack package manager. When evaluated on common scientific software stacks, this ML model-based approach is able to achieve a 13% higher success rate in building packages than the default version selection mechanism in Spack.
Paper
Accelerators
Applications and Application Frameworks
Artificial Intelligence/Machine Learning
Distributed Computing
Graph Algorithms
Heterogeneous Computing
Tensors
TP
Best Student Paper Finalist
DescriptionFIRAL is a recently proposed deterministic active learning algorithm for multiclass classification using logistic regression. It was shown to outperform the state-of-the-art in terms of accuracy and robustness and comes with theoretical performance guarantees. However, its scalability suffers when dealing with datasets featuring a large number of points $n$, dimensions $d$, and classes $c$, due to its $\mathcal{O}(c^2d^2+nc^2d)$ storage and $\mathcal{O}(c^3(nd^2 + bd^3 + bn))$ computational complexity where $b$ is the number of points to select. To address these challenges, we propose an approximate algorithm with storage requirements reduced to $\mathcal{O}(n(d+c) + cd^2)$ and a computational complexity of $\mathcal{O}(bncd^2)$. Additionally, we present a parallel implementation on GPUs. We demonstrate the accuracy and scalability of our approach using MNIST, CIFAR-10, Caltech101, and ImageNet. The accuracy tests reveal no deterioration compared to FIRAL. We report strong and weak scaling tests on up to 12 GPUs, for three million point synthetic dataset.
Workshop
Artificial Intelligence/Machine Learning
W
DescriptionAI-based foundation models like FourCastNet, GraphCast are revolutionizing weather and climate predictions but are not yet ready for operational use. Their limitation lies in the absence of a data assimilation system to incorporate real-time Earth system observations, crucial for accurately forecasting events like tropical cyclones. To overcome these obstacles, we introduce a generic real-time data assimilation framework and demonstrate its end-to-end performance on the Frontier supercomputer. This framework comprises two primary modules: an ensemble score filter (EnSF), which significantly outperforms the state-of-the-art data assimilation method, and a vision transformer-based surrogate capable of real-time adaptation through the integration of observational data. We demonstrate both the strong and weak scaling of our framework up to 1024 GPUs on the Exascale supercomputer, Frontier. Our results not only illustrate the framework's exceptional scalability on high-performance computing systems, but also demonstrate the importance of supercomputers in real-time data assimilation for weather and climate predictions.
Workshop
Applications and Application Frameworks
Algorithms
Performance Evaluation and/or Optimization Tools
W
DescriptionGenerative artificial intelligence extends beyond its success in image/text synthesis, proving itself a powerful uncertainty quantification (UQ) technique through its capability to sample from complex high-dimensional probability distributions. However, existing methods often require a complicated training process, which greatly hinders their applications to real-world UQ problems. To alleviate this challenge, we developed a scalable, training-free score-based diffusion model for high-dimensional sampling. We incorporate a parallel-in-time method into our diffusion model to use a large number of GPUs to solve the backward stochastic differential equation and generate new samples of the target distribution. Moreover, we also distribute the computation of the large matrix subtraction used by the training-free score estimator onto multiple GPUs available across all nodes. We showcase the remarkable strong and weak scaling capabilities of the proposed method on the Frontier supercomputer, as well as its uncertainty reduction capability in hurricane predictions when coupled with AI-based foundation models.
Workshop
Applications and Application Frameworks
Distributed Computing
Middleware and System Software
W
DescriptionScientific workflows and provenance are two faces of the same medal. While the former addresses the coordinated execution of multiple tasks over a set of computational resources, the latter relates to the historical record of data from its original sources. This paper highlights the importance of tracking multi-level provenance metadata in complex, AI-based scientific workflows as a way to (i) foster and (ii) expand documentation of experiments, (iii) enable reproducibility, (iv) address interpretability of the results, (v) facilitate performance bottlenecks diagnosis, and (vi) advance provenance exploration and analysis opportunities.
Workshop
Architecture
Embedded and/or Reconfigurable Systems
Performance Optimization
Resource Management
W
DescriptionQuantum computers are making their way into High Performance Computing centers in the form of accelerators. Due to their physical implementation as mostly large appliances in separate racks, their number in typical data centers is significantly lower than the number of nodes offloading work to them, unlike the case with GPU accelerators. As a consequence, they form large-scale disaggregated infrastructures that pose a number of integration challenges due to their diverse implementation technologies and their need to be used as a shared resource for optimal utilization. Running hybrid High Performance Computing-Quantum Computing (HPCQC) applications in HPC environments, where the quantum portion is offloaded to the quantum processing units requires sophisticated resource management strategies to optimize resource utilization and performance. In this paper, we present the Munich Quantum Software Stack (MQSS), a Just-In-Time (JIT) compilation and execution software stack tailored for integrating disaggregated quantum accelerators into traditional HPC workflows.
Posters
TP
DescriptionKnowledge graph (KG) learning offers a powerful framework for generating new knowledge and making inferences. Training KG embedding can take a significantly long time, especially for larger datasets. Our analysis shows that the gradient computation of embedding and vector normalization are the dominant functions in the KG embedding training loop. We address this issue by replacing the core embedding computation with SpMM (Sparse-Dense Matrix Multiplication) kernels. This allows us to unify multiple scatter (and gather) operations as a single operation, reducing training time and memory usage. Applying this sparse approach in training the TransE model results in up to 5.7x speedup on the CPU and up to 1.7x speedup on the GPU. Distributing this algorithm on 64 GPUs, we observe up to 3.9x overall speedup in each epoch. Our proposed sparse approach can also be extended to accelerate other translation-based models such as TransR and TransH.
Paper
Algorithms
Artificial Intelligence/Machine Learning
Graph Algorithms
Linear Algebra
TP
DescriptionMultiplying two sparse matrices (SpGEMM) is a common computational primitive used in many areas including graph algorithms, bioinformatics, algebraic multigrid solvers, and randomized sketching. Distributed-memory parallel algorithms for SpGEMM have mainly focused on sparsity-oblivious approaches that use 2D and 3D partitioning. Sparsity-aware 1D algorithms can theoretically reduce communication by not fetching nonzero of the sparse matrices that do not participate in the multiplication.
Here, we present a distributed-memory 1D SpGEMM algorithm and implementation. It uses MPI RDMA operations to mitigate the cost of packing/unpacking submatrices for communication, and it uses a block fetching strategy to avoid excessive fine-grained messaging. Our results show that our 1D implementation outperforms state-of-the-art 2D and 3D implementations within CombBLAS for many configurations, inputs, and use cases, while remaining conceptually simpler.
Here, we present a distributed-memory 1D SpGEMM algorithm and implementation. It uses MPI RDMA operations to mitigate the cost of packing/unpacking submatrices for communication, and it uses a block fetching strategy to avoid excessive fine-grained messaging. Our results show that our 1D implementation outperforms state-of-the-art 2D and 3D implementations within CombBLAS for many configurations, inputs, and use cases, while remaining conceptually simpler.
Invited Talk
TP
DescriptionPerformance. Power. Startup. Sculpture. Music. This might seem like a disparate set of topics to describe one person's research in HPC. PowerPack. The Green500. grano.la. SeeMore. The CSGenome. Do these artifacts help? Maybe not. If life is a journey, so is the story of my research. I can assure you that all the research topics and artifacts I've studied and created along the way have common roots in HPC's sustainability. Another commonality is that early on, some in our community deemed these topics or artifacts a waste of time and resources, a non-problem, soon to be made irrelevant, or further evidence that this researcher is losing his grip on reality. Impact. Impact. Impact. In this talk, I will share key research findings and outcomes that have proven the naysayers wrong over time. I will also describe the inspiration and genesis of the work and the connections among these seemingly incongruous research projects. The goal of every endeavor so far — and the journey is far from over — has been to create sustained change in the way HPC computes and to ensure broad audiences understand the importance of what we do and how it connects to their everyday lives. And, perhaps surprisingly to some, I owe much of the success to the arts.
Workshop
A Study of a Deterministic Networking Framework for Latency Critical Large Scientific Data Transfers
Architecture
Data-Intensive
Network
Performance Optimization
System Administration
W
DescriptionScientific workflows often involve large data transfers, which increasingly require completion-time guarantees. To support these time-sensitive flows, the Energy Science Network (ESnet) has implemented on-demand circuits with packet priority, allowing the circuit to be utilized by other traffic when the deadline-sensitive flow is inactive. We explore a deterministic networking framework designed to support large scientific data transfers with completion guarantees. We consider an ideal network where all nodes are time-synchronized and utilize Cyclic Queueing and Forwarding (CQF) to achieve reliable low-latency data transfers. Our results show that the deterministic network architecture achieves performance comparable to the dynamic bandwidth reservation scheme. We believe that a more optimized version of the time-sensitive networking protocol that exploit multi-path routing could offer better completion guarantees than traditional network reservation options while improving overall network bandwidth utilization.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionPrograms like Girls Who Code (GWC) are pivotal in working to inspire and equip young women with the skills and confidence needed to pursue careers in computing. Understanding the impact of such initiatives is particularly important for addressing the decline in interest among girls aged 13 to 17, a critical period for career decision-making. By evaluating the effectiveness of employing a GWC club at our university, this research aims to uncover strategies that can successfully attract and retain women in computer science (CS) in our region. The goal is to not only reverse the trend of declining female participation but also to sustain their interest in the field of computing.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionPersistent Memory (PM) is a promising next-generation storage device, combining features of both volatile memory (like DRAM) and non-volatile memory (like SSDs). Many studies use PM to optimize training to advance deep learning technology. However, these studies have not addressed the issue of multiple copies of training data during deep learning, leading to reduced training efficiency. In this study, we first analyze the characteristics of PM and mainstream file systems. We then explore PM's byte addressability to manage metadata and data efficiently. This approach minimizes multiple I/O operations of tasks involving repeated read-write data accesses, such as machine learning datasets, enabling zero-copy data handling and significant speedups of read-and-write operations.
Workshop
Accelerators
Heterogeneous Computing
Parallel Programming Methods, Models, Languages and Environments
W
DescriptionAutomatic code optimization is a complex process that typically involves the application of multiple discrete algorithms that modify the program structure irreversibly. However, the design of these algorithms is often monolithic, and they require repetitive implementation to perform similar analyses due to the lack of cooperation. To address this issue, modern optimization techniques, such as equality saturation, allow for exhaustive term rewriting at various levels of inputs, thereby simplifying compiler design.
In this paper, we propose equality saturation to optimize sequential codes utilized in directive-based programming for GPUs. Our approach simultaneously realizes less computation, less memory access, and high memory throughput. Our fully-automated framework constructs single-assignment forms from inputs to be entirely rewritten while keeping dependencies and extracts optimal cases. Through practical benchmarks, we demonstrate a significant performance improvement on several compilers. Furthermore, we highlight the advantages of computational reordering and emphasize the significance of memory-access order for modern GPUs.
In this paper, we propose equality saturation to optimize sequential codes utilized in directive-based programming for GPUs. Our approach simultaneously realizes less computation, less memory access, and high memory throughput. Our fully-automated framework constructs single-assignment forms from inputs to be entirely rewritten while keeping dependencies and extracts optimal cases. Through practical benchmarks, we demonstrate a significant performance improvement on several compilers. Furthermore, we highlight the advantages of computational reordering and emphasize the significance of memory-access order for modern GPUs.
Paper
Accelerators
Applications and Application Frameworks
Artificial Intelligence/Machine Learning
Modeling and Simulation
Numerical Methods
TP
DescriptionSimulating emerging resistive switching memory devices, such as memristors, requires modeling frameworks that can treat the motion of point defects across nanoscale domains. Field-driven Kinetic Monte Carlo (d-KMC) methods that simulate the discrete structural evolution of atomic coordinates in the presence of external potential and heat fields can be used for this purpose. While physically similar to conventional KMC methods, field-driven approaches present different computational motifs and introduce global communication. Here, we develop the first scalable d-KMC code for resistive memory arrays at atomistic resolution. We accelerate this latency-sensitive simulation on the GPU partition of the LUMI Supercomputer, exploiting the high-speed interconnects between GPUs on the same node. Applied to the technologically relevant HfOx material stack, our code enables the first atomistic simulation of 3x3 arrays of resistive switching memory cells with more than 1 million atoms, matching the dimensions of fabricated structures.
Tutorial
Accelerators
Emerging Technologies
Numerical Methods
Parallel Programming Methods, Models, Languages and Environments
Quantum Computing
TUT
DescriptionGPU-accelerated quantum simulations are increasingly being adopted in hybrid quantum-classical algorithm development to speed up algorithm run-time, to test and implement future parallel QPU workflows, to scale up the size of quantum research, and to deploy workflows where QPUs and GPUs are tightly coupled. This tutorial guides attendees through examples simulated on their laptops to GPUs on NVIDIA Quantum Cloud. We then focus on running industry-relevant quantum research problems on HPC systems. The tutorial begins with an interactive Jupyter notebook demonstrating parallel quantum simulation using open-source CUDA-Q (introductory material). Next, the tutorial enables attendees to deploy quantum software on large scale HPC clusters like Perlmutter to run, for example, a 30,000 term Hamiltonian using 100 GPUs across multiple nodes (intermediate and advanced material). The tutorial ends with a presentation on QuEra machines and their capabilities along with a hands-on example setting up a Quantum Reservoir Models on QuEra’s platform (intermediate and advanced material). This is the software to be used: https://nvidia.github.io/cuda-quantum/latest/index.html This is the Docker image to be used (or extended to be optimal on Perlmutter): https://catalog.ngc.nvidia.com/orgs/nvidia/teams/quantum/containers/cuda-quantum
Workshop
Algorithms
Heterogeneous Computing
W
DescriptionMethods to mitigate the kernel launch overhead, one of drawbacks of GPUs, were implemented to an overhead-sensitive atmospheric model using OpenACC and CUDA and were evaluated. OpenACC enables kernels to run asynchronously in either one or multiple GPU queues. Moreover, CUDA allows different loops to be collocated in one kernel by branching operations based on block indices. While the default synchronous execution on A100 GPU lagged behind the A64FX CPU in strong scaling, the single-queue asynchronous execution reduced the total model runtime by 37, and the kernel fusion of the core application component further accelerated the entire model by approximately 10. In overhead-sensitive applications, the single-queue asynchronous execution is recommended because it can be easily implemented and maintained. If a small number of kernels are executed particularly frequently, it would be worth the effort to eliminate synchronizations and introduce CUDA Graphs, or bundle kernels using CUDA.
Paper
Artificial Intelligence/Machine Learning
Distributed Computing
Heterogeneous Computing
Performance Optimization
TP
DescriptionDLRM is a state-of-the-art recommendation system model that has gained widespread adoption across various industry applications. The large size of DLRM models, however, necessitates the use of multiple devices/GPUs for efficient training. A significant bottleneck in this process is the time-consuming all-to-all communication required to collect embedding data from all devices. To mitigate this, we introduce a method that employs error-bounded lossy compression to reduce the communication data size and accelerate DLRM training. We develop a novel error-bounded lossy compression algorithm, informed by an in-depth analysis of embedding data features, to achieve high compression ratios. Moreover, we introduce a dual-level adaptive strategy for error-bound adjustment, spanning both table-wise and iteration-wise aspects, to balance the compression benefits with the potential impacts on accuracy. We further optimize our compressor for PyTorch tensors on GPUs, minimizing compression overhead. Evaluation shows that our method achieves a 1.38X training speedup with a minimal accuracy impact.
Doctoral Showcase
Posters
TP
DescriptionAdvances in networks, accelerators, and cloud services encourage programmers to reconsider where to compute — such as when fast networks make it cost-effective to compute on remote accelerators despite added latency. Workflow and cloud-hosted serverless computing frameworks can manage multi-step computations spanning federated collections of cloud, high-performance computing, and edge systems, but passing data among computational steps remains a challenge when applications are a composition of multiple distinct software with differing communication and patterns.
This work introduces a new programming paradigm that decouples data flow from control flow by extending the pass-by-reference model to distributed applications. ProxyStore, developed here, implements this paradigm through object proxies that act as wide-area object references with just-in-time resolution. The proxy model enables producers to communicate data unilaterally, transparently, and efficiently to both local and remote consumers. This decoupling enables the dynamic selection of different data movement methods, depending on what data are moved, where data are moved, or when data are moved — a longstanding challenge in distributed applications.
The efficacy of the proxy paradigm is further understood through four high-level proxy-based programming patterns applied to real-world computational science applications. These high-level patterns — distributed futures, streaming, ownership, and stateful actors — make the power of the proxy paradigm accessible for more complex and dynamic distributed program structures. ProxyStore is evaluated through standardized benchmark suites, introduced here, and meaningful science applications, spanning bioinformatics, federated learning, and molecular design, in which substantial improvements in runtime, throughput, and memory usage are demonstrated.
This work introduces a new programming paradigm that decouples data flow from control flow by extending the pass-by-reference model to distributed applications. ProxyStore, developed here, implements this paradigm through object proxies that act as wide-area object references with just-in-time resolution. The proxy model enables producers to communicate data unilaterally, transparently, and efficiently to both local and remote consumers. This decoupling enables the dynamic selection of different data movement methods, depending on what data are moved, where data are moved, or when data are moved — a longstanding challenge in distributed applications.
The efficacy of the proxy paradigm is further understood through four high-level proxy-based programming patterns applied to real-world computational science applications. These high-level patterns — distributed futures, streaming, ownership, and stateful actors — make the power of the proxy paradigm accessible for more complex and dynamic distributed program structures. ProxyStore is evaluated through standardized benchmark suites, introduced here, and meaningful science applications, spanning bioinformatics, federated learning, and molecular design, in which substantial improvements in runtime, throughput, and memory usage are demonstrated.
Paper
Accelerators
Applications and Application Frameworks
Artificial Intelligence/Machine Learning
Distributed Computing
Graph Algorithms
Heterogeneous Computing
Tensors
TP
DescriptionDeep Learning Recommendation Models (DLRMs) face challenges due to the high memory needs of embedding tables and significant communication overhead in distributed settings. Traditional methods, like Tensor-Train (TT) decomposition, compress these tables effectively but add computational load. Furthermore, existing frameworks for distributed training are inadequate due to the excessive data exchange requirements.
We introduce EcoRec, an advanced library that boosts DLRM training by integrating TT decomposition with distributed training. EcoRec innovates with a unique computation pattern to streamline TT operations and an optimized multiplication approach, drastically cutting computation time. It implements a novel micro-batching method using sorted indices to slash memory use without extra computation. Moreover, EcoRec employs a pioneering pipeline for embedding layers, promoting even data spread and communication efficiency. Built on PyTorch and CUDA, tested on a 32 GPU cluster, EcoRec dramatically surpasses EL-Rec, delivering up to 3.1× faster training and reducing memory needs by 38.5%.
We introduce EcoRec, an advanced library that boosts DLRM training by integrating TT decomposition with distributed training. EcoRec innovates with a unique computation pattern to streamline TT operations and an optimized multiplication approach, drastically cutting computation time. It implements a novel micro-batching method using sorted indices to slash memory use without extra computation. Moreover, EcoRec employs a pioneering pipeline for embedding layers, promoting even data spread and communication efficiency. Built on PyTorch and CUDA, tested on a 32 GPU cluster, EcoRec dramatically surpasses EL-Rec, delivering up to 3.1× faster training and reducing memory needs by 38.5%.
Doctoral Showcase
Posters
TP
DescriptionModern high-performance computing (HPC) workflows produce massive datasets, often exceeding 100+ TB per day, driven by instruments collecting data at gigabytes per second. These workflows, executed on advanced HPC systems with heterogeneous storage devices, high-performance microprocessors, accelerators, and interconnects, are increasingly complex and often involve non-deterministic computations. In this context, thousands of processes share computing resources using synchronization for consistency. The intricate process interaction and existing non-deterministic operations challenge explorations of workflow behaviors to ensure reproducibility, optimize performance, and reason about what happens when processes compete for resources. Existing reproducibility analysis frameworks are not well-suited to identify the sources and locations of non-determinism and performance variations, as they often focus on the final workflow results and general statistics about workflow performance.
We address these challenges by introducing scalable techniques that accelerate intermediate workflow results' comparison using variation-tolerant hashing of floating-point datasets, thus improving result reproducibility. We also capture workflow performance profiles and benchmark various queries to analyze workflow performance reproducibility. We also identify opportunities to optimize the loading process and indexing of performance data to ensure minimal initialization and querying overhead. Using collected performance data, we propose a cache-aware staggering technique that leverages workflow I/O profiles to reduce bottlenecks and resource contention, particularly in workflows that share the same input data. Our evaluations across molecular dynamics, cosmology, and deep learning workflows demonstrate significant speedup in intermediate results reproducibility analyses compared to state-of-art baselines and our ability to propose workflow execution strategies that maximize cache reuse and minimize execution makespan.
We address these challenges by introducing scalable techniques that accelerate intermediate workflow results' comparison using variation-tolerant hashing of floating-point datasets, thus improving result reproducibility. We also capture workflow performance profiles and benchmark various queries to analyze workflow performance reproducibility. We also identify opportunities to optimize the loading process and indexing of performance data to ensure minimal initialization and querying overhead. Using collected performance data, we propose a cache-aware staggering technique that leverages workflow I/O profiles to reduce bottlenecks and resource contention, particularly in workflows that share the same input data. Our evaluations across molecular dynamics, cosmology, and deep learning workflows demonstrate significant speedup in intermediate results reproducibility analyses compared to state-of-art baselines and our ability to propose workflow execution strategies that maximize cache reuse and minimize execution makespan.
Workshop
Artificial Intelligence/Machine Learning
Codesign
W
DescriptionInspired by the success of the first TPU for ML inference deployed in 2015, Google has developed multiple generations of machine learning supercomputers for efficient ML training and serving, enabling near linear scaling of ML workloads. In this talk, we will present how TPU works as a machine learning supercomputer to benefit a growing number of Google services, including Gemini and Ads. Furthermore, we will have a deep dive into our full-stack co-design methodology that spans across model, software and hardware layers, and how it turns accelerator concepts into reality.
Workshop
Heterogeneous Computing
Parallel Programming Methods, Models, Languages and Environments
PAW-Full
Task Parallelism
W
DescriptionIn this paper, we propose using Partitioned Global Address Space (PGAS) GPU one-sided asynchronous small messages to replace the widely used collective communication calls for sparse input multi-GPU embedding retrieval in deep learning recommendation systems. This GPU PGAS communication approach achieves (1) better communication and computation overlap, (2) smoother network usage, and (3) reduced overhead (due to the data unpack and rearrangement steps associated with collective communication calls). We implement a CUDA embedding retrieval backend for PyTorch that supports the proposed PGAS communication scheme and evaluate it on deep learning recommendation inference passes. Our backend outperforms the baseline using NCCL collective calls, achieving 1.97x speedup for the weak scaling test and 2.63x speedup for the strong scaling test in a 4 GPU NVLink-connected system.
Workshop
Applications and Application Frameworks
Artificial Intelligence/Machine Learning
Parallel Programming Methods, Models, Languages and Environments
W
DescriptionApplications are increasingly written as dynamic workflows underpinned by an execution framework that manages asynchronous computations across distributed hardware. However, execution frameworks typically offer one-size-fits-all solutions for data flow management, which can restrict performance and scalability. ProxyStore, a middleware layer that optimizes data flow via an advanced pass-by-reference paradigm, has shown to be an effective mechanism for addressing these limitations. Here, we investigate integrating ProxyStore with Dask Distributed, one of the most popular libraries for distributed computing in Python, with the goal of supporting scalable and portable scientific workflows. Dask provides an easy-to-use and flexible framework, but is less optimized for scaling certain data-intensive workflows. We investigate these limitations and detail the technical contributions necessary to develop a robust solution for distributed applications and demonstrate improved performance on synthetic benchmarks and real applications.
Exhibitor Forum
Accelerating Scientific Computing: GPU Optimization Strategies, Challenges, and Performance Outcomes
Accelerators
Software Engineering
TP
XO/EX
DescriptionGPUs are transforming scientific computing by delivering substantial speedups and energy savings. This presentation outlines the development of GPU computing strategies for a leading CFD software. Initially, we identified and optimized computational bottlenecks suitable for GPU acceleration, using an offload model. To overcome limitations imposed by Amdahl's law, we developed a GPU-native solver architecture with streamlined APIs, ensuring seamless integration with existing workflows.
Our optimization strategy also accounts for diverse GPU platforms, implementing platform-specific enhancements for NVIDIA, AMD, and Intel architectures. Scalability was achieved through advanced load-balancing algorithms and improved inter-GPU communication, enabling efficient parallelization for large-scale simulations.
We present results demonstrating significant speedups and energy savings compared to CPU-based methods, highlighting the transformative potential of GPUs in enabling faster, more complex simulations.
Our optimization strategy also accounts for diverse GPU platforms, implementing platform-specific enhancements for NVIDIA, AMD, and Intel architectures. Scalability was achieved through advanced load-balancing algorithms and improved inter-GPU communication, enabling efficient parallelization for large-scale simulations.
We present results demonstrating significant speedups and energy savings compared to CPU-based methods, highlighting the transformative potential of GPUs in enabling faster, more complex simulations.
Workshop
Accelerators
Emerging Technologies
Hardware Technologies
W
DescriptionThe RISC-V Instruction Set Architecture (ISA) has enjoyed phenomenal growth in recent years, however it still to gain popularity in HPC. Whilst adopting RISC-V CPU solutions in HPC might be some way off, RISC-V based PCIe accelerators offer a middle ground where vendors benefit from the flexibility of RISC-V yet fit into existing systems.
In this paper we focus on the Tenstorrent Grayskull PCIe RISC-V based accelerator which, built upon Tensix cores, decouples data movement from compute. Using the Jacobi iterative method as a vehicle, we explore the suitability of stencils on the Grayskull e150. We explore best practice in structuring these codes for the accelerator and demonstrate that the e150 provides similar performance to a Xeon Platinum CPU (albeit BF16 vs FP32) but the e150 uses around five times less energy. Over four e150s we obtain around four times the CPU performance, again at around five times less energy.
In this paper we focus on the Tenstorrent Grayskull PCIe RISC-V based accelerator which, built upon Tensix cores, decouples data movement from compute. Using the Jacobi iterative method as a vehicle, we explore the suitability of stencils on the Grayskull e150. We explore best practice in structuring these codes for the accelerator and demonstrate that the e150 provides similar performance to a Xeon Platinum CPU (albeit BF16 vs FP32) but the e150 uses around five times less energy. Over four e150s we obtain around four times the CPU performance, again at around five times less energy.
Workshop
Applications and Application Frameworks
Distributed Computing
Middleware and System Software
W
DescriptionIn this position paper we argue for standardizing how we share and process data in scientific workflows at the network-level to maximize step re-use and workflow portability across platforms and networks in pursuit of a foundational workflow stack. We look to evolve workflows from steps connected point-to-point in a directed acyclic graph (DAG) to steps connected via shared channels in a message system implemented as a network service. To start this evolution, we contribute: a preliminary reference model, architecture, and open tools to implement the architecture today. Our goal stands to improve the deployment and operation of complex workflows by decoupling data sharing and data processing in workflow steps. We seek the workflow community’s input on this approach’s merit, related research to explore and initial requirements from the workflows community to inform future research.
Workshop
Data Compression
Data Movement and Memory
Middleware and System Software
W
DescriptionTraditional scientific visualization pipelines transfer entire data arrays from storage to client nodes for processing into displayable graphics objects. However, this full data transfer is often unnecessary, as many visualization filters operate on only small subsets of data in a data array. With the rise of computational storage, smart NICs, and smart devices enabling offloaded processing, this paper examines a case where a visualization pipeline is divided into pre-filters that run near data and post-filters that execute on the client side. Pre-filters preprocess the data near it on storage nodes, reducing data volumes before transfer based on downstream pipeline needs, while post-filters complete the processing on the client node. Experiments done on two real-world simulation datasets demonstrate that this approach can significantly reduce network transfer volumes, cutting visualization pipeline data load times by up to 2.8X compared to traditional methods, and up to 11.9X when combined with data compression techniques.
Workshop
Artificial Intelligence/Machine Learning
Graph Algorithms
Scalable Data Mining
W
DescriptionGraph Neural Networks (GNNs) have been used to solve complex problems of drug discovery, social media analysis, etc. Meanwhile, GPUs are becoming dominating accelerators to improve deep neural network performance. However, due to the characteristics of graph data, it is challenging to accelerate GNN-type workloads with GPUs alone. GraphSAGE is one representative GNN workload that uses sampling to improve GNN learning efficiency. Profiling the GraphSAGE using PyG library reveals that the sampling stage on the CPU is the bottleneck. Hence, we propose a heterogeneous system architecture solution with the sampling algorithm accelerated on customizable accelerators (FPGA), and feed sampled data into GPU training through a PCIe Peer-to-Peer (P2P) communication flow. With FPGA acceleration, for the sampling stage alone, we achieve a speed-up of 2.38X to 8.55X compared with sampling on CPU.
For end-to-end latency, compared with the traditional flow, we achieve a speed-up of 1.24X to 1.99X.
For end-to-end latency, compared with the traditional flow, we achieve a speed-up of 1.24X to 1.99X.
Paper
Accelerators
Energy Efficiency
Facilities
Resource Management
State of the Practice
TP
DescriptionGPU has emerged as the go-to accelerator for HPC workloads; however, its power consumption has become a major limiting factor for further scaling HPC systems. An accurate understanding of GPU power consumption is essential for further improving its energy efficiency, and consequently reducing the associated carbon footprint. Despite the limited documentation and lack of understanding, NVIDIA GPUs' built-in power sensor is widely used in energy-efficient computing research. Our study seeks to elucidate the internal mechanisms of the power readings provided by nvidia-smi and assess the accuracy of the measurements. We evaluated over 70 different GPUs across 12 architectural generations, and identified several unforeseen problems that can lead to drastic under/overestimation of energy consumed, for example on the A100 and H100 GPUs only 25% of the runtime is sampled. We proposed several mitigations that could reduce the energy measurement error by an average of 35% in the test cases we present.
Workshop
Distributed Computing
Fault-Tolerance, Reliability, Maintainability, and Adaptability
W
DescriptionInterconnection networks are key actors that condition the performance of current large data center and supercomputer systems. Both topology and routing are critical aspects that must be carefully considered for a competitive system network design. Moreover, when daily failures are expected, this tandem should exhibit resilience and robustness. Low-diameter networks, including HyperX, are cheaper than typical Fat Trees. But, to be really competitive, they have to employ evolved routing algorithms to both balance traffic and tolerate failures.
In this paper, SurePath, an efficient fault-tolerant routing mechanism for HyperX topology, is introduced and evaluated. SurePath leverages routes provided by standard routing algorithms and a deadlock avoidance mechanism based on an Up/Down escape subnetwork. This mechanism not only prevents deadlock but also allows for a fault-tolerant solution for these networks. SurePath is thoroughly evaluated in the paper under different traffic patterns, showing no performance degradation under extremely faulty scenarios.
In this paper, SurePath, an efficient fault-tolerant routing mechanism for HyperX topology, is introduced and evaluated. SurePath leverages routes provided by standard routing algorithms and a deadlock avoidance mechanism based on an Up/Down escape subnetwork. This mechanism not only prevents deadlock but also allows for a fault-tolerant solution for these networks. SurePath is thoroughly evaluated in the paper under different traffic patterns, showing no performance degradation under extremely faulty scenarios.
Workshop
Data Movement and Memory
Emerging Technologies
W
DescriptionThe need to support a large volume of transactions on shared data is increasing to meet explosive growth in worldwide data and processing demands. Emerging memory architectures such as CXL are increasing in popularity; CXL allows for dynamic demand-sensitive resizing of aggregated memory, support for heterogeneous memory types, and sharing of data amongst supported processors and devices. However, while this new memory architecture alleviates many concerns in datacenter and HPC architectures, data integrity when using memory-based transactions over CXL faces many challenges.
To solve for these challenges, we describe a novel solution for providing ACID (Atomicity, Consistency, Isolation, Durability) transactions in a CXL-based memory architecture. We call this solution Transactional CXL, or TCXL, which requires no changes to the existing processor microarchitectures and is implemented in a software library with a back-end controller that can be embedded in a CXL controller, as a stand-alone CXL device, or host implemented.
To solve for these challenges, we describe a novel solution for providing ACID (Atomicity, Consistency, Isolation, Durability) transactions in a CXL-based memory architecture. We call this solution Transactional CXL, or TCXL, which requires no changes to the existing processor microarchitectures and is implemented in a software library with a back-end controller that can be embedded in a CXL controller, as a stand-alone CXL device, or host implemented.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionActive learning algorithms, integrating machine learning, quantum computing and optics simulation in an iterative loop, offer a promising approach to optimizing metamaterials. However, these algorithms can face difficulties in optimizing highly complex structures due to computational limitations. High-performance computing (HPC) and quantum computing (QC) integrated systems can address these issues by enabling parallel computing. In this study, we develop an active learning algorithm working on HPC-QC integrated systems. We evaluate the performance of optimization processes within active learning (i.e., training a machine learning model, problem-solving with quantum computing, and evaluating optical properties through wave-optics simulation) for highly complex metamaterial cases. Our results showcase that utilizing multiple cores on the integrated system can significantly reduce computational time, thereby enhancing the efficiency of optimization processes. Therefore, we expect that leveraging HPC-QC integrated systems helps effectively tackle large-scale optimization challenges in general.
Workshop
Distributed Computing
Experimental Facility
HPC Infrastructure
W
DescriptionArtificial Intelligence, combined with simulations and experiments, has great potential in accelerating scientific discovery, yet bridging the gap between simulations and experiments remains challenging due to time and scale disparities. Our research addresses this issue by developing a deep kernel-based surrogate model that learns from microscopic images to map structural features to energy differences from defect formation. We begin with full training using simulated images to establish optimal settings and create a baseline for active learning. Active learning is then employed to predict structures along simulation trajectories based on uncertainty and stability, reducing data requirements and computational costs. The model shows a low average error margin of approximately 0.03 meV. A autoencoder-decoder was developed as additional surrogate to enhance feature extraction and reconstruction, achieving a reconstruction loss of around 0.2 and facilitating precise comparisons between simulations and experiments. This approach advances real-time experimental guidance through computational simulations.
Workshop
Debugging and Correctness Tools
Performance Evaluation and/or Optimization Tools
W
DescriptionA Fine-grained Asynchronous Bulk Synchronous Parallel (FA-BSP) model is an extended version of the existing BSP model that facilitates fine-grained asynchronous point-to-point messages with automatic message aggregation.
While there are many large irregular applications written with the FA-BSP model, demonstrating promising performance, no profiler is aware of profile-worthy portions of an FA-BSP program and visualizes the results in an intuitive way. This is reasonable because the FA-BSP program relies on multiple external libraries, and the runtime frequently switches between different portions of the program, which makes it difficult for well-established profilers like score-p, TAU, CrayPat, Vtune, and HPCToolkit to profile and visualize these portions in an FA-BSP-friendly manner.
This paper designs and implements a profiling and visualization framework called ActorProf. The framework enables 1) asynchronous point-to-point message-aware profiling with hardware performance counters, 2) overall performance breakdown that is aware of FA-BSP execution, and 3) visualization of these profiling results.
While there are many large irregular applications written with the FA-BSP model, demonstrating promising performance, no profiler is aware of profile-worthy portions of an FA-BSP program and visualizes the results in an intuitive way. This is reasonable because the FA-BSP program relies on multiple external libraries, and the runtime frequently switches between different portions of the program, which makes it difficult for well-established profilers like score-p, TAU, CrayPat, Vtune, and HPCToolkit to profile and visualize these portions in an FA-BSP-friendly manner.
This paper designs and implements a profiling and visualization framework called ActorProf. The framework enables 1) asynchronous point-to-point message-aware profiling with hardware performance counters, 2) overall performance breakdown that is aware of FA-BSP execution, and 3) visualization of these profiling results.
Paper
Accelerators
Algorithms
Artificial Intelligence/Machine Learning
Graph Algorithms
Performance Optimization
TP
DescriptionAttention-based models are proliferating in the space of image analytics, including segmentation. The standard method of feeding images to transformer encoders is to divide the images into patches and then feed the patches to the model as a linear sequence of tokens. For high-resolution images, e.g., microscopic pathology images, the quadratic compute and memory cost prohibits the use of an attention-based model. The solution is to either use complex multi-resolution models or approximate attention schemes. We take inspiration from Adapative Mesh Refinement (AMR) methods by adaptively patching the images, based on the image details to reduce the number of patches being fed to the model. This method has a negligible overhead and works seamlessly as a pre-processing step with any attention-based model. We demonstrate superior segmentation quality over widely used segmentation models for real-world pathology datasets while gaining a geomean speedup of 6.9\x for resolutions up to 64K^2.
Birds of a Feather
TP
XO/EX
DescriptionTestbeds play a vital role in assessing the readiness of novel architectures for upcoming supercomputers for the exascale and post-exascale era. These testbeds also act as co-design hubs, enabling the collection of application operational requirements, while identifying critical gaps that need to be addressed for an architecture to become viable for HPC. Various research centers are actively deploying testbeds, and our aim is to build a community that facilitates the sharing of information, encouraging collaboration and understanding of the available evaluation resources. This BoF will facilitate the exchange of best practices, including testbed design, benchmarking, system evaluation, and availability.
Inclusivity
Broader Engagement
Inclusivity
TP
W
TUT
XO/EX
DescriptionAdvanced Computing for Social Change presented by the NSF Leadership-Class Computing Facility is underway! ACSC brings together undergraduate students who use TACC resources to investigate research questions with societal impact. By the end of the week, students present their data analysis and visualization to peers, mentors, and sponsors.
Tutorial
Parallel Programming Methods, Models, Languages and Environments
TUT
DescriptionThe vast majority of production parallel scientific applications today use MPI and run successfully on the largest systems in the world. Parallel system architectures are evolving to include complex, heterogeneous nodes comprising general-purpose CPUs as well as accelerators such as GPUs. At the same time, the MPI standard itself is evolving to address the needs and challenges of future extreme-scale platforms as well as applications. This tutorial will cover several advanced features of MPI that can help users program modern systems effectively. Using code examples based on scenarios found in real applications, we will cover several topics including efficient ways of doing 2D and 3D stencil computation, derived datatypes, one-sided communication, hybrid programming (MPI + threads, shared memory, GPUs), topologies and topology mapping, neighborhood and nonblocking collectives, and some of the new performance-oriented features in MPI-4. Attendees will leave the tutorial with an understanding of how to use these advanced features of MPI and guidelines on how they might perform on different platforms and architectures.
Exhibitor Forum
Artificial Intelligence/Machine Learning
Facilities
TP
XO/EX
DescriptionWith the growth of AI, liquid cooling of IT equipment is no longer optional. Power consumption and heat dissipation have dramatically increased. Liquid cooling is now required to manage heat loads of AI semiconductors with much higher heat fluxes and AI servers with much greater rack densities.
There are two core approaches to liquid cooling: single-phase, using water or a water-glycol mixture as a coolant, and two-phase, using refrigerants. In single-phase cooling, the coolant remains in the liquid state, whereas in two-phase cooling, the refrigerant changes between liquid and gas phases.
In this forum, we will provide a technical assessment of single-phase and two-phase direct-to-chip liquid cooling (DLC) technologies, focusing on system design and operational differences. These differences help explain why single-phase DLC is a mature technology now seeing mainstream adoption. In contrast, two-phase DLC is earlier in the technology adoption cycle and still faces technical and operational hurdles.
There are two core approaches to liquid cooling: single-phase, using water or a water-glycol mixture as a coolant, and two-phase, using refrigerants. In single-phase cooling, the coolant remains in the liquid state, whereas in two-phase cooling, the refrigerant changes between liquid and gas phases.
In this forum, we will provide a technical assessment of single-phase and two-phase direct-to-chip liquid cooling (DLC) technologies, focusing on system design and operational differences. These differences help explain why single-phase DLC is a mature technology now seeing mainstream adoption. In contrast, two-phase DLC is earlier in the technology adoption cycle and still faces technical and operational hurdles.
Exhibitor Forum
Hardware Technologies
TP
XO/EX
DescriptionThe evolution of the Arm Neoverse Compute Subsystem (CSS) is propelled by recent advancements in upstream open-source projects. As AI and machine learning workloads continue to expand, the demand for custom-built Arm solutions is becoming critical for optimizing resource utilization in high-performance computing (HPC) environments. Cloud-native Arm processors and accelerators, such as AWS Graviton and Microsoft Maia, illustrate the trend toward specialized processors designed to manage large-scale, compute-intensive tasks.
This presentation will explore how open-source firmware serves as a foundational element for Arm Total Design (ATD) partners in developing diverse processors tailored for HPC, AI, and beyond. We will spotlight contributions to the open-source community and the value of leveraging collaborative solutions, and discuss practical approaches for enabling advanced Arm solutions in today's rapidly evolving tech landscape. Attendees will gain insights into how purpose-built processors are driving innovation in HPC and the broader industry.
This presentation will explore how open-source firmware serves as a foundational element for Arm Total Design (ATD) partners in developing diverse processors tailored for HPC, AI, and beyond. We will spotlight contributions to the open-source community and the value of leveraging collaborative solutions, and discuss practical approaches for enabling advanced Arm solutions in today's rapidly evolving tech landscape. Attendees will gain insights into how purpose-built processors are driving innovation in HPC and the broader industry.
Workshop
Artificial Intelligence/Machine Learning
W
DescriptionWe present a comparison of twenty different Time Series deep learning models for the important but challenging problem of Earthquake nowcasting in Southern California. We find that pattern models where general architectures are trained on this problem outperform foundation models that do not exploit some key features of earthquake time series. A graph neural network expressing spatial locality has the best performance. We introduce a new general approach termed MultiFoundationPattern that combines a bespoke model with other pattern and foundation model results handled as auxiliary streams. In the earthquake case, the resultant MultiFoundationQuake model achieves the best overall performance. This work in progress is being extended in different directions, including the study of different earthquake regions and the use of simulations for better training. Further, we are examining the importance of patterns and the MultiFoundationPattern integration model in other geospatial applications, including the CAMELS and CARAVAN datasets in hydrology.
Invited Talk
TP
DescriptionFord is driving a historic digital transformation. Our movement is toward being an automotive technology company with highly capable software and services. This is evident across the core businesses with new technologies in Ford Blue, Ford Model e, Ford Pro, Ford Plus, and Ford Next. We are focused on working differently by reimagining our cross-functional enterprise business processes, supporting technology systems, and workforce development. Ford is highly leveraging computer-aided engineering, modeling and simulation, and high performance computing to advance product development and deliver outcomes in performance, safety, speed to market, first time quality, and long-term reliability.
Workshop
HPC Infrastructure
State of the Practice
System Administration
W
DescriptionSupercomputers produce big data and operational data analytics (ODA) makes this information useful to users and administrators. ODA data, properly used, can impact energy use, operational costs, high performance computing (HPC) performance, and ultimately science. Each HPC site globally has to wrestle ODA into submission with each new system. As a new system is deployed, dashboards can be highly valuable to system administrators. If the telemetry data schema changes, dashboards must be recreated using different sources, query languages, and metric names. When dashboards need to be reworked, this keeps people from using them during this valuable time. In order to ease this burden across HPC sites, we have created a system dashboard to share between sites as a way to move toward a standard for telemetry data. This paper describes the dashboard and outlines how to use it at your site.
Panel
Distributed Computing
TP
DescriptionDistributed services are a pervasive element of large-scale computing: they aggregate system resources, abstract reusable application functionality, and coordinate work among distributed applications. These services form the backbone of data management, scheduling, in situ analytics, performance instrumentation, AI workflow coupling, integrated research infrastructure, and more across cloud, edge, and HPC environments. The goal of this panel is to bring together leading experts to identify fundamental cross-cutting challenges and opportunities for the distributed services community in HPC. What are the technical hurdles, how can services adapt to the increasing scale of systems and applications, what can we learn from other communities,
and how can we advance the state of the art and adoption of distributed services for HPC?
and how can we advance the state of the art and adoption of distributed services for HPC?
Workshop
Artificial Intelligence/Machine Learning
Graph Algorithms
Scalable Data Mining
W
Workshop
Emerging Technologies
Modeling and Simulation
Scientific and Information Visualization
W
Workshop
Artificial Intelligence/Machine Learning
W
DescriptionMore than half of the global population lives in cities. Climate change impacts have diverse implications on urban life across multiple sectors, including: water supply, health and wellbeing, energy demand and air quality. There is thus a growing need by various stakeholders for an integrated tool to support climate change mitigation and adaptation in cities. We propose to design and develop LLM-based agents aimed at supporting urban climate resilience, by providing tailored solutions to individual cities and communicating climate knowledge and insights to different audiences. We highlight the main challenges involved in the development of such LLM agents and potential approaches to addressing them, including: incorporating high-resolution data better fitting the urban scale and addressing queries for simulations. We hope that this work will open the door to the development of additional AI-based tools for climate and environmental applications.
Birds of a Feather
TP
XO/EX
DescriptionAgriculture worldwide is facing massive challenges in production, distribution, pollution reduction, food security and waste: In a $4 trillion global food production industry, less than 40% of any crop is actually marketed. The farm, the oldest human-engineered system, produces the vast majority of human sustenance and consumes the majority of global freshwater. Its efficient operation is crucial —particularly when supply chains are disrupted by wars and pandemics. This second BoF will discuss how novel supercomputing technologies, AI and related distributed heterogeneous systems could empower the primary sector and, as a result, stop operating in a needlessly fragile and inefficient way.
Tutorial
Applications and Application Frameworks
Artificial Intelligence/Machine Learning
Portability
Runtime Systems
TUT
DescriptionKubernetes has emerged as the leading container orchestration solution (maintained by Cloud Native Computing Foundation) that works on resources ranging from on-prem clusters to commercial clouds. The Kubernetes ecosystem has been growing to enable batch type workflows and developed rich semantics that allow execution of complex scientific computing workflows typically not feasible on batch systems. This growth has been possible thanks to tool like k8s-sig/kueue, the kubeflow/mpi-operator, k8s/scheduler-plugins, k8s/device-plugin, among many projects that have been created by the community to enable complex workloads leveraging Kubernetes rich API system.
The tutorial aims to educate AI and computational science researchers on Kubernetes as a resource management system, comparing it with traditional batch systems. It provides information on IO/storage options and utilizing GPU and MPI operators in Kubernetes to scale workloads leveraging high-performance networks like InfiniBand. Attendees will receive an overview of Kubernetes architecture; job submission procedures, learn about storage options; run various AI inference, training, and scientific research software hands-on examples using Kubernetes on CPU and GPU resources, and explore MPI examples for scaling out. Theoretical knowledge will be reinforced with hands-on sessions using the PNRP production Kubernetes cluster Nautilus.
The tutorial aims to educate AI and computational science researchers on Kubernetes as a resource management system, comparing it with traditional batch systems. It provides information on IO/storage options and utilizing GPU and MPI operators in Kubernetes to scale workloads leveraging high-performance networks like InfiniBand. Attendees will receive an overview of Kubernetes architecture; job submission procedures, learn about storage options; run various AI inference, training, and scientific research software hands-on examples using Kubernetes on CPU and GPU resources, and explore MPI examples for scaling out. Theoretical knowledge will be reinforced with hands-on sessions using the PNRP production Kubernetes cluster Nautilus.
Exhibits
Flash Session
TP
XO/EX
DescriptionUnlock the full potential of your unstructured data with Dell Technologies' AI-Ready Data Platform. This comprehensive solution combines industry-leading PowerScale storage, servers, and a modern data Lakehouse architecture to enable seamless AI deployment across all your data sources. Discover how Dell's innovative data platform empowers customers to harness the power of AI to transform their data into valuable insights, wherever it resides.
Exhibits
Flash Session
TP
XO/EX
DescriptionUnlock the full potential of your unstructured data with Dell Technologies' AI-Ready Data Platform. This comprehensive solution combines industry-leading PowerScale storage, servers, and a modern data Lakehouse architecture to enable seamless AI deployment across all your data sources. Discover how Dell's innovative data platform empowers customers to harness the power of AI to transform their data into valuable insights, wherever it resides.
Exhibits
Flash Session
TP
XO/EX
DescriptionAs AI revolutionizes industries, data storage is no longer just a repository but a critical enabler of AI’s transformative power. In this session, Seagate experts reveal how advanced storage solutions support the expanding demands of AI workloads, from the cloud to the edge. With multimodal large language models and compliance mandates driving AI’s growth, scalable and energy-efficient storage is essential. Learn how innovations in hard drive areal density are tripling storage capacity while reducing power consumption by 60%, accelerating AI performance and data center energy efficiency.
Join us to see how the right storage solution can make all the difference and propel your AI initiatives forward.
Join us to see how the right storage solution can make all the difference and propel your AI initiatives forward.
Workshop
Artificial Intelligence/Machine Learning
W
DescriptionLarge-scale international scientific collaborations, such as ATLAS, generate vast volumes of data, necessitating substantial computational power. Centralized workflow and data management systems are employed to handle these demands, but current decision-making processes for data placement and payload allocation are often heuristic and disjointed. This optimization challenge could be addressed using machine learning methods, such as reinforcement learning, which, in turn, require access to extensive data. We propose a generative surrogate modeling approach to address the lack of training data and concerns about privacy preservation. We collect and process real-world job records, and compare four generative models for tabular data---TVAE, CTAGGAN+, SMOTE, and TabDDPM---to these datasets, thoroughly evaluating their performance. Experiments indicate that SMOTE and TabDDPM generate similar tabular data to ground truth, while SMOTE ranks the lowest in privacy preservation. As a result, we conclude that the probabilistic-diffusion-model-based TabDDPM is the most suitable generative model for managing job record data.
Workshop
Accelerators
Modeling and Simulation
Performance Evaluation and/or Optimization Tools
W
DescriptionThis work quantifies the impact of microarchitectural features in modern high-performance Arm CPUs. To combat a parameter space that is too large to traverse naively, we employ a decision tree regression machine learning model to predict the number of execution cycles with 93.38% accuracy compared to the simulated cycles. We build on previous work by specializing our design to real-world HPC workloads and modernize our approach with updated search spaces, improved simulation frameworks, and over 180,000 simulated data points. We find empirically that vector length typically has the largest impact on HPC code performance at 25.91% of our performance weighting, followed by memory performance across all levels of the memory hierarchy, and the size of the reorder buffer and register files. Our results motivate deeper exploration of these parameters in both hardware design and simulation, as well as advancing the modelling of architectural simulation through the use of machine learning.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionAs high-performance computing (HPC) advances to exascale levels, its role in scientific fields such as medicine, climate research, finance, and scientific computing becomes increasingly critical. However, these large-scale systems are susceptible to performance variations caused by anomalies, including network contention, hardware malfunctions, and shared resource conflicts. These anomalies can lead to increased energy consumption, scheduling inefficiencies, and reduced application performance. Therefore, accurately and promptly diagnosing these performance anomalies is essential for maintaining the efficiency and reliability of HPC systems. Machine learning offers a powerful approach to automating the detection of such anomalies by learning patterns from the vast amounts of complex telemetry data generated by these systems. Our research focuses on increasing the efficiency and resilience of HPC systems through automated telemetry analytics, and this poster presentation will summarize our efforts and findings in this domain.
Workshop
Energy Efficiency
HPC Infrastructure
Sustainability
W
DescriptionEnterprise and high-performance computing data centers are dealing with thousands of sensor metrics and associated data. A top-end target for exascale machines is 10 million data points per second. The escalating volume and speed of data generation are making things more difficult, and outages are increasing. Uptime Institute's Outage Analysis report, published in June 2022, states that 30% of all outages in 2021 lasted more than 24 hours, a disturbing increase from 8% in 2017. While equipment is idle during downtime, it often continues to consume power, especially for cooling systems. This leads to wasted energy and higher operational costs. We propose an AIOps solution that uses advanced data analytics, machine learning, and deep learning methods to develop automated and advanced anomaly detection and predictive tools for data centers. They perform at scale and speed, and improve data center resiliency and energy efficiency, thereby promoting the sustainability of data centers.
Doctoral Showcase
Posters
TP
DescriptionAs heterogeneity becomes commonplace in HPC systems, algorithmic and optimization techniques are needed to address the challenges that come with it, especially for irregular applications. This includes workload balancing, scheduling, latency tolerance, and memory utilization and contention, among others. This showcase covers three works addressing key questions in running complex irregular graph applications on heterogeneous systems: programmability, performance portability, memory efficiency, load balancing, and scalability.
The first work explores the efficacy of utilizing commercial high-level synthesis tools to accelerate two different graph sampling methods on FPGAs. We achieve up to a 40x speedup compared to the baseline OpenCL kernel, and identify key areas for toolchain improvements, such as memory subsystems and latency tolerance.
The second work focuses on improving breadth-first probabilistic traversals (BPTs), as they dominate runtime in some applications. By identifying and exploiting redundancies in edge accesses, we achieve an average of 75x and 135x speedups when deployed on two different frameworks. We also demonstrate strong scaling up to 4,096 nodes on OLCF Frontier enabled by CPU-GPU heterogeneous workload balancing.
The third work is currently in progress, exploring the use of lossy compression to enable training on graph neural networks. We have promising preliminary results, showing a compression ratio of between 6x-20x with minimal accuracy loss on both GCN and GAT. We identify future directions and use cases for this method with an emphasis on systems integration such as larger batch sizes in mini-batch training, compressing feature vector caches, and adaptive compression methods for heterogeneous and dynamic GNNs.
The first work explores the efficacy of utilizing commercial high-level synthesis tools to accelerate two different graph sampling methods on FPGAs. We achieve up to a 40x speedup compared to the baseline OpenCL kernel, and identify key areas for toolchain improvements, such as memory subsystems and latency tolerance.
The second work focuses on improving breadth-first probabilistic traversals (BPTs), as they dominate runtime in some applications. By identifying and exploiting redundancies in edge accesses, we achieve an average of 75x and 135x speedups when deployed on two different frameworks. We also demonstrate strong scaling up to 4,096 nodes on OLCF Frontier enabled by CPU-GPU heterogeneous workload balancing.
The third work is currently in progress, exploring the use of lossy compression to enable training on graph neural networks. We have promising preliminary results, showing a compression ratio of between 6x-20x with minimal accuracy loss on both GCN and GAT. We identify future directions and use cases for this method with an emphasis on systems integration such as larger batch sizes in mini-batch training, compressing feature vector caches, and adaptive compression methods for heterogeneous and dynamic GNNs.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionHigh-performance computing hardware is co-developed with U.S. DOE codes, and proxy applications (apps) based on these codes are critical technologies for iterative innovation. Numerical modeling/simulation proxies have been the most impactful in co-design. To broaden the types of computation available for co-design, we are developing proxy apps based on MetaHipMer (mhm2), a DOE-developed, scalable, \textit{de novo} metagenome assembler. MetaHipMer is implemented in C++, and offloads several routines to GPU. It has been used to assemble large (>50 Terabase) metagenomes on exascale-class machines (e.g., Summit). Our first proxy focuses on the expensive kmer analysis step. This and subsequent steps are memory-bound computations using CPU shared-memory distributed data structures (e.g., distributed hash tables). These data structures are often larger than inputs, and operations on them account for most of runtime. Our proxies will be implemented in Kokkos, a C++ performance portability programming model for emerging architecture design/testing.
Workshop
Heterogeneous Computing
Parallel Programming Methods, Models, Languages and Environments
Task Parallelism
W
Workshop
Artificial Intelligence/Machine Learning
Graph Algorithms
Scalable Data Mining
W
DescriptionGraph-based representations are increasingly popular for storing and managing information through knowledge graphs, which capture entities and their relationships. However, these knowledge graphs often suffer from incomplete link information. To address this issue, link classification methods can be used to predict and verify missing connections. Recently, supervised heuristic learning methods have improved link classification accuracy. Specifically, the SEAL framework, as a state-of-the-art supervised heuristic learning tool, excels in learning associativity patterns by analyzing local enclosing subgraphs to classify links. However, DGCNN, a graph neural network model in this framework, lacks the capability to process edge attributes, leading to poor classification accuracy in knowledge graphs. Hence, this paper proposes an Augmented Model of the DGCNN (AM-DGCNN) by replacing GCNs with GATs to better incorporate link information. With extensive experiments, we demonstrate that our AM-DGCNN in the SEAL framework can achieve up to 98% accuracy for classifying links in knowledge graphs.
Workshop
Embedded and/or Reconfigurable Systems
Heterogeneous Computing
W
Birds of a Feather
TP
XO/EX
DescriptionThe SC24 edition of the Americas HPC Collaboration BoF will focus on improving continental actions since the initial edition at SC15. This year's BoF will showcase three continental Networks of Research and Education (NRES): CANARIE, Internet2 and RedCLARA, and the collaborative experiences in HPC that these NRES have enabled to grow the skills and actions of the continental research communities. Additionally, during the BoF, the guidelines for an MoU between CANARIE, Internet2 and RedCLARA will be proposed to endorse the upcoming actions of the Americas HPC Collaboration, including formation of a continental consortium.
Paper
Accelerators
Algorithms
Data Compression
Linear Algebra
Tensors
TP
Best Paper Finalist
DescriptionAlgebraic multigrid (AMG) methods are efficient to solve diverse sparse linear systems, due to their flexibility and adaptability. Even though modern parallel devices brought massive parallelism to AMG, the latest major hardware tensor core and their low-precision compute power, have not been exploited to accelerate AMG.
This paper proposes AmgT, a new AMG solver that utilizes the tensor core and mixed-precision ability. Considering that the SpGEMM and SpMV are extensively used in the setup and solve phases, respectively, we propose a novel method based on a unified storage format that leverages tensor cores and their variable precision. To utilize algorithm components in existing libraries, the data format and compute kernels of the AmgT solver are incorporated into the Hypre. The experimental results on NVIDIA A100 and H100 GPUs show that our AmgT outperforms the GPU version of Hypre by a factor of on average 1.46× and 1.32×, respectively.
This paper proposes AmgT, a new AMG solver that utilizes the tensor core and mixed-precision ability. Considering that the SpGEMM and SpMV are extensively used in the setup and solve phases, respectively, we propose a novel method based on a unified storage format that leverages tensor cores and their variable precision. To utilize algorithm components in existing libraries, the data format and compute kernels of the AmgT solver are incorporated into the Hypre. The experimental results on NVIDIA A100 and H100 GPUs show that our AmgT outperforms the GPU version of Hypre by a factor of on average 1.46× and 1.32×, respectively.
Posters
TP
DescriptionQuantum computing is an innovative technology that can solve certain problems faster than classical computing. One of its promising applications is in solving partial differential equations (PDEs). However, current PDE solvers that are based on variational-quantum-eigensolver (VQE) techniques suffer from low accuracy, high execution times, and low scalability on noisy-intermediate-scale-quantum (NISQ) devices, especially for multidimensional PDEs.
We introduce a highly accurate and scalable quantum algorithm for solving multidimensional PDEs and present two variants of our algorithm. The first leverages classical-to-quantum (C2Q) encoding, finite-difference-method (FDM), and numerical instantiation, while the second employs C2Q, FDM, and column-by-column decomposition (CCD). To evaluate our algorithm, we have used a multidimensional Poisson equation. Our results demonstrate higher accuracy, higher scalability, and faster execution times compared to VQE-based solvers on noise-free and noisy quantum simulators from IBM. We have also investigated our proposed algorithm on hardware emulators, employing various noise mitigation techniques with encouraging preliminary results.
We introduce a highly accurate and scalable quantum algorithm for solving multidimensional PDEs and present two variants of our algorithm. The first leverages classical-to-quantum (C2Q) encoding, finite-difference-method (FDM), and numerical instantiation, while the second employs C2Q, FDM, and column-by-column decomposition (CCD). To evaluate our algorithm, we have used a multidimensional Poisson equation. Our results demonstrate higher accuracy, higher scalability, and faster execution times compared to VQE-based solvers on noise-free and noisy quantum simulators from IBM. We have also investigated our proposed algorithm on hardware emulators, employing various noise mitigation techniques with encouraging preliminary results.
Workshop
Graph Algorithms
Heterogeneous Computing
Programming Frameworks and System Software
W
DescriptionLarge-scale graphs with billions and trillions of vertices and edges require efficient parallel algorithms for common graph problems, one of which is single-source shortest paths (SSSP). Bulk-synchronous parallel algorithms such as Delta-stepping experience large synchronization costs at the scale of many nodes, so asynchronous approaches are needed for scalability. However, asynchronous approaches are susceptible to wasteful, speculative execution. We introduce ACIC, a highly asynchronous approach modulated by continuous concurrent introspection and adaptation. Using message-driven concurrent reductions and broadcasts, task-based scheduling, and an adaptive aggregation library, we explore techniques such as evolving windows and generation and prioritized flow of optimal updates, or edge relaxations, aimed at reducing speculative loss without constraining parallelism. Our results, while preliminary, demonstrate the promise of these ideas, with the potential to impact a wider class of graph algorithms.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionWe propose a novel approach for executing dynamic applications on GPUs. Different from traditional approaches that use a single kernel, our method allows the GPU to autonomously allocate computational resources at runtime. We decompose a kernel into multiple fragment kernels and dynamically launch an optimal number of them during execution. The input data is partitioned into smaller segments and each fragmented kernel processes each of the partitioned segments. This method is implemented using CUDA graphs conditional nodes for determining the number of fragmented kernels to be launched based on the input size. We compared the proposed method with the traditional kernel execution method with a Breadth-First Search (BFS) application, a representative dynamic application. Results show comparable performance while reducing utilization of compute resources by up to 19.9%, and opportunities to further performance improvement by optimizing parameters of our method.
Workshop
Performance Optimization
Programming Frameworks and System Software
W
DescriptionRust is a type-safe programming language originally developed by Mozilla in 2010. With design goals including guarantees of memory and thread safety, alongside foundations in functional programming, it claims to be highly performant, reliable, and productive for developers. Therefore, the Rust language could be well-suited for use in High Performance Computing (HPC) applications.
We present a functionally verified Rust translation of HPCCG, a proxy application in the Mantevo suite, showing the possibility of applying both shared and distributed memory parallelism using Rayon and MPI bindings in Rust to mature HPC codebases. Performance analysis, within a novel framework empowering reproducibility, empirically shows the translated Rust approaches the original C++ and a Kokkos performance portability framework implementation in scaling characteristics and overall performance for representative HPC workloads. The productivity benefits of Rust in combination with its measured performance characteristics holistically demonstrate it to be an effective tool for use in HPC.
We present a functionally verified Rust translation of HPCCG, a proxy application in the Mantevo suite, showing the possibility of applying both shared and distributed memory parallelism using Rayon and MPI bindings in Rust to mature HPC codebases. Performance analysis, within a novel framework empowering reproducibility, empirically shows the translated Rust approaches the original C++ and a Kokkos performance portability framework implementation in scaling characteristics and overall performance for representative HPC workloads. The productivity benefits of Rust in combination with its measured performance characteristics holistically demonstrate it to be an effective tool for use in HPC.
Workshop
Distributed Computing
Experimental Facility
HPC Infrastructure
W
Workshop
I/O, Storage, Archive
W
DescriptionAs machine learning
models increase in size and complexity rapidly, the cost of
checkpointing in ML training became a bottleneck in storage
and performance (time). For example, the latest GPT-4 model
has massive parameters at the scale of 1.76 trillion. It is highly
time and storage consuming to frequently writes the model to
checkpoints with more than 1 trillion floating point values to
storage. This work aims to understand and attempt to mitigate
this problem. First, we characterize the checkpointing interface
in a collection of representative large machine learning/language
models with respect to storage consumption and performance
overhead. Second, we propose the two optimizations: i) A periodic
cleaning strategy that periodically cleans up outdated checkpoints
to reduce the storage burden; ii) A data staging optimization that
coordinates checkpoints between local and shared file systems for
performance improvement.
models increase in size and complexity rapidly, the cost of
checkpointing in ML training became a bottleneck in storage
and performance (time). For example, the latest GPT-4 model
has massive parameters at the scale of 1.76 trillion. It is highly
time and storage consuming to frequently writes the model to
checkpoints with more than 1 trillion floating point values to
storage. This work aims to understand and attempt to mitigate
this problem. First, we characterize the checkpointing interface
in a collection of representative large machine learning/language
models with respect to storage consumption and performance
overhead. Second, we propose the two optimizations: i) A periodic
cleaning strategy that periodically cleans up outdated checkpoints
to reduce the storage burden; ii) A data staging optimization that
coordinates checkpoints between local and shared file systems for
performance improvement.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionWe present error-bounded lossy compression tailored for particle datasets from diverse scientific applications in cosmology, fluid dynamics, and fusion energy sciences. As today's high-performance computing capabilities advance, these datasets often reach trillions of points, posing significant analysis and storage challenges. While error-bounded lossy compression makes it possible to represent floating-point values with strict pointwise accuracy guarantees, the lack of correlations in particle data's storage ordering often limits the compression ratio. Inspired by quantization-encoding schemes in SZ lossy compressors, we dynamically determine the number of bits to encode particles of the dataset to increase the compression ratio. Specifically, we utilize a k-d tree to partition particles into subregions and generate "bit boxes" centered at particles for each subregion to encode their positions. These bit boxes ensure error control while reducing the bit count used for compression. We evaluate our method against state-of-the-art compressors on cosmology, fluid dynamics, and fusion plasma datasets.
Paper
Accelerators
HPC Infrastructure
Performance Evaluation and/or Optimization Tools
State of the Practice
TP
Best Paper Finalist
DescriptionThe Dragonfly is an extensively deployed network
topology in large-scale high-performance computing (HPC) due
to its cost-effectiveness and efficiency. In comparison to other in-
direct networks of similar scale, the Dragonfly network has shown
a considerable reduction in cable lengths and network costs.
Three of the deployed and upcoming Exascale supercomputers
for leadership, class workloads will be deployed using Dragonfly
networks.
It is imperative to evaluate its performance across
a broad range of HPC workloads to facilitate optimal system
procurement. While previous work has focused on understanding
the topology from a capacity computing workload perspective, this study assesses extreme-scale leadership workloads on a dragonfly network. To accomplish this, we conduct a comprehensive evaluation of various workload efficiencies using state-of-the-art Slingshot 11 Dragonfly topology and compare it against Summit supercomputers EDR InfiniBand non-blocking fat-tree.
These evaluations are conducted utilizing resources at the OLCF (Frontier and Summit)
topology in large-scale high-performance computing (HPC) due
to its cost-effectiveness and efficiency. In comparison to other in-
direct networks of similar scale, the Dragonfly network has shown
a considerable reduction in cable lengths and network costs.
Three of the deployed and upcoming Exascale supercomputers
for leadership, class workloads will be deployed using Dragonfly
networks.
It is imperative to evaluate its performance across
a broad range of HPC workloads to facilitate optimal system
procurement. While previous work has focused on understanding
the topology from a capacity computing workload perspective, this study assesses extreme-scale leadership workloads on a dragonfly network. To accomplish this, we conduct a comprehensive evaluation of various workload efficiencies using state-of-the-art Slingshot 11 Dragonfly topology and compare it against Summit supercomputers EDR InfiniBand non-blocking fat-tree.
These evaluations are conducted utilizing resources at the OLCF (Frontier and Summit)
Workshop
Data Compression
Data Movement and Memory
Middleware and System Software
W
DescriptionData reduction is now frequently used by simulations to reduce the amount of data that needs to be stored. Consequently, several error-bound lossy data reduction techniques have been developed to help compress scientific datasets while trying to maximize quality. However, their impact on visualization has hardly been studied and is not very well understood. In this paper, we do an in-depth analysis of the impact of lossy data reduction on volume rendering to try to determine which parameters, such as characteristics of datasets, opacity, color affect the perception quality of lossy data reduction.
Workshop
Debugging and Correctness Tools
Performance Evaluation and/or Optimization Tools
W
DescriptionWith the growing complexity in architecture and the size of large-scale computing systems, monitoring and analyzing system behavior and events has become daunting. Monitoring data amounting to terabytes per day are collected by sensors housed in these massive systems at multiple fidelity levels and varying temporal resolutions. In this work, we develop an incremental version of multiresolution dynamic mode decomposition (mrDMD), which converts high-dimensional data to spatial-temporal patterns at varied frequency ranges. Our incremental implementation of the mrDMD algorithm (I-mrDMD) promptly reveals valuable information in the massive environment log dataset, which is then visually aligned with the processed hardware and job log datasets through our generalizable rack visualization using D3 visualization integrated into the Jupyter Notebook interface. We demonstrate the efficacy of our approach with two use scenarios on a real-world dataset from a Cray XC40 supercomputer, Theta.
Workshop
Broader Engagement
Education
Inclusivity
W
DescriptionOne of the challenges in practical quantum computing is the limited number of qubits. Most problems of interest require far more qubits than currently are available -- whether through execution on an existing physical quantum device or through simulation on multiple graphical processing units (GPUs). Circuit cutting (sometimes referred to as circuit knitting) is a divide-and-conquer approach to break down a quantum circuit into smaller pieces, each of which may require fewer qubits than the original circuit. Often these smaller circuits can be executed in parallel before their output is stitched back together for an approximation of the original circuit execution. To our knowledge, no educational materials have been disseminated broadly that provide students the opportunity to actively learn this technique and experiment with running quantum algorithms in parallel.
This self-contained educational module walks students through a visual example of circuit cutting through the Max-Cut problem, using the Quantum Approximate Optimization Algorithm (QAOA). Students will experiment with many of the design decisions that researchers face when implementing circuit cutting. In addition to quantum computing learning objectives, students will gain transferable skills in high performance computing as they simulate large scale quantum algorithms on a GPU using Message Processing Interface (MPI).
This self-contained educational module walks students through a visual example of circuit cutting through the Max-Cut problem, using the Quantum Approximate Optimization Algorithm (QAOA). Students will experiment with many of the design decisions that researchers face when implementing circuit cutting. In addition to quantum computing learning objectives, students will gain transferable skills in high performance computing as they simulate large scale quantum algorithms on a GPU using Message Processing Interface (MPI).
Workshop
Algorithms
Heterogeneous Computing
W
DescriptionVideo AI and transmission are used in various fields. Video compression is based on many combinatorial optimization problems. It is difficult for conventional computers to solve problems because these problems take exponential time as the problem size increases. To solve these problems, Ising machines specialized for combinatorial optimization have attracted attention. In video encoding, selecting appropriate prediction modes while considering the specific characteristics of the videos is essential. This paper proposes an Ising-based decision method for intra prediction mode in video coding. The combinatorial optimization problem in intra prediction is formulated into a QUBO model and then is executed using an Ising machine. The experiments using the Fixstars Amplify Annealing Engine show the proposed method can achieve 70 times faster than conventional computers while choosing the same optimal mode. As a result, problems of sizes previously unsolvable by conventional computers can be solved by the proposed method.
Exhibitor Forum
Facilities
TP
XO/EX
DescriptionTwo-phase cooling in data center servers is a specific form of liquid cooling wherein a saturated liquid absorbs heat from high-power electronics, such as processors, causing the liquid to boil and change phase to a vapor. This form of liquid cooling has been given much attention recently due to the potential of absorbing incredibly large amounts of heat using a dielectric refrigerant that will not damage the electronics like water-based coolants.
During this session, we will objectively analyze the heat transfer performance and environmental impacts of two-phase liquid cooling compared with single-phase water cooling. We will review each of the fundamental claims that two-phase cooling advertises, along with simulation results illuminating practical engineering opportunities and challenges to deploying two-phase data center cooling.
During this session, we will objectively analyze the heat transfer performance and environmental impacts of two-phase liquid cooling compared with single-phase water cooling. We will review each of the fundamental claims that two-phase cooling advertises, along with simulation results illuminating practical engineering opportunities and challenges to deploying two-phase data center cooling.
Workshop
Energy Efficiency
HPC Infrastructure
Sustainability
W
DescriptionPower has been a key constraint for supercomputers, and limitations on power become increasingly noticeable through the exascale era. Limited power availability pushes the facilities to operate under power constraints and develop power management methods, making it crucial to understand applications' power consumption behavior and their performance under power constraints. In this study, we examine the power consumption of MILC, a widely used lattice quantum chromodynamics application, on the Perlmutter GPU system at NERSC. We analyze the power consumption of Generation and Spectrum applications of MILC using varying parallel concurrencies and input sizes. We then investigate the performance under GPU power caps and show that MILC is well-suited for GPU power capping. Up to 50% of GPU's TDP can be applied to MILC jobs with less than 15% of performance decrease.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionAlltoall collective operations in MPI are critical in several types of computation, including matrix multiplication and transposition, and machine learning applications. As a result, it is critical that these operations are performant for large amounts of data. Meanwhile, dragonfly networks are becoming more common in state-of-the-art supercomputers. However, there has been little analysis of the performance of alltoall operations on these networks. The hierarchical and modular nature of dragonfly networks results in distinct challenges in alltoall operations, though typical alltoall algorithms fail to account for topology. In this poster, we analyze the performance of the alltoall algorithm in four scenarios, and discuss the conditions under which each algorithm performs best.
Workshop
Debugging and Correctness Tools
Performance Evaluation and/or Optimization Tools
W
DescriptionWe present a set of tools to support analysis of the utilization of high performance computing (HPC) resources by extending PIKA, a site-level monitoring tool, and Vampir, a well-known performance analysis tool. We show an approach to transforming site-level data into a form that Vampir (and other application-level performance tools) can use to analyze cluster-level data as well. Finally, we show that the end product of this transformation allows useful performance analysis of cluster behavior.
Birds of a Feather
TP
XO/EX
DescriptionParallel I/O performance can be a critical bottleneck for applications, yet users often need to be equipped for identifying and diagnosing I/O performance issues. Increasingly complex hierarchies of storage hardware and software deployed on many systems only compound this problem. Tools that can effectively capture, analyze, and tune I/O behavior for these systems empower users to realize performance gains for many applications.
In this BoF, we form a community around best practices in analyzing parallel I/O and cover recent advances to help address the above mentioned problem, drawing on the expertise of users, I/O researchers, and administrators in attendance.
In this BoF, we form a community around best practices in analyzing parallel I/O and cover recent advances to help address the above mentioned problem, drawing on the expertise of users, I/O researchers, and administrators in attendance.
Birds of a Feather
TP
XO/EX
DescriptionANARI is a 3D rendering standard for portable 3D visualization using state-of-the-art rendering algorithms and hardware acceleration technologies. ANARI is open, royalty-free, and is developed by the Khronos ANARI Working Group and its Advisory Panel. The ANARI BoF session will be a venue for direct interaction between the ANARI Working Group, ANARI renderer implementers, middleware developers, and HPC visualization application developers. Discussion topics and lightning talks will include visualization application technical requirements relevant to the ANARI SDK and the standard APIs, opportunities for ANARI extensions and future feature standardization, ANARI user experiences and best practices, and ANARI hackathons.
Exhibits
Flash Session
TP
XO/EX
DescriptionFluent® is an industry leading Computational Fluid Dynamics (CFD) package developed by ANSYS which is used world-wide on a variety of applications. While Fluent has been highly optimized for traditional CPU based High Performance Computing (HPC) systems, the latest release, version 2024R2, has been written to fully harness NVIDIA Tensor Core GPU acceleration for CFD workloads. This talk discusses the advantages offered by Fluent, accelerated by NVIDIA, at Oracle Cloud Infrastructure (OCI).
Workshop
Artificial Intelligence/Machine Learning
Biology
Education
Emerging Technologies
Medicine
Modeling and Simulation
W
DescriptionCancer is a family of complex genetic disorders characterized by the accumulation of genetic and epigenetic alterations that drive uncontrolled cell growth and metastasis. It is becoming a leading cause of death worldwide, accounting for 10 million deaths in 2020. In 2024, NIH projects that roughly 2 million people will be diagnosed with cancer in the United States. However, developing novel therapies or selecting a best-suited treatment for a patient poses a challenge to the scientific community due to the individualized nature of the disease.
Cancer research is often hampered by limited data, particularly in the context of rare cancer types and heterogenous disease mechanisms, restricting the utility of the latest advances in Artificial Intelligence (AI) in this domain. This issue is dire for more refined biological models, such as patient-derived xenografts (PDX), that are more accurate in reflecting patient treatment responses than immortalized cell lines but more costly in terms of time and resources for data generation. Due to the high complexity and cost of PDX experiments, it is unlikely to garner datasets on the scale of traditional AI domains like computer vision or natural language processing. This situation calls for more efficient approaches to learning from data.
Human reasoning heavily relies on the context surrounding the problem. To achieve generalizability, we compare various settings and examples to find key differences in subjects and outcomes. This strategy is critical for efficiently utilizing limited available data for learning. Currently, the closest analog to this strategy in the AI field is Contrastive Learning (CL). This approach leverages the relatively abundant cell line drug screening data for transfer learning to drug response prediction of PDXs. CL utilizes not only positive data (responsive samples) but also negative data (non-responsive samples), which is ubiquitously generated during drug response experiments. The most famous example of CL is Contrastive Language-Image Pretraining (CLIP), which bridges the gap between textual and visual information. We adopt a similar approach to create compatible representations between different biological models, resulting in Contrastive Transfer Learning. This emerging machine learning paradigm improves the performance and stability of AI models across different application domains by emphasizing the learning of more generalizable feature representations. To enhance model explainability, we create a biologically guided neural network based on the KEGG pathways and BRITE hierarchy so we can elucidate the decision process via its activation pass.
We explored the utility of the proposed architecture in drug-specific response modeling, where we constructed an individual response model for each drug based on cancer gene expressions. We used the area under the receiver-operator curve (AUROC) to evaluate prediction performance. We compare our approach to the stand-alone fully connected neural network (FCNN) via 5-fold cross-validation (CV). Our approach improves the average AUROC score and reduces its standard deviation (std) from CV trials, producing more accurate and stable prediction results. Baseline FCNN performance for Selumetinib, Bortezomib and Paclitaxel is AUROC=0.87, std=0.188; AUROC=0.92, std=0.068; AUROC=0.87, std=0.077, respectively; while our approach achieves AUROC=0.99, std=0.003; AUROC=0.99, std=0.008; AUROC= 0.97, std=0.018 for these three drugs, correspondingly.
Cancer research is often hampered by limited data, particularly in the context of rare cancer types and heterogenous disease mechanisms, restricting the utility of the latest advances in Artificial Intelligence (AI) in this domain. This issue is dire for more refined biological models, such as patient-derived xenografts (PDX), that are more accurate in reflecting patient treatment responses than immortalized cell lines but more costly in terms of time and resources for data generation. Due to the high complexity and cost of PDX experiments, it is unlikely to garner datasets on the scale of traditional AI domains like computer vision or natural language processing. This situation calls for more efficient approaches to learning from data.
Human reasoning heavily relies on the context surrounding the problem. To achieve generalizability, we compare various settings and examples to find key differences in subjects and outcomes. This strategy is critical for efficiently utilizing limited available data for learning. Currently, the closest analog to this strategy in the AI field is Contrastive Learning (CL). This approach leverages the relatively abundant cell line drug screening data for transfer learning to drug response prediction of PDXs. CL utilizes not only positive data (responsive samples) but also negative data (non-responsive samples), which is ubiquitously generated during drug response experiments. The most famous example of CL is Contrastive Language-Image Pretraining (CLIP), which bridges the gap between textual and visual information. We adopt a similar approach to create compatible representations between different biological models, resulting in Contrastive Transfer Learning. This emerging machine learning paradigm improves the performance and stability of AI models across different application domains by emphasizing the learning of more generalizable feature representations. To enhance model explainability, we create a biologically guided neural network based on the KEGG pathways and BRITE hierarchy so we can elucidate the decision process via its activation pass.
We explored the utility of the proposed architecture in drug-specific response modeling, where we constructed an individual response model for each drug based on cancer gene expressions. We used the area under the receiver-operator curve (AUROC) to evaluate prediction performance. We compare our approach to the stand-alone fully connected neural network (FCNN) via 5-fold cross-validation (CV). Our approach improves the average AUROC score and reduces its standard deviation (std) from CV trials, producing more accurate and stable prediction results. Baseline FCNN performance for Selumetinib, Bortezomib and Paclitaxel is AUROC=0.87, std=0.188; AUROC=0.92, std=0.068; AUROC=0.87, std=0.077, respectively; while our approach achieves AUROC=0.99, std=0.003; AUROC=0.99, std=0.008; AUROC= 0.97, std=0.018 for these three drugs, correspondingly.
Paper
Accelerators
HPC Infrastructure
Performance Evaluation and/or Optimization Tools
State of the Practice
TP
DescriptionBenchmarks are essential in the design of modern HPC installations, as they define key aspects of system components. Beyond synthetic workloads, it is crucial to include real applications that represent user requirements into benchmark suites, to guarantee high-usability and widespread adoption of a new system. Given the significant investments in leadership-class supercomputers of the exascale era, this is even more important and necessitates alignment with a vision of Open Science and reproducibility.
In this work, we present the JUPITER Benchmark Suite, which incorporates 16 applications from various domains. It was designed for and used in the procurement of JUPITER, the first European exascale supercomputer. We identify requirements and challenges and outline the project and software infrastructure setup. We provide descriptions and scalability studies of selected applications and a set of key takeaways. The JUPITER Benchmark Suite is released as open-source software with this work.
In this work, we present the JUPITER Benchmark Suite, which incorporates 16 applications from various domains. It was designed for and used in the procurement of JUPITER, the first European exascale supercomputer. We identify requirements and challenges and outline the project and software infrastructure setup. We provide descriptions and scalability studies of selected applications and a set of key takeaways. The JUPITER Benchmark Suite is released as open-source software with this work.
Birds of a Feather
TP
XO/EX
DescriptionFortran plays a crucial role in numerous applications. This BoF provides a forum for Fortran developers to engage with the language's modern programming features. With features introduced in recent language revisions, Fortran 2023 supports modern programming practices and high-performance computing (HPC). This BoF gathers developers from diverse domains to share experiences and explore Fortran's evolving capabilities. After some brief presentations, the session will focus on an interactive discussion where audience members will be encouraged to share their own experiences and ask questions of our experts.
Workshop
Broader Engagement
HPC in Society
Inclusivity
W
DescriptionThe High-Performance Computing (HPC) paradigm, which forms the backbone of global cloud infrastructure, has experienced exponential growth through multiple iterations.
Initially used for scientific computing, later for productivity, and more recently Artificial Intelligence (AI) based services, this application growth has further contributed to the complexity of the HPC hardware-software ecosystem, creating new challenges and opportunities in the domain. Given the innate nature of computing is highly distributed, tiered, and evolving, several decisions must be made locally across non-overlapping decision boundaries often with multiple local and global objectives for optimization. Such optimization goals span energy, cost, performance, and environmental impact. In this paper, we present our most recent works in the applications of AI for HPC, with a special focus on applications in federated digital twins, intelligent storage buffer cache, application performance projection, and energy-aware scheduling.
Initially used for scientific computing, later for productivity, and more recently Artificial Intelligence (AI) based services, this application growth has further contributed to the complexity of the HPC hardware-software ecosystem, creating new challenges and opportunities in the domain. Given the innate nature of computing is highly distributed, tiered, and evolving, several decisions must be made locally across non-overlapping decision boundaries often with multiple local and global objectives for optimization. Such optimization goals span energy, cost, performance, and environmental impact. In this paper, we present our most recent works in the applications of AI for HPC, with a special focus on applications in federated digital twins, intelligent storage buffer cache, application performance projection, and energy-aware scheduling.
Workshop
Heterogeneous Computing
Parallel Programming Methods, Models, Languages and Environments
PAW-Full
Task Parallelism
W
DescriptionThe growing demands across various scientific fields
have led to a significant shift in applications that consume data at
the edge of the computing continuum. These applications require
unified programming models for the composition of components
and coordinating the execution of computational workloads,
including training machine learning (ML) models on distributed
resources. Personalized healthcare often leverages data generated
from wearable devices used to train ML models, can be benefited
from distributed computing approaches. Specifically, stroke care
can be greatly benefited from distributed ML with modifiable
risk factors that can be monitored using wearable devices. In this
work, we present an implementation that leverages distributed
techniques for large-scale ML workflows using electrocardiogram
(ECG) recordings for atrial fibrillation (AF) classification. The
application was evaluated using the PhysioNet database, show-
casing the potential of distributed, ML in stroke care, opening
the way for future creation of more advanced models embedded
in edge devices.
have led to a significant shift in applications that consume data at
the edge of the computing continuum. These applications require
unified programming models for the composition of components
and coordinating the execution of computational workloads,
including training machine learning (ML) models on distributed
resources. Personalized healthcare often leverages data generated
from wearable devices used to train ML models, can be benefited
from distributed computing approaches. Specifically, stroke care
can be greatly benefited from distributed ML with modifiable
risk factors that can be monitored using wearable devices. In this
work, we present an implementation that leverages distributed
techniques for large-scale ML workflows using electrocardiogram
(ECG) recordings for atrial fibrillation (AF) classification. The
application was evaluated using the PhysioNet database, show-
casing the potential of distributed, ML in stroke care, opening
the way for future creation of more advanced models embedded
in edge devices.
Workshop
Cloud Computing
Middleware and System Software
State of the Practice
W
DescriptionThe Flexible Modeling System (FMS) Runtime Environment (FRE) handles post processing of the National Atmospheric and Oceanic Administration (NOAA) research climate models on a computing cluster called Post Processing and Analysis (PPAN). FRE is currently only used on PPAN, but NOAA wishes to provide a reproducible environment that can be used by collaborators and the climate research community running future models. A containerized version of the post processing workflow has been developed using Podman, Singularity, and Apptainer. Designed to reduce user input, the container can handle directory setup, experiment configuration, and experiment running with minimal human interaction. Input is mostly passed in the form of a YAML, in which experiment details are read. The container is being used solely on NOAA infrastructure, but will soon be tested on external systems as well as the cloud.
Workshop
Software Engineering
W
Paper
Artificial Intelligence/Machine Learning
Distributed Computing
Heterogeneous Computing
Performance Optimization
TP
DescriptionRecently, the sparsely-gated Mixture-Of-Experts (MoE) architecture has garnered significant attention. To benefit a wider audience, fine-tuning MoE models on more affordable clusters, which are typically a limited number of bandwidth-constrained GPU nodes, holds promise. However, it is non-trivial to apply existing cost-effective fine-tuning approaches to MoE models, due to the increased ratio of data to computation.
In this paper, we introduce APTMoE, which employs affinity-aware pipeline parallelism for fine-tuning MoE models on bandwidth-constrained GPU nodes. We propose an affinity-aware offloading technique that enhances pipeline parallelism for both computational efficiency and model size, and it benefits from a hierarchical loading strategy and a demand-priority scheduling strategy. Experiments demonstrate that APTMoE outperforms existing methods in most cases. Particularly, APTMoE successfully fine-tunes a 61.2B MoE model on 4 Nvidia A800 GPUs(40GB) and achieves up to 33% throughput improvement compared to the SOTA method.
In this paper, we introduce APTMoE, which employs affinity-aware pipeline parallelism for fine-tuning MoE models on bandwidth-constrained GPU nodes. We propose an affinity-aware offloading technique that enhances pipeline parallelism for both computational efficiency and model size, and it benefits from a hierarchical loading strategy and a demand-priority scheduling strategy. Experiments demonstrate that APTMoE outperforms existing methods in most cases. Particularly, APTMoE successfully fine-tunes a 61.2B MoE model on 4 Nvidia A800 GPUs(40GB) and achieves up to 33% throughput improvement compared to the SOTA method.
Workshop
Debugging and Correctness Tools
Hardware Technologies
Resource Management
State of the Practice
W
DescriptionAs the size of large language models and the processing needs keep on increasing, the compute infrastructure needs to adapt to be able to handle these reliably. In particular in addition to having a large number of processing units, the platform needs to provide guarantees on fabric and IO but also software strategies to schedule jobs and cache data reliably. In this work, we will show how some strategic choices on reference design definitions, combined with versatile scheduling, checkpointing, and validation strategies can help leverage the infrastructure for best performance. We will also review how scaling up to extreme scale impacts the hardware and software implementation choices for LLMs.
Workshop
Artificial Intelligence/Machine Learning
Codesign
W
DescriptionLike many other hyperscalar companies, Meta heavily relies on AI in its internal systems for numerous things including recommending personalized content to the user; understanding text, speech, and visual content to recognize things such as hate speech; and generating text, image and video content for users. While these AI workloadsshare some similarities with traditional scientific computing applications, they are a lot less structured---their computational requirements, memory access patterns, data I/O approaches, and network communication, tend to be highly irregular raising tremendous challenges on modern supercomputing system architectures. In this talk, I'll discuss some of these architectural challenges and allude to some potential hardware and software solutions that Meta as well as the broader community is looking at to address these challenges.
Art of HPC
Posters
TP
W
TUT
XO/EX
DescriptionThe data used comes from E3SM-MPAS, a global climate model run at Los Alamos National Laboratory. ParaView, an open-source scientific visualization system, was used to transform the raw data to renderable geometry. Artifact-Based Rendering (www.sculpting-vis.org), a research system developed by a collaboration between the University of Minnesota and the Texas Advanced Computing Center at the University of Texas at Austin, was then used to add artist-made data-driven visual attributes and render the scene. No ML programs were used.
Workshop
Debugging and Correctness Tools
Performance Evaluation and/or Optimization Tools
W
DescriptionIn modern HPC systems, performance measurements are often disturbed by noise.
Because repeating measurements to increase confidence in their results is costly, alternative noise-resilient techniques are desirable.
Therefore, we implement a logical clock, which does not rely on real-time measurements, in Score-P.
We explore several methods to model computational work with the time stamps, counting OpenMP loop iterations, LLVM basic blocks/statements, or hardware counters.
We demonstrate the strengths and weaknesses of using logical time stamps in a trace analysis workflow with Score-P and Scalasca, by evaluating the performance problems we can find in three MPI+OpenMP mini-apps.
By design, logical measurements reliably show algorithmic issues, such as load imbalance, but cannot capture external aspects of program execution, for example memory contention.
In summary, logical-time based measurements are a specialized but valuable addition to the performance analyst's toolbox.
Because repeating measurements to increase confidence in their results is costly, alternative noise-resilient techniques are desirable.
Therefore, we implement a logical clock, which does not rely on real-time measurements, in Score-P.
We explore several methods to model computational work with the time stamps, counting OpenMP loop iterations, LLVM basic blocks/statements, or hardware counters.
We demonstrate the strengths and weaknesses of using logical time stamps in a trace analysis workflow with Score-P and Scalasca, by evaluating the performance problems we can find in three MPI+OpenMP mini-apps.
By design, logical measurements reliably show algorithmic issues, such as load imbalance, but cannot capture external aspects of program execution, for example memory contention.
In summary, logical-time based measurements are a specialized but valuable addition to the performance analyst's toolbox.
Workshop
Data Movement and Memory
I/O, Storage, Archive
W
DescriptionRecently, streaming processing engines (SPEs) support plugin-based connector to streamline online indexing and online query serving. Users can seamlessly integrate the online indexing/serving stack with a few code modifications. SPEs transform in-flight massive datasets to high-dimensional vector embeddings and delegates to store the embeddings on the vector database. In addition, it queries to the vector database to find similar data. However, loose coupling of the streaming engine and vector database does not recognize the internal operations of each engine, which can lead to performance bottlenecks in online data indexing and query serving scenarios. Through a preliminary experiment, we observed a high tail latency of query serving when data indexing is overlapped. Based on the results, we suggest the potential performance bottleneck cases that cause a high tail latency of query serving and present a future work that can mitigate this problem.
Birds of a Feather
TP
XO/EX
DescriptionThe needs of industrial users are special in that they are characterised by the diversity of their pricing, access timing, security, licensing, and data management requirements.
ETP4HPC of Europe and HPC4EI of the U.S. are in the process of collecting the characteristics of such needs, together with best practices in meeting them.
This BoF aims at sharing the European and U.S. expertise and juxtaposing it against the experiences of other industrial users, service providers, and access programs.
The two organizations will produce a public repository of best practices that will be available on the website of ETP4HPC.
ETP4HPC of Europe and HPC4EI of the U.S. are in the process of collecting the characteristics of such needs, together with best practices in meeting them.
This BoF aims at sharing the European and U.S. expertise and juxtaposing it against the experiences of other industrial users, service providers, and access programs.
The two organizations will produce a public repository of best practices that will be available on the website of ETP4HPC.
Exhibitor Forum
Hardware Technologies
Network
TP
XO/EX
DescriptionThe Arista EtherLink AI networking platform introduces a modular, scalable architecture with a unified lossless dataplane across distributed, independent components. Each component, equipped with its own control and data planes, is interconnected via high-speed Ethernet links.
By embracing Ethernet, it ensures interoperability and optimizes for currently available, cost-efficient RDMA NICs.
Key technical benefits:
- Scalable AI Networking: EtherLink losslessly connects 32,000 XPUs in one cluster, or 100,000 across multiple data centers at 800 Gbps.
- Proven Reliability and Performance: The system has been field-tested by hyperscalers and is validated through large-scale simulations. Simulations indicate a 10% to 30% improvement in job completion time.
- Fast Failure Recovery: EtherLink uses hardware-accelerated link fault detection and repair for milliseconds-level recovery.
- EOS Integration: Arista's EOS® provides centralized control, telemetry, and quality of service (QoS) management, supporting configuration, monitoring, and debugging of AI/ML workloads from network to NIC.
At its core, EtherLink incorporates Broadcom's Jericho3 packet processor and Ramon3 fabric chips. These components target the requirements of AI/ML, offering advanced load balancing, congestion management, and fault resilience.
This presentation discusses the design, motivations, and technical foundations of the EtherLink solution.
By embracing Ethernet, it ensures interoperability and optimizes for currently available, cost-efficient RDMA NICs.
Key technical benefits:
- Scalable AI Networking: EtherLink losslessly connects 32,000 XPUs in one cluster, or 100,000 across multiple data centers at 800 Gbps.
- Proven Reliability and Performance: The system has been field-tested by hyperscalers and is validated through large-scale simulations. Simulations indicate a 10% to 30% improvement in job completion time.
- Fast Failure Recovery: EtherLink uses hardware-accelerated link fault detection and repair for milliseconds-level recovery.
- EOS Integration: Arista's EOS® provides centralized control, telemetry, and quality of service (QoS) management, supporting configuration, monitoring, and debugging of AI/ML workloads from network to NIC.
At its core, EtherLink incorporates Broadcom's Jericho3 packet processor and Ramon3 fabric chips. These components target the requirements of AI/ML, offering advanced load balancing, congestion management, and fault resilience.
This presentation discusses the design, motivations, and technical foundations of the EtherLink solution.
Birds of a Feather
TP
XO/EX
DescriptionWe propose the first BoF on Artificial Intelligence and Machine Learning for HPC Workload Analysis to provide a much-needed opportunity not only for cutting-edge research ideas to be shared, but also to bring together researchers across the disciplines of data science, machine learning, statistics, applied mathematics, systems design, systems monitoring, systems resilience, and hardware architecture to address a shared goal of better and more efficient monitoring and understanding of the usage of large-scale computing machines and facilities through artificial intelligence and machine learning.
Workshop
Applications and Application Frameworks
Distributed Computing
Middleware and System Software
W
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionAs data from various domains is increasingly shared and processed in the cloud, Homomorphic Encryption (HE) provides a crucial solution for ensuring privacy in the post-quantum era. In this work, we evaluate the performance and accuracy of the HE matrix multiplication leveraging SEAL
library kernels. Moreover, we compare it against EVA, which offers optimized HE parameters to conduct this operation.
Not only performance results are shown in our poster — check out how working with more appropriate parameters reduces the execution time while keeping the accuracy of the result.
library kernels. Moreover, we compare it against EVA, which offers optimized HE parameters to conduct this operation.
Not only performance results are shown in our poster — check out how working with more appropriate parameters reduces the execution time while keeping the accuracy of the result.
Workshop
Accelerators
Modeling and Simulation
Performance Evaluation and/or Optimization Tools
W
DescriptionWith an ever-growing compute advantage over CPUs, GPUs are often used in workloads with ample BLAS computation to improve performance. However, several factors, including data-to-compute ratio, amount of data re-use, and data structure, can all impact performance. Hence, using a GPU is not a guarantee of better BLAS performance. In this work, we introduce the GPU BLAS Offload Benchmark (GPU-BLOB), a novel and portable benchmark that measures CPU and GPU compute performance of different BLAS kernels and problem configurations. From the GPU offload threshold (a BLAS kernel’s minimum dimensions for a certain configuration where using a GPU is guaranteed to yield improved performance), we evaluate the per-node performance of three in-production HPC systems. We show that the offload threshold for GEMM is highly dependent on problem shape and number of consecutive BLAS calls, and that, contrary to conventional wisdom, GEMV can benefit from GPU acceleration, especially on SoC-based systems.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionIn the dynamic landscape of smart cities, it is necessary to improve the potential of real-time traffic data and high-performance computing to optimize traffic flow through dynamic re-routing strategies. Our research contributes to the assessment of how real-time traffic optimization and alternative route computation influences the overall improvement of traffic flow within cities. Our experimental scenarios take place under various traffic modeling and computational conditions, and deliver a scalability analysis and evaluation of several simulations on high-performance computing. Our approach involves simulations where different portions of vehicles dynamically adjust routes based on real-time traffic information. Also, scalability tests with varying computational workers and nodes assess our traffic simulator's capacity for scaling. One of our main findings shows that an informed management of live traffic data and selective alternative route computation can have a significant impact on the overall driving time and traffic flow within a city.
Workshop
Artificial Intelligence/Machine Learning
W
DescriptionContinual pretraining of large language models on domain-specific data has been proposed to enhance performance on downstream tasks. In astronomy, the previous absence of astronomy-focused benchmarks has hindered objective evaluation of these specialized LLM models. Leveraging a recent initiative to curate high-quality astronomical MCQs, this study aims to quantitatively assess specialized LLMs in astronomy. We find that the previously released AstroLLaMA series, based on LLaMA-2-7B, underperforms compared to the base model. We demonstrate that this performance degradation can be partially mitigated by utilizing high-quality data for continual pretraining, such as summarized text from arXiv. Despite the observed catastrophic forgetting in smaller models, our results indicate that continual pretraining on the 70B model can yield improvements. However, the current supervised fine-tuning dataset still constrains the performance of instruct models. In conjunction with this study, we introduce a new set of models, AstroLLaMA-3-8B and AstroLLaMA-2-70B, building upon the previous AstroLLaMA series.
Paper
Accelerators
Algorithms
Data Movement and Memory
Graph Algorithms
TP
DescriptionIM is the problem of finding the k most influential nodes in a graph. We propose distributed-memory parallel algorithms for the two main kernels of a state-of-the-art implementation of one IM algorithm, IMM. The baseline relies on a bulk-synchronous parallel approach and uses replication to
reduce communication and achieve approximate load balance, at the cost of synchronization and high memory requirements. By contrast, our method fully distributes the data, thereby improving memory scalability, and uses fine-grained asynchronous parallelism to improve network utilization and the cost of doing more communication. We show our design and implementation can achieve up to 29.6x speedup over the MPI-based state-of-the-art on synthetic and real-world network graphs. Moreover, ours is the first implementation that can run IMM to find influencers in the ‘twitter’ graph (41M nodes and 1.4B edges) in 200 seconds using 8K CPU cores of NERSC Perlmutter supercomputer.
reduce communication and achieve approximate load balance, at the cost of synchronization and high memory requirements. By contrast, our method fully distributes the data, thereby improving memory scalability, and uses fine-grained asynchronous parallelism to improve network utilization and the cost of doing more communication. We show our design and implementation can achieve up to 29.6x speedup over the MPI-based state-of-the-art on synthetic and real-world network graphs. Moreover, ours is the first implementation that can run IMM to find influencers in the ‘twitter’ graph (41M nodes and 1.4B edges) in 200 seconds using 8K CPU cores of NERSC Perlmutter supercomputer.
Exhibits
SCinet
TP
XO/EX
Paper
Post-Moore Computing
Quantum Computing
TP
DescriptionThis paper presents techniques for theoretically and practically efficient and scalable Schrödinger-style quantum circuit simulation. Our approach partitions a quantum circuit into a hierarchy of subcircuits and simulates the subcircuits on multi-node GPUs, exploiting available data parallelism while minimizing communication costs. To minimize communication costs, we formulate an Integer Linear Program that rewards simulation of "nearby" gates on "nearby" GPUs. To maximize throughput, we use a dynamic programming algorithm to compute the subcircuit simulated by each kernel at a GPU. We realize these techniques in Atlas, a distributed, multi-GPU quantum circuit simulator. Our evaluation on a variety of quantum circuits shows that Atlas outperforms state-of-the-art GPU-based simulators by more than 2x on average and is able to run larger circuits via offloading to DRAM, outperforming other large-circuit simulators by two orders of magnitude.
Workshop
Artificial Intelligence/Machine Learning
W
DescriptionThe general public has access to Large Language Models (LLMs), allowing their users to obtain quick answers about any topic. While the generated content tends to be correct, bringing trust in the models, all auto-regressive next token prediction LLMs can hallucinate. This leads to the spread of misinformation and possibly severe consequences when using LLMs for high-risk applications. So, we need proper validation systems for the generated content.
We develop an LLM attribution algorithm using post-generation retrieval, based on Retrieval-Augmented Generation (RAG). This algorithm has a simple implementation and uses a pre-trained LLM, making it accessible to the public.
We develop an LLM attribution algorithm using post-generation retrieval, based on Retrieval-Augmented Generation (RAG). This algorithm has a simple implementation and uses a pre-trained LLM, making it accessible to the public.
Paper
Cloud Computing
Fault-Tolerance, Reliability, Maintainability, and Adaptability
State of the Practice
TP
DescriptionCheckpoint/Restart (C/R) has been widely deployed in numerous HPC systems, Clouds, and industrial data centers, which are typically operated by system engineers. Nevertheless, there is no existing approach that helps system engineers without domain expertise and domain scientists without system fault tolerance knowledge identify those critical variables accounted for correct application execution restoration in a failure for C/R. To address this problem, we propose an analytical model and a tool (AutoCheck) that can automatically identify critical variables to checkpoint for C/R. AutoCheck relies on first, analytically tracking and optimizing data dependency between variables and other application execution state, and second, a set of heuristics that identify critical variables for checkpointing from the refined data-dependency graph (DDG). AutoCheck allows programmers to pinpoint critical variables to checkpoint quickly within a few minutes. We evaluate AutoCheck on 13 representative HPC benchmarks, demonstrating that AutoCheck can efficiently identify correct critical variables to checkpoint.
Paper
Accelerators
Compilers
Embedded and/or Reconfigurable Systems
Linear Algebra
Performance Evaluation and/or Optimization Tools
TP
DescriptionThis paper presents an open-source library that pushes the limits of performance portability for irregular General Matrix Multiplication (GEMM) computations on the widely-used Arm architectures. autoGEMM generates optimized kernels for various hardware configurations by auto-combining fragments of auto-generated micro-kernels that employ hand-written optimizations to maximize computational efficiency. We optimize the kernel pipeline by tuning the register reuse and the data load/store overlapping. In addition, we use a dynamic tiling scheme to generate balanced tile shapes, based on the shapes of the matrices. We build autoGEMM on top of the TVM framework where our dynamic tiling scheme prunes the search space for TVM to identify the optimal combination of parameters for code optimization. Evaluations on five different classes of Arm chips demonstrate the advantages of autoGEMM. For small matrices, autoGEMM achieves 98% of peak and up to 2.0x speedup over state-of-the-art libraries such as LIBXSMM and LibShalom. autoGEMM is available at:https://github.com/wudu98/autoGEMM.
Paper
Accelerators
Compilers
Embedded and/or Reconfigurable Systems
Linear Algebra
Performance Evaluation and/or Optimization Tools
TP
DescriptionFinite-difference methods based on high-order stencils are widely used in seismic simulations, weather forecasting, and computational fluid dynamics. Recently, multiple research groups have begun exploring the use of dataflow architectures, such as Cerebras' wafer-scale engine, to accelerate stencil computations. However, implementations of stencil computations for dataflow architectures must address unique challenges, such as managing the routing of data communications and accommodating a significantly constrained memory footprint. These make hand-crafting code for a dataflow architecture difficult and time-consuming. This paper describes a framework for developing portable, high-performance implementations of stencil computations for modern node architectures. The paper focuses on code generation strategies for the Cerebras wafer-scale engine, including code generation of router configurations and sequencing of communication for high-order stencils. A 25-point star-shaped stencil written using our tool is 7x shorter than hand-crafted code written in Cerebras Software Language (CSL), and it delivers comparable performance to manually written code.
Workshop
Performance Optimization
Programming Frameworks and System Software
W
DescriptionRecent trends in HPC systems increasingly emphasize accelerators, particularly GPUs, as autonomous execution units, shifting control of entire program execution to GPUs. In this work, we aim to bridge this gap with a compiler and provide a productive method for writing efficient GPU-first code. We design and develop a code generator that efficiently fuses and schedules persistent kernels, provides high-level abstractions over device resources, and enables GPU-initiated communication within Python code using NVSHMEM to realize autonomous multi-GPU execution. We compare our implementation to other accelerated Python compilers including CuPy, DaCe, and cuNumeric on 22 NPBench kernels. We additionally perform a scaling study of distributed 2D/3D Jacobi and observe a speedup of 6.1𝑥 and 30.8𝑥 over DaCe and cuNumeric, respectively, on 8 GPUs for the 3D case with a scaling efficiency of 98%.
Birds of a Feather
TP
XO/EX
DescriptionBatch computations solve relatively small, independent problems on HPC architectures. For over a decade, there has been a significant demand for high-performance batch linear algebra (LA) software, especially for uniform batches. However, we might be just scratching the surface of what batch LA software can offer. From non-uniform batches to batch sparse algorithms, and even JIT-compiled linear operators, applications are constantly pushing the boundaries of batch LA software. Interested audience members are encouraged to attend this BoF, listen to short presentations by experts from academia and industry, and share their feedback and experiences with the community.
Workshop
Graph Algorithms
Heterogeneous Computing
Programming Frameworks and System Software
W
DescriptionWe develop a distributed-memory parallel algorithm for performing batch updates on streaming graphs, where vertices and edges are continuously added or removed. Our algorithm leverages distributed sparse matrices as the core data structures, utilizing equivalent sparse matrix operations to execute graph updates. By reducing unnecessary communication among processes and employing shared-memory parallelism, we accelerate updates of distributed graphs. Additionally, we maintain a balanced load in the output matrix by permuting the resultant matrix during the update process. We demonstrate that our streaming update algorithm is at least 25 times faster than alternative linear-algebraic methods and scales linearly up to 4,096 cores (32 nodes) on a Cray EX supercomputer.
Workshop
Data Compression
Data Movement and Memory
Middleware and System Software
W
DescriptionHandling large graphs in a distributed environment requires effective partitioning across processors and efficient management of local partitions. In 2D partitioning, local graphs often become too sparse, making memory-efficient data structures crucial. Using the Compressed Sparse Row (CSR) format wastes space, especially for >83% of vertices with empty edges for the sparse graphs. This study explores bit-CSR (BCSR), a modified CSR representation, on GPUs to reduce memory usage in graph computations. We achieved 16.67\% memory savings on a sparse rmat dataset with 268 million vertices and 357 million edges, without performance degradation, supported by both theoretical and experimental storage savings of 33%. However, we observed a 1.7x slowdown in degree lookup times due to bitwise operations on AMD CPUs. This analysis highlights the potential of BCSR on GPUs for improving Graph500 benchmark performance on GPU-accelerated systems, such as the Frontier supercomputer.
Workshop
Message Passing
Network
W
DescriptionBeatnik is a novel open source mini-application that exercises the complex communication patterns often found in production codes but rarely found in benchmarks or mini-applications. It simulates 3D Raleigh-Taylor instabilities based on Pandya and Shkoller’s Z-Model formulation using the Cabana performance portability framework. This paper presents both the high-level design and important implementation details about Beatnik, along with four benchmark setups for evaluating different aspects of HPC communication system performance. Evaluation results demonstrate Beatnik's scalability on modern accelerator-based systems using weak and strong scaling tests up to 1024 GPUs, along with Beatnik's ability to expose communication challenges in modern systems and solver libraries.
Workshop
HPC Infrastructure
State of the Practice
System Administration
W
DescriptionHigh-performance computing (HPC) resources are critical for scientific and engineering computations, but they come with significant costs, making optimal utilization essential. While substantial effort is often invested in performance optimization during the initial deployment, regular performance monitoring tends to diminish during the production phase, potentially leading to undetected degradation in software and hardware performance. The XDMoD Application Kernel Performance Monitoring Module addresses this by automatically executing a suite of applications and benchmarks on a daily basis to identify performance degradation proactively. This module is designed to generate meaningful performance data while keeping short wall times to minimize impact on users. It also includes a performance degradation detection algorithm that can trigger regular or problem-specific email notifications. Operational since 2011, this module has successfully monitored the performance of XSEDE and ACCESS resources. This presentation provides an overview of the module’s capabilities and showcases practical use cases illustrating its effectiveness.
Workshop
Debugging and Correctness Tools
Hardware Technologies
Resource Management
State of the Practice
W
DescriptionContinuous performance monitoring is critical for maintaining optimal performance of High-Performance Computing resources. This is especially important for technological test bed systems, in which software updates occur often, and performance degradation in one place can be masked by performance improvement in other places. This paper reports on our experience running continuous performance monitoring on Ookami, an ARM Fujitsu A64FX machine (the first ARM CPU with SVE-512 support) using XDMoD. After over three years of monitoring, we found that the applications and numerical library performance improved the most on the initial release with new technology support, followed by a series of smaller performance gains. Another interesting observation about numeric libraries is that the most invested vendors produce optimized code faster than community codes.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionIn this poster, we present the Analytics4X (A4X) framework, a workflow framework that enables systematic studies, evaluation, and modeling of HPC/HTC workflow data movement in a flexible and controlled environment. We apply A4X to an in situ molecular dynamics workflow to assess the performance trade-offs of two data management solutions: DYAD and Lustre. Through this assessment, we illustrate the importance of selecting a data solution that optimizes for both data movement and producer-consumer synchronization.
Workshop
I/O, Storage, Archive
W
DescriptionInterconnects have always played a cornerstone role
in HPC. Since the inception of the Top500 ranking, interconnect
statistics have been predominantly dominated by two compet-
ing technologies: InfiniBand and Ethernet. However, even if
Ethernet increased its popularity due to versatility and cost-
effectiveness, InfiniBand used to provide higher bandwidth and
continues to feature lower latency. Industry seeks for a further
evolution of the Ethernet standards to enable fast and low-
latency interconnect for emerging AI workloads by offering
competitive, open-standard solutions. This paper analyzes the
early results obtained from two systems relying on an HPC
Ethernet interconnect, one relying on 100G and the other on
200G Ethernet. Preliminary findings indicate that the Ethernet-
based networks exhibit competitive performance, closely aligning
with InfiniBand, especially for large message exchanges.
in HPC. Since the inception of the Top500 ranking, interconnect
statistics have been predominantly dominated by two compet-
ing technologies: InfiniBand and Ethernet. However, even if
Ethernet increased its popularity due to versatility and cost-
effectiveness, InfiniBand used to provide higher bandwidth and
continues to feature lower latency. Industry seeks for a further
evolution of the Ethernet standards to enable fast and low-
latency interconnect for emerging AI workloads by offering
competitive, open-standard solutions. This paper analyzes the
early results obtained from two systems relying on an HPC
Ethernet interconnect, one relying on 100G and the other on
200G Ethernet. Preliminary findings indicate that the Ethernet-
based networks exhibit competitive performance, closely aligning
with InfiniBand, especially for large message exchanges.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionThis work compares performance of various QUBO software tools on different platforms available on the Sol supercomputer at Arizona State University. CPU, GPU and the NEC vector engine (VE) card provide various means to implement these computations on simulated qubits while employing various solvers, including simulated quantum annealing. As current quantum hardware cannot reach the scale of many real-world problems, these simulations give a sense of the potential future technology.
Although this particular work is complete, there remains potential to expand on it by researching other types of NP-Hard optimization problems and implementing other solvers. We will present plots illustrating the results of the performance and accuracy benchmarking.
Although this particular work is complete, there remains potential to expand on it by researching other types of NP-Hard optimization problems and implementing other solvers. We will present plots illustrating the results of the performance and accuracy benchmarking.
Workshop
Accelerators
Modeling and Simulation
Performance Evaluation and/or Optimization Tools
W
DescriptionIn the span of 1.5 years, Intel has launched four families of Xeon processors, with some novel architectural features. First, the Sapphire Rapids generation, which featured a version with on-package HBM; next, the Emerald Rapids generation; and then Intel differentiated by releasing the performance-oriented Granite Rapids and the efficiency-oriented Sierra Forest families. In this work, we evaluate the performance and efficiency of CPUs from each of these generations and variants, with a particular focus on bandwidth-bound high performance computing (HPC) applications. We contrast runtime and energy consumption figures and track trends across generations.
Birds of a Feather
TP
XO/EX
DescriptionBenchmarking is a key factor in ensuring the success of numerous spawning AI services that are captivating the world today, such as Large Language Models. In this session, five experts from diverse perspectives (AMD, NVIDIA, Microsoft Azure, Oracle Cloud, and Snorkel AI) will discuss the usage of benchmarking across the main steps of developing AI infrastructure and applications. Attendees will have an opportunity to engage with the experts about the critical role of benchmarking, from building AI infrastructure, optimizing cloud environments, and ensuring customer performance to continuously improving AI applications through iterative development and feedback.
Workshop
Distributed Computing
Performance Evaluation and/or Optimization Tools
Scientific and Information Visualization
W
Workshop
Accelerators
Heterogeneous Computing
Parallel Programming Methods, Models, Languages and Environments
W
Workshop
Artificial Intelligence/Machine Learning
Broader Engagement
HPC in Society
W
DescriptionAchieving trustworthy AI systems with easy usability for all stakeholders in the healthcare sector is challenging, as trustworthiness has many facets. It has been shown that even when physicians lack knowledge or understanding, patients are usually willing to use drugs that are demonstrably safe and efficient (Boddington, 2017). Reducing the opacity of black-box AI systems is crucial for healthcare AI applications because of the moral and professional responsibility of physicians to provide reasons and explanations for their decisions (Holzinger et al., 2019).
However, black-box models are common in AI and are generally thought to pose a problem for trustworthiness. Despite the fact that robotic surgical systems are as efficient as physicians, many patients still trust a surgeon more than a robotic system (Longoni, 2019).
This paper explores the challenges in healthcare simulations, emphasizing the need for ethical frameworks and adaptive regulatory mechanisms to address data requirements and privacy concerns.
However, black-box models are common in AI and are generally thought to pose a problem for trustworthiness. Despite the fact that robotic surgical systems are as efficient as physicians, many patients still trust a surgeon more than a robotic system (Longoni, 2019).
This paper explores the challenges in healthcare simulations, emphasizing the need for ethical frameworks and adaptive regulatory mechanisms to address data requirements and privacy concerns.
Panel
Heterogeneous Computing
TP
DescriptionWith the recent deployment of the U.S. Department of Energy's first exascale system, the timing couldn't be better for us to delve into the world of exascale and explore the challenges and opportunities in the post-exascale era. During this "golden age of architectures," we are now seeing a Cambrian explosion of new technologies, catalyzed by chiplets, heterogeneous integration, open hardware, AI, and substantial global investments in semiconductors. This new level of heterogeneity is rich with opportunity but fraught with serious challenges that often seem insurmountable. This panel will survey post-exascale technologies, scrutinize enabling catalysts, and discuss their implications for system design, software, and applications.
Workshop
HPC Infrastructure
State of the Practice
System Administration
W
DescriptionIn this talk, we will address misconceptions about the I/O requirements for training large language models (LLMs) at scale. Contrary to popular belief, the read workload is not the biggest I/O challenge in training and does not intrinsically require extreme IOPS nor bandwidth. Rather, the write workload, driven by checkpointing, imposes the highest demands on storage. Despite this, the write bandwidth for even the largest frontier LLMs is also modest even at massive scale, and we will demonstrate this quantitatively through a simple performance model. We will also explore strategies to optimize write performance of training frameworks, further reducing the need for boutique storage systems optimized for write throughput. This presentation aims to provide a clear, realistic view of the I/O patterns in LLM training based on our real-world experience, dispelling myths that may otherwise result in overspecifying and overspending on storage for AI workloads.
Art of HPC
Posters
TP
W
TUT
XO/EX
DescriptionSimulation data was computed on resources of the Argonne Leadership Computing Facility, and rendered using ParaView. No ML tools were leveraged in the rendering.
Art of HPC
Posters
TP
W
TUT
XO/EX
DescriptionWe used the Coreform Cubit software to create an Exodus-II tri-mesh with 11,196 points for the blood vessel walls. Red blood cells are placed randomly within the mesh bounds, and then the algorithm from RBC3D, a spectral boundary integral solver for cell-scale flows, initiates Stokes flow through the vessel. This algorithm is parallelized via MPI, and we had to use 192 CPU cores for eight hours to run this simulation to 10,000 timesteps. To visualize the simulation data, we used Kitware's ParaView software. Then, we used two NVIDIA RTX 6000 GPUs to run the OSPRay path tracer algorithm from ParaView's ray-tracing tools. Georgia Tech's PACE Phoenix cluster provided access to CPU and GPU nodes under Spencer Bryngelson's allocation. The ray-tracing step took 16 hours on these nodes. Finally, we combined images of the simulation from each timestep into a video using FFmpeg.
Workshop
Distributed Computing
Experimental Facility
HPC Infrastructure
W
DescriptionSynchrotron light sources have emerged as vibrant hubs of scientific collaboration, fueling breakthroughs across multiple disciplines. As domain scientists increasingly push the boundaries of their research, the demand for customized beam- lines has grown in response. Recent successes in experiment orchestration and data management solutions such as Bluesky are being adopted widely, yet a significant challenge remains: generalizing support for data analysis workflows to accommodate the unique needs of diverse user communities. Addressing this hurdle could unlock a new era of experimental innovation, enabling the development of sophisticated techniques that are adaptive, automated, and tailored to specific research domains. This work presents a strategic approach to integrating exper- iment orchestration and data analysis, aimed at empowering scientists to realize their most ambitious goals while maximizing the potential of modern light sources.
Paper
Distributed Computing
Middleware and System Software
TP
DescriptionExisting disaggregated memory (DM) systems face a problem of underutilized far memory bandwidth, which greatly limits the data throughput when processing data-intensive applications. Specifically, prior works all target runtime design for a single PCIe-based secondary memory device (i.e., single-backend far memory) with low data bandwidth and high system overhead.
In this work, we take the first step to realize a well-crafted, multi-backend DM system with scale-out far memory paths. We propose xDM, a novel DM management scheme that can dynamically build and implicitly select appropriate far memory access paths. As part of xDM, we devise a smart far memory configuration strategy that can further optimize bandwidth usage effectiveness by tuning a wide set of key parameters based on synthesized information of application page data. Our design shows up to 3.9x data swap performance speedup, 2.8x data throughput increase, and 5.1x data center task throughput improvement compared with state-of-the-art works.
In this work, we take the first step to realize a well-crafted, multi-backend DM system with scale-out far memory paths. We propose xDM, a novel DM management scheme that can dynamically build and implicitly select appropriate far memory access paths. As part of xDM, we devise a smart far memory configuration strategy that can further optimize bandwidth usage effectiveness by tuning a wide set of key parameters based on synthesized information of application page data. Our design shows up to 3.9x data swap performance speedup, 2.8x data throughput increase, and 5.1x data center task throughput improvement compared with state-of-the-art works.
ACM Gordon Bell Climate Modeling Finalist
TP
DescriptionWe present the design and scalable implementation of an exascale climate emulator for addressing the computational and storage requirements of high-resolution Earth System Model simulations. We utilize the spherical harmonic transform to stochastically model spatio-temporal variations in climate data, providing tunable resolution and improving the fidelity and granularity of climate emulation, achieving an ultra-high spatial resolution of 0.034◦ (∼3.5 km) in space. Our emulator trained on 318 billion hourly temperature data points from a 35-year and 31 billion daily points from an 83-year global simulation ensemble. We extend linear solver software to mixed-precision arithmetic GPUs, applying different precisions within a single solver to adapt to different correlation strengths, and use the PaRSEC runtime system. Our BLAS3-rich code is optimized for systems with four different families of GPUs, to achieve 0.976 EFlop/s (9,025 nodes, Frontier), 0.739 EFlop/s (1,936 nodes, Alps), 0.243 EFlop/s (1,024 nodes, Leonardo), and 0.375 EFlop/s (3,072 nodes, Summit).
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionThe NSF-funded Anvil supercomputer, built and maintained by the Purdue University Rosen Center for Advanced Computing, enables efficient research computing across a variety of scientific domains nationwide. In addition to the traditional terminal interface, Anvil uses the open-source web portal framework Open OnDemand to provide a low barrier web interface to the Anvil supercomputer. In this poster we describe enhancements that were made to Anvil's Open OnDemand dashboard to provide a clean, well-structured, and extensible interface for researchers to visualize useful information about their utilization of Anvil without accessing the terminal. Our enhancements include various apps that provide detailed statistics about user jobs and their respective performance metrics while also focusing on data query performance. The information in these apps will enable users to identify and debug any jobs that are noticeably resource-inefficient, as well as improve the queue wait time and efficiency of their jobs in the future.
ACM Gordon Bell Finalist
TP
DescriptionThe accurate simulation of complex biochemical phenomena has historically been hampered by the computational requirements of high-fidelity molecular-modeling techniques. Quantum mechanical methods, such as \emph{ab initio} wave-function (WF) theory, deliver the desired accuracy, but have impractical scaling for modeling biosystems with thousands of atoms. Combining molecular fragmentation with MP2 perturbation theory, this study presents an innovative approach that enables biomolecular-scale \emph{ab initio} molecular dynamics (AIMD) simulations at WF theory level. Leveraging the resolution-of-the-identity approximation for Hartree-Fock and MP2 gradients, our approach eliminates computationally intensive four-center integrals and their gradients, while achieving near-peak performance on modern GPU architectures. The introduction of asynchronous time steps minimizes time step latency, overlapping computational phases and effectively mitigating load imbalances. Utilizing up to 9,400 nodes of Frontier and achieving 59% (1006.7 PFLOP/s) of its double-precision floating-point peak, our method enables us to break the million-electron and 1 EFLOP/s barriers for AIMD simulations with quantum accuracy.
ACM Gordon Bell Finalist
TP
DescriptionMolecular dynamics (MD) simulations have transformed our understanding of the nanoscale, driving breakthroughs in materials science, computational chemistry, and several other fields, including biophysics and drug design. Even on exascale supercomputers, however, runtimes are excessive for systems and timescales of scientific interest. Here, we demonstrate strong scaling of MD simulations on the Cerebras Wafer Scale Engine. By dedicating a processor core for each simulated atom, we demonstrate a 457-fold improvement in timesteps per second versus the Frontier GPU-based exascale platform, along with a large improvement in timesteps per unit energy. Reducing every year of runtime to less than a day unlocks currently inaccessible timescales of slow microstructure transformation processes that are critical for understanding material behavior and function. Our dataflow algorithm runs embedded-atom method (EAM) simulations at rates over 699k timesteps per second for problems with up to 800k atoms. This demonstrated performance is unprecedented for general-purpose processing cores.
Workshop
Distributed Computing
Performance Evaluation and/or Optimization Tools
Scientific and Information Visualization
W
DescriptionThis paper presents a novel, general-purpose interface for adding interactive human-in-the-loop steering controls to existing simulation codes. The design is agnostic to any specific in situ analysis and visualization library, though our reference implementation is based on Ascent -- a common in situ visualization and analysis library for large-scale simulations. Traditional in situ analysis and visualization workflows are typically automated through trigger mechanisms that execute as simulations reach certain predefined states (e.g. every N timesteps, simulation parameters become unstable, etc.). Although such automated in situ tasks suffice for many real-world applications, we demonstrate that a complementary interactive interface can significantly boost scientific productivity. With two use case simulations, we show how our approach enables scientists to pause and resume simulations, allowing for interactive adjustments to the simulation state between timesteps. This method avoids the overhead of cold restarting, enabling faster case setup, troubleshooting, and exploration.
Birds of a Feather
TP
XO/EX
DescriptionAI needs data to learn and make decisions. Building and using AI requires access to massive amounts of data and computing power. Consequently, democratizing the AI R&D ecosystem requires democratizing access to AI-ready data and the ability to process this data. However, fair, open, and equitable access to these resources remains challenging. This session will discuss the many dimensions of democratizing data. It will then introduce key related efforts and facilitate an open discussion about how these and other efforts can come together as part of an open data ecosystem that ensures all researchers can participate in AI R&D.
Workshop
Emerging Technologies
Modeling and Simulation
Scientific and Information Visualization
W
DescriptionThis session will bring together multiple practitioners in the field of Digital Twins and present live, running examples in a lightning format.
Abstract for Elisabeth Mayer's talk: Digital twins are powerful tools used across various scientific fields. One example is art history: By capturing models containing even small details from the the physical space, precise virtual replicas of a room can be used to create art historic digital twins. These digital twins can be used to recreate situations through simulations that would otherwise be impossible to achieve in the physical model. The digital twin in this presentation encompasses the great hall or Spiegelsaal in the castle Rheinsberg. The room measures approximately 12.5 x 11m with a ceiling height of between 5.0 and 5.5m. It has three large windows on the east and west sides, interspersed with full-length mirrors. The digital twin was captured via photogrammetry by Hess & Hindmarch (University of Bamberg) and reworked to capture the space of the physical model. The digital twin is used to explore the significance of light through simulations as well as spatial interactions through immersive analytics. Digital twins in art history can assist in the analysis, understanding, interpretation and conservation of historical spaces.
Hess, M., & Hindmarch, J. (2023). Textured 3D Model of the great hall at Castle Rheinsberg, Germany/ Schloss Rheinsberg (Version v1). Otto-Friedrich-Universität Bamberg. https://doi.org/10.48564/unibafd-gbc86-fha62
Abstract for Pieter van Schalkwyk's talk - From Static Twins to Smart Agents: The Evolution of Digital Twin Technology in EV Battery Assembly As the automotive industry seeks real-time adaptability, digital twins are evolving from static models to active, intelligent agents. These Multi-Agent Generative Systems (MAGS) bring advanced capabilities to EV battery assembly, enabling digital twins to monitor, learn, and optimize complex manufacturing processes autonomously.
Abstract for Dimitrios Rovas talk: Title: Building of the future: using digital building twins for data-smart operation As climate change and the need for resilient operations continue to drive innovation, digitalisation in energy efficiency is becoming pivotal. This presentation will explore the concept of a digital twin in high-performance building operations, detailing its applications, variations, and how it supports sustainability goals. Attendees will gain insight into digital twins designed for advanced operational efficiency, including examples where agent-based operation can surpass traditional building management systems. We’ll also dive into the technologies, especially semantic web frameworks and integrated data platforms, that enable the scale-up and effective delivery of digital twins, offering valuable perspectives on their construction and deployment.
Abstract for Anuj Kapadia : Abstract: Digital twins are transforming oncology by enabling patient-specific radiation treatments through precise simulations for personalized planning, real-time monitoring, and outcome prediction. At Oak Ridge National Laboratory, we developed a multiscale computational framework for radiation dosimetry and outcome prediction, integrating GEANT4, TOPAS-nBIO, CompuCell3D, and XCAT phantoms. Simulations were performed for external beam radiotherapy (gamma and proton beams) and radiopharmaceutical therapy (e.g., 18F and 225Ac), modeling effects from whole-body exposure to DNA damage and repair. This approach incorporates molecular interactions and biochemical pathways, enabling cell survival predictions in tumor and healthy tissues, advancing precision in cancer treatment.
Abstract for Elisabeth Mayer's talk: Digital twins are powerful tools used across various scientific fields. One example is art history: By capturing models containing even small details from the the physical space, precise virtual replicas of a room can be used to create art historic digital twins. These digital twins can be used to recreate situations through simulations that would otherwise be impossible to achieve in the physical model. The digital twin in this presentation encompasses the great hall or Spiegelsaal in the castle Rheinsberg. The room measures approximately 12.5 x 11m with a ceiling height of between 5.0 and 5.5m. It has three large windows on the east and west sides, interspersed with full-length mirrors. The digital twin was captured via photogrammetry by Hess & Hindmarch (University of Bamberg) and reworked to capture the space of the physical model. The digital twin is used to explore the significance of light through simulations as well as spatial interactions through immersive analytics. Digital twins in art history can assist in the analysis, understanding, interpretation and conservation of historical spaces.
Hess, M., & Hindmarch, J. (2023). Textured 3D Model of the great hall at Castle Rheinsberg, Germany/ Schloss Rheinsberg (Version v1). Otto-Friedrich-Universität Bamberg. https://doi.org/10.48564/unibafd-gbc86-fha62
Abstract for Pieter van Schalkwyk's talk - From Static Twins to Smart Agents: The Evolution of Digital Twin Technology in EV Battery Assembly As the automotive industry seeks real-time adaptability, digital twins are evolving from static models to active, intelligent agents. These Multi-Agent Generative Systems (MAGS) bring advanced capabilities to EV battery assembly, enabling digital twins to monitor, learn, and optimize complex manufacturing processes autonomously.
Abstract for Dimitrios Rovas talk: Title: Building of the future: using digital building twins for data-smart operation As climate change and the need for resilient operations continue to drive innovation, digitalisation in energy efficiency is becoming pivotal. This presentation will explore the concept of a digital twin in high-performance building operations, detailing its applications, variations, and how it supports sustainability goals. Attendees will gain insight into digital twins designed for advanced operational efficiency, including examples where agent-based operation can surpass traditional building management systems. We’ll also dive into the technologies, especially semantic web frameworks and integrated data platforms, that enable the scale-up and effective delivery of digital twins, offering valuable perspectives on their construction and deployment.
Abstract for Anuj Kapadia : Abstract: Digital twins are transforming oncology by enabling patient-specific radiation treatments through precise simulations for personalized planning, real-time monitoring, and outcome prediction. At Oak Ridge National Laboratory, we developed a multiscale computational framework for radiation dosimetry and outcome prediction, integrating GEANT4, TOPAS-nBIO, CompuCell3D, and XCAT phantoms. Simulations were performed for external beam radiotherapy (gamma and proton beams) and radiopharmaceutical therapy (e.g., 18F and 225Ac), modeling effects from whole-body exposure to DNA damage and repair. This approach incorporates molecular interactions and biochemical pathways, enabling cell survival predictions in tumor and healthy tissues, advancing precision in cancer treatment.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionThe increasing demand for computing in scientific research has given rise to memory contention and performance bottlenecks. Existing solutions often carry high overheads or lack the necessary detail for effective contention mitigation. To tackle these challenges, we are developing a powerful tool, HOME (Hierarchy-Oriented Memory Evaluation), which can efficiently identify contention by capturing detailed load-store traces and passing them to configurable memory hierarchy models.
HOME helps programmers identify the code regions that create contention at various memory hierarchy levels. This enables developers to redesign applications and optimize memory hierarchy designs efficiently. Our preliminary assessments indicate that HOME can save time by up to 50x compared to the state-of-the-art, with an average error rate of 6.51%. We also provide solutions for mitigating the impact of sample drops on contention analysis, a common issue in trace-based analysis.
HOME helps programmers identify the code regions that create contention at various memory hierarchy levels. This enables developers to redesign applications and optimize memory hierarchy designs efficiently. Our preliminary assessments indicate that HOME can save time by up to 50x compared to the state-of-the-art, with an average error rate of 6.51%. We also provide solutions for mitigating the impact of sample drops on contention analysis, a common issue in trace-based analysis.
Workshop
Emerging Technologies
Modeling and Simulation
Scientific and Information Visualization
W
DescriptionIn this talk, we will explore the development of a digital twin for High-Performance Computing (HPC) systems, using the CINECA datacenter in Bologna as a case study. Digital twins provide a virtual replica of the physical system, enabling real-time monitoring, simulation, and optimization. We will discuss the monitoring infrastructure necessary to collect and process system data, as well as how predictive models are used to enhance predictive maintenance. Furthermore, we will examine the development of energy and power prediction models, focusing on optimizing both the system’s energy consumption and workload performance. Attendees will gain insights into how these digital twin models improve system resilience, efficiency, and sustainability.
Birds of a Feather
TP
XO/EX
DescriptionData streaming is a proven, widely used technique. It’s not commonly used in HPC yet, but it can support critical applications with high societal and scientific impact. This BoF examines use cases like Destination Earth, Large Hadron Collider, Square Kilometre Array, Exa-AToW, and Earth systems digital twins by NASA, and discusses how to make streaming of measurement/sensor or simulation data useful in the context of such large, data-oriented initiatives, which all critically rely on HPC infrastructures.
The aim is starting a global HPC Data Streaming Community to examine lessons learnt, adapt data streaming approaches and drive integration with HPC systems.
The aim is starting a global HPC Data Streaming Community to examine lessons learnt, adapt data streaming approaches and drive integration with HPC systems.
Exhibits
Flash Session
TP
XO/EX
DescriptionDDN presents a ground breaking new data platform for Massive Scale AI Development and Production. A modern, Software architecture that has been designed from the ground up to enable the next generations of AI model creation, by handling distributed data, massive metadata and vastly simplifying how organizations handle AI models through to production.
Birds of a Feather
TP
XO/EX
DescriptionScience relies on intercontinental collaborations, not least in hot areas such as AI, but for historical reasons many HPC codebases tend to dominate in the region where they were developed. This duplication of effort is a major challenge as we increasingly need large teams to perform co-design to adapt to emerging HPC platforms, where expertise is a worldwide bottleneck. International cooperation is a way to address this, both for training and development of applications. This BoF will share best practices and identify existing intercontinental collaborations in HPC applications, using the recent HANAMI Europe-Japan collaboration as a starting point for discussion.
Exhibits
Flash Session
TP
XO/EX
DescriptionOperationalizing AI initiatives while ensuring compliant, well-governed data is a top challenge for 60% of technology leaders. Today, businesses that want to leverage AI are often forced to bring their data to proprietary model providers’ cloud-based technologies, which undermines control and security, complicates the data ecosystem with costly vendors and tools, and hinders flexibility to adapt to changing market forces.
There is another way: bring AI Models to your Postgres data. Join EDB and guest speaker from Supermicro to learn how to build a GenAI chat application with your existing Postgres infrastructure. This session will unveil four key points for ensuring on-demand deployment of GenAI workloads, offering a flexible multi-modal experience without exposing sensitive data to the cloud or relying on fragmented database solutions. Discover how to harness the power of AI while maintaining complete data control. Don’t miss this chance to lead in the AI transformation while maximizing your database investments.
With EDB and Supermicro, the supercomputing community can leverage Postgres in an enterprise-hardened, high-performance, completely controlled environment.
There is another way: bring AI Models to your Postgres data. Join EDB and guest speaker from Supermicro to learn how to build a GenAI chat application with your existing Postgres infrastructure. This session will unveil four key points for ensuring on-demand deployment of GenAI workloads, offering a flexible multi-modal experience without exposing sensitive data to the cloud or relying on fragmented database solutions. Discover how to harness the power of AI while maintaining complete data control. Don’t miss this chance to lead in the AI transformation while maximizing your database investments.
With EDB and Supermicro, the supercomputing community can leverage Postgres in an enterprise-hardened, high-performance, completely controlled environment.
Workshop
Data Movement and Memory
I/O, Storage, Archive
W
DescriptionBULKI is designed to address the limitations of traditional key-value formats in handling large-scale metadata operations efficiently. BULKI is designed to be a self-describing, multi-key structure that can be serialized and deserialized efficiently. Our vision is to create a compact and flexible format that excels in high-throughput, metadata-driven applications.
Workshop
I/O, Storage, Archive
W
DescriptionModern supercomputers host numerous jobs that compete for shared storage resources, causing I/O interference and performance degradation. Solutions based on software-defined storage (SDS) emerged to address this issue by coordinating the storage environment through the enforcement of QoS policies. However, these often fail to consider the scale of modern HPC infrastructures.
In this work, we explore the advantages and shortcomings of state-of-the-art SDS solutions and highlight the scale of current production clusters and their rising trends. Furthermore, we conduct the first experimental study that sheds new insights into the performance and scalability of flat and hierarchical SDS control plane designs.
Our results, using the Frontera supercomputer, show that a flat design with a single controller can scale up to 2,500 nodes with an average control cycle latency of 41 ms, while hierarchical designs can handle up to 10,000 nodes with an average latency ranging between 69 and 103 ms.
In this work, we explore the advantages and shortcomings of state-of-the-art SDS solutions and highlight the scale of current production clusters and their rising trends. Furthermore, we conduct the first experimental study that sheds new insights into the performance and scalability of flat and hierarchical SDS control plane designs.
Our results, using the Frontera supercomputer, show that a flat design with a single controller can scale up to 2,500 nodes with an average control cycle latency of 41 ms, while hierarchical designs can handle up to 10,000 nodes with an average latency ranging between 69 and 103 ms.
Birds of a Feather
TP
XO/EX
DescriptionPython is now one of the most popular programming languages. In HPC, it has predominantly been used to coordinate coarse-grain library components or workflows. However, it is increasingly being used to develop and coordinate applications with dynamic finer-grain components that are challenging to map efficiently onto heterogeneous resources. In this BoF, we discuss this challenge and efforts to design Python-based HPC, production quality codes for HPC leadership platforms. We will discuss issues such as multithreading, GPU kernel development, task-based coordination on heterogeneous systems with a mix of CPUs and GPUs, inter-node interoperability, scalability, portability, and reproducibility.
Workshop
State of the Practice
System Administration
W
DescriptionResearch computing facilitators must balance providing the most up-to-date versions of software while also ensuring that the software ecosystem is stable enough that version changes do not cause performance degradation to existing workflows. Additionally, the data centers where these ecosystems are running are intricately complex systems with many points of failure. These challenges inspire the need for tools that ensure these systems continue to perform at their expected levels. Here we present such a tool in a framework for Cluster Analysis and Node Assessment for Resource Integrity called CANARI. CANARI was developed and used at the Rosen Center for Advanced Computing to monitor the availability of nodes in our clusters as well as their performance against synthetic benchmarks, ingest that performance data into a persistent database, mark nodes displaying performance regression offline, and provide summary reports and real-time alerts to the Slack instance used at RCAC by using Slack's API.
Workshop
Cloud Computing
Middleware and System Software
State of the Practice
W
Workshop
Cloud Computing
Middleware and System Software
State of the Practice
W
Workshop
Cloud Computing
Middleware and System Software
State of the Practice
W
Workshop
Cloud Computing
Middleware and System Software
State of the Practice
W
Art of HPC
Posters
TP
W
TUT
XO/EX
DescriptionUsing GFlowNets, we generate porous reticular materials, such as Metal-Organic Frameworks and Covalent Organic Frameworks, for applications in carbon dioxide capture. We introduce a new Python package (matgfn) to train and sample GFlowNets. We use matgfn to generate the matgfn-rm dataset of novel and diverse reticular materials with gravimetric surface area above 5000 m2 g−1. We calculate single- and two-component gas adsorption isotherms for the top 100 candidates in matgfn-rm. These candidates are novel compared to the state-of-the-art ARC-MOF dataset and rank in the 90th percentile in terms of working capacity compared to the CoRE2019 dataset. We identify 13 materials with CO2 working capacity outperforming all materials in CoRE2019. After further analysis and structural relaxation, two outperforming materials remain (https://pubs.rsc.org/en/content/articlelanding/2024/dd/d4dd00020j).
Once the .xyz files were created, they were then handed over to the visualisation team, where they were imported into VMD. Once in VMD, the Van Der Waals graphical representation was chosen and then imported into Blender. In Blender, the lighting, materials, textures and environment were all altered to show the MOFs in an artistic way.
Once the .xyz files were created, they were then handed over to the visualisation team, where they were imported into VMD. Once in VMD, the Van Der Waals graphical representation was chosen and then imported into Blender. In Blender, the lighting, materials, textures and environment were all altered to show the MOFs in an artistic way.
Students@SC
TP
W
TUT
XO/EX
DescriptionComplimentary brief coaching sessions following the résumé and pitch workshops will be available to participants. Pre-registration is required.
Paper
Algorithms
Data Movement and Memory
I/O, Storage, Archive
Performance Optimization
Scientific and Information Visualization
Visualization
TP
DescriptionIngestion of data generated by high-performance scientific applications continues to stress available storage resources. Efficient range-based analyses on this data can be enabled by reordering it on attributes of interest, but require expensive post-processing sorts to realize the query benefits of reordering. In-situ indexing techniques, while write-efficient, are orders of magnitude slower at range queries than sorted indices. Range queries are necessary for analyzing continuous physical attributes and tracking phenomena such as energy bands and wave fronts.
We present CARP, a scalable data partitioner for range queries that reorders data in-situ as it is streamed to storage during application I/O. Motivated by our findings that real application distributions tend to be highly skewed and dynamic, CARP dynamically discovers and adapts its data partitions to track these characteristics. As a result, CARP can approximate the query performance of a sort without any ingestion overhead, making it 5X faster than prior work.
We present CARP, a scalable data partitioner for range queries that reorders data in-situ as it is streamed to storage during application I/O. Motivated by our findings that real application distributions tend to be highly skewed and dynamic, CARP dynamically discovers and adapts its data partitions to track these characteristics. As a result, CARP can approximate the query performance of a sort without any ingestion overhead, making it 5X faster than prior work.
Workshop
Distributed Computing
Experimental Facility
HPC Infrastructure
W
DescriptionParticle accelerators are physically and technically complex systems. Their operation requires demanding beam control systems. To support the development of new components in a beam phase control system for a synchrotron, we present a CGRA-based hardware/software environment that is capable of simulating the fundamental beam behaviour, leading to a hardware-in-the-loop setup. We show that this setup is able to emulate the longitudinal phase oscillations of the particle bunches in real-time.
Workshop
Energy Efficiency
HPC Infrastructure
Sustainability
W
DescriptionCompute Energy & Emissions Monitoring Stack (CEEMS) has been designed to report energy usage of compute workloads in real time for HPC and cloud platforms alike. Besides CPU energy usage, it supports reporting energy usage of workloads on NVIDIA and AMD GPU accelerators. CEEMS has been built around the prominent open-source tools like Prometheus and Grafana. This paper explains the architectural overview of CEEMS, data sources that are used to measure energy usage and estimate equivalent emissions and potential use cases of CEEMS from operator and user perspectives. Finally, the paper will conclude by describing how CEEMS deployment on the Jean-Zay supercomputing platform is capable of monitoring more than 1400 nodes that have a daily job churn rate of around 20k jobs.
Exhibits
Flash Session
TP
XO/EX
Description1. Challenge & Technologies in Liquid-Cooling toward High-density and Efficient Applications
2. Liquid Cooling from Component to System
3. Reliability Confirmed from Beginning
4. Efficiency Evaluated from Design
5. Evaluation & Implementation of the Liquid Cooling
6. Design & Qualification Validations
7. Product Implementation & Shipment
2. Liquid Cooling from Component to System
3. Reliability Confirmed from Beginning
4. Efficiency Evaluated from Design
5. Evaluation & Implementation of the Liquid Cooling
6. Design & Qualification Validations
7. Product Implementation & Shipment
Workshop
Debugging and Correctness Tools
Performance Evaluation and/or Optimization Tools
W
DescriptionHPC is undergoing profound changes as AI workloads reshape hardware architectures, directly impacting software development. This shift has led to increased diversity in software stacks, with higher-level languages and new parallel programming interfaces emerging. In this evolving landscape, we aim to highlight the critical role of tools in addressing these shifting demands. Our discussion will start by examining the nature of current HPC tools and then explore how modern software stacks are promoting tools to first-class citizens within their ecosystems. Specifically, we will discuss shifts in runtimes, the motivations behind the MPI ABI for Tools, and the growing importance of Python and Rust leading to 'Interfaces' between languages and components. We will also emphasize the associated challenges in monitoring these increasingly complex systems. We argue that now, more than ever, robust tools are essential for enabling end-users to efficiently harness the capabilities of a given hardware platform.
Birds of a Feather
TP
XO/EX
DescriptionDespite the quantity and quality of existing training materials, acquisition and development of HPC skills is still not straightforward enough to address the needs of the growing and diversifying community. The HPC training ecosystem needs to be easier to navigate for both learners and educators. Many elements required to make the ecosystem more FAIR (findable, accessible, interoperable and reusable) already exist; the challenge is to connect them and identify the gaps in content, tools and approaches. This BoF will interactively explore the challenges in designing learning pathways, including the potential use of AI tools for personalized pathway development.
Art of HPC
Posters
TP
W
TUT
XO/EX
DescriptionThe packets were captured at a network border of 400 Gbps links. The raw packet data were preprocessed in Python to generate bipartite (pair-wise) TCP connections. Gephi put the TCP connections in a graph and visualized them. The original picture was 16K resolution (15360 × 8640), containing a snapshot of 2.29 M TCP connections visualized by Gephi using the Yifan Hu graph layout algorithm. Further processing was done in PowerPoint to highlight the area of interest. Several people were involved in data collection (Alex Withers), rendering (Bach Hoang), leading the process (Phuong Cao), and providing concept and feedback (Ravi Iyer, Zbigniew Kalbarczyk).
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionSimulating plasma turbulence in the edge region of a magnetic confinement fusion (MCF) device is crucial for identifying optimal operational scenarios for future fusion energy commercialization. GENE-X, an Eulerian electromagnetic gyrokinetic code, can simulate plasma turbulence throughout an MCF device, including the edge region. This work focuses on characterizing GENE-X's performance, such as the elapsed time during the time integration phase, memory usage, and file I/O. Two cases with different MPI decomposition schemes are analyzed using GENE-X's built-in profiler, along with profiling and monitoring tools such as IPM and Darshan. This study aims to provide a preliminary view of the HPC characteristics of the code to assist in future optimization efforts.
Workshop
Artificial Intelligence/Machine Learning
W
DescriptionWe present ChatBLAS, the first AI-generated and portable Basic Linear Algebra Subprograms (BLAS) library on different CPU/GPU configurations. The purpose of this study is (i) to evaluate the capabilities of current large language models (LLMs) to generate a portable and HPC library for BLAS operations and (ii) to define the fundamental practices and criteria to interact with LLMs for HPC targets to elevate the trustworthiness and performance levels of the AI-generated
HPC codes. The generated C/C++ codes must be highly optimized using device-specific solutions to reach high levels of performance. Additionally, these codes are very algorithm-dependent, thereby adding an extra dimension of complexity to this study. We used OpenAI’s LLM ChatGPT and focused on vector-vector BLAS level-1 operations.ChatBLAS can generate functional and correct codes, thereby achieving high trustworthiness levels, and can compete or even provide
better performance against vendor libraries.
HPC codes. The generated C/C++ codes must be highly optimized using device-specific solutions to reach high levels of performance. Additionally, these codes are very algorithm-dependent, thereby adding an extra dimension of complexity to this study. We used OpenAI’s LLM ChatGPT and focused on vector-vector BLAS level-1 operations.ChatBLAS can generate functional and correct codes, thereby achieving high trustworthiness levels, and can compete or even provide
better performance against vendor libraries.
Workshop
Artificial Intelligence/Machine Learning
W
DescriptionWe develop an iterative assistant we call ChatVis that can synthetically generate Python scripts for data analysis and visualization using a large language model (LLM). The assistant allows a user to specify the operations in natural language, attempting to generate a Python script for the desired operations, prompting the LLM to revise the script as needed until it executes correctly. The iterations include an error detection and correction mechanism that extracts error messages from the execution of the script and subsequently prompts LLM to correct the error. Our method demonstrates correct execution on five canonical visualization scenarios, comparing results with ground truth. We also compared our results with scripts generated by several other LLMs without any assistance. In every instance, ChatVis successfully generated the correct script, whereas the unassisted LLMs failed to do so.
Workshop
Artificial Intelligence/Machine Learning
W
DescriptionLarge Language Models (LLMs) are transforming multiple fields, yet verifying their answers remains challenging, especially for complex tasks like summarization, consolidation, and knowledge extraction. We introduce CheckEmbed, a scalable and straightforward approach to LLM verification. CheckEmbed leverages a simple idea: to compare LLM-generated answers with each other or a ground truth, use their answer-level embeddings obtained from models like GPT Text Embedding Large. This method reduces complex textual answers to single embeddings, enabling fast and meaningful verification. Our comprehensive pipeline implements the CheckEmbed methodology, including metrics like embedding heatmaps to assess answer truthfulness. We demonstrate how these metrics can be used to determine whether an LLM answer is satisfactory. Applied to real-world tasks, such as term extraction and document summarization, the CheckEmbed pipeline shows notable improvements in accuracy, cost-effectiveness, and runtime compared to existing methods like BERTScore or SelfCheckGPT.
Workshop
Distributed Computing
Fault-Tolerance, Reliability, Maintainability, and Adaptability
W
DescriptionThis work considers checkpointing strategies for a parallel application executing on a large-scale platform whose nodes are subject to failures. The application executes for a fixed duration, namely the length of the reservation that it has been granted.
We start with little examples that show the difficulty of the problem: the optimal checkpointing strategy neither always uses periodic checkpoints nor always takes its last checkpoint exactly at the end of the reservation.
Then, we introduce a dynamic heuristic that is periodic and decides for the checkpointing frequency based upon thresholds for the time left.
Next, we use time discretization and design a (complicated) dynamic programming algorithm that computes the optimal solution, without any restriction on the checkpointing strategy.
Finally, we report the results of an extensive simulation campaign that shows that the optimal solution is far more efficient than the Young/Daly periodic approach for short or mid-size reservations.
We start with little examples that show the difficulty of the problem: the optimal checkpointing strategy neither always uses periodic checkpoints nor always takes its last checkpoint exactly at the end of the reservation.
Then, we introduce a dynamic heuristic that is periodic and decides for the checkpointing frequency based upon thresholds for the time left.
Next, we use time discretization and design a (complicated) dynamic programming algorithm that computes the optimal solution, without any restriction on the checkpointing strategy.
Finally, we report the results of an extensive simulation campaign that shows that the optimal solution is far more efficient than the Young/Daly periodic approach for short or mid-size reservations.
Workshop
Applications and Application Frameworks
Distributed Computing
Middleware and System Software
W
Workshop
Artificial Intelligence/Machine Learning
Graph Algorithms
Scalable Data Mining
W
Workshop
Artificial Intelligence/Machine Learning
Biology
Education
Emerging Technologies
Medicine
Modeling and Simulation
W
Workshop
Applications and Application Frameworks
Algorithms
Performance Evaluation and/or Optimization Tools
W
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionThis research explores the utilization of underused computational resources in network switches from Arista and Mellanox by deploying containers for auxiliary tasks in computational cluster networks. We tested five scenarios: 1) running cloud-init services for efficient boot processes, 2) using Telegraf for network monitoring, 3) deploying a caching proxy to reduce latency, 4) setting up an IPv6 DHCP/DNS provider for VLANs, and 5) implementing client detection with Magellan for network topology mapping. Containers were deployed using Podman and Docker on SONiC-based switches and tested both physically and virtually. Results demonstrated the feasibility and benefits of this approach, which optimizes network performance and reduces server load, offering a cost-effective enhancement to HPC clusters. Future work can expand this research to include additional network management and security tasks directly on the switches.
Workshop
HPC Infrastructure
State of the Practice
System Administration
W
DescriptionRecent years have brought about a sense of urgency regarding global environmental health and this shines attention on the environmental impact of modern computing data centers. These are mostly powered by electricity originating from carbon sources. In addition to the general concern for the environment, modern data centers are only increasing in their thirst for power. The latest data center applications and the latest foundation machine leaning models are examples of this. In 2021, the Georgia Institute of Technology transitioned into a consumption-based model for its centralized computing resources and further adopted SLURM for cluster management and for job scheduling. This transition resulted in greater utilization, among other metrics, than with the prior system based on Torque/Moab.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionThis work focuses on performance portability and proposes a methodological approach to assessing and explaining how different kernels behave across various hardware architectures using the RAJA Performance Suite (RAJAPerf). Our methodology leverages metrics from the Intel top-down pipeline and clustering techniques to sort the kernels based on performance characteristics. We assess the methodology on 54 RAJAPerf computational kernels on Intel Xeon and NVIDIA V100 platforms. Our results confirm the effectiveness of our methodology in automatically characterizing performance differentials and speedups, particularly in memory-bound kernels.
Paper
Accelerators
Data Movement and Memory
Emerging Technologies
Hardware Technologies
Heterogeneous Computing
Linear Algebra
Network
TP
DescriptionThe memory system is a major performance determinant for server processors. Ever-growing core counts and datasets demand higher memory bandwidth and capacity. DDR—the dominant processor interface to memory—requires a large number of on-chip pins, which is a scarce resource, thus limiting the processor’s memory bandwidth. With limited bandwidth, multiple concurrent memory requests experience significant queuing delays that often overshadow DRAM’s service time and degrade performance. We present CoaXiaL, a memory system design for throughput-oriented manycore servers that replaces all of the processor’s DDR interfaces with the pin-efficient CXL interface, which offers 4x higher bandwidth per pin. While such replacement incurs a considerable latency overhead, we demonstrate that, for many workloads, and with careful integration, CXL’s higher bandwidth more than offsets its latency premium. Our evaluation shows that COAXIAL improves the performance of manycore throughput-oriented servers by 1.39x on average and by up to 3x.
Art of HPC
Posters
TP
W
TUT
XO/EX
DescriptionThe image is generated wholly from code written in Python (version 3.11.5) using the visualization library Matplotlib (version 3.8.3) to programmatically and procedurally define the design. This underlying code has, along with the generated output design, been open-sourced under the CC BY 4.0 license and is available to view from a public GitHub repository, "high-res-art" under the artist's personal space (https://github.com/sadielbartholomew/high-res-art/blob/main/collatz_hi_res.py). High-performance computing was indispensable towards refining the parameters encoding the precise design, in particular the translation factors and set of hex-triplet color codes defining the background and marker colors, through batch processing of configurations of the parameters starting from exploratory values, followed by inspection of the generated outcomes and honing in on parameter sets producing promising outcomes in several iterations until finally this design emerged as a visual favorite. Specifically, the Slurm workload manager was used for such batch computing on the UK's JASMIN supercomputer.
Art of HPC
Posters
TP
W
TUT
XO/EX
DescriptionThe image is generated wholly from code written in Python (version 3.11.5) using the visualization library "matplotlib" (version 3.8.3) to programmatically and procedurally define the design. This underlying code has, along with the generated output design, been open-sourced under the CC BY 4.0 license and is available to view from a public GitHub repository, "high-res-art" under the artist's personal space (https://github.com/sadielbartholomew/high-res-art/blob/main/collatz_hi_res.py). High-performance computing was indispensable towards refining the parameters encoding the precise design, in particular the translation factors and set of hex-triplet color codes defining the background and marker colors, through batch processing of configurations of the parameters starting from exploratory values, followed by inspection of the generated outcomes and honing in on parameter sets producing promising outcomes in several iterations until finally this design emerged as a visual favorite. Specifically, the Slurm workload manager was used for such batch computing on the UK's JASMIN supercomputer.
Exhibitor Forum
Software Engineering
TP
XO/EX
DescriptionA recent case study from the Hartree Centre is presented. In this project, we developed a stack that integrates GitLab CI/CD tooling with a Slurm-enabled HPC environment (deployed in AWS ParallelCluster). The stack enables the triggering of computationally expensive tasks automatically from code commits, as well as the ability to pin task outputs to those commits via artifacts.
This case study showcases a template for architectures that enable engineers to work on designs/models independently with familiar tooling and share outputs, which may be dependencies for collaborators, automatically; all without needing to interact directly with an unfamiliar HPC environment. This presentation will focus on the challenges we faced in developing the architecture, and how we overcame those challenges. In particular, we will discuss our approach to defining "templated" computational tasks to run on HPC, and some pitfalls in creating client CLI interfaces for REST APIs.
This case study showcases a template for architectures that enable engineers to work on designs/models independently with familiar tooling and share outputs, which may be dependencies for collaborators, automatically; all without needing to interact directly with an unfamiliar HPC environment. This presentation will focus on the challenges we faced in developing the architecture, and how we overcame those challenges. In particular, we will discuss our approach to defining "templated" computational tasks to run on HPC, and some pitfalls in creating client CLI interfaces for REST APIs.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionFor large-scale matrix-free finite-element PDE solvers, parallel matrix-vector products typically comprise the dominant computational cost. Synchronization steps for input and output, when degrees of freedom (DOFs) lying on inter-process boundaries are communicated, become the dominant serial portion of the program and the main scalability bottleneck. However, the cost of communication can be mitigated if it can be overlapped with local computations which do not require the values of DOFs on process boundaries. In this research, we study the nonlinear Stokes solver of the mantle convection code Rhea, comparing several different methods for overlapping communication with computation during matrix-vector products, including a new dynamic method which automatically adjusts to measured imbalances in communication waiting times. We observe significant improvements in the waiting times, and in the overall computation times, for the matrix-vector products in Rhea.
Workshop
HPC Infrastructure
State of the Practice
System Administration
W
Workshop
Cloud Computing
Middleware and System Software
State of the Practice
W
DescriptionThe National Oceanic and Atmospheric Administration (NOAA) strives to understand and predict changes in the environment through improving climate modeling. These models can be tested for reproducibility using tools in the Flexible Modeling System (FMS) Runtime Environment (FRE) framework. The FRE ecosystem provides a model compilation tool, FREMAKE, in which the process of checking out source code and compiling the model is automated, and FRERUN, which automates running the compiled model. In order to make NOAA models more accessible, easier to maintain, and easier to develop, a FRE rewrite is being constructed with the idea of portability, flexibility, and simplicity to support container builds and runs, as well as bare-metal runs. To ensure optimal performance of the FRE container, a comparative analysis was done for container versus bare-metal compilation runs. Here, we analyze how the container runtime set-up affects the total runtime performance of the model compilation process.
ACM Student Research Competition: Undergraduate Poster
Posters
TP
DescriptionTo enhance data sharing and reduce access latency in scientific collaborations, high energy physics (LHC CMS experiment) employs regional in-network storage caches. Accurate predictions of cache utilization trends help design new caching policies and improve capacity planning. This study leverages the SoCal cache access trends to improve prediction on the newer caches in Chicago and Boston through transfer learning. We also investigate the impact of doubling the Chicago cache's storage capacity on its cache hit rate.
Workshop
Debugging and Correctness Tools
Fault-Tolerance, Reliability, Maintainability, and Adaptability
Software Engineering
W
DescriptionHybrid MPI + X models, combining the Message Passing Interface (MPI) with node-level parallel programming models, increase complexity and introduce additional correctness issues. This work addresses the challenges of detecting data races in hybrid CUDA-aware MPI applications due to the asynchronous and non-blocking nature of CUDA and MPI APIs. We introduce CuSan, an LLVM compiler extension and runtime, to track CUDA-specific concurrency, synchronization and memory access semantics. We integrate CuSan with MUST, a dynamic MPI correctness tool, and ThreadSanitizer (TSan), a thread-level data race detector. MUST with TSan can already detect concurrency issues for multi-threaded MPI codes. Together with CuSan, these tools allow for comprehensive correctness checking of concurrency issues in CUDA-aware MPI applications. Our evaluation on two-mini apps reveals runtime overhead ranging from 6×-36×, depending on the amount of memory tracked by TSan, compared to the uninstrumented version. Memory overhead remains under 1.8×. CuSan is available at https://github.com/ahueck/cusan
Art of HPC
Posters
TP
W
TUT
XO/EX
DescriptionThe simulation was written in C/C++ and CUDA. The computationally intensive portions of the code were offloaded to NVIDIA GPUs for acceleration. The graphics were rendered using OpenGL.
Art of HPC
Posters
TP
W
TUT
XO/EX
DescriptionThe coronal calculation was performed with the open source POT3D code (github.com/predsci/pot3d) using high-resolution surface magnetic field observations from the Helioseismic and Magnetic Imager as a lower boundary condition. The solution was computed on the Stampede2 supercomputer at the Texas Advanced Computing Center using 128 48-core compute nodes. Over 100 million magnetic field lines were then traced through the 6.6 billion cell 3D solution using the open source MapFL code (github.com/predsci/mapfl) to calculate the “magnetic squashing factor” (an indicator of magnetic structure) and determine which magnetic field lines were open (extending out into the heliosphere) or closed (falling back to the Sun). The image is a layered composite of six images of three signed quantities (squashing factor in purple [+] and green [-], magnetic field strength in orange [+] and cyan [-], and whether the field is open [red] or closed [dark blue]). The six images are blended using alpha transparency mapping, but otherwise are not altered from the original raw quantities (no sharpening or artistic enhancements are applied).
Workshop
Artificial Intelligence/Machine Learning
W
DescriptionThe evaluation of skill and safety in language models (LMs) and foundation models (FMs) within scientific domains is increasingly critical as these models become more integral to research and discovery. We propose a multi-stage evaluation framework that integrates automatic question-answer generation for skill assessment and automated red-teaming for safety evaluation, tailored specifically for scientific applications. The framework includes domain-specific benchmark creation and rigorous validation processes to ensure that the models are both knowledgeable and safe for deployment. Addressing these challenges is essential for developing reliable AI systems that can support scientific innovation responsibly.
Workshop
Accelerators
Modeling and Simulation
Performance Evaluation and/or Optimization Tools
W
DescriptionGenerative AI, in particular large transformer models, are increasingly driving HPC system design in science and industry. We analyze performance characteristics of such transformer models and discuss their sensitivity to the transformer type, parallelization strategy, and HPC system features (accelerators and interconnects). We utilize a performance model that allows us to explore this complex design space and highlight its key components. We find that different transformer types demand different parallelism and system characteristics at different training regimes. Large language models are performant with 3D parallelism and amplify network needs only at pre-training scales with reduced dependence on accelerator capacity and bandwidth. On the other hand, long-sequence transformers, representative of scientific foundation models, place a more uniform dependence on network and capacity with necessary 4D parallelism. Our analysis emphasizes the need for closer performance modeling of different transformer types, keeping system features in mind, and demonstrates a path towards this.
Tutorial
I/O, Storage, Archive
Scientific and Information Visualization
TUT
DescriptionLarge-scale numerical simulations, observations, experiments, and AI computations generate or consume very large datasets. Data compression is an efficient technique to reduce scientific datasets and make them easier to analyze, store, and transfer. The first part of this one-day tutorial reviews the motivations, principles, techniques, and error analysis methods for lossy compression of scientific datasets. It details the main compression stages (decorrelation, approximation, coding) and their variations in state-of-the-art generic lossy compressors: SZ, ZFP, MGARD, and SPERR. The second part of the tutorial focuses on lossy compression trustability, hands-on sessions, and customization of lossy compression to respond to user-specific lossy compression constraints. In the third part of the tutorial, we discuss different ways of composing and testing specialized lossy compressors. The tutorial uses examples of real-world scientific datasets to illustrate the different compression techniques and their performance. The tutorial features 2 hours of hands-on sessions on generic compressors and how to compose specialized compressors. Participants are encouraged to bring their data to make the tutorial productive. The tutorial, given by the leading teams in this domain and primarily targeting beginners interested in learning about lossy compression for scientific data, is improved from the highly rated tutorials given at SC17-23.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionPhotons are essential in many matter-light interaction systems, with applications in physics (e.g., core-collapse supernovae) and engineering (e.g., neutron transport). Although the theory of radiating fluids has been established since the 1980s, developing robust and efficient numerical simulations remains an active research area.
Using the FleCSI framework, we developed a radiation hydrodynamics code called HARD (Hydrodynamics And Radiative Diffusion), which is scalable and portable across various HPC architectures and computing systems. FleCSI provides us with a task-based parallelism framework supported by different backends such as MPI, HPX, and Legion. The on-node parallelism is then handled through Kokkos targeting CPU and GPU architectures.
This poster presents both our achievements in the physics implementation and its adaptation to the task-based model, and also the benchmark of the different backends using FleCSI on Los Alamos National Laboratory supercomputers.
Using the FleCSI framework, we developed a radiation hydrodynamics code called HARD (Hydrodynamics And Radiative Diffusion), which is scalable and portable across various HPC architectures and computing systems. FleCSI provides us with a task-based parallelism framework supported by different backends such as MPI, HPX, and Legion. The on-node parallelism is then handled through Kokkos targeting CPU and GPU architectures.
This poster presents both our achievements in the physics implementation and its adaptation to the task-based model, and also the benchmark of the different backends using FleCSI on Los Alamos National Laboratory supercomputers.
Workshop
Distributed Computing
Education
Emerging Technologies
W
DescriptionIn June 2024, the University of Washington’s (UW) Clean Energy Institute and Molecular Engineering and Materials Center, in partnership with the UW high-performance computing (HPC) facilitation team, prepared complimentary training for a group of 25 Research Experience for Undergraduates (REU) participants. Workshop participants had completed zero to four years of post-secondary education and were from 17 universities across eight states, with 29% currently attending two-year programs, and on average, 14 students attended each workshop. The program included four targeted introductory workshop offerings, spanning essential skills in computational science and advanced topics. The program's effectiveness was evaluated with a post-workshop survey. The survey showed that some students agreed with statements indicating that learning objectives were met, but overall scores and open responses indicated areas for improvement. Our reproducible program addressed a wide range of training and education needs within computational science, emphasizing practical skills and interdisciplinary applicability.
Panel
Edge Computing
TP
W
TUT
XO/EX
DescriptionWhat if your data is at the extreme edge and has an HPC requirement—possibly deep in the ocean, on the space station, on a lunar colony, or even on an icy moon in the far reaches of the solar system? Panelists from NASA, NOAA, and industry will discuss current and future solutions to this challenging problem, present visually captivating imagery, and provide insight developed over many years while including fresh approaches from new voices. Examples include operational fault diagnosis and decision-making that require AI on missions to extraterrestrial icy moons with planned data flows to terrestrial capabilities, and collecting data to inform hurricane models. Hear from, and engage with, experts familiar with how recent and planned missions have expanded our concept of computing at the edge for both space-based and terrestrial challenges.
Panel
Energy Efficiency
TP
W
TUT
XO/EX
DescriptionHigh-performance computing (HPC) is critical to addressing and contributing to environmental sustainability challenges. This dual impact highlights the urgent need for collaboration among computer scientists and the broader research community. HPC harnesses the power of computing to enable significant near-term advances across many industries. To enable this, however, HPC and cloud data centers have an increasingly large carbon footprint and require substantial water resources for electricity generation and cooling. The carbon emissions of today's data and HPC centers exceed those of the airline industry, and emerging AI models are estimated to draw up to 21% of the world's electricity supply by 2030. To debate and discuss how to bring about transformational change, our panel will bring together leaders from the computing research and environmental sustainability communities. The panel will discuss achieving sustainability by identifying research gaps and new ways multidisciplinary collaborations can address them.
Invited Talk
TP
DescriptionThe Computing Community Consortium (CCC) at the Computing Research Association plays a vital role in shaping the future of computing through a broad range of visioning activities, including workshops, roundtable discussions, Quad papers, and responses to federal Requests for Information (RFIs). The drivers for CCC visioning are members of the CCC Council, nominated and selected for three-year terms to serve as a diverse representation of the computing research community. Established in 2006, the CCC Council has developed best practices for research visioning and models for evaluating the impact of visioning activities. Recently, CCC has facilitated roundtable discussions focused on identifying the next Grand Challenges in Computing. These discussions have deepened our understanding of what makes a Grand Challenge “Grand” and highlighted potential topics that could drive future research. Every four years, the CCC Council also produces Quad papers, which outline pressing research topics for the upcoming federal administration, released during election years. This presentation will describe our visioning processes and showcase recent computing research visioning topics.
Workshop
Graph Algorithms
Heterogeneous Computing
Programming Frameworks and System Software
W
Students@SC
TP
W
TUT
XO/EX
DescriptionThis workshop will explore the impact of impostor syndrome and tools participants can use to better adjust and feel confident in their abilities and surroundings. This workshop will touch on the foundational tenets of psychological safety and emotional intelligence. Additionally, definitions of microaggressions, macroaggressions, microaffirmations and effective methods to recognize the impacts of these in the workplace will be reviewed as well.
Workshop
Software Engineering
W
Tutorial
Accelerators
Broader Engagement
Emerging Technologies
Quantum Computing
TUT
DescriptionQuantum computing has the potential to revolutionize many fields in the 21st century. Over the past decade, numerous quantum computers have been made publicly available. However, the effectiveness of the hardware is heavily reliant on the software ecosystem — a lesson drawn from classical computing's evolution. Unlike classical systems, which benefit from mature Electronic Design Automation (EDA) and High-Performance Computing (HPC) tools for handling complexity and optimizing performance, quantum software is still in its infancy.
One of the goals of this tutorial is to educate the HPC community on quantum computing and to bring these two communities closer together. To this end, the tutorial intends to cover topics such as high-level support for users in realizing applications, as well as efficient methods for the classical simulation, compilation, and verification of quantum circuits. Furthermore, the tutorial showcases how expertise in classical HPC can address key challenges in the quantum software stack, enhancing efficiency, scalability, and reliability.
All of the above is accompanied by hands-on demonstrations based on the Munich Quantum Toolkit (MQT), which is an open-source collection of high-performance software tools for quantum computing developed at the Technical University of Munich (see https://mqt.readthedocs.io).
One of the goals of this tutorial is to educate the HPC community on quantum computing and to bring these two communities closer together. To this end, the tutorial intends to cover topics such as high-level support for users in realizing applications, as well as efficient methods for the classical simulation, compilation, and verification of quantum circuits. Furthermore, the tutorial showcases how expertise in classical HPC can address key challenges in the quantum software stack, enhancing efficiency, scalability, and reliability.
All of the above is accompanied by hands-on demonstrations based on the Munich Quantum Toolkit (MQT), which is an open-source collection of high-performance software tools for quantum computing developed at the Technical University of Munich (see https://mqt.readthedocs.io).
Art of HPC
Posters
TP
W
TUT
XO/EX
DescriptionThe image is generated wholly from code written in Python (version 3.11.5) using the visualization library Matplotlib (version 3.8.3) to programmatically and procedurally define the design. This underlying code has, along with the generated output design, been open-sourced under the CC BY 4.0 license and is available to view from a public GitHub repository, "high-res-art" under the artist's personal space (https://github.com/sadielbartholomew/high-res-art/blob/main/inspired_by_le_parc_hi_res.py). High-performance computing was indispensable towards refining the parameters encoding the precise design, in particular the radius, width and transparency of the individual patch forming the element under repeated rotation; the number of patches per side; and the rotational start and end points forming the pair of rotational arrays. The Slurm workload manager was used on the UK's JASMIN supercomputer to batch process such configurations of the parameters starting from exploratory values, followed by inspection of the generated outcomes and honing in on parameter sets producing promising outcomes in several iterations until finally this design emerged as a visual favorite.
Exhibitor Forum
Network
TP
XO/EX
DescriptionWhat matters most to effective use of HPC systems is achieving performance under load. This session will delve into the limitations of traditional Ethernet to perform effectively on tightly coupled HPC and AI workloads. We will explore how cutting-edge Ethernet technologies like HPE Slingshot overcome these challenges and effectively control congestion and jitter to deliver performance under demanding network conditions. The insights presented will be applicable to those using or managing systems of all sizes, from dozens of nodes up to exascale. We’ll also discuss how supercomputing network technologies are feeding forward into Ethernet-based standards like the Ultra Ethernet Consortium.
Workshop
Broader Engagement
Education
Inclusivity
W
DescriptionI propose an applications-first approach for adjusting how parallel and distributed computing concepts are incorporated into curricula. By focusing on practical applications that leverage parallelism and distributed systems, this approach aims to make these complex topics more accessible and engaging for both CS and non-CS majors. An applications-first approach would demonstrate the advantages of parallel and distributed computing in solving real-world problems and build some experience and skills using such systems before delving into theoretical concepts, potentially broadening the appeal and retention of parallel and distributed computing concepts. I highlight some example application-centric efforts, and conclude with questions that could be investigated in the service of exploring applications-first approaches.
Workshop
Cloud Computing
Middleware and System Software
State of the Practice
W
DescriptionThe convergence of scientific computing, exascale data, and high-performance computing (HPC) has driven significant advancements across multiple scientific domains. However, the growing complexity of software stacks, coupled with the increased need for reproducibility and efficient resource management, has raised containerization as a vital solution in HPC environments.
A major challenge is the potential performance overhead introduced by containers. HPC workloads often demand direct hardware access for optimal performance, and the additional layer of abstraction provided by containers can affect computation speed and efficiency. Minimizing this overhead while preserving the advantages of containerization is essential for its adoption in performance-sensitive scientific applications. Additionally, containerization in HPC demands rigorous security measures, balancing user flexibility with system integrity and HPC center policies.
This work delves into these challenges and opportunities by examining the real-time implementation of containerization for NWCHEM (MDT) and OCP (Open Catalyst) scientific data within the Perlmutter supercomputer environment at LBNL.
A major challenge is the potential performance overhead introduced by containers. HPC workloads often demand direct hardware access for optimal performance, and the additional layer of abstraction provided by containers can affect computation speed and efficiency. Minimizing this overhead while preserving the advantages of containerization is essential for its adoption in performance-sensitive scientific applications. Additionally, containerization in HPC demands rigorous security measures, balancing user flexibility with system integrity and HPC center policies.
This work delves into these challenges and opportunities by examining the real-time implementation of containerization for NWCHEM (MDT) and OCP (Open Catalyst) scientific data within the Perlmutter supercomputer environment at LBNL.
Workshop
Cloud Computing
Middleware and System Software
State of the Practice
W
DescriptionHigh-performance computing (HPC) systems are crucial for solving complex scientific problems but challenges like resource management, fault tolerance, and maintaining consistent performance across diverse environments can be difficult. Container technologies, like NERSC's Shifter and Podman-HPC, offer some solutions to these problems. This study uses Distributed MultiThreaded CheckPointing (DMTCP) technologies to implement robust checkpoint/restart (C/R) mechanisms to handle challenges with fault tolerance and resource management, within containerized environments. This study highlights successful C/R implementations on Perlmutter at NERSC using Shifter, Podman-HPC, and Apptainer, which has broader adoption in the HPC container space, to show where C/R within containers could be used at more HPC centers. Work on MPI-Agnostic Network-Agnostic (MANA) and containerized C/R for GPUs is also being pursued as part of future developments. These insights emphasize the growing importance of resilience in containerization deployments in scientific computing, ultimately accelerating the pace of discovery and innovation.
Workshop
Broader Engagement
HPC in Society
Inclusivity
W
DescriptionIn this paper, we will explore the containerization focusing on scientific simulation data such as NWCHEM and the Open Catalyst OCP. Additionally, how singularity containers are used to containerize HPC applications. We run different machine learning frameworks with and without Singularity containers show their performance and challenges of integrating containerization technologies within High-Performance Computing (HPC) environments, particularly for applications in computational chemistry and materials science. We also highlighted the limitations of traditional containerization tools such as Docker in supercomputing contexts and introduces Singularity as a more suitable alternative tailored to HPC requirements. By analyzing the containerization process of NWChem and OCP project at the PERLMUTTER supercomputer at LBNL, we illustrate how Singularity effectively addresses HPC-specific needs while ensuring performance and scalability. Our findings offer valuable insights for optimizing containerization strategies for complex scientific software in advanced computing environments.
Birds of a Feather
TP
XO/EX
DescriptionHigh-performance computing systems that have been traditionally deployed at a single site are expected to significantly expand their reach to include a variety of remote edge systems. These edge systems include computing platforms located near instruments and the instruments themselves. Examples range from interconnected ecosystems of large science instruments to smart energy grids supported by complex analytics and controls. These interconnected systems form a compute and instrument continuum wherein computation is orchestrated in various stages. This BoF will discuss the role of quantum networks in communicating between quantum, conventional, and hybrid systems to enable continuum computing of the future.
Birds of a Feather
TP
XO/EX
DescriptionCloud computing technologies such as elastic scaling, application containerization and orchestration are gaining prevalence in HPC due to their benefits of resource dynamism, automation, reproducibility, and resilience. Similarly, HPC technologies are being integrated into cloud infrastructures to enable traditional HPC workloads and emerging GenAI workloads which have HPC characteristics. This trend is leading to converged computing, an environment that combines the best capabilities from both worlds. In this highly interactive BoF, we invite the broader computing community to discuss its experiences with converged computing and share its views on the future, considering the astronomical growth of GenAI and resource contention.
Workshop
Data Movement and Memory
I/O, Storage, Archive
W
DescriptionJob initialization time of dynamic executables increases as HPC jobs launch on a larger number of nodes and processes. This is due to the processes flooding the storage system with a tremendous number of I/O requests for the same files, leading to significant performance degradation, causing the nodes to remain idle for an extended period of time waiting for I/O resources, wasting valuable CPU cycles. In order to remove the I/O bottleneck occurring at job initialization time, a data loader capable of reducing storage-side congestion is necessary. In this paper, we introduce Copper, a read-only cooperative caching layer aimed to enable scalable data loading on a large number of nodes. We evaluate our data loading solution on the Aurora supercomputer located at Argonne National Laboratory. Our experiments show that Copper is able to load data in near constant time when scaling from 32 to 8,300 nodes.
Paper
I/O, Storage, Archive
TP
DescriptionA significant drawback of erasure-coded is suffering from the expensive update traffic. The analysis of real-world-production traces shows that partial updates, including partial-block-updates and partial-stripe-updates, are both common. Existing schemes cannot work adequately for partial updates. Raid-based scheme coordinates multiple updated entire blocks to update parity, yet it incurs significant network traffic for partial-block-updates. Delta-based scheme transmits the updated parts and independently updates parity, yet it cannot share computed-delta parts for partial-stripe-updates. We propose CoRD, which optimally combines Raid-based and Delta-based schemes to minimize the update traffic. It exploits the offset address intersections between multiple updated blocks and only transmits the updated parts to coordinate in parity updates. CoRD further address cross-block update scenarios by flipping some dedicated blocks to improve the performance. Comprehensive evaluations verify the effectiveness of CoRD for the latest traces, with the update traffic reduction of 37.02%-87.19% and the performance improvement of 36.54%-231.92% compared to state-of-the-art.
Tutorial
Architecture
Broader Engagement
Performance Evaluation and/or Optimization Tools
Portability
TUT
DescriptionWhile many developers put a lot of effort into optimizing large-scale parallelism, they often neglect the importance of an efficient serial code. Even worse, slow serial code tends to scale very well, hiding the fact that resources are wasted because no definite hardware performance limit (“bottleneck”) is exhausted. This tutorial conveys the required knowledge to develop a thorough understanding of the interactions between software and hardware on the level of a single-CPU core and the lowest memory hierarchy level (the L1 cache). We introduce general out-of-order core architectures and their typical performance bottlenecks using modern x86-64 (Intel Sapphire Rapids) and ARM (Fujitsu A64FX) processors as examples. We then go into detail about x86 assembly code, specifically including vectorization (SIMD), pipeline utilization, critical paths, and loop-carried dependencies. We also demonstrate performance analysis and performance engineering using the Open-Source Architecture Code Analyzer (OSACA) in combination with a dedicated instance of the well-known Compiler Explorer. Various hands-on exercises will allow attendees to make their own experiments and measurements and identify in-core performance bottlenecks. Furthermore, we show real-life use-cases from computational science (sparse solvers) to emphasize how profitable in-core performance engineering can be.
Workshop
Debugging and Correctness Tools
Fault-Tolerance, Reliability, Maintainability, and Adaptability
Software Engineering
W
Workshop
Debugging and Correctness Tools
Fault-Tolerance, Reliability, Maintainability, and Adaptability
Software Engineering
W
DescriptionEnsuring the correctness of MPI+OpenMP programs requires careful consideration of both models and their interactions, as combining them increases complexity and makes development more error-prone. Existing tools primarily focus on a single programming model, offering limited support for hybrid applications. This paper presents four MPI+OpenMP error classes that require a tool to detect concurrent MPI calls within the same process. To detect these concurrent MPI calls, we track OpenMP synchronizations using vector clocks. To enhance the detection of these error classes, we extend the MPI correctness tool MUST with clock-based analyses for MPI+OpenMP checks. We maintain the vector clocks within MUST by utilizing the synchronization information provided by the OpenMP race detector Archer. We evaluate the clock-based analyses on hybrid test cases from MPI-CorrBench and our own test suite. The clock-based analyses correctly classify almost all test cases and significantly improve MUST's ability to detect these MPI+OpenMP errors.
Workshop
Debugging and Correctness Tools
Fault-Tolerance, Reliability, Maintainability, and Adaptability
Software Engineering
W
Art of HPC
Posters
TP
W
TUT
XO/EX
DescriptionThis animation consists of three primary elements. The first is the evolving render of the 3D structure of the coronal magnetic field. It is created from 776 sequences from the “Live Prediction” HPC simulation developed by Predictive Science Inc. as part of their scientific research and outreach for the April 8, 2024 eclipse. See predsci.com/eclipse2024 for more details on the eclipse simulation, which was created using the MAS code (predsci.com/mas) running on several thousand processors. The volume rendering is used for scientific purposes and is generated using parallelized Fortran tools. These map millions of magnetic field lines from 3D model data and compute a scene by integrating a complexity indicator (the squashing factor) along parallel lines of sight. The only difference between this and the pure science version are the resolution and a very light unsharp masking to enhance contrast. The second is an image of the moon obtained using the LROC Quickmap tool (quickmap.lroc.asu.edu) based on public NASA/LRO data. The third is the slowly moving starfield cropped from a publicly available observation/image from the ESA/Hubble archive [Credit: NASA, ESA and Jesús Maíz Apellániz (Instituto de Astrofísica de Andalucía, Spain), esahubble.org/images/heic1011a]. A gamma correction is applied to the original image data. The three elements are then composited using “screen” blending. The background music is “Fireflies" by Ambient Boy, obtained from uppbeat.io.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionThe rapid advancement in computing demands and the increasing complexity of modern applications (e.g., image processing, numerical computation, and machine learning) necessitate efficient heterogeneous computing solutions. Project CoVA proposes an MLIR-based compilation flow designed to bridge the gap between high-level algorithm development with Python and diverse hardware architecture development with C++, including CPUs, GPUs, FPGAs, and quantum computing units. By utilizing a high-level MLIR dialect specifically designed for CoVA, we achieve the decoupling of algorithms from backend hardware, facilitating more efficient algorithm development and significantly reducing development cycles. Moreover, our design enhances development efficiency within HPC platforms equipped with heterogeneous accelerators, enabling faster and more streamlined development processes.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionLarge language models (LLMs) are being used increasingly by software developers, researchers, and students to assist them in coding tasks. While newer LLMs have been improving their coding abilities with regards to serial coding tasks, they consistently perform worse when it comes to parallelism and HPC-related coding tasks. Bridging this gap and creating HPC-capable code LLMs could drastically improve the quality and quantity of code research software developers can write. The current poor performance of LLMs on HPC-related problems can be partially attributed to the lack of significant HPC data in their training, which is what we address in this poster. We present HPC Coder v2, a new LLM created by fine-tuning a previous code LLM using HPC synthetic data. We demonstrate that it is one of the most capable open-source LLMs for generating parallel code to date.
Panel
Data-Intensive
Experimental Facility
HPC in Society
Inclusivity
TP
DescriptionJoin us for a panel discussion with experts from diverse scientific domains to explore groundbreaking strategies for democratizing data access and usage. Learn about transformative projects like the Pelican Platform, the National Science Data Fabric, and the High-Performance Data Facility Hub. These initiatives pave the way for an inclusive future in areas such as dark matter research, ecology, wildfire impacts, and new material design, along with technologies like high-throughput computing, information streaming, and remote collaborative access. The discussion will focus on how these efforts advance equity and empower minority-serving institutions, leadership computing facilities, and national experimental setups. This panel offers a unique opportunity for researchers, educators, and technology enthusiasts to gain insights and engage in a national dialogue on the future of scientific collaboration involving massive data ranging from Gigabytes to Petabytes.
Students@SC
TP
W
TUT
XO/EX
DescriptionJoin this interactive career panel moderated by Andrekka “AJ” Lanier from Lawrence Livermore National Laboratory. Students at SC24 will be offered an opportunity to attend this 60-minute session with professionals from national labs, academia, and industry. The intent of this career panel is to inspire the next generation to forge their own paths and encourage them to create a career through HPC.
Workshop
Artificial Intelligence/Machine Learning
Biology
Education
Emerging Technologies
Medicine
Modeling and Simulation
W
DescriptionPredicting and comparing anti-cancer drug responses using deep learning models across datasets is a challenging modern problem. In this study, we optimized hyperparameters in several novel neural network-based models, including GraphDRP [1], IGTD [2], Paccmann [3], PathDSP [4], and HiDRA [5], and a machine learning model LGBM (build with LightGBM), across multiple public pharmacogenomic datasets: CCLE [6], CTRPv2 [7], gCSI [8], GDSCv1 [9], and GDSCv2 [10]. Our primary objective was to enhance prediction performance and robustness through hyperparameter optimization (HPO) tailored to each dataset. As a result, we have published the HPO results on GitHub for the research community and have started the cross-analysis of these HPO runs. The results are a first effort in the cross-model, cross-dataset HPO analysis (termed “Cross-HPO”) that is now possible. This work will enhance drug discovery candidate evaluation and increase discovery success.
Background: The Innovative Methodologies and New Data for Predictive Oncology Model Evaluation (IMPROVE) project provides a robust framework for this research. We use the Supervisor [11] hyperparameters optimization framework to run the models on ALCF Polaris, a powerful supercomputer with over 2000 NVIDIA A100 GPUs [12].
Finding 1: Dataset-specific HPO: Our findings, illustrated in Fig. 1 for GraphDRP and IGTD models, emphasize the importance of dataset-specific hyperparameter tuning. Tailoring hyperparameters to individual datasets led to significant performance improvements compared to a uniform approach. This study highlights the inherent complexity and variability in pharmacogenomic data. As shown in Figs. 2, IGTD-gCSI and Pacmann_MCA-gCSI exhibit improved performance with decreasing validation loss over iterations. In contrast, PathDSP-CCLE and HIDRA-CCLE show plateauing losses, indicating possible overfitting or suboptimal hyperparameters. LGBM models consistently achieve lower validation loss, mainly on CCLE, underscoring their potential effectiveness.
Finding 2: Community resources for HPO: This work provides a framework for hyperparameter optimization to enhance model performance and underscores the necessity of dataset-specific tuning for neural network models in cancer drug response prediction. Optimizing hyperparameters can lead to more accurate and reliable predictions, ultimately advancing personalized cancer treatments. We have published the HPO results in an updateable, versioned data structure called the IMPROVE Hall of Fame [13]. The studies were performed over standardized HPO range specifications, published and encoded in a readable JSON format specified for compliance with CANDLE conventions [14]. HPO runs were performed on Polaris but are portable to other systems via Supervisor, and were run at four standard sizes, SMALL, MEDIUM, LARGE, and XL, which specify the number of samples per HPO iteration and number of HPO iterations.
Finding 3: Cross-model behavior analysis for HPO: This data corpus makes many forms of analysis possible. Fig. 3 shows the aggregate performance across all model-dataset runs for loss improvement. This can be used as a reference point for future HPO runs, as the results show an exceptional 56.35% improvement in validation loss. Fig. 4 shows the varying optimal hyperparameters for IGTD across datasets. In Fig. 5, we show model-dataset results in an at-a-glance format, so that a fingerprint of the different behaviors of the combinations can be quickly seen.
Background: The Innovative Methodologies and New Data for Predictive Oncology Model Evaluation (IMPROVE) project provides a robust framework for this research. We use the Supervisor [11] hyperparameters optimization framework to run the models on ALCF Polaris, a powerful supercomputer with over 2000 NVIDIA A100 GPUs [12].
Finding 1: Dataset-specific HPO: Our findings, illustrated in Fig. 1 for GraphDRP and IGTD models, emphasize the importance of dataset-specific hyperparameter tuning. Tailoring hyperparameters to individual datasets led to significant performance improvements compared to a uniform approach. This study highlights the inherent complexity and variability in pharmacogenomic data. As shown in Figs. 2, IGTD-gCSI and Pacmann_MCA-gCSI exhibit improved performance with decreasing validation loss over iterations. In contrast, PathDSP-CCLE and HIDRA-CCLE show plateauing losses, indicating possible overfitting or suboptimal hyperparameters. LGBM models consistently achieve lower validation loss, mainly on CCLE, underscoring their potential effectiveness.
Finding 2: Community resources for HPO: This work provides a framework for hyperparameter optimization to enhance model performance and underscores the necessity of dataset-specific tuning for neural network models in cancer drug response prediction. Optimizing hyperparameters can lead to more accurate and reliable predictions, ultimately advancing personalized cancer treatments. We have published the HPO results in an updateable, versioned data structure called the IMPROVE Hall of Fame [13]. The studies were performed over standardized HPO range specifications, published and encoded in a readable JSON format specified for compliance with CANDLE conventions [14]. HPO runs were performed on Polaris but are portable to other systems via Supervisor, and were run at four standard sizes, SMALL, MEDIUM, LARGE, and XL, which specify the number of samples per HPO iteration and number of HPO iterations.
Finding 3: Cross-model behavior analysis for HPO: This data corpus makes many forms of analysis possible. Fig. 3 shows the aggregate performance across all model-dataset runs for loss improvement. This can be used as a reference point for future HPO runs, as the results show an exceptional 56.35% improvement in validation loss. Fig. 4 shows the varying optimal hyperparameters for IGTD across datasets. In Fig. 5, we show model-dataset results in an at-a-glance format, so that a fingerprint of the different behaviors of the combinations can be quickly seen.
Paper
Distributed Computing
Heterogeneous Computing
Programming Frameworks and System Software
Runtime Systems
Task Parallelism
TP
Best Paper Finalist
DescriptionOrganizing computation as asynchronous tasks with data-driven dependencies is a simple and efficient model for single-and-multi-GPU programs. Sequential Task Flow (STF) is such a model, which derives task graphs from data dependencies.
We propose CUDASTF, a C++ library that implements STF over CUDA APIs, fostering easy creation of scalable and composable algorithms. Users may easily elect to use CUDA graphs instead of streams if needed. Structured kernels spanning multiple devices can exercise fine-grained control of affinity.
Implementation-wise, CUDASTF makes a compelling argument for an event-based approach to asynchronous parallel libraries. We obtain up to a 1.8x improvement over the cusolverMg library on Cholesky decomposition. On a small weather simulation task we demonstrate near-optimal scalability of our multi-gpu kernels; also, on a single GPU, CUDA graphs improve performance by up to 30%. Finally, we were able to author the first implementation of the CKKS Fully Homomorphic Encryption scheme over multiple devices.
We propose CUDASTF, a C++ library that implements STF over CUDA APIs, fostering easy creation of scalable and composable algorithms. Users may easily elect to use CUDA graphs instead of streams if needed. Structured kernels spanning multiple devices can exercise fine-grained control of affinity.
Implementation-wise, CUDASTF makes a compelling argument for an event-based approach to asynchronous parallel libraries. We obtain up to a 1.8x improvement over the cusolverMg library on Cholesky decomposition. On a small weather simulation task we demonstrate near-optimal scalability of our multi-gpu kernels; also, on a single GPU, CUDA graphs improve performance by up to 30%. Finally, we were able to author the first implementation of the CKKS Fully Homomorphic Encryption scheme over multiple devices.
Art of HPC
Posters
TP
W
TUT
XO/EX
DescriptionThe data is processed in ParaView and then transferred to Artifact-Based Rendering, a custom-built visualization system designed for artists that enables one to apply custom artifacts to large multivariate volumetric data. Details about Artifact-Based Rendering can be found at www.sculpting-vis.org.
Prior work within Artifact-Based Rendering has focused on colormaps and glyphs. Here we have turned our attention to the potential of line textures to distinguish different categories of streamlines and enrich their visual impact.
All the visual encodings of the data were hand-generated and applied by the artist via the Artifact-Based Rendering interface. No ML was used to create this image.
Prior work within Artifact-Based Rendering has focused on colormaps and glyphs. Here we have turned our attention to the potential of line textures to distinguish different categories of streamlines and enrich their visual impact.
All the visual encodings of the data were hand-generated and applied by the artist via the Artifact-Based Rendering interface. No ML was used to create this image.
Paper
Accelerators
Algorithms
Data Compression
I/O, Storage, Archive
Performance Optimization
TP
Best Student Paper Finalist
DescriptionExisting GPU lossy compressors suffer from expensive data movement overheads, inefficient memory access patterns, and high synchronization latency, resulting in limited throughput. This work proposes cuSZp2, a generic single-kernel error-bounded lossy compressor purely on GPUs designed for applications that require high speed, such as large-scale GPU simulation and large language model training. In particular, cuSZp2 proposes a novel lossless encoding method, optimizes memory access patterns, and hides synchronization latency, achieving extreme end-to-end throughput and optimized compression ratio. Experiments on NVIDIA A100 GPU with nine real-world HPC datasets demonstrate that, even with higher compression ratios and data quality, cuSZp2 can deliver on average 332.42 and 513.04 GB/s end-to-end throughput for compression and decompression, respectively, which is around 2× of existing pure-GPU compressors and 200× of CPU-GPU hybrid compressors.
Exhibitor Forum
Architecture
Hardware Technologies
TP
XO/EX
DescriptionCXL Consortium member companies are developing CXL solutions capable of enabling HPC and AI workloads with increased system scalability and flexibility. To support the ecosystem of CXL devices entering the market and ensure interoperability in the industry, the CXL Consortium has advanced its compliance program by hosting Compliance Test Events and establishing the CXL Integrators List featuring our member’s CXL solutions.
This session will begin with an update from the Consortium and discuss the growth of the CXL Compliance program, as well as the ecosystem of CXL devices including IP, operating platforms, storage, system-level products, and more. The session will also highlight CXL solutions showcased in the CXL Pavilion (booth #1807) and the necessary components to deploy CXL technology.
This session will begin with an update from the Consortium and discuss the growth of the CXL Compliance program, as well as the ecosystem of CXL devices including IP, operating platforms, storage, system-level products, and more. The session will also highlight CXL solutions showcased in the CXL Pavilion (booth #1807) and the necessary components to deploy CXL technology.
Workshop
Data Movement and Memory
I/O, Storage, Archive
W
DescriptionThe Distributed Asynchronous Object Storage (DAOS) is an open source software-defined high performance scalable storage system that has redefined performance for a wide spectrum of AI and HPC workloads (https://daos.io/). This PDSW24 WiP session presents the developments in the DAOS Project since the formation of the DAOS Foundation in late 2023, and highlights the features in the recently released DAOS 2.6 as well as ongoing research and development activities.
Birds of a Feather
TP
XO/EX
DescriptionDAOS (https://daos.io/) is an open-source scale-out object store that delivers extremely high performance to the most data-intensive HPC and AI workloads. With the creation of the DAOS Foundation in 2023 and the support of PMem-less DAOS servers, DAOS has seen increasing community contributions and growing adoption in both on-prem and cloud environments.
This BoF brings together the DAOS community to share experiences and to brainstorm on future enhancements of DAOS. It targets application programmers and HPC/AI middleware developers wanting to optimize the I/O path of their applications, administrators of DAOS storage systems, and DAOS software developers alike.
This BoF brings together the DAOS community to share experiences and to brainstorm on future enhancements of DAOS. It targets application programmers and HPC/AI middleware developers wanting to optimize the I/O path of their applications, administrators of DAOS storage systems, and DAOS software developers alike.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionData assimilation (DA) integrates real-world data into coupled climate models, enhancing prediction accuracy and capturing Earth system complexity. NCAR’s Data Assimilation Research Testbed (DART) is an ensemble DA tool for climate predictions with NCAR’s Community Earth System Model (CESM). Traditionally, DART has modified "restart" files written to disk to influence models, involving significant I/O and stop/restart processes, which are computationally expensive, even on large supercomputers.
This work explores using the National Unified Operational Prediction Capability (NUOPC) layer to enable direct in-memory data transfer between DART and CESM. We describe the development of a NUOPC cap for DART, focusing on strategies for full software integration and minimal disruption to existing functionalities. The infrastructure addresses software incompatibilities and includes decisions on tools, frameworks, and workflow optimizations. This approach aims to enhance efficiency and scalability in data assimilation, offering the first prototype for in-memory data transfer between DART and CESM.
This work explores using the National Unified Operational Prediction Capability (NUOPC) layer to enable direct in-memory data transfer between DART and CESM. We describe the development of a NUOPC cap for DART, focusing on strategies for full software integration and minimal disruption to existing functionalities. The infrastructure addresses software incompatibilities and includes decisions on tools, frameworks, and workflow optimizations. This approach aims to enhance efficiency and scalability in data assimilation, offering the first prototype for in-memory data transfer between DART and CESM.
Doctoral Showcase
Posters
TP
DescriptionThe performance of tensor applications is often bottlenecked by data movement across the memory subsystem. This dissertation contributes domain-specific programming frameworks (compilers and runtime systems) that optimize data movement in tensor applications. We develop novel execution reordering and data reorganization techniques, achieving performance portability along with improved programmability.
We present BrickDL, a compiler framework that performs "merged execution" of fused deep learning operators as graph-level optimization. We employ fine-grained data blocking with "bricks" — a data layout of small, fixed-size blocks of contiguously packed data that enhance on-chip data locality on GPUs. BrickDL demonstrates up to 18% improved performance and 16% reduced DRAM data movement compared to existing deep learning frameworks for prominent models on NVIDIA and AMD GPUs.
The sequence of layers in neural networks is analogous to the nested hierarchy of grids in the Geometric Multigrid (GMG) iterative solver. The series of stencil calculations in the GMG V-cycle results in its memory-bound performance. We hence extend the optimizations in BrickDL to BrickGMG, a framework for restructuring computations and exploiting inter-operator reuse in the V-cycle. BrickGMG provides performance portability across NVIDIA, AMD, Intel GPUs, achieving 55% speedup over HPGMG and 73% of Roofline performance on average.
We develop MLTT, a compiler optimization pipeline in LLVM MLIR for arbitrary tensor transpositions, which are the primary performance bottleneck in tensor contractions for transforming data layouts. MLTT is portable across various CPU vector instruction sets. We integrate MLTT with COMET, an MLIR-based compiler, and present speedups of >40% for memory-bound tensor contractions.
We present BrickDL, a compiler framework that performs "merged execution" of fused deep learning operators as graph-level optimization. We employ fine-grained data blocking with "bricks" — a data layout of small, fixed-size blocks of contiguously packed data that enhance on-chip data locality on GPUs. BrickDL demonstrates up to 18% improved performance and 16% reduced DRAM data movement compared to existing deep learning frameworks for prominent models on NVIDIA and AMD GPUs.
The sequence of layers in neural networks is analogous to the nested hierarchy of grids in the Geometric Multigrid (GMG) iterative solver. The series of stencil calculations in the GMG V-cycle results in its memory-bound performance. We hence extend the optimizations in BrickDL to BrickGMG, a framework for restructuring computations and exploiting inter-operator reuse in the V-cycle. BrickGMG provides performance portability across NVIDIA, AMD, Intel GPUs, achieving 55% speedup over HPGMG and 73% of Roofline performance on average.
We develop MLTT, a compiler optimization pipeline in LLVM MLIR for arbitrary tensor transpositions, which are the primary performance bottleneck in tensor contractions for transforming data layouts. MLTT is portable across various CPU vector instruction sets. We integrate MLTT with COMET, an MLIR-based compiler, and present speedups of >40% for memory-bound tensor contractions.
Paper
Accelerators
Algorithms
Linear Algebra
Modeling and Simulation
Numerical Methods
TP
DescriptionThe Sparse Triangular Solver (SPTRSV) plays a critical role in solving structured grid problems. Yet, the commonly used sparse matrix storage formats for structured grid methods do not efficiently support SPTRSV in utilizing the instruction parallelism offered by modern multi-core CPUs. We introduce DBSR, a new sparse storage format to enable SPTRSV to take advantage of the SIMD instructions. DBSR promotes contiguous memory access and vectorized computation, while also optimizing memory usage. We evaluate DBSR by applying it within multigrid algorithms and the zero fill-in incomplete LU preconditioner. Our evaluation, conducted on four architectures -- three ARMv8 systems and one x86 system -- demonstrates that DBSR consistently outperforms mainstreamed storage formats across evaluation workloads and platforms.
Workshop
Artificial Intelligence/Machine Learning
Broader Engagement
HPC in Society
W
DescriptionWhenever one considers the possibility of designing ethical artificial intelligence (AI), it is tempting to think that the success of such a project would depend on whether systems could be built to implement the same kinds of ethical decision-making procedures as the ones we regard as appropriate for humans. This paper calls into question the foregoing line of thought. It argues that (i) the appropriateness of a decision procedure for a given moral agent depends on the nature of the agent’s capacities; (ii) AIs and humans possess capacities that differ in their nature; and (iii) if (i) and (ii), then the appropriate decision procedures for AIs are different from the ones that are appropriate for humans. The temptation to design ethical AI that employs the same decision procedures as humans should be resisted, lest we miss out on the benefits that could be gained from AI that utilizes distinct procedures.
Tutorial
Artificial Intelligence/Machine Learning
TUT
DescriptionDeep learning is rapidly and fundamentally transforming the way science and industry use data to solve problems. Deep neural network models have been shown to be powerful tools for extracting insights from data across a large number of domains, from large language models (LLMs) to protein folding. As these models grow in complexity to solve increasingly challenging problems with larger and larger datasets, the need for scalable methods and software to train them grows accordingly.
The Deep Learning at Scale tutorial aims to provide attendees with a working knowledge of deep learning on HPC-class systems, including core concepts, scientific applications, performance optimization, tips, and techniques for scaling. We will provide training accounts on some of the worlds largest GPU systems, example code, and datasets to allow attendees to experiment hands-on with optimized, scalable distributed training of deep neural network machine learning models from real scientific computing applications.
The Deep Learning at Scale tutorial aims to provide attendees with a working knowledge of deep learning on HPC-class systems, including core concepts, scientific applications, performance optimization, tips, and techniques for scaling. We will provide training accounts on some of the worlds largest GPU systems, example code, and datasets to allow attendees to experiment hands-on with optimized, scalable distributed training of deep neural network machine learning models from real scientific computing applications.
Workshop
Embedded and/or Reconfigurable Systems
Heterogeneous Computing
W
DescriptionWe present DeLiBA-K, an improved version of the Development of Linux Block I/O Accelerators (DeLiBA) framework. DeLiBA-K operates at the Linux kernel level, bypassing the user-space interactions of DeLiBA-1 and -2 to interact with the block and network I/O kernel stack directly. Another critical feature of DeLiBA-K is implementing and benchmarking the modern io uring Asynchronous I/O (AIO) API within a 16nm AMD Alveo U280 FPGA I/O framework. This allows for better parallelism and reduced latency in I/O operations. Our results show significant performance gains, up to a 3.2x improvement in I/O operations per second (IOPS) and 3.45x increased throughput for synthetic workloads. Real-world applications see a 30% reduction in execution time for data-intensive tasks. DeLiBA-K has been successfully tested in an industrial environment
using real workloads, demonstrating its effectiveness in large-scale enterprise environments.
using real workloads, demonstrating its effectiveness in large-scale enterprise environments.
Tutorial
Broader Engagement
Facilities
TUT
DescriptionHPC leadership and management skills are essential to the success of HPC. This includes securing funding, procuring the right technology, building effective support teams, ensuring value for money, and delivering a high-quality service to users.
This tutorial will provide practical, experience-based training on delivering HPC. This includes stakeholder management, requirements capture, market engagement, hardware procurement, benchmarking, bid scoring, acceptance testing, total cost of ownership, cost recovery models, metrics, and value.
The presenters have been involved in numerous major HPC procurements in several countries, over three decades, as HPC managers or advisors. The tutorial is applicable to HPC procurements and service delivery in most countries, public or private sector, and is based on experiences from a diversity of real-world cases.
The lead author (Jones) has become the de-facto international leader in delivering training on these topics, with a desire to improve the best practice of the community, and without a sales focus or product to favor.
The SC tutorials by these authors have been consistently among the most strongly attended and highly rated by attendees for several years.
This tutorial will provide practical, experience-based training on delivering HPC. This includes stakeholder management, requirements capture, market engagement, hardware procurement, benchmarking, bid scoring, acceptance testing, total cost of ownership, cost recovery models, metrics, and value.
The presenters have been involved in numerous major HPC procurements in several countries, over three decades, as HPC managers or advisors. The tutorial is applicable to HPC procurements and service delivery in most countries, public or private sector, and is based on experiences from a diversity of real-world cases.
The lead author (Jones) has become the de-facto international leader in delivering training on these topics, with a desire to improve the best practice of the community, and without a sales focus or product to favor.
The SC tutorials by these authors have been consistently among the most strongly attended and highly rated by attendees for several years.
Birds of a Feather
TP
XO/EX
DescriptionThe National Science Foundation's Office of Advanced Cyberinfrastructure (OAC) supports the development and provisioning of state-of-the-art cyberinfrastructure (CI) resources, including HPC systems, tools, and services to advance science and engineering. The central vision of OAC is to support inclusive and sustainable research workforce development by leveraging CI across domains. A particular focus is on the integration of cyberinfrastructure and artificial intelligence. We continue to facilitate innovation and innovative usage of CI and CI+AI, democratized access, and the development of sustainable CI ecosystems. We seek to engage the community and institutions to obtain feedback on the evolving needs of workforce development.
Birds of a Feather
TP
XO/EX
DescriptionA recent surge in novel AI accelerator hardware (e.g. from Cerebras, Groq, SambaNova, Graphcore, Intel, and Tenstorrent) has sparked intense interest in running HPC applications and driven algorithmic research on these architectures. However, the maturity levels of associated tooling and software stacks vary significantly, with gaps in documentation and support. This BoF aims to explore trends in AI accelerators for HPC, ideal programming models, software stack support, code portability, training resources, and common challenges. We hope to foster a community of users interested or experienced in using AI accelerators for HPC applications.
ACM Gordon Bell Finalist
TP
DescriptionTraining and fine-tuning large language models (LLMs) with hundreds of billions to trillions of parameters requires tens of thousands of GPUs, and a highly scalable software stack. In this work, we present a novel four-dimensional hybrid parallel algorithm implemented in a highly scalable, portable, open-source framework called AxoNN. We describe several performance optimizations in AxoNN to improve matrix multiplication kernel performance and overlap non-blocking collectives with computation, and performance modeling to choose performance-optimal configurations.
While the abilities of LLMs improve with the number of trainable parameters, so do privacy and copyright risks caused by memorization of training data, which can cause disclosure of sensitive or private information at inference time. We highlight this side effect of scale through experiments that explore "catastrophic memorization,'' where models are sufficiently large to memorize training data in a single pass, and present an approach to prevent it.
While the abilities of LLMs improve with the number of trainable parameters, so do privacy and copyright risks caused by memorization of training data, which can cause disclosure of sensitive or private information at inference time. We highlight this side effect of scale through experiments that explore "catastrophic memorization,'' where models are sufficiently large to memorize training data in a single pass, and present an approach to prevent it.
Workshop
Cloud Computing
Middleware and System Software
State of the Practice
W
DescriptionThere is increasing interest in the use of containerized workflows on high performance computing (HPC) systems from the research community and industry to tackle “real world” problems, which requires HPC resources to efficiently compute and scale complex algorithms across thousands of nodes. Unfortunately, a lot of scientists and HPC center staff are not familiar with the unique requirements and characteristics of deploying containerized workflows on secure HPC environments. They usually develop their HPC workflows and policies for non-containerized environments and don’t fully appreciate how containerized workflows and policies deviate from their standard mode of operation. This often leads to misunderstanding of how to deploy and execute containers on HPC systems, specifically in maintaining the security of the HPC system when running containers. In this presentation we discuss the issues associated with the deployment of containers in a secure HPC environment and techniques that can potentially make deploying containers easier.
Exhibits
Flash Session
TP
XO/EX
DescriptionThis is a brief introduction to the automation tools and wrappers to simplify the deployment of NVIDIA HPC containers to help accelerate workloads such as NAMD, GROMACS, and RELION, on Oracle Cloud Infrastructure (OCI). These tools aim to enable seamless, scalable deployment for computationally intensive tasks in molecular dynamics, computational chemistry, and cryo-electron microscopy. By integrating NVIDIA GPU-optimized containers with OCI's high-performance compute capabilities, the OCI AI team helps enhance accessibility, reduce setup time, and improve performance for research and enterprise users, empowering them to efficiently leverage GPU-accelerated HPC applications in cloud environments.
Workshop
Message Passing
Network
W
DescriptionGPUs have become the dominant type of accelerators for high-performance computing and artificial intelligence. To support these systems, new communication libraries have emerged, such as NCCL and NVSHMEM, providing stream-based semantics and GPU-initiated communication. Some of the best performing communication libraries are unfortunately vendor-specific, and may use load-store semantics that have been traditionally underused in the application community. Moreover, MPI has yet to define explicit GPU support mechanisms, making it difficult to deploy the message-passing communication model efficiently on GPU-based systems.
MPI 4.0 introduced Partitioned point-to-point communication, which facilitates hybrid-programming models. Partitioned communication is designed to allow GPUs to trigger data movement through a persistent channel. We extend MPI Partitioned to provide intra-kernel GPU-initiated communication and partitioned collectives, augmenting MPI with techniques used in vendor-specific libraries. We evaluate our designs on an NVIDIA GH200 Grace Hopper Superchip testbed to understand the benefits of GPU-initiated communication on NVLink and InfiniBand networks.
MPI 4.0 introduced Partitioned point-to-point communication, which facilitates hybrid-programming models. Partitioned communication is designed to allow GPUs to trigger data movement through a persistent channel. We extend MPI Partitioned to provide intra-kernel GPU-initiated communication and partitioned collectives, augmenting MPI with techniques used in vendor-specific libraries. We evaluate our designs on an NVIDIA GH200 Grace Hopper Superchip testbed to understand the benefits of GPU-initiated communication on NVLink and InfiniBand networks.
Workshop
Applications and Application Frameworks
W
DescriptionJetstream 2 exists as a unique resource within the ACCESS ecosystem, and as such expressing the value of and justification for building an outlier of a NSF funded system is particularly important. Tichenor et al has published a return on investment calculation for HPC systems. In the context of unique system design choices for Jetstream 2 this calculation provides insight into the unique value proposition of the system. The inherently interactive nature and flexibility of an OpenStack based system minimizes training costs and time to launch. High availability and hardware selections reduce administrative time and time to parallelize respectively. Minimizing these four terms in the ROI calculation maximize the value of the system.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionTo provide file systems in user space, FUSE and syscall interception libraries have been used. However, FUSE has a performance degradation problem, and the syscall interception library has problems with reliability and portability.
This study proposes a design and implementation of the reliable and efficient syscall hooking library using zpoline, which is a syscall hooking mechanism based on binary rewriting. To support POSIX interfaces in user space, it completely replaces all required system calls with file system function calls. The proposed method achieved performance results comparable to the native API, and performed 5.3 to 6.4 times better than FUSE.
This study proposes a design and implementation of the reliable and efficient syscall hooking library using zpoline, which is a syscall hooking mechanism based on binary rewriting. To support POSIX interfaces in user space, it completely replaces all required system calls with file system function calls. The proposed method achieved performance results comparable to the native API, and performed 5.3 to 6.4 times better than FUSE.
Paper
Accelerators
Applications and Application Frameworks
Artificial Intelligence/Machine Learning
Modeling and Simulation
Numerical Methods
TP
DescriptionAs biological research demands simulations with increasingly larger cell counts, optimizing these models for large-scale deployment on heterogeneous supercomputing resources becomes crucial. This requires the redesign of fluid-structure interaction tasks written around distributed data structures built for CPU-based systems, where design flexibility and overall memory footprint are key considerations, to instead be performant on CPU-GPU machines. This paper describes the trade-offs of offloading communication tasks to the GPUs and the corresponding changes to the underlying data structures required, along with new algorithms that significantly reduce time-to-solution. At scale performance of our GPU implementation is evaluated on the Polaris and Frontier leadership systems. Real-world workloads involving millions of deformable cells are evaluated. We analyze the competing factors that come into play when designing a communication layer for a fluid-structure interaction code, including code efficiency, complexity, and GPU memory demands, and offer advice to other high performance computing applications facing similar decisions.
Doctoral Showcase
Posters
TP
DescriptionAs supercomputers advance towards exascale capabilities, computational intensity increases significantly, and the volume of data requiring storage and transmission experiences exponential growth. Multi-resolution methods, such as Adaptive Mesh Refinement (AMR), have emerged as an effective solution to address these challenges. Concurrently, error-bounded lossy compression is recognized as one of the most efficient approaches to tackle the latter issue. Despite their respective advantages, few attempts have been made to investigate how the multi-resolution method and error-bounded lossy compression can function together.
To address this gap, this dissertation introduces a series of optimizations for data reduction solutions in multi-resolution simulations:
(1) This dissertation first enhances the offline compression quality of multi-resolution data for different state-of-the-art scientific compressors by adaptively preprocessing the data and optimizing the compressor.
(2) This dissertation then presents a novel in-situ lossy compression framework, utilizing HDF5 and enhanced SZ2, specifically tailored for real-world AMR applications. This framework can reduce I/O costs and improve compression quality.
(3) Furthermore, to extend the usability of multi-resolution techniques, this dissertation introduces a workflow for multi-resolution data compression, applicable to both uniform and AMR simulations. Initially, the workflow employs a Region of Interest (ROI) extraction approach to enable multi-resolution methods for uniform data. Subsequently, to bridge the gap between multi-resolution techniques and lossy compressors, we optimize three distinct compressors, ensuring their optimal performance on multi-resolution data. Lastly, we incorporate an advanced uncertainty visualization method into our workflow to help users understand the potential impacts of lossy compression.
To address this gap, this dissertation introduces a series of optimizations for data reduction solutions in multi-resolution simulations:
(1) This dissertation first enhances the offline compression quality of multi-resolution data for different state-of-the-art scientific compressors by adaptively preprocessing the data and optimizing the compressor.
(2) This dissertation then presents a novel in-situ lossy compression framework, utilizing HDF5 and enhanced SZ2, specifically tailored for real-world AMR applications. This framework can reduce I/O costs and improve compression quality.
(3) Furthermore, to extend the usability of multi-resolution techniques, this dissertation introduces a workflow for multi-resolution data compression, applicable to both uniform and AMR simulations. Initially, the workflow employs a Region of Interest (ROI) extraction approach to enable multi-resolution methods for uniform data. Subsequently, to bridge the gap between multi-resolution techniques and lossy compressors, we optimize three distinct compressors, ensuring their optimal performance on multi-resolution data. Lastly, we incorporate an advanced uncertainty visualization method into our workflow to help users understand the potential impacts of lossy compression.
Workshop
Debugging and Correctness Tools
Fault-Tolerance, Reliability, Maintainability, and Adaptability
Software Engineering
W
DescriptionSeveral MPI correctness benchmarks have been proposed to evaluate the quality of MPI correctness tools.
The design of such a benchmark comes with different challenges, which we will address in this paper.
First, an imbalance in the proportion of correct and erroneous codes in the benchmarks requires careful metric interpretation (recall, accuracy, F1 score).
Second, tools that detect errors but do not report additional information, like the affected source line or class of error, are less valuable.
We extend the typical notion of a true positive with stricter variants that consider a tool's helpfulness.
We introduce a new noise metric to consider the amount of distracting error reports.
We evaluate those new metrics with MPI-BugBench, on the MPI correctness tools ITAC, MUST, and PARCOACH.
Third, we discuss the complexities of hand-crafted and automatically generated benchmark codes and the additional challenges of non-deterministic errors.
The design of such a benchmark comes with different challenges, which we will address in this paper.
First, an imbalance in the proportion of correct and erroneous codes in the benchmarks requires careful metric interpretation (recall, accuracy, F1 score).
Second, tools that detect errors but do not report additional information, like the affected source line or class of error, are less valuable.
We extend the typical notion of a true positive with stricter variants that consider a tool's helpfulness.
We introduce a new noise metric to consider the amount of distracting error reports.
We evaluate those new metrics with MPI-BugBench, on the MPI correctness tools ITAC, MUST, and PARCOACH.
Third, we discuss the complexities of hand-crafted and automatically generated benchmark codes and the additional challenges of non-deterministic errors.
Art of HPC
Posters
TP
W
TUT
XO/EX
DescriptionThis image was created on the Ohio Supercomputer Center’s Pitzer cluster using Stable Diffusion Automatic1111 (open source) using SD-XL checkpoint with the following parameters and prompt: "Destruction, (one artist:1.3) pulling strings from walls, war, dark, good quality, masterpiece"; negative prompt: "painting". Steps: 50. Sampler: DPM++ 2M SDE Karras. CFG scale: 7. Size: 1920x1080. Model hash: 31e35c80fc. Model: sd_xl_base_1.0. Version: v1.5.1.
Workshop
Embedded and/or Reconfigurable Systems
Heterogeneous Computing
W
DescriptionSpatial (dataflow) computer architectures can mitigate the control and performance overhead of classical von Neumann architectures such as traditional CPUs.
Driven by the popularity of Machine Learning (ML) workload, spatial devices are being marketed as ML inference accelerators. Despite providing a rich software ecosystem for ML practitioners, their adoption in other scientific domains is hindered by the steep learning curve and lack of reusable software, which makes them inaccessible to non-experts.
We present our ongoing project AIEBLAS, an open-source, expandable implementation of Basic Linear Algebra Routines (BLAS) for the AMD AI Engine. Numerical routines are designed to be easily reusable, customized, and composed in dataflow programs, leveraging the characteristics of the targeted device without requiring the user to deeply understand the underlying hardware and programming model.
Driven by the popularity of Machine Learning (ML) workload, spatial devices are being marketed as ML inference accelerators. Despite providing a rich software ecosystem for ML practitioners, their adoption in other scientific domains is hindered by the steep learning curve and lack of reusable software, which makes them inaccessible to non-experts.
We present our ongoing project AIEBLAS, an open-source, expandable implementation of Basic Linear Algebra Routines (BLAS) for the AMD AI Engine. Numerical routines are designed to be easily reusable, customized, and composed in dataflow programs, leveraging the characteristics of the targeted device without requiring the user to deeply understand the underlying hardware and programming model.
Workshop
Distributed Computing
Education
Emerging Technologies
W
DescriptionBuilding an HPC training program, keeping the program fresh, and meeting the evolving needs of users can be challenging. Considerations for beginning a program include the type of training, when to provide training, advertising methods, and how to measure the effectiveness of the program. Planning to ensure the training program is effective as it continues to change involves how often to offer a course, new hardware or software, time needed for instructor preparation, when to advertise the training schedule, and teaching strategies to engage users and provide the best learning environment possible. In this paper we describe the process we have developed over time to continue meeting the needs of our HPC community at Texas A&M University and beyond with our training program. The purpose is to provide one example of program development that could be useful to other institutions who are beginning an HPC training program.
Workshop
Distributed Computing
Education
Emerging Technologies
W
DescriptionMany research software engineers (RSEs) come to high performance computing (HPC) after previously pursuing a related degree, and receiving on-the-job experience as part of a PhD or junior staff position. If we are to increase the diversity of RSEs working in HPC, we need to examine different routes for people, particularly those from non-traditional HPC backgrounds, to obtain the formal education and training needed for them to have a successful career. The UNIVERSE-HPC project has been developing personas, career pathways and a proposed master's-level programme for people looking to become RSEs specializing in HPC or large-scale data science. In this lightning talk, we introduce the pillars of this programme, and discuss some of the challenges in developing a balanced programme given the relatively short duration of UK master's courses.
Workshop
Accelerators
Emerging Technologies
Hardware Technologies
W
DescriptionThe rapid evolution of the RISC-V architecture
presents both opportunities and challenges, particularly for
systems lacking support for compressed instructions (RV64G).
This paper explores the development of a Fedora Linux distribution
tailored specifically for the RV64G architecture, providing
a comprehensive narrative of the process from inception to
implementation. Key milestones include establishing a robust
filesystem hierarchy, creating a cross-compiler, preparing and
bootstrapping target image, integrating a native GCC compiler,
and leveraging the Koji build system to streamline package rebuilding.
Additionally, we introduce a custom Python application
to automate the Koji builds, enhancing efficiency and consistency.
Our innovative approach not only addresses the immediate needs
of RV64G systems but also lays the groundwork for future
advancements in High-Performance Computing (HPC) on the
RISC-V platform. This work aims to bridge the gap in the
current ecosystem, offering a scalable and maintainable solution
that promotes the broader adoption of RISC-V technology.
presents both opportunities and challenges, particularly for
systems lacking support for compressed instructions (RV64G).
This paper explores the development of a Fedora Linux distribution
tailored specifically for the RV64G architecture, providing
a comprehensive narrative of the process from inception to
implementation. Key milestones include establishing a robust
filesystem hierarchy, creating a cross-compiler, preparing and
bootstrapping target image, integrating a native GCC compiler,
and leveraging the Koji build system to streamline package rebuilding.
Additionally, we introduce a custom Python application
to automate the Koji builds, enhancing efficiency and consistency.
Our innovative approach not only addresses the immediate needs
of RV64G systems but also lays the groundwork for future
advancements in High-Performance Computing (HPC) on the
RISC-V platform. This work aims to bridge the gap in the
current ecosystem, offering a scalable and maintainable solution
that promotes the broader adoption of RISC-V technology.
Workshop
Performance Optimization
Programming Frameworks and System Software
W
DescriptionThis paper describes the development of performance portable spline building kernels on top of Kokkos-kernels. We wish to solve a single matrix equation with multiple right-hand sides. This problem is quite unique and thus neither Kokkos-kernels (direct method) nor Ginkgo (iterative methods) is optimized for this. We develop the required solvers in Kokkos-kernels with a batched serial implementation and optimize them using kernel fusion and sparse matrix storage. We demonstrate that our spline solver works efficiently on NVIDIA A100 and AMD MI250X GPUs, while keeping a reasonable performance on CPUs. This effort significantly reduces the development and maintenance cost of spline solvers on exa-scale supercomputing systems.
Workshop
Broader Engagement
HPC in Society
Inclusivity
W
DescriptionEmbedding theories have emerged as a possible solution to quantum-based computational chemistry methods' “curse of dimensionality" that limit the systems that can be studied. One affected area is chiral induced spin selectivity (CISS), a phenomenon in which a molecule’s chirality influences the spin of electrons passing through the molecule. To offer a solution to the problem of dimensionality and to study CISS, we are developing the real-time extension of the wavefunction-based projected density matrix embedding theory (RT-pDMET). Previous work has shown that spin-restricted RT-pDMET is able to accurately capture electron dynamics in model systems and converges to an exact solution. Current work focuses on developing a spin non-collinear RT-pDMET to correctly capture spin phenomena, as well as implementing MPI to take advantage of embarrassingly parallel parts of the algorithm. This method will then be applied to investigate the CISS effect but has wide-reaching applications in the study of spin dynamics.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionIn high performance computing, researchers often work with extremely large time-series datasets. Compression techniques allow them to shrink their data down, allowing for quicker and less storage-intensive transfers. TEZip is a compression model that utilizes PredNet (a video prediction model) to predict each frame of a time-series dataset, subtracting these predictions from the actual frames and performing further encoding operations. However, TEZip is currently built using TensorFlow and only supports the PredNet model, which trains and predicts slowly. In this work, we rebuilt TEZip in order to accommodate for PyTorch models while also including functionality for PredNet as well as ConvLSTM, a simpler time-series prediction model. We found that our PyTorch version (specifically with the ConvLSTM model) results in faster compression and decompression times. This work is significant in extending the capabilities of TEZip and suggests that simple prediction models are worth exploring in the realm of prediction-based compression.
Paper
Algorithms
Artificial Intelligence/Machine Learning
Data Movement and Memory
Graph Algorithms
TP
DescriptionModern HPC workflows involve intricate coupling of simulation, data analytics, and artificial intelligence (AI) applications to improve time to scientific insight. However, current tools are not designed to work with an AI-based I/O software stack that requires tracing at multiple levels of the application. To this end, we designed DFTracer to capture data-centric events from workflows and the I/O stack. DFTracer has following three novel features, including a unified interface to capture tracing data from different layers in the software stack, a trace format which is analysis-friendly optimized to supports efficiently loading, and the capability to tag events with workflow-specific context to improve analysis. Additionally, we demonstrate that DFTracer has a 1.44x smaller runtime overhead and 7.1x smaller trace size as compared to state-of-the-art tools. In conclusion, we demonstrate that DFTracer can capture multi-level performance data with a low overhead of 1-5% from MuMMI and Megatron Deepspeed workflows.
Workshop
Emerging Technologies
Modeling and Simulation
Scientific and Information Visualization
W
DescriptionDiscuss the morning session in a panel format - also discuss survey questions
Workshop
Emerging Technologies
Modeling and Simulation
Scientific and Information Visualization
W
DescriptionWes Brewer will provide an overview of the afternon agenda and introductions for the session presenters.
Workshop
Emerging Technologies
Modeling and Simulation
Scientific and Information Visualization
W
Workshop
Emerging Technologies
Modeling and Simulation
Scientific and Information Visualization
W
DescriptionTitle: Medical Digital Twins and Virtual Human Models
Abstract: The presentation will provide insights into current activities in the development of virtual human models and efforts to create biomedical digital twins. Perspectives from the published report from the First Virtual Human Global Summit will be shared as well as insights from recent activities in the US and internationally supporting digital twin efforts. Challenges and opportunities will help frame future efforts, collaborations and needs to advance biomedical digital twins in health.
Abstract: The presentation will provide insights into current activities in the development of virtual human models and efforts to create biomedical digital twins. Perspectives from the published report from the First Virtual Human Global Summit will be shared as well as insights from recent activities in the US and internationally supporting digital twin efforts. Challenges and opportunities will help frame future efforts, collaborations and needs to advance biomedical digital twins in health.
Workshop
Emerging Technologies
Modeling and Simulation
Scientific and Information Visualization
W
DescriptionAbstract: As the demand for generative AI and accelerated computing continues to grow, the need for efficient and reliable data center cooling solutions has never been more critical. Traditional cooling designs struggle to keep pace with the increasing power densities and energy demands of modern accelerated compute data center infrastructures. In this talk, we will explore the revolutionary role of digital twins in the design, operation, and optimization of data center cooling systems. By creating virtual replicas of physical data centers, digital twins enable predictive modeling, real-time monitoring, and scenario testing, offering a smarter approach to managing thermal performance and energy efficiency. Central to this innovation is the use of tools such as NVIDIA Omniverse and Modulus, which facilitate the creation and simulation of digital twins with unprecedented precision. These platforms allow for seamless integration of real-time data, advanced physics modeling, and machine learning, making it possible to simulate various cooling strategies, predict future challenges, and optimize configurations. By partnering with industry leaders like Cadence, Vertiv, and Schneider Electric, we are delivering comprehensive digital twin solutions tailored to address the complex cooling requirements of generative AI workloads. These collaborations bring together expertise in hardware, software, and engineering to create more efficient, flexible, and scalable cooling systems. This presentation will also showcase real-world case studies demonstrating how digital twins, powered by NVIDIA Omniverse and Modulus, have been successfully deployed to tackle the challenges of cooling for generative AI and accelerated compute data centers. Attendees will gain insights into best practices, key technologies, and the future potential of digital twins in driving smarter, more sustainable data center operations. As generative AI and accelerated compute workloads continues to evolve, digital twin technology—backed by strategic industry partnerships—will be at the forefront of innovation, enabling data centers to adapt to changing demands with precision and agility.
Workshop
Emerging Technologies
Modeling and Simulation
Scientific and Information Visualization
W
DescriptionDigital Twins: Reporting Past-, Detecting Current-, and Predicting Future-Failures
Birds of a Feather
TP
XO/EX
DescriptionThis BoF is a forum for the discussion of recent topics of research in the field of disaggregated heterogeneous architectures, their operation and use. The term "disaggregated" describes systems that organize heterogeneous resources in "pools" and dynamically compose them according to the application needs. Advances in interconnects (e.g., SmartNICs, memory coherence protocols, optical switches) open new perspectives. The discussion will address the opportunities and challenges that this approach presents for HPC system operators, end-users, and system and application software developers.
Posters
TP
DescriptionWe present the cross-architecture parallel simulation framework DisCostiC (Distributed Cost in Clusters). It predicts the performance of real or hypothetical, massively parallel MPI(+X) programs on current and future supercomputer systems. The novelty of DisCostiC is that it employs analytical, first-principle performance models at full scale, including cores, chips, nodes, and networks, while being aware of bottlenecks such as memory bandwidth. DisCostiC uses application skeletons in a Domain-Specific Embedded Language (DSEL), which encodes inter-process dependencies and any number of system and code properties, enabling flexible design space exploration. As a consequence of the model-based design, DisCostiC requires much less time and resources than traditional simulators because the application code is never actually run. This is in contrast to state-of-the-art solutions, which are based on trace data and/or simulated code execution and may thus need considerable resources. The resulting traces can be visualized using Chromium Browser, ITAC, or Vampir.
Workshop
Distributed Computing
Performance Evaluation and/or Optimization Tools
Scientific and Information Visualization
W
Workshop
Broader Engagement
HPC in Society
Inclusivity
W
Tutorial
Artificial Intelligence/Machine Learning
Scalable Data Mining
TUT
DescriptionDeep learning (DL) is rapidly becoming pervasive in almost all areas of computer science, and is even being used to assist computational science simulations and data analysis. A key behavior of these deep neural networks (DNNs) is that they reliably scale i.e., they continuously improve in performance when the number of model parameters and amount of data grow. As the demand for larger, more sophisticated, and more accurate DL models increases, the need for large-scale parallel model training, fine-tuning and inference has become increasingly pressing. Subsequently, in the past few years, several parallel algorithms and frameworks have been developed to parallelize model training and fine-tuning on GPU-based platforms. This tutorial will introduce and provide basics of the state-of-the-art in distributed DL. We will use large language models (LLMs) as a running example, and provide hands-on training in performing three essential tasks for working with DNNs: i. training a DNN from scratch, ii. continued training/fine-tuning of a DDN from a checkpoint, and iii. inference on a trained DNN. We will cover algorithms and frameworks that employ data parallelism (PytorchDDP and DeepSpeed), and model parallelism (AxoNN).
Workshop
Artificial Intelligence/Machine Learning
W
DescriptionThe rise of LLMs (Large Language Models) has increased the need
not only for larger amounts of data, but also for data quality. In
order to achieve this, data engineering procedures like
deduplication are applied. Deduplication is a very computationally
intensive task that typically requires HPC (High Performance
Computing) environments when dealing with large datasets. This work presents a large-scale scientific open-source pipeline to
perform exact deduplication over a corpus of data. The proposed
approach uses the distributed file system as a feasible and simple
way to address common limitations and challenges that arise when
working with shared SLURM-based HPC environments.
not only for larger amounts of data, but also for data quality. In
order to achieve this, data engineering procedures like
deduplication are applied. Deduplication is a very computationally
intensive task that typically requires HPC (High Performance
Computing) environments when dealing with large datasets. This work presents a large-scale scientific open-source pipeline to
perform exact deduplication over a corpus of data. The proposed
approach uses the distributed file system as a feasible and simple
way to address common limitations and challenges that arise when
working with shared SLURM-based HPC environments.
Workshop
Data Movement and Memory
I/O, Storage, Archive
W
DescriptionCheckpointing — i.e., regularly saving relevant data in a resilient store — is a common approach to protect programs against hardware failures on clusters. While existing checkpointing libraries, such as VeloC, focus on iterative applications, Asynchronous Many-Task (AMT) programs pose specific requirements that affect the design of this store.
Many AMT runtimes use independent worker processes that balance their load via work stealing. The workers naturally write separate checkpoints autonomously at their respective task boundaries. To keep them in sync, many small write operations are performed at unpredictable time intervals. Reads, on the other hand, are rare. Recovery can be localized, but then involves complicated protocols and transactions.
This talk will elaborate on the specific features that AMT checkpointing and recovery requires from a resilient store. We'll discuss existing storage solutions, of which none seem sufficient yet. We'll argue that a distributed, in-memory, key-value store may be most appropriate.
Many AMT runtimes use independent worker processes that balance their load via work stealing. The workers naturally write separate checkpoints autonomously at their respective task boundaries. To keep them in sync, many small write operations are performed at unpredictable time intervals. Reads, on the other hand, are rare. Recovery can be localized, but then involves complicated protocols and transactions.
This talk will elaborate on the specific features that AMT checkpointing and recovery requires from a resilient store. We'll discuss existing storage solutions, of which none seem sufficient yet. We'll argue that a distributed, in-memory, key-value store may be most appropriate.
Paper
Algorithms
Artificial Intelligence/Machine Learning
Graph Algorithms
Linear Algebra
TP
DescriptionWe consider a sparse matrix-matrix multiplication (SpGEMM) setting where one matrix is square and the other is tall and skinny. This special variant, which we call TS-SpGEMM, has important applications in multi-source breadth-first search, influence maximization, sparse graph embedding, and algebraic multigrid solvers. Unfortunately, popular distributed algorithms like sparse SUMMA deliver suboptimal performance for TS-SpGEMM. To address this limitation, we
develop a novel distributed-memory algorithm tailored for TS-SpGEMM.
Our approach employs customized 1D partitioning for all matrices involved and leverages sparsity-aware tiling for efficient data transfers. In addition, it minimizes communication overhead by incorporating both local and remote computations. On average, our TS-SpGEMM algorithm attains 5X performance gains over 2D and 3D SUMMA. Furthermore, we use our algorithm to implement multi-source breadth-first search and sparse graph embedding algorithms
and demonstrate their scalability up to 512 Nodes (or 65,536 cores) on NERSC Perlmutter.
develop a novel distributed-memory algorithm tailored for TS-SpGEMM.
Our approach employs customized 1D partitioning for all matrices involved and leverages sparsity-aware tiling for efficient data transfers. In addition, it minimizes communication overhead by incorporating both local and remote computations. On average, our TS-SpGEMM algorithm attains 5X performance gains over 2D and 3D SUMMA. Furthermore, we use our algorithm to implement multi-source breadth-first search and sparse graph embedding algorithms
and demonstrate their scalability up to 512 Nodes (or 65,536 cores) on NERSC Perlmutter.
Workshop
Broader Engagement
Education
Inclusivity
W
DescriptionWe present an assignment for a full Parallel Computing course. Since 2017/2018, we have proposed a different problem each academic year to illustrate various methodologies for approaching the same computational problem using different parallel programming models. They are designed to be parallelized using shared-memory programming with OpenMP, distributed-memory programming with MPI, and GPU programming with CUDA or OpenCL. The problem chosen for this year implements a brute-force solution for exact DNA sequence alignment of multiple patterns. The program searches for exact coincidences of multiple nucleotide strings in a long DNA sequence. The sequential implementation is designed to be clear and understandable to students while offering many opportunities for parallelization and optimization.
This assignment addresses key basic concepts that many students find difficult to apply in practical scenarios: race conditions, reductions, collective operations, and point-to-point communications. It also covers the problem of parallel generation of pseudo-random sequences and strategies to notify and stop speculative computations when matches are found. This assignment serves as an exercise that reinforces basic knowledge and prepares students for more complex parallel computing concepts and structures. It has been successfully implemented as a practical assignment in a third-year Parallel Computing course of a Computer Engineering degree program. Supporting materials for previous assignments in this series are available at https://gamuva.infor.uva.es/peachy-assignments/
This assignment addresses key basic concepts that many students find difficult to apply in practical scenarios: race conditions, reductions, collective operations, and point-to-point communications. It also covers the problem of parallel generation of pseudo-random sequences and strategies to notify and stop speculative computations when matches are found. This assignment serves as an exercise that reinforces basic knowledge and prepares students for more complex parallel computing concepts and structures. It has been successfully implemented as a practical assignment in a third-year Parallel Computing course of a Computer Engineering degree program. Supporting materials for previous assignments in this series are available at https://gamuva.infor.uva.es/peachy-assignments/
Workshop
Broader Engagement
HPC in Society
Inclusivity
W
DescriptionThis study examines gender bias in LLMs by comparing model-generated responses with those of human respondents. A questionnaire based on the Gender Equality Public Opinion Survey was employed, with virtual personas reflecting the demographic distribution of participants, consistent with the human survey. These personas engaged in role-playing scenarios using two distinct LLMs. Statistical analysis identified significant differences between AI models and human survey data, underscoring the regional specificity of gender equality perceptions and the limitations of LLMs in capturing nuanced social dynamics. Furthermore, the study addresses the potential consequences of over-filtering, which may reduce diverse viewpoints, including those of minority groups. These findings highlight the necessity of culturally sensitive bias mitigation strategies and ensuring diversity when applying LLMs in cultural and social contexts.
Paper
Accelerators
Algorithms
Data Movement and Memory
Graph Algorithms
TP
DescriptionBreadth-first search (BFS) is a fundamental building block of various high-performance computing applications beyond graph analysis and also known as a benchmark problem in the Graph500 list. The increasing volume of global data demands efficient distributed BFS, which, however, is hindered by the high communication costs of exchanging vertex data between compute nodes. To address this challenge, this paper introduces four techniques: (i) forest pruning, which reduces the number of vertices by eliminating those unnecessary for the search; (ii) group reordering and (iii) multilevel bitmap compression, which decrease the memory footprint of graph data, thereby enabling fewer nodes to manage larger graphs; and (iv) adaptive parameter tuning, which quickly optimizes the hyperparameters of the BFS algorithm. In the evaluation using 152,064 nodes of the supercomputer Fugaku, our implementation achieved 198 tera-traversed edges per second, doubling the performance reported in the latest Graph500-related study on Fugaku.
Workshop
Data Compression
Data Movement and Memory
Middleware and System Software
W
Workshop
Data Compression
Data Movement and Memory
Middleware and System Software
W
Workshop
Artificial Intelligence/Machine Learning
W
DescriptionFoundation models (FMs) have the potential to help drive experiments towards new discoveries, provided they are augmented with proper scientific knowledge and their reasoning is verified to assure self-consistency and compliance with basic physical principles. We are developing a Virtual Foundation Model (VFM) as an Operating System (VFMOS) infrastructure that manages Data, Model and Trust Augmentation Agents hidden under the VFM interface that dynamically retrieve the required knowledge, select suitable FMs and guide the reasoning process for domain specific use cases. We demonstrate benefits for autonomous characterization of defect sites in microscopy of material thin films, and for exploration of new molecules with interesting properties.
Workshop
HPC Infrastructure
State of the Practice
System Administration
W
DescriptionLogin nodes, also known as interactive nodes, are an essential component of high-performance computing (HPC) environments. Covering a range of different use cases, they provide users of a cluster with flexibility when developing and running programs. However, the resource management of these shared machines is an ongoing problem within the community, as each institution has different requirements of these nodes. In most cases, these nodes are shared simultaneously by multiple users, and providing users with freedom and fairness has proved to be a difficult task. To address these challenges, we introduce Arbiter, a software suite for setting and enforcing resource usage policies on Linux nodes via cgroups. In this paper we discuss the previous version of Arbiter, and the updates and improvements we have made in the new Arbiter 3.
Birds of a Feather
TP
XO/EX
DescriptionWith recent exascale systems, power, energy and density have motivated moving to cold plate liquid cooling. However, the control systems are reactive. ORNL's control system takes minutes to deliver cold water, while Frontier instantaneously turns megawatts into waste heat. This highly interactive BoF will describe and discuss recent efforts in unifying power measurement systems at the compute node level with liquid cooling controls and CDUs. The discussion of future directions will include sites, system integrators and CDU vendors to discuss the benefits of first steps and the long-term goals of predictive load systems for proactive rather than passive cooling controls.
Workshop
Distributed Computing
Performance Evaluation and/or Optimization Tools
Scientific and Information Visualization
W
DescriptionReliable and uninterrupted operation is crucial in supercomputers, especially during failures or inconsistencies i.e., anomalies. In this paper, we present a federated adaptive Digital Twin (DT) framework, with a focus on enhancing anomaly detection -- a critical aspect of modern data center management. Our DT continuously monitors key metrics, detects anomalies powered by AI, and dynamically adjusts its monitoring parameters to ensure optimal performance. Using a dashboard, our system provides real-time alarms and detailed visualizations of detected anomalies, along with real-time visualization and forecast for selected metrics. Through a series of experiments, we validate the effectiveness of our approach in maintaining operational reliability and promptly identifying potential anomalies within the data center.
Workshop
Artificial Intelligence/Machine Learning
W
DescriptionThis paper investigates the applicability of artificial neural network (ANN) for developing non-destructive tests (NDT) with non-intrusive load monitoring (NILM) in manufacturing. Specifically, with Gas metal arc welding (GMAW).
For GMAW it is shown that power drawn by the welder is sufficient to accurately identifying anomalies with an ANN. ANNs can utilize raw data without requiring subject matter experts for preprocessing or feature engineering. Echo State Networks (ESN) can learn from only one data point, one example weld. This is due to their use of the pseudo-inverse matrix method for training. Allowing implementation of NILM in a wide range of manufacturing processes where large amounts of training data are unavailable or impractical to collect.
The comparative analysis shows models that train with backpropagation, such as transformers, demand a large amount of training data to get results similar to ESNs, thus they are unrealistic in scenarios with limited training data availability.
For GMAW it is shown that power drawn by the welder is sufficient to accurately identifying anomalies with an ANN. ANNs can utilize raw data without requiring subject matter experts for preprocessing or feature engineering. Echo State Networks (ESN) can learn from only one data point, one example weld. This is due to their use of the pseudo-inverse matrix method for training. Allowing implementation of NILM in a wide range of manufacturing processes where large amounts of training data are unavailable or impractical to collect.
The comparative analysis shows models that train with backpropagation, such as transformers, demand a large amount of training data to get results similar to ESNs, thus they are unrealistic in scenarios with limited training data availability.
Paper
Architecture
Codesign
Data Movement and Memory
Energy Efficiency
Green Computing
Linear Algebra
TP
DescriptionThis work introduces ECOLIFE, the first carbon-aware serverless function scheduler to co-optimize carbon footprint and performance. ECOLIFE builds on the key insight of intelligently exploiting multi-generation hardware to achieve high performance and lower carbon footprint. ECOLIFE designs multiple novel extensions to Particle Swarm Optimization (PSO) in the context of serverless execution environment to achieve high performance while effectively reducing the carbon footprint.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionThere is a growing need to acquire a larger quantity of meteorological data to address climate change. In this paper, we design an improved Automatic Weather Station (AWS) based on a prototype from the National Center for Atmospheric Research (NCAR). We integrate this weather station with IBIS, a platform for adaptable, multi-sensor data collection on edge devices. Our solution utilizes a Raspberry Pi 4 to aggregate sensor data from AWSs over LOng-RAnge (LoRa) radio frequency. A real-time data visualization platform, using Grafana, InfluxDB, and hosted on the Chameleon testbed, is presented. We show how the expanded peripherals allow for the implementation of novel weather forecasting techniques and demonstrate the power efficiency of our solution by comparing the power consumption of our choice of microcontroller to the Raspberry Pi. Lastly, we examine how our implementation can address challenges in big-data weather forecasting.
Panel
Education
Quantum Computing
TP
DescriptionQuantum computing is a promising paradigm for solving complex problems and accelerating progress in many areas of high-performance computing. The convergence of quantum technologies and high-performance computing offers unique opportunities for research and algorithm development, and requires a skilled workforce to exploit its full potential. As a result, there is a growing demand for hybrid and quantum computing education worldwide.
This panel brings together experts from leading supercomputing centers and the quantum computing industry to address the integration of QC into educational frameworks worldwide. The goal is to explore the dynamic interface between QC and HPC education, focusing on evolving user needs, necessary support systems, and innovative strategies to broaden participation in this emerging field. Through interactive discussions, the panel will explore the educational infrastructure and collaborative initiatives essential to cultivating a competent workforce adept at exploiting the capabilities of quantum computing.
This panel brings together experts from leading supercomputing centers and the quantum computing industry to address the integration of QC into educational frameworks worldwide. The goal is to explore the dynamic interface between QC and HPC education, focusing on evolving user needs, necessary support systems, and innovative strategies to broaden participation in this emerging field. Through interactive discussions, the panel will explore the educational infrastructure and collaborative initiatives essential to cultivating a competent workforce adept at exploiting the capabilities of quantum computing.
Workshop
Broader Engagement
Education
Inclusivity
W
DescriptionThe advent of chiplet technology introduces cutting-edge opportunities for constructing highly heterogeneous platforms with specialized accelerators. However, the HPC community currently lacks expertise in hardware development, a gap that must be bridged to leverage these advancements. Additionally, technologies like chiplet is cutting-edge with limited educational resource available. This paper addresses potential hardware specialization direction in HPC and how to cultivate these skills among students and staff, emphasizing the importance of understanding and developing custom hardware (e.g., rapid prototyping and resource estimation). We have been mentoring graduate-level students and new staff in hardware designs in a hands-on manner, encouraging them to utilize modern open-source hardware tools for their designs, which facilitates the sharing of research ideas. Additionally, we provide a summary of theses tools as part of our approach to prototyping and mentoring.
Workshop
Distributed Computing
Education
Emerging Technologies
W
DescriptionAs the capabilities of large language models (LLMs) continue to expand, with more accurate and powerful models being released monthly, researchers and educators are increasingly eager to incorporate these tools. The growing demand for this technology reflects its transformative potential in natural language and its impact on scientific research. However, as more users seek to harness the power of LLMs, the need to provide comprehensive education and scalable support becomes ever more critical. Our institution has recognized this challenge and developed a support framework to educate users through educational events, consultations, and project support. We have implemented several key strategies to address the growing need for LLM support, including deploying Jupyter Lab sessions using Open OnDemand for seamless HPC access and integrating cloud-based solutions via Jetstream2. We provide insights into our approach, detailing how we empower researchers and educators to leverage the capabilities of LLMs in their diverse applications.
Workshop
Energy Efficiency
HPC Infrastructure
Sustainability
W
DescriptionThe energy consumption has become a major cost
factor in the procurement and operation of large scale HPC data
centers. In addition, funding bodies and governments are starting
to focus on assessment and improvement of energy efficiency, as
well as reducing the overall environmental impact of data centers,
like carbon usage reduction. The goal of the EE-HPC project is
to develop a targeted job specific control and optimization of the
hardware to enable a more efficient energy usage of HPC systems.
The project started at the end of 2022 and builds upon the
existing stable software components ClusterCockpit [1] and
LIKWID [2] developed by FAU. It provides a simple, robust, secure
and scalable monitoring & energy control framework for hybrid
HPC cluster management. The EE-HPC project is developing
energy aware software components that will be integrated with
ClusterCockpit for power monitoring and reducing the energy
consumption of the system.
factor in the procurement and operation of large scale HPC data
centers. In addition, funding bodies and governments are starting
to focus on assessment and improvement of energy efficiency, as
well as reducing the overall environmental impact of data centers,
like carbon usage reduction. The goal of the EE-HPC project is
to develop a targeted job specific control and optimization of the
hardware to enable a more efficient energy usage of HPC systems.
The project started at the end of 2022 and builds upon the
existing stable software components ClusterCockpit [1] and
LIKWID [2] developed by FAU. It provides a simple, robust, secure
and scalable monitoring & energy control framework for hybrid
HPC cluster management. The EE-HPC project is developing
energy aware software components that will be integrated with
ClusterCockpit for power monitoring and reducing the energy
consumption of the system.
Doctoral Showcase
Posters
TP
DescriptionMachine learning is a fundamental tool that is incorporated in fields across academia and industry. Due to the large amounts of data needed for training machine learning models, compression is utilized because it reduces the data footprint playing a critical role in storage. Machine learning involves the use of algorithms and models to learn patterns in data allowing AI to make decisions without specific programming. On the other hand, compression utilizes encoding and decoding techniques to reduce file size. Compression can be lossy or lossless; lossy causes a loss of data while lossless preserves the data.
This dissertation explores the accuracy and scalability of machine learning when working with lossy distorted data. Performance metrics studied look at how accurately the model’s inference performs. Issues with machine learning performance on lossy data involve the following: data storage, data transfer bandwidth, and processing on the intersection between machine learning and lossy compression. Over these various issues, machine learning is examined in different domains. This work investigates how meaningful patterns in the distorted data are extracted.
The primary focus explores neural network models' ability to manage lossy compressed data and find ways to mitigate loss due to distortion, addressing machine learning across various domains including object detection, semantic segmentation, and image classification, to find the balance between compression ratio and data quality.
This dissertation explores the accuracy and scalability of machine learning when working with lossy distorted data. Performance metrics studied look at how accurately the model’s inference performs. Issues with machine learning performance on lossy data involve the following: data storage, data transfer bandwidth, and processing on the intersection between machine learning and lossy compression. Over these various issues, machine learning is examined in different domains. This work investigates how meaningful patterns in the distorted data are extracted.
The primary focus explores neural network models' ability to manage lossy compressed data and find ways to mitigate loss due to distortion, addressing machine learning across various domains including object detection, semantic segmentation, and image classification, to find the balance between compression ratio and data quality.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionDynamic graphs, characterized by their evolving topologies over time, necessitate continuous updates to their associated graph properties, including shortest paths, vertex coloring, and strongly connected components. Traditional static graph algorithms, which re-compute properties following each modification, typically falter in efficiency under such conditions. In this paper, we introduce a suite of methodologies implemented within our software platform, CANDY, designed to efficiently analyze dynamic graphs. We propose a generic framework that supports the parallel updating of graph properties across large networks subject to various types of changes. Our results demonstrate the enhanced performance of these update algorithms in managing large dynamic networks, highlighting significant improvements over conventional approaches.
Tutorial
Accelerators
Numerical Methods
Parallel Programming Methods, Models, Languages and Environments
Performance Evaluation and/or Optimization Tools
TUT
DescriptionOver the past decade, GPUs became ubiquitous in HPC installations around the world, delivering the majority of performance of some of the largest supercomputers (e.g. Summit, Sierra, JUWELS Booster). This trend continues in the recently deployed and upcoming Pre-Exascale and Exascale systems (JUPITER, LUMI, Leonardo; El Capitan, Frontier, Aurora): GPUs are chosen as the core computing devices to enter this next era of HPC.
To take advantage of future GPU-accelerated systems with tens of thousands of devices, application developers need to have the proper skills and tools to understand, manage, and optimize distributed GPU applications.
In this tutorial, participants will learn techniques to efficiently program large-scale multi-GPU systems. While programming multiple GPUs with MPI is explained in detail, also advanced tuning techniques and complementing programming models like NCCL and NVSHMEM are presented. Tools for analysis are shown and used to motivate and implement performance optimizations. The tutorial teaches fundamental concepts that apply to GPU-accelerated systems in general, taking the NVIDIA platform as an example. It is a combination of lectures and hands-on exercises, using a development system for JUPITER (JEDI), for interactive learning and discovery.
To take advantage of future GPU-accelerated systems with tens of thousands of devices, application developers need to have the proper skills and tools to understand, manage, and optimize distributed GPU applications.
In this tutorial, participants will learn techniques to efficiently program large-scale multi-GPU systems. While programming multiple GPUs with MPI is explained in detail, also advanced tuning techniques and complementing programming models like NCCL and NVSHMEM are presented. Tools for analysis are shown and used to motivate and implement performance optimizations. The tutorial teaches fundamental concepts that apply to GPU-accelerated systems in general, taking the NVIDIA platform as an example. It is a combination of lectures and hands-on exercises, using a development system for JUPITER (JEDI), for interactive learning and discovery.
Doctoral Showcase
Posters
TP
DescriptionGraph-structured data analysis is extensively used in various real-world applications, including biology, social media, and recommendation systems. With the increasing prevalence of real-time data, many graphs become dynamic and evolve over time. Thus, dynamic graph processing systems become a necessary tool to store these real-time updates and continuously run analytic algorithms to provide insights into the data. However, these systems require special design to efficiently support both tasks. As a result, we have seen a growing demand in this research direction in recent years, as numerous low-level data structures and high-level systems have addressed different aspects of dynamic graph processing.
With the high demand for data-intensive systems and the growing volume of data, many emerging storage hardware technologies have been added to the storage hierarchy, including persistent memory. Due to its promising features such as low latency, high density, and byte-addressable accessibility, persistent memory has gained the attention of researchers and developers of high-performance data-intensive applications. As such, it is not surprising that we expect to see persistent memory usage in dynamic graph processing systems due to the need for high performance and capacity. Therefore, our research aims to explore efficient ways of designing and implementing dynamic graph processing systems on persistent memory.
With the high demand for data-intensive systems and the growing volume of data, many emerging storage hardware technologies have been added to the storage hierarchy, including persistent memory. Due to its promising features such as low latency, high density, and byte-addressable accessibility, persistent memory has gained the attention of researchers and developers of high-performance data-intensive applications. As such, it is not surprising that we expect to see persistent memory usage in dynamic graph processing systems due to the need for high performance and capacity. Therefore, our research aims to explore efficient ways of designing and implementing dynamic graph processing systems on persistent memory.
Paper
Accelerators
Data Movement and Memory
Emerging Technologies
Hardware Technologies
Heterogeneous Computing
Linear Algebra
Network
TP
DescriptionThe deep learning models (DL) are becoming bigger, easily beyond the memory capacity of a single accelerator. The recent progress in large DL training utilizes CPU memory as an extension of accelerator memory and offloads tensors to CPU memory to save accelerator memory. This solution transfers tensors between the two memories, creating a major performance bottleneck. We identify two problems during tensor transfers: (1) the coarse-grained tensor transfer creating difficulty in hiding transfer overhead, and (2) the redundant transfer that unnecessarily migrates value-unchanged bytes from CPU to accelerator. We introduce a cache-coherence interconnect based on Compute Express Link (CXL) to build a cache-coherence domain between CPU memory and accelerator memory. By slightly extending CXL to support an update cache-coherence protocol and avoiding unnecessary data transfers, we reduce training time by 33.7% (up to 55.4%) without changing model convergence and accuracy, compared with the state-of-the-art work in DeepSpeed
Workshop
Graph Algorithms
Heterogeneous Computing
Programming Frameworks and System Software
W
DescriptionThe Barnes-Hut approximation for N-body simulations reduces the time complexity of the naive all-pairs approach from O(N^2) to O(N log N) by hierarchically aggregating nearby particles into single entities using a tree data structure.
This inherently irregular algorithm poses substantial challenges for performance portable implementations on multi-core CPUs and GPUs.
We introduce two portable fully-parallel Barnes-Hut implementation strategies that trade-off different levels of GPU support for performance: an unbalanced concurrent octree, and a balanced bounding volume hierarchy sorted by a Hilbert space-filling curve.
We implemente these algorithms in portable ISO C++ using parallel algorithms and concurrency primitives like atomics.
The results demonstrate competitive performance on a range of CPUs and GPUs.
Additionally, they highlight the effectiveness of the par execution policy for highly concurrent irregular algorithms, outperforming par_unseq on CPUs and GPUs with Independent Forward Progress.
This inherently irregular algorithm poses substantial challenges for performance portable implementations on multi-core CPUs and GPUs.
We introduce two portable fully-parallel Barnes-Hut implementation strategies that trade-off different levels of GPU support for performance: an unbalanced concurrent octree, and a balanced bounding volume hierarchy sorted by a Hilbert space-filling curve.
We implemente these algorithms in portable ISO C++ using parallel algorithms and concurrency primitives like atomics.
The results demonstrate competitive performance on a range of CPUs and GPUs.
Additionally, they highlight the effectiveness of the par execution policy for highly concurrent irregular algorithms, outperforming par_unseq on CPUs and GPUs with Independent Forward Progress.
Paper
Algorithms
Artificial Intelligence/Machine Learning
Data Movement and Memory
Graph Algorithms
TP
DescriptionWeighted matching identifies a maximal subset of edges in a graph with no common vertices. As a prototypical graph problem, matching has numerous applications in multi-level graph algorithms and machine learning. However, challenges arise in developing efficient, parallel graph matching methods on contemporary GPGPU systems due to general graph processing complexities, such as irregular memory access patterns and load imbalances. Furthermore, increasingly massive graph sizes exceed available GPU memory while data dependencies and synchronization costs in multi-GPU massive-graph processing challenge sustainable scalability.
Considering these challenges, we present efficient approximation algorithms for locally-dominant matching and demonstrate performance improvements via batching and distribution across multiple NVIDIA A100/V100 GPUs of NVIDIA DGX dense-GPU platform. Our pointer-based matching method exhibits 2-45x performance improvements compared to state-of-the-art single-GPU and multithreaded CPU matching implementations on real-world and synthetic graphs. We show competitive quality comparisons and analysis of GPU-data distribution for efficient, practical graph matching on GPUs.
Considering these challenges, we present efficient approximation algorithms for locally-dominant matching and demonstrate performance improvements via batching and distribution across multiple NVIDIA A100/V100 GPUs of NVIDIA DGX dense-GPU platform. Our pointer-based matching method exhibits 2-45x performance improvements compared to state-of-the-art single-GPU and multithreaded CPU matching implementations on real-world and synthetic graphs. We show competitive quality comparisons and analysis of GPU-data distribution for efficient, practical graph matching on GPUs.
Doctoral Showcase
Posters
TP
DescriptionThe rapid advancement in Artificial Neural Networks (ANNs) has paved the way for Spiking Neural Networks (SNNs), which offer significant advantages in energy efficiency and computational speed, especially on neuromorphic hardware. My research focuses on the development of Efficient, Robust, and Scalable Heterogeneous Recurrent Spiking Neural Networks (HRSNNs) for high-performance computing, addressing key challenges in traditional digital systems, such as high energy consumption due to ADC/DAC conversions and vulnerability to process variations, temperature, and aging.
HRSNNs leverage the diversity in neuronal dynamics and Spike-Timing-Dependent Plasticity (STDP) to improve memory capacity, learn complex patterns, and enhance network performance. By incorporating unsupervised learning models and biologically plausible pruning techniques, we maintain network stability and computational efficiency. A notable contribution of this work is the introduction of Lyapunov Noise Pruning (LNP), which leverages temporal overparameterization to achieve significant reductions in network complexity without compromising accuracy.
Our approach also explores DNN-SNN hybrid models, which combine the strengths of deep neural networks and spiking networks for tasks such as object detection, demonstrating competitive accuracy with lower power consumption. Additionally, we propose a Processing-in-Memory (PIM) hardware platform for on-chip acceleration, further enhancing the scalability of our models.
This research represents a step towards scalable, energy-efficient, and robust SNNs, enabling their deployment in real-time, on-device learning, and inference, crucial for future AI applications in resource-constrained environments.
HRSNNs leverage the diversity in neuronal dynamics and Spike-Timing-Dependent Plasticity (STDP) to improve memory capacity, learn complex patterns, and enhance network performance. By incorporating unsupervised learning models and biologically plausible pruning techniques, we maintain network stability and computational efficiency. A notable contribution of this work is the introduction of Lyapunov Noise Pruning (LNP), which leverages temporal overparameterization to achieve significant reductions in network complexity without compromising accuracy.
Our approach also explores DNN-SNN hybrid models, which combine the strengths of deep neural networks and spiking networks for tasks such as object detection, demonstrating competitive accuracy with lower power consumption. Additionally, we propose a Processing-in-Memory (PIM) hardware platform for on-chip acceleration, further enhancing the scalability of our models.
This research represents a step towards scalable, energy-efficient, and robust SNNs, enabling their deployment in real-time, on-device learning, and inference, crucial for future AI applications in resource-constrained environments.
Workshop
Artificial Intelligence/Machine Learning
Broader Engagement
HPC in Society
W
DescriptionThe High Performance Computing Center Stuttgart (HLRS) has been operating a Philosophy of Computational Sciences group since 2016. Its collaboration with HLRS and external simulation scientists has covered many topics ranging from ethics and epistemology of simulations, to sociological aspects of HPC, modeling for policy, and philosophy of science of simulations. This talk will give a peek into three topics from the past, present and future of the group's work, reflecting on the opportunities and challenges of a highly trans-disciplinary collaboration.
Birds of a Feather
TP
XO/EX
DescriptionIn this exciting time for storage in AI and HPC, new usage models capture the imagination of academics, labs, and vendors with GNNs, RAG, LLM checkpointing, IOPS, security, multitenancy, and DPUs. To spur the creativity of our community and frame emerging opportunities, we’ve gathered renowned experts from academia in usage models, vendor technologists for storage microcontrollers, object/file systems, network storage acceleration, and CSPs that engage in storage innovation to tease the palette with key new technologies and opportunities. The audience will gain new knowledge of emerging tech and new perspectives on topics that they may have recently heard of.
Workshop
Broader Engagement
Education
Inclusivity
W
DescriptionThis paper explores SYCL as a versatile tool for high-performance computing (HPC), providing practical guidance tailored for educators and students. SYCL's portability across a wide range of hardware platforms positions it as a compelling alternative to CUDA, especially within modern supercomputers featuring diverse accelerators. By developing open-source tutorial modules, this work seeks to democratize HPC education, making these resources accessible in workshops and to underserved communities, including those in Latin America.
Building on the foundational work of UnoAPI \cite{unoapi}, our project explores SYCL's potential to enrich HPC education through three targeted modules: addressing a traditional graph problem, generating volumetric data on particle electron density, and visualizing data with the marching cubes algorithm \cite{lorensen1987marching, nvidia2024marchingcubes}. These modules showcase SYCL's versatility across varied computational tasks, empowering learners with the skills needed to excel in heterogeneous computing environments.
For access to the repository containing the example projects and more information, visit: https://github.com/SYCLTutorials/Intro2024.
Building on the foundational work of UnoAPI \cite{unoapi}, our project explores SYCL's potential to enrich HPC education through three targeted modules: addressing a traditional graph problem, generating volumetric data on particle electron density, and visualizing data with the marching cubes algorithm \cite{lorensen1987marching, nvidia2024marchingcubes}. These modules showcase SYCL's versatility across varied computational tasks, empowering learners with the skills needed to excel in heterogeneous computing environments.
For access to the repository containing the example projects and more information, visit: https://github.com/SYCLTutorials/Intro2024.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionThe growing volume and complexity of scientific data pose significant challenges in data management, organization, and analysis. Our objective is to enhance the utilization of historical scientific datasets across various disciplines. To address this, we propose integrating large language models (LLMs) with databases to enable natural language queries, streamlined data retrieval, and analysis. Leveraging LangChain, our approach harnesses the capabilities of LLMs and complements them with data visualization and interpretation tools. Initial results using Llama 3.1 70B demonstrate an 88% success rate in searching and summarizing structured text and numerical data, showcasing the potential for LLM-powered tools to accelerate scientific discovery and innovation.
Workshop
Distributed Computing
Experimental Facility
HPC Infrastructure
W
Paper
Accelerators
Applications and Application Frameworks
Modeling and Simulation
Numerical Methods
Task Parallelism
TP
DescriptionGW approximation is a powerful approach to accurately describe the excited-state of semiconductors. However, GW incurs high computational cost $\mathcal{O}(N^{4})$ and large memory usage $\mathcal{O}(N^{3})$, limiting its applications to thousands of (2,742) atoms even on leadership supercomputers. Herein we present a massively parallel implementation of accurate and efficient cubic-scaling plane-wave GW calculations by using low-rank approximations and high-performance computing on leadership supercomputers. By using a series of low rank approximations, we can reduce the expensive GW calculations to the cubic-scaling computational cost $\mathcal{O}(N^{3})$ and quadratic memory usage $\mathcal{O}(N^{2})$. With the help of parallel and communication optimization, the plane-wave GW calculations gain an overall speedup of over 70x and efficiently scale up to 13,824 atoms within a few minutes using 449,280 cores on new Sunway supercomputer. This accomplishment paves the way for excited-state quantum mechanical material simulations at mesoscopic scale (10K atoms) and for the design of next-generation semiconductor devices.
Workshop
Artificial Intelligence/Machine Learning
W
DescriptionBig-tech companies pre-train SOTA LLMs on special-purpose, private HPCs, while public research centres lack the resources to compete. We advocate a new take on large model training, e.g., LLMs, called xFFL, which leverages federated learning as an enabling technique to exploit geographically distributed computing power to bridge such digital divide. This work introduces a proof-of-concept federated training of LLaMA-3 8B on three EuroHPC Top500 facilities, proving the viability of leveraging cross-facility publicly available computational power to sustain SOTA LLM workloads.
Workshop
Data Compression
Data Movement and Memory
Middleware and System Software
W
DescriptionHigh-fidelity physics simulation codes, such as Flash-X, generate large amounts of simulation data. Much of the data written to files is sparse and can be compressed without significantly impacting the accuracy of the simulation or the quality of the visualizations. Reduced file sizes can significantly save storage space and bandwidth, and offer improved performance of visualization tools. Reduction in file size also allows the simulation to output more data for higher resolution/fidelity analysis. We introduced SZ3 and ZFP compression technologies into Flash-X as an effective data reduction strategy. We conducted experiments on the Frontier exascale supercomputer, evaluating both lossless and lossy compression techniques and quantifying their effects. We examined the impact of accuracy variations and chunk size variations for different Flash-X problems. Our study provides valuable insights and guidelines for simulation developers, helping them understand the best ways to adopt compression tailored to their specific problems.
Workshop
Cloud Computing
Middleware and System Software
State of the Practice
W
DescriptionThe convergence of edge computing, big data analytics, and AI with traditional scientific calculations is increasingly being adopted in HPC workflows. Workflow management systems are crucial for managing and orchestrating these complex computational tasks. However, it is difficult to identify patterns within the growing population of HPC workflows. Serverless has emerged as a novel computing paradigm, offering dynamic resource allocation, quick response time, fine-grained resource management and auto-scaling.
In this paper, we propose a framework to enable HPC scientific workflows on serverless. Our approach integrates a widely used traditional HPC workflow generator with an HPC serverless workflow management system to create benchmark suites of scientific workflows with diverse characteristics. These workflows can be executed on different serverless platforms. We comprehensively compare executing workflows on traditional local containers and serverless computing platforms. Our results show that serverless can reduce CPU and memory usage by 78.11% and 73.92%, respectively, without compromising performance.
In this paper, we propose a framework to enable HPC scientific workflows on serverless. Our approach integrates a widely used traditional HPC workflow generator with an HPC serverless workflow management system to create benchmark suites of scientific workflows with diverse characteristics. These workflows can be executed on different serverless platforms. We comprehensively compare executing workflows on traditional local containers and serverless computing platforms. Our results show that serverless can reduce CPU and memory usage by 78.11% and 73.92%, respectively, without compromising performance.
Workshop
Applications and Application Frameworks
Distributed Computing
Middleware and System Software
W
DescriptionGNU Parallel is a versatile and powerful tool for process parallelization widely used in scientific computing. This paper demonstrates its effective application in high-performance computing (HPC) environments, particularly focusing on its scalability and efficiency in executing large-scale high-throughput high-performance computing (HT-HPC) workflows. Through real-world examples, we highlight GNU Parallel's performance across various HPC workloads, including GPU computing, container-based workloads, and node-local NVMe storage. Our results on two leading supercomputers, OLCF's Frontier and NERSC's Perlmutter, showcase GNU Parallel's rapid process dispatching ability and its capacity to maintain low overhead even at extreme scales. We explore GNU Parallel's application in massive parallel file transfers using a scheduled Data Transfer Node (DTN) cluster, emphasizing its broad utility in diverse scientific workflows. GNU Parallel can be employed in conjunction with other workflow systems as a "last-mile" parallelizing driver and as a quick prototyping tool to design and extract parallel profiles from application executions.
Exhibits
Flash Session
TP
XO/EX
DescriptionLearn about a new multi-tenant high performance computing (HPC) capability that’s single-handedly solving the low-system utilization problem currently plaguing legacy methods used by organizations today. SealingTech Technical Director, Spencer Shimko demonstrates how their multi-tenant HPC is helping the Department of Defense and various industries like healthcare, finance, and engineering increase utilization in environments subjected to stringent regulatory requirements like PHI, HIPAA, PCI DSS, Sox compliance, and more. See how it assists customers process data that presents a high-risk to the nation or their own organization if compromised by enabling concurrent and secure sharing of HPC resources across tenants without having to take it offline. In this session, you’ll get a behind-the-scenes look at how SealingTech’s proprietary solution transfers data throughout your HPC environment and keeps resources dynamically allocated and dedicated by project—saving your business valuable time and resources.
Workshop
Applications and Application Frameworks
W
DescriptionScientific instruments are increasingly producing large amounts of data. However, instrument time at large- scale user facilities is a highly constrained resource. Research teams need to analyze experimental data in real-time to inform future experimentation, which creates challenges, especially when coupled with High Performance Computing (HPC) systems. Our work explores the use of live collaborative data analysis on HPC systems using the new Jupyter Real-Time Collaboration features to address these issues. We discuss enhancements to the Jupyter platform that were co-developed by our team to support collaboration at HPC centers like NERSC. Our work pays special attention to security and auditing requirements around user traceability. We discuss how users at the National Center for Electron Microscopy collaborated on a data collection and analysis run using this approach. Our work demonstrates real-time interactive, collaborative analysis, a critical part of emerging scientific workflows.
Exhibitor Forum
Facilities
Sustainability
TP
XO/EX
DescriptionLarge data centers around the world consume significant amounts of natural resources, with water receiving significant attention recently due to environmental concerns over consumption rates. It is no secret that data centers consume incredible amounts of water, primarily using evaporative cooling towers to reject heat from servers. However, on average, every watt of heat that is dissipated by evaporated water requires 20x less electricity than dissipating that same amount of heat using an axial fan and a dry cooler. Is the war on water consumption justified? What is the alternative to evaporating water in cooling towers? In this session, we will dive deeper into the physics of water evaporation, and why it is evaporated in such high quantities today and why it is the singular most environmentally friendly element of modern data center operation.
Birds of a Feather
TP
XO/EX
DescriptionThe Department of Defense (DoD) has invested significant time and funding to support a large base of users on a variety of HPC-backed projects. This BoF will use lightning talks about current research, technology acquisition plans, and software development needs and interests to illustrate DoD goals and opportunities for engagement. These lightning talks are intended to help external organizations and researchers connect with DoD users and sites to encourage partnerships and help solve problems. External engagement will help DoD users and HPC sites grow expertise and connect to the larger HPC community.
Art of HPC
Posters
TP
W
TUT
XO/EX
DescriptionThe visualization pipeline was developed to process the IFF data using ParaView. A blend between surfacic approximation and volumetric multi-scattering was used to generate the two volumes of the brain and the tumor. The particles passed through the voxels of interest that were represented as spheres with point Gaussian oriented by the velocity vector and colored by the path line density value.
The visualization required the use of MPI and HPC in order to develop the volume rendering of the image stacks. The framework was established by using EGL ParaView in a server-client setup, using a Dell PowerEdge R7525 (2U) with a Dual AMD 7502 (32C/64T) CPU, 512GB RAM (16x32GB), one NVIDIA Ampere A40 GPU (48 GB VRAM), and about 3.2TB of local storage.
The visualization required the use of MPI and HPC in order to develop the volume rendering of the image stacks. The framework was established by using EGL ParaView in a server-client setup, using a Dell PowerEdge R7525 (2U) with a Dual AMD 7502 (32C/64T) CPU, 512GB RAM (16x32GB), one NVIDIA Ampere A40 GPU (48 GB VRAM), and about 3.2TB of local storage.
Workshop
Artificial Intelligence/Machine Learning
W
DescriptionManual labeling for ML tasks is labour intensive, resulting in scarce and limited datasets. Despite the promising evidence of data augmentation, little work investigates the impact and limitations of combinatorial data augmentation. Our work addresses this gap by examining how standard and combinatorial data augmentation of small datasets affects models’ label classification performance. We generate datasets augmented with one, two, and four simultaneous augmentation techniques and compare their impact on the performance of three image classification models (DenseNet169, MobileNetV2, ResNet101V2) Our experiments show a non-monotonic relationship between augmentation quantity and classification performance. Our findings suggest the optimal augmentation quantity depends on domain and use case. We also find that the application order of augmentation techniques impacts model performance, up to 2.6% in our use case. Our work highlights the limits of data augmentation on small datasets.
Workshop
Distributed Computing
Education
Emerging Technologies
W
DescriptionThe rapid advancement in high-performance computing (HPC) poses significant challenges for the HPC community. Current HPC training approaches are often too generic or too customized to local environments, limiting their applicability and impact. Often, these shortcomings are due to the limited accessibility, excessive cost, and specialized support necessary to provide HPC environments for teaching. To address these challenges, we introduce HPC Virtual Cluster, a hardware-agnostic platform designed to provide an easy-to-configure, generalizable, and scalable approach to HPC system management for training and education in computational research alongside production system configurations. We implemented this platform in a virtually integrated project (VIP) course aimed at training undergraduates for HPC cluster building. Drawing from our experience from the VIP course, we advocate for the integration of more comprehensive educational and training approaches, such as HPC Virtual Cluster, to better support HPC.
Doctoral Showcase
Posters
TP
DescriptionThe existing parallel I/O stack is complex and difficult to tune due to the interdependencies among multiple factors that impact the performance of data movement between storage and compute systems. When performance is slower than expected, end-users, developers, and system administrators rely on I/O profiling and tracing information to pinpoint the root causes of inefficiencies. Despite having numerous tools that collect I/O metrics on production systems, it is not obvious where the I/O bottlenecks are (unless one is an I/O expert), their root causes, and what to do to solve them. Hence, there is a gap between the currently available metrics, the issues they represent, and the application of optimizations that would mitigate performance slowdowns. Streamlining such analysis, investigation, and recommendations could close this gap without requiring a specialist to intervene in every case.
This dissertation explores how this translation gap can be closed by introducing two innovative frameworks that leverage both offline and online analysis and tuning methodologies. The offline framework, named Drishti I/O, provides interactive visualizations that detail an application's I/O behavior. It pinpoints the root causes of I/O bottlenecks and offers actionable recommendations to enhance performance. The runtime framework extends the capabilities of the Recorder I/O tracing tool by incorporating a dynamic I/O prediction and optimization system. This system leverages context-free grammar to optimize I/O behavior in real time during application execution. Together, these frameworks offer a comprehensive approach to improving I/O performance through detailed analysis and real-time optimizations.
This dissertation explores how this translation gap can be closed by introducing two innovative frameworks that leverage both offline and online analysis and tuning methodologies. The offline framework, named Drishti I/O, provides interactive visualizations that detail an application's I/O behavior. It pinpoints the root causes of I/O bottlenecks and offers actionable recommendations to enhance performance. The runtime framework extends the capabilities of the Recorder I/O tracing tool by incorporating a dynamic I/O prediction and optimization system. This system leverages context-free grammar to optimize I/O behavior in real time during application execution. Together, these frameworks offer a comprehensive approach to improving I/O performance through detailed analysis and real-time optimizations.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionIn recent years, quantum computing has demonstrated the potential to revolutionize specific algorithms and applications by solving problems exponentially faster than classical computers. However, its widespread adoption for general computing remains a future prospect. In this work we demonstrate the integration of quantum computing within high-performance computing (HPC) environments. We developed a resource management framework that streamlines the use of quantum simulators and enhances HPC/QC hybrid application runtime performance and workflow efficiency.
Workshop
Data Compression
Data Movement and Memory
Middleware and System Software
W
DescriptionLossy compression is one of the most effective methods for reducing the size of scientific data containing multiple data fields. It reduces information density through prediction or transformation techniques to compress the data. Previous approaches use local information from a single target field when predicting target data points, limiting their potential to achieve higher compression ratios. In this paper, we identified significant cross-field correlations within scientific datasets. We propose a novel hybrid prediction model that utilizes CNN to extract cross-field information and combine it with existing local field information. Our solution enhances the prediction accuracy of lossy compressors, leading to improved compression ratios without compromising data quality. We evaluate our solution on three scientific datasets, demonstrating its ability to improve compression ratios by up to 25% under specific error bounds. Additionally, our solution preserves more data details and reduces artifacts compared to baseline approaches.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionReproducibility is related to achieving consistent performance across multiple runs of the same application in an identical computing environment. From a computational perspective, jobs should run for equal time, with equal performance repeatedly. Without this, we see significant variations in performance, which can undermine the reliability of scientific results. However, the complexity and scale of these workflows present unique challenges, especially when it comes to achieving consistent performance across repeated runs. We seek to provide researchers with Findable, Accessible, Interoperable and Reusable (FAIR) data. Ensuring the “FAIRness” of (meta)data can reduce barriers to reproducibility by making this information easier to find and interpret, programmatically access, and reuse in new contexts. Therefore we are exploring the process of analyzing performance data and seek to integrate our findings in the RECUP framework for reproducibility, showing data sources, repository saving intermediate results, and user analysis of performance and result reproducibility.
Workshop
Graph Algorithms
Heterogeneous Computing
Programming Frameworks and System Software
W
DescriptionInfluence Maximization (IM) is vital in viral marketing and biological network analysis for identifying key influencers. Given its NP-hard nature, approximate solutions are employed. This paper addresses scalability challenges in a scale-out shared memory system, by focusing on the state-of-the-art Influence Maximization via Martingales (IMM) benchmark. To enhance the work efficiency of the current IMM implementation, we propose EFFICIENTIMM with key strategies, including new parallelization scheme, NUMA-aware memory usage, dynamic load balancing and fine-grained adaptive data structures. Benchmarking on a 128-core CPU system with 8 NUMA nodes, EFFICIENTIMM demonstrated significant performance improvements, achieving an average 5.9x speedup over Ripples across 8 diverse SNAP datasets, when compared to the best execution times of the original Ripples framework. Also, on graph Youtube, EFFICIENTIMM is shown to have better memory access pattern with 357.4x reduction in L1+L2 cache misses as compared to
Ripples.
Ripples.
Workshop
Graph Algorithms
Heterogeneous Computing
Programming Frameworks and System Software
W
DescriptionThe partitioned global address space (PGAS) model offers one-sided communication operations to efficiently access local and remote data through a distributed shared memory model using point-to-point network operations. An extension to the OpenSHMEM PGAS library previously demonstrated how message aggregation could be applied in a minimally intrusive manner to an application, while still achieving a significant portion of the performance possible through manual tuning. However, its primary deficiency was the inability to abstract dependencies between aggregated remote memory accesses and their subsequent uses, which must be managed explicitly by applications. This undermined its goal of preserving algorithmic intent. In this paper, we present a novel directive-based approach for automatically deferring the execution of arbitrary code that depends on aggregated messages, shifting the concern of their efficient management from the application to the implementation. We demonstrate our approach using two applications from the bale 3.0 classic suite on the Frontier supercomputer.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionThis work introduces enhanced benchmark suites — NeoRodinia, RedOptBench and CUDAMicroBench — specifically designed to enrich the educational landscape of parallel programming. By integrating practical examples and detailed optimization processes into traditional benchmarks, these suites could illuminate performance-limiting issues, identify inefficient patterns, and clarify the steps involved in optimization. They aim to demystify the complexities of parallel programming for beginners, fostering a deep and practical understanding of the subject. Serving dual roles as performance evaluators and comprehensive educational tools, these suites effectively demonstrate tangible performance improvements and optimization techniques, thereby enhancing both theoretical knowledge and practical skills in parallel programming.
Workshop
Applications and Application Frameworks
Algorithms
Performance Evaluation and/or Optimization Tools
W
DescriptionScientific productivity can be enhanced through workflow management tools, relieving large High Performance Computing (HPC) system users from the tedious tasks of scheduling and designing the complex computational execution of scientific applications. This paper presents a study on the usage of ensemble workflow tools to accelerate science using the Summit and Frontier supercomputing systems. The research aims to connect science domain simulations using Oak Ridge Leadership Computing Facility (OLCF) platforms with ensemble workflow methods to boost scientific impact. We present the coupling, porting and optimization of Radical-Cybertools on simulations. The tools augment traditional HPC monolithic runs with a pilot scheduler. We discuss intrinsic limitations of coupling and porting ensemble workflow tools to applications that run
on large HPC systems. The origins of technical challenges and their solutions developed during the implementation process are discussed. Data management strategies, OLCF’s policies for ensembles, and natively supported workflow tools are also summarized.
on large HPC systems. The origins of technical challenges and their solutions developed during the implementation process are discussed. Data management strategies, OLCF’s policies for ensembles, and natively supported workflow tools are also summarized.
Workshop
Architecture
Network
Performance Optimization
Quantum Computing
System Administration
W
DescriptionThe throughput is an important performance metric of entangled qubit distribution quantum networks, and is characterized by the number of distributed entangled pairs of entanglement bits per second (ebps). It is measured over physical connections using specialized instruments, including photonic entanglement sources and single photon detectors. Extensive theory has been developed to estimate the entangled qubit capacity of quantum channels using abstractions of physical connections. These two quantities both characterize aspects of performance but in different ways, and typically have been hard to relate to each other. We describe measurements on physical testbed with fiber connections of lengths 0-75 kilometers. We obtain the normalized analytic capacity estimates using the transmissivity approximations derived using single photon coincidence measurements, and convert them to bounds on throughput (measured in ebps) using a multiplier derived from co-located detector measurements. The results indicate consistent throughput measurements upper-bounded by their analytical capacity estimates across all connections.
Paper
Accelerators
Algorithms
Data Movement and Memory
Graph Algorithms
TP
DescriptionMaximal biclique enumeration (MBE) is crucial in bipartite graph analysis. Recent studies rely on extensive set intersections on static bipartite graphs to solve the MBE problem. However, the computational subgraphs dynamically change during enumeration, leading to redundant memory accesses and degraded set intersection performance. To overcome this limitation, we propose an AdaMBE algorithm. First, we redesign its core operations using local neighborhood information derived from computational subgraphs to minimize redundant memory accesses. Second, we dynamically create computational subgraphs using bitmaps leveraging its fast bitwise operations to accelerate set intersections. Finally, we integrate them in AdaMBE. Our experimental results show that AdaMBE is 1.6×−49.7× faster than its closest CPU-based competitor and successfully enumerates all 19 billion maximal bicliques on the TVTropes dataset, a large task beyond the capabilities of existing algorithms. Notably, on certain datasets, our parallel version, ParAdaMBE, on CPUs even outperforms GMBE on GPUs by up to 5.07×.
Workshop
Applications and Application Frameworks
Algorithms
Performance Evaluation and/or Optimization Tools
W
Paper
Algorithms
Data Movement and Memory
I/O, Storage, Archive
Performance Optimization
Scientific and Information Visualization
Visualization
TP
DescriptionThe unprecedented amount of scientific data has introduced heavy pressure on the data storage and transmission systems. Progressive compression has been proposed to mitigate this problem, which offers data access with on-demand precision. However, existing approaches only consider precision control on raw data, leaving uncertainties on the quantities of interest (QoIs). We present a progressive data retrieval framework with guaranteed error control on derivable QoIs. Our contributions are three-fold. (1)We carefully derive the theories to control QoI errors during progressive retrieval. (2)We develop a general progressive retrieval framework based on the proposed theories, and optimize it by exploring feasible progressive representations. (3)We evaluate our framework using five real-world datasets with multiple QoIs. Experiments demonstrate that our framework can respect the QoI error bouds in the evaluated applications. This leads to over 2.02x performance gain in data transfer tasks compared to transferring the raw data while guaranteeing the QoI error.
Workshop
Distributed Computing
Fault-Tolerance, Reliability, Maintainability, and Adaptability
W
DescriptionMachine learning (ML) has increasingly been adopted in safety-critical systems such as autonomous vehicles (AVs) and industrial robotics. In these domains, reliability and safety are important considerations, hence it is critical to ensure the resilience of ML systems to faults and errors. This also applies to ML systems deployed in the HPC context. On the other hand, soft errors are becoming more frequent in commodity computer systems due to the effects of technology scaling and reduced supply voltages. Further, traditional solutions for masking hardware faults such as triple-modular redundancy (TMR) are prohibitively expensive in terms of their energy and performance overheads. Therefore, there is a compelling need to provide low-cost error resilience to ML applications on commodity HPC platforms. I will present three directions we have explored in my research group towards this goal.
First, we experimentally assessed the resilience of ML applications to soft errors via fault injection. We found that even a single bit flip due to a soft error can lead to misclassification in deep neural network (DNN) applications. Such misclassifications can result in safety violations. However, not all errors result in safety violations, and so it is sufficient to protect the DNN from the ones that do. Unfortunately, finding all possible errors that result in safety violations is a very compute-intensive task.
Second, we proposed BinFI, a fault injection approach that efficiently injects critical faults that are highly likely to result in safety violations, by leveraging the DNN’s properties.
Finally, we proposed Ranger, an approach to protect DNNs from critical faults without causing any loss in their accuracies, and with minimal performance overheads. I will conclude by presenting some of our ongoing work as well as the future challenges in this area. This is joint work with my students and colleagues at the University of British Columbia, as well as industry collaborators.
First, we experimentally assessed the resilience of ML applications to soft errors via fault injection. We found that even a single bit flip due to a soft error can lead to misclassification in deep neural network (DNN) applications. Such misclassifications can result in safety violations. However, not all errors result in safety violations, and so it is sufficient to protect the DNN from the ones that do. Unfortunately, finding all possible errors that result in safety violations is a very compute-intensive task.
Second, we proposed BinFI, a fault injection approach that efficiently injects critical faults that are highly likely to result in safety violations, by leveraging the DNN’s properties.
Finally, we proposed Ranger, an approach to protect DNNs from critical faults without causing any loss in their accuracies, and with minimal performance overheads. I will conclude by presenting some of our ongoing work as well as the future challenges in this area. This is joint work with my students and colleagues at the University of British Columbia, as well as industry collaborators.
Workshop
Graph Algorithms
Heterogeneous Computing
Programming Frameworks and System Software
W
DescriptionGraphics Processing Units (GPUs) offer significant potential for accelerating various computational tasks, including Breadth-First Search (BFS). Numerous efforts have been made to deploy BFS on GPUs effectively. To address the dynamic nature of BFS, XBFS, the state-of-the-art work, employs an adaptive strategy that leverages different optimized frontier queue generation designs, accommodating the varying characteristics of levels in BFS. While XBFS demonstrates excellent performance on NVIDIA Quadro P6000 GPUs, it faces challenges when deployed on AMD GPUs. In this work, we present our efforts to implement XBFS's adaptive approach on Frontier, the most powerful supercomputer system, by porting XBFS to AMD MI250X GPUs. Through targeted optimizations tailored to the unique features of AMD GPUs, our implementation achieves an average performance of 43 GTEPS per GCD. Based on these results, we observe potential for surpassing the performance of the official Frontier results from the Graph500 benchmark released in June 2024.
Workshop
State of the Practice
System Administration
W
DescriptionDemocratizing access to the research computing ecosystem is critical for accelerating research progress. However, the gap between a high-level workload, such as Python in a Jupyter notebook, and the resources exposed by HPC systems is significant. Users must securely authenticate, manage network connections, deploy and manage software, provision and configure nodes, and manage workload execution. Globus Compute reduces these barriers by providing a managed, fire-and-forget model that enables execution of Python functions across any resource to which a user has access. In this paper we describe enhancements to Globus Compute that further reduce barriers to use of the research computing ecosystem: an asynchronous, future-based executor interface for submitting and monitoring tasks, shell and MPI-based function types, and a multi-user endpoint that can be deployed by administrators and used by authorized users.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionHPC applications require massive amounts of memory to process large datasets. While data compression is used to avoid bottlenecks in transmission and storage, it is still necessary to decompress this data into memory to use it. Inline compressed arrays (ICA) is a method which keeps the data compressed in application memory, decompressing blocks of data as needed. The goal is to reduce the memory footprint of big-data applications, allowing them to run on more abundant HPC nodes with less DRAM. This research uses matrix multiplication as a lens for analyzing the effects of various ICA parameters on runtime and memory usage. We construct a model for a minimum number of compressor calls needed to complete the computation, and show how careful tuning of ICA parameters achieves this minimum. Finally, we briefly discuss how our lessons learned impact other computational kernels.
Birds of a Feather
TP
XO/EX
DescriptionThe implications for HPC technology on society and the environment prompt us as a community to discuss and understand the impacts we are having directly and indirectly. This BoF is highly interactive and aims to be an exchange for the community to discuss and relate ethical behavior and societal norms to the design of HPC solutions and autonomous/intelligent systems, for example, so that they do not intentionally perpetuate global inequality. By furthering this dialogue, we can work to ensure that the HPC community is advancing its commitment to technology for the benefit of all of humanity.
Birds of a Feather
TP
XO/EX
DescriptionHave you ever wished that all the scientific software you use was available on all the resources you have access to without having to go through the pain of installing it the way you want/need? In this session, we introduce to the HPC community the European Environment for Scientific Software Installations (EESSI — pronounced “easy”) as a common stack of scientific software installations for HPC systems and beyond all around the globe. The session is set to be interactive, promoting an open environment to discuss how EESSI could help alleviate the burden of getting scientific software installed.
Birds of a Feather
TP
XO/EX
DescriptionIn recent years, the European HPC ecosystem has undergone profound changes. The objective of this BoF is to give an overview of the current state of European HPC activities, with a particular focus on training and skills activities. We will present and discuss with international HPC stakeholders the current state of play, future plans, and challenges and critically analyze the European HPC skills and training offerings to HPC practitioners in academia and industry. With the ever-evolving technical landscape around HPC and AI, the needs for skill development are more important than ever.
Workshop
Artificial Intelligence/Machine Learning
Codesign
W
DescriptionThe STX and VEC accelerators being developed by the European Processor Initiative (EPI) are targeting the HPC market with different approaches. The STX is co-designed for algorithms that prevail in hpc simulation applications. It's tuned using features from previous architectures like BlueGene, IBM Cell and GPUs, and leverages current Open Source developments on hardware and software. In order to easily utilize the computational units, it implements novel multi-hierarchy offloading features added to OpenMP and automatic DMA instruction injection via LLVM optimization passes. STX is not only a chip - but a full system approach. A company called UNEEC Systems was founded for sales and the development of the next generations of STX systems.
VEC is a highly efficient vector processor positioned as a general purpose HPC accelerator. Using RISC-V vector extensions and cache coherence, even legacy codes are easily ported to this accelerator through its self-booting capability. Additionally, VEC is power- and area-efficient through its use of single-ported register file design and simple ring-based inter-lane connect while featuring the longest vector length (up to 2048 Double Precision FP elements) of any Vector Processor.
This talk will give a brief overview over the concepts behind the two accelerator developments and presents a roadmap.
VEC is a highly efficient vector processor positioned as a general purpose HPC accelerator. Using RISC-V vector extensions and cache coherence, even legacy codes are easily ported to this accelerator through its self-booting capability. Additionally, VEC is power- and area-efficient through its use of single-ported register file design and simple ring-based inter-lane connect while featuring the longest vector length (up to 2048 Double Precision FP elements) of any Vector Processor.
This talk will give a brief overview over the concepts behind the two accelerator developments and presents a roadmap.
Workshop
Performance Optimization
Programming Frameworks and System Software
W
DescriptionWith the increasing use of graphic processing units (GPUs) from various vendors in oil and gas companies, achieving code portability has become essential. This capability allows for evaluating performance across diverse GPU vendors, facilitating well-informed decisions, and promoting competition and innovation in GPU technologies. We have made significant contributions in addressing the challenge of achieving performance portability for Fletcher, an anisotropic wave propagation modeling application across GPUs from NVIDIA and AMD. Our first contribution involves the implementation of Fletcher on GPUs using ten variations of portable programming models, including HIP, CUDA, Kokkos, and RAJA. Our second contribution is a comprehensive evaluation of these implementations across five generations of GPUs from NVIDIA and AMD, assessing performance and performance-portability. A series of extensive experiments demonstrated that HIP outperforms the assessed programming models regarding performance portability across all evaluated GPUs. Specifically, it performs 7.9% better than Kokkos and 8.8% better than RAJA.
Workshop
Compilers
Parallel Programming Methods, Models, Languages and Environments
Performance Optimization
W
DescriptionTuning parallel applications on multi-core architectures is an arduous task. Several studies have utilized auto-tuning for OpenMP applications via standardized user-facing features, namely number of threads, thread placement and binding policy. In this paper, we analyze OpenMP application runtime through an exhaustive exploration of all relevant configuration options of the LLVM/OpenMP runtime.
Our findings allow to identify trends in tuning potential and architecture-aware tuning suggestions. We will open-source the 240,000 unique samples collected during experiments. These runs have been conducted on three different CPU architectures vital in the HPC and datacenter community. Choice of applications includes popular benchmark suites and microbenchmarks namely, NPB, Barcelona OpenMP Task Suite, XSBench, RSBench, SU3Bench and LULESH.
We employ Machine Learning algorithms to perform analysis, explain, and form qualitative relations between features comprising of the architecture, application, input-size, threads, and environment variables. This is further used to recommend different configurations given an application type/architecture.
Our findings allow to identify trends in tuning potential and architecture-aware tuning suggestions. We will open-source the 240,000 unique samples collected during experiments. These runs have been conducted on three different CPU architectures vital in the HPC and datacenter community. Choice of applications includes popular benchmark suites and microbenchmarks namely, NPB, Barcelona OpenMP Task Suite, XSBench, RSBench, SU3Bench and LULESH.
We employ Machine Learning algorithms to perform analysis, explain, and form qualitative relations between features comprising of the architecture, application, input-size, threads, and environment variables. This is further used to recommend different configurations given an application type/architecture.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionAdaptive optimizers, which adjust the learning rate for individual parameters, have become the standard for training deep neural networks. AdamW is a popular adaptive method that maintains two optimizer state values (momentum and variance) per parameter, doubling the model’s memory usage during training. Many proposed memory efficient optimizers claim to match AdamW’s performance but lack its desirable qualities such as robustness to learning rate changes. This quality is especially desirable when pre-training LLMs, where experimenting with different hyperparameters is infeasible. We propose Eve, a Memory Efficient AdaptiVe Moment Estimation algorithm that saves memory by reducing the variance term while also preserving AdamW’s desirable properties across different training settings. We fine-tune Llama 2 70B on 64 GPUs and show memory savings of 20% compared to AdamW. We also compare our method to a recent well-received memory efficient optimizer called Adam-mini and demonstrate better training stability across various learning rates.
Workshop
Broader Engagement
HPC in Society
Inclusivity
W
DescriptionCRISPR-Cas9 has become the gold-standard technology for gene editing due to its simplicity and low cost. By carefully designing the “guide RNA” (gRNA) that directs it to the desired target site, nearly any DNA site can be edited. However, selecting good quality gRNA is a complex task and can only be achieved with high-performance computers, which are valuable resources that not all are equipped with. Here, we describe our use of event-driven cloud computing technologies made available by Amazon Web Services to overcome challenges faced by traditional standalone software for designing gRNA. Using a cloud-native software template, any researcher can deploy our software, named Crackling-Cloud, to their own cloud environment. This enables them to bring the software to their data, rather than sending their data to the software. Crackling-Cloud is available for free on GitHub under the terms of the BSD 3-clause licence: https://github.com/bmds-lab/Crackling-AWS
Exhibits
SCinet
TP
XO/EX
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionOur research combines an Evolutionary Algorithm with a Quantum Approximate Optimization Algorithm (QAOA) to update the ansatz parameters, in place of traditional gradient-based methods, and benchmarks on the Max-Cut problem. We demonstrate that our Evolutionary-QAOA pairing performs on par or better than a COBYLA-based QAOA in terms of solution accuracy and variance, for d-3 regular graphs between 4 and 26 nodes, using Conditional Value at Risk for fitness function evaluations. Furthermore, we take our algorithm one step further and present a novel approach by presenting a multi-population algorithm distributed on two QPUs, which evolves independent and isolated populations in parallel, classically communicating elite individuals. Experiments were conducted on both simulators and quantum hardware, with investigations in the relative performance accuracy and variance.
Birds of a Feather
TP
XO/EX
DescriptionCombining insights from former SC Reproducibility Initiative chairs, thought leaders, and the SC24 community, this BoF aims to sharpen our focus on practical next steps to improve reproducibility in the HPC community. Addressing this field's unique challenges and opportunities, we’ll define practical reproducibility through interactive discussions, building on and contrasting it with computational reproducibility, and examine its real-world applications and constraints. We’ll explore the evolution of HPC reproducibility initiatives, comparing past and present approaches. We’ll also investigate the central role that specialized platforms play in enabling artifact sharing, evaluation, and usage.
Workshop
Architecture
Embedded and/or Reconfigurable Systems
Performance Optimization
Resource Management
W
DescriptionRow-scale Composable Disaggregated Infrastructure (CDI) is a heterogeneous high performance computing (HPC) architecture that relocates the GPUs to a single chassis which CPU nodes can then request compute resources from. This is a distinctly different architecture from rack-scaled CDI as the GPUs are accessed over a network rather than existing in the same PCIe domain as the CPUs. We perform comparisons between the kernel and data transfer characteristics of two production applications to a slack proxy application which allowed for the development of a mathematical model to predict the performance penalty generalized applications can face as a result of slack. Our proposed method found that the applications tested would pessimistically see a less than 1% performance penalty above the effects of crossing the network in an environment which induced 100 us of slack, thus demonstrating that row-scale CDI is a viable technology from an application performance perspective.
Workshop
Distributed Computing
Education
Emerging Technologies
W
DescriptionDespite the quantity of existing training materials, acquisition and development of the specialist skills required for high performance computing (HPC) is not straightforward enough to address the needs of the growing, diversifying and constantly evolving HPC community. The HPC education and training community is exploring different approaches that could facilitate the uptake and progression of technical skills; one of those new approaches is focused on defining and formalizing learning pathways. In this lightning talk we will briefly present an exercise designed as a starting point for capturing and outlining learning pathways for the HPC community. This exercise was run for the first time during the ISC’24 BoF “Developing a Sustainable Future for HPC and RSE Skills: Training Pathways and Structures,” and was accompanied by a Mentimeter survey to evaluate its effectiveness. The summary of the survey results is also included.
Paper
I/O, Storage, Archive
TP
DescriptionKVM is the dominant VM hypervisor on Linux, and relies on QEMU to realize the backends of the virtio family of devices such as virtio-blk. However, KVM/QEMU-based paravirtualization prolongs the guest I/O path with multiple context switches. As fast NVMe storage devices have been widely used, the software overhead is becomes non-negligible. This paper presents EXO, an extension of virtio-blk for efficient KVM/QEMU-based storage paravirtualization. The insight is that no matter how complex the QEMU backend’ processing is, to handle a guest I/O request, the host storage stack only needs to know the request’s guest-to-host address mapping. Therefore, we preserve the original slow I/O path of virtio-blk as a fallback, and leverage eBPF to introduce an in-kernel fast path that directly queries the address mapping without switching to the user-space backend processing. Extensive evaluation shows that compared to existing storage paravirtualization solutions, EXO achieves similar or even higher performance.
Workshop
Distributed Computing
Experimental Facility
HPC Infrastructure
W
DescriptionThe OMFIT CAKE workflow for plasma state reconstruction has been automated on the DIII-D National Fusion Facility to run on a combination of computational resources at DIII-D and NERSC, utilizing the emerging DOE Integrated Research Infrastructure. The reconstruction of the plasma state is vital for understanding what occurred in the DIII-D machine during the pulse. This understanding allows informed decisions to be made on how to change the configuration for the next pulse. The initial reconstruction workflow was performed on DIII-D resources for a benchmark case in 62 minutes. The wall-clock time for the benchmark case was reduced to 11 minutes by running on the Perlmutter system at NERSC, which opens the possibility to influence decisions between DIII-D pulses during an experiment. The reconstruction results can be used as inputs for other modeling analyses; the determination of the classification of the microturbulence modes is given as an example.
Workshop
Broader Engagement
Education
Inclusivity
W
DescriptionParallel and distributed computing (PDC) is increasingly recognized as important undergraduate computer science content that should be included earlier, at the introductory level courses, instead of being reserved for upper level elective courses. Currently, most commonly used materials for CS0 do not include PDC topics. This experience report describes the experience of including two PDC activities in an existing CS0 course that otherwise used mainstream course materials. Building on prior work that suggests that unplugged activities may be one way to successfully engage students, one of the activities was an in class group activity coin search. The other activity was a more in-depth programming assignment. The results suggest that the unplugged activity engaged students and was a good introduction to the topic in a short time (less than one class period). The programming assignment was enthusiastically received by some students, but many students chose not to participate in the assignment.
Workshop
State of the Practice
System Administration
W
DescriptionThe Triton Shared Computing Cluster (TSCC) [1] is the XX Supercomputing Center (“Center” in the remaining text)’s primary campus research computing system. This paper describes the transition from TSCC 1.0 to TSCC 2.0, focusing on the implementation of new high-performance computing (HPC) infrastructure components and management strategies. We detail our approach to overcoming challenges posed by node heterogeneity, enhancing job scheduling efficiency, and improving resource allocation and billing fairness.
The legacy TSCC 1.0 is described first, focusing on some critical issues we want to solve under TSCC 2.0. The HPC tools under TSCC 2.0 are then described. Lastly, the best practices and experiences learned are discussed.
The legacy TSCC 1.0 is described first, focusing on some critical issues we want to solve under TSCC 2.0. The HPC tools under TSCC 2.0 are then described. Lastly, the best practices and experiences learned are discussed.
The legacy TSCC 1.0 is described first, focusing on some critical issues we want to solve under TSCC 2.0. The HPC tools under TSCC 2.0 are then described. Lastly, the best practices and experiences learned are discussed.
The legacy TSCC 1.0 is described first, focusing on some critical issues we want to solve under TSCC 2.0. The HPC tools under TSCC 2.0 are then described. Lastly, the best practices and experiences learned are discussed.
Exhibitor Forum
HPC Infrastructure
Performance Evaluation and/or Optimization Tools
TP
XO/EX
DescriptionThe growth of AI models and their deployment at scale have led to massive distributed environments that serve workflows spanning multiple nodes, storage clusters, and the networks that bind them together.
The challenge is clear: profiling and debugging the entire stack requires a new methodology. A holistic solution should provide visibility into every single component in the stack: switch, NIC, DPU, CPU, GPU, and storage.
In this talk we will show you the features of NVIDIA Nsight Systems to profile and analyze applications running on an NVIDIA DGX cluster attached to a VAST Data Platform. We shall peel the onion, going through every infrastructure component, understanding its role and effect on performance.
We will complete the picture with data-centric insights from the VAST Data Platform to provide a perspective never seen before on how applications use data and the impact on the storage stack, leading to optimizations and increased efficiency.
The challenge is clear: profiling and debugging the entire stack requires a new methodology. A holistic solution should provide visibility into every single component in the stack: switch, NIC, DPU, CPU, GPU, and storage.
In this talk we will show you the features of NVIDIA Nsight Systems to profile and analyze applications running on an NVIDIA DGX cluster attached to a VAST Data Platform. We shall peel the onion, going through every infrastructure component, understanding its role and effect on performance.
We will complete the picture with data-centric insights from the VAST Data Platform to provide a perspective never seen before on how applications use data and the impact on the storage stack, leading to optimizations and increased efficiency.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionFuture exascale systems will feature unprecedented computing power with over 10^18 FLOPS, provided by thousands of heterogeneous computing nodes. To fully harness such potential, applications must scale effectively and efficiently utilize these computing architectures.
Many performance losses on large HPC systems originate from inefficiencies in data movement.
On exascale systems, the involved data set size also increases, leading to challenges in (a) migrating data through memory hierarchies, (b) communicating data between distributed memory components, and (c) storing data in file systems.
This poster shows the ongoing effort in the DaREXA-F project and its goal to address these issues in the plasma turbulence code GENE, by measures such as (a) mixed precision and the usage of novel data formats, (b) data compression, and (c) utilizing new network hardware.
We present our current findings in the areas of mixed-precision and lossy data compression for efficient computation without a reduction in accuracy.
Many performance losses on large HPC systems originate from inefficiencies in data movement.
On exascale systems, the involved data set size also increases, leading to challenges in (a) migrating data through memory hierarchies, (b) communicating data between distributed memory components, and (c) storing data in file systems.
This poster shows the ongoing effort in the DaREXA-F project and its goal to address these issues in the plasma turbulence code GENE, by measures such as (a) mixed precision and the usage of novel data formats, (b) data compression, and (c) utilizing new network hardware.
We present our current findings in the areas of mixed-precision and lossy data compression for efficient computation without a reduction in accuracy.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionTEZIP is a (de)compression framework leveraging PredNet, a deep neural network designed for video prediction tasks, to exploit temporal locality in time-evolving data. This study evaluates video super-resolution (VSR) models, which enhance low-resolution images by reconstructing high-resolution ones, under various compression and size reduction techniques. Specifically, we evaluate the VRT and BasicVSR++ models across various compression techniques, including H.264 and H.265, applied to the Vimeo90K dataset. Our results, evaluated using common super-resolution image quality metrics, indicate that the VRT model consistently outperforms BasicVSR++, particularly with H.264 and H.265 compressions. We observe that larger file sizes and lower compression ratios correlate with higher PSNR and SSIM values, highlighting the trade-offs between compression techniques and quality metrics in generating high-resolution images. These findings emphasize the balance needed between compression efficiency and image quality in VSR applications.
Exhibitor Forum
Hardware Technologies
TP
XO/EX
DescriptionCompute express link (CXL) is gaining attention as a practical solution that can reduce total cost of ownership (TCO) of HPC and datacenters while providing reasonable performance. In this presentation, we will introduce how to utilize the recent version of CXL technology, CXL 3.1, to construct a composable HPC server and how the CXL 3.1-enabled composable server can accelerate practical HPC applications.
The composable server consists of CPU and memory nodes. Since each node features a CXL 3.1 switch that supports multi-level switching and port-based routing, it can connect multiple nodes with flexibility. Using this composable server, we will demonstrate the effectiveness of CXL 3.1 technology. To this end, we will show a demo running practical HPC applications (e.g., atomic-level simulation) on the CXL 3.1 composable server. Our demo shows that the CXL-based server outperforms the conventional system by about 1.8 times, through the utilization of memory sharing features.
The composable server consists of CPU and memory nodes. Since each node features a CXL 3.1 switch that supports multi-level switching and port-based routing, it can connect multiple nodes with flexibility. Using this composable server, we will demonstrate the effectiveness of CXL 3.1 technology. To this end, we will show a demo running practical HPC applications (e.g., atomic-level simulation) on the CXL 3.1 composable server. Our demo shows that the CXL-based server outperforms the conventional system by about 1.8 times, through the utilization of memory sharing features.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionWe present an experimental evaluation of a burst buffer for a real-time DAQ streaming system designed to transmit instrument data to remote data centers. The system is based on EJ-FAT, a load balancing system capable of Nx 100Gbps streams, distributing data from event sources to processing nodes. We explore applying the DAOS system as a burst buffer to serve a number of purposes: improve resiliency, elasticity and add new functions into the processing pipeline. In the evaluation a sender transmits events over a 100Gbps network to a receiver integrated with DAOS to store the reassembled events using DAOS APIs. We evaluate the system for possible bottlenecks and provide end-to-end evaluation with a burst buffer using DAOS storage abstractions. We show that a receiver node can support 38.1 Gbps. This proves the viability of our approach and allows us to extend this work to investigate scale-out properties and new streaming optimizations.
Workshop
Data Movement and Memory
I/O, Storage, Archive
W
DescriptionDistributed Asynchronous Object Store (DAOS) is a novel software-defined object store leveraging Non-Volatile Memory (NVM) devices, designed for high performance. It provides a number of interfaces for applications to undertake I/O, ranging from a native object storage API to a DAOS FUSE module for seamless compatibility with existing applications using POSIX file system APIs.
In this paper we discuss these interfaces and the options they provide, exercise DAOS through them with various I/O benchmarks, and analyze the observed performance. We also briefly compare the performance with a distributed file system and another object storage system deployed on the same hardware, and showcase DAOS' potential and increased flexibility to support high-performance I/O.
In this paper we discuss these interfaces and the options they provide, exercise DAOS through them with various I/O benchmarks, and analyze the observed performance. We also briefly compare the performance with a distributed file system and another object storage system deployed on the same hardware, and showcase DAOS' potential and increased flexibility to support high-performance I/O.
Workshop
Applications and Application Frameworks
Artificial Intelligence/Machine Learning
Parallel Programming Methods, Models, Languages and Environments
W
DescriptionData scientists can be thought of as modern-day explorers, venturing into the vast unknown of information. However, this exciting journey is not without its hurdles. One of the biggest challenges they face is the sheer immensity of data they encounter. Modern datasets cannot fit in laptop memory, containing terabytes or even petabytes of information. Working with massive data requires specialized tools to extract meaningful insights. As data sets are growing ever larger, data science demands interactivity, where scientists can learn while working with the data. Data science demands scalability, where scientists are able to work with data sets in their entirety. Data scientists have naturally been drawn to Python as it provides interactivity through its read, evaluate, print loop and performance through its utilization of libraries written in other languages, like C and Fortran. These libraries typically are not designed for HPC and run into problems when attempting to scale.
Paper
Accelerators
Algorithms
Artificial Intelligence/Machine Learning
Graph Algorithms
Performance Optimization
TP
DescriptionFourier Neural Operator (FNO) has been proven to be a universal and effective deep learning framework capable of achieving remarkable accuracy on Partial Differential Equation (PDE) solution problem. However, certain key components of emerging FNO-based models cannot leverage hardware potential, which makes it difficult to apply in high resolution and high real-time demand scenario. This paper presents a high optimized model called Speed Galerkin Transformer, including multilevel parallel SliceK-SplitK-ReduceK strategy for batched skinny matrix multiplication, memory layout optimization for QKV matrices and positional encodings and multi-head layer normalization fusion, as well as batched transposition optimization with strided scattering and gathering in 2D FNO, and these strategies can achieve 10.29x, 4.41x and 2.38x speedup respectively under specific configuration. When solving the Darcy Flow equation at 512x512 resolution, the Speed Galerkin Transformer model can achieve about 1.72x speedup, and achieve more than 90\% parallel efficiency on 8 GPUs.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionIn modern computing, a challenge is the data bottleneck between the CPU and RAM. This issue arises because the CPU can process data faster than it can be accessed from RAM; this is worsened by the fact that large amounts of RAM are less accessible than a powerful CPU. Furthermore, RAM’s high cost creates a need for a cost-effective solution. Processing in Memory (PIM) offers a potential remedy by reducing data movement, thus alleviating bottlenecks. To optimize the use of this new hardware, developers need to identify when to offload their programs to a PIM device. To address this need, we have developed a solution that enables developers to run Python programs through our pipeline, highlighting the memory-intensive parts of their code.
Paper
Accelerators
HPC Infrastructure
Performance Evaluation and/or Optimization Tools
State of the Practice
TP
DescriptionMulti-GPU nodes are increasingly common in the rapidly evolving landscape of exascale supercomputers. On these systems, GPUs on the same node are connected through dedicated networks, with bandwidths up to a few terabits per second. However, gauging performance expectations and maximizing system efficiency is challenging due to different technologies, design options, and software layers. This paper comprehensively characterizes three supercomputers — Alps, Leonardo, and LUMI — each with a unique architecture and design. We focus on performance evaluation of intra-node and inter-node interconnects on up to 4096 GPUs, using a mix of intra-node and inter-node benchmarks. By analyzing its limitations and opportunities, we aim to offer practical guidance to researchers, system architects, and software developers dealing with multi-GPU supercomputing. Our results show that there is untapped bandwidth, and there are still many opportunities for optimization, ranging from network to software optimization.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionWhile efficient routing in Dragonfly networks presents significant challenges, the advent of Software-defined Networking (SDN) offers new opportunities for routing optimization by providing a global network view. This research proposes an SDN-based adaptive routing approach for Dragonfly interconnects. By leveraging global traffic information from SDN, our approach identifies and avoids persistent congestion points that may occur with conventional UGAL routing, leading to improved resource utilization and enhanced performance. This study addresses a critical gap in the development of efficient adaptive routing solutions using SDN technology.
Workshop
Heterogeneous Computing
Parallel Programming Methods, Models, Languages and Environments
PAW-Abstract
Task Parallelism
W
DescriptionIn this project, we implemented algorithms to construct suffix arrays in Chapel, and used them to compute minimal unique substrings and similarity rankings. We developed these algorithms as building blocks for strain detection in metagenomic analysis. With Chapel, we were able to create parallel programs for each of these tasks in short order, and we are seeing good performance with minimal optimization.
Workshop
Performance Optimization
Programming Frameworks and System Software
W
DescriptionBatched parallelism with local allocations is an extremely common pattern in HPC, appearing in multi-dimensional FFTs, neural networks processing, or split computation of numerical operators.
Its efficient support is especially complex on GPU where memory per work-item is limited and dynamic memory allocations are challenging.
This study investigates whether the native abstractions of SYCL can support performance portability for this pattern. We implement versions of a batched semi-Lagrangian advection kernel using each parallel construct of SYCL. We evaluate them in terms of maintainability, performance portability and memory footprint on CPUs and GPUs (AMD, Intel, NVIDIA), with two distinct SYCL implementations (AdaptiveCpp and DPC++). Our results demonstrate that no single parallel construct of SYCL emerges as best solution and that a construct offering a higher level of abstraction would be required to support this common pattern.
Its efficient support is especially complex on GPU where memory per work-item is limited and dynamic memory allocations are challenging.
This study investigates whether the native abstractions of SYCL can support performance portability for this pattern. We implement versions of a batched semi-Lagrangian advection kernel using each parallel construct of SYCL. We evaluate them in terms of maintainability, performance portability and memory footprint on CPUs and GPUs (AMD, Intel, NVIDIA), with two distinct SYCL implementations (AdaptiveCpp and DPC++). Our results demonstrate that no single parallel construct of SYCL emerges as best solution and that a construct offering a higher level of abstraction would be required to support this common pattern.
Workshop
Energy Efficiency
HPC Infrastructure
Sustainability
W
DescriptionIn the face of surging power demands for exascale HPC systems, this work tackles the critical challenge of understanding the impact of software-driven power management
techniques like Dynamic Voltage and Frequency Scaling (DVFS) and Power Capping. These techniques have been actively developed over the past few decades. By combining insights from
GPU benchmarking to understand application power profiles, we present a telemetry data-driven approach for deriving energy savings projections. This approach has been demonstrably applied
to the Frontier supercomputer at scale. Our findings based on three months of telemetry data indicate that, for certain resource constrained jobs, significant energy savings (up to 8.5%) can be
achieved without compromising performance. This translates to a substantial cost reduction, equivalent to 1438 MWh of energy saved. The key contribution of this work lies in the methodology for establishing an upper limit for these best-case scenarios and its successful application.
techniques like Dynamic Voltage and Frequency Scaling (DVFS) and Power Capping. These techniques have been actively developed over the past few decades. By combining insights from
GPU benchmarking to understand application power profiles, we present a telemetry data-driven approach for deriving energy savings projections. This approach has been demonstrably applied
to the Frontier supercomputer at scale. Our findings based on three months of telemetry data indicate that, for certain resource constrained jobs, significant energy savings (up to 8.5%) can be
achieved without compromising performance. This translates to a substantial cost reduction, equivalent to 1438 MWh of energy saved. The key contribution of this work lies in the methodology for establishing an upper limit for these best-case scenarios and its successful application.
Workshop
Data Movement and Memory
I/O, Storage, Archive
W
DescriptionProactive Data Containers (PDC) is an object-centric runtime metadata and data management system designed for transparent, asynchronous, and autonomous data movement, taking advantage of the complex memory and storage hierarchy. We have previously explored PDC's performance and scalability with a traditional Lustre-based storage backend. Understanding PDC's behavior on novel storage solutions like VAST offers the chance to support cross-facility and multi-filesystem deployment for complex scientific analysis while preserving the key characteristics PDC was designed to attain. In this work-in-progress, we share our initial scalability results of PDC operations when running atop of a VAST data system.
Workshop
Applications and Application Frameworks
W
DescriptionHPC systems are now going to be connected to various external data sources via high-speed networks, and on-demand execution of urgent jobs on such a "connected supercomputing" environment could be an attractive approach to supporting disaster response and recovery. Indeed, a production system for real-time forecast of tsunami damage is already in operation. Once a large-scale earthquake occurs, the seismic information is sent to Supercomputer AOBA at Tohoku University to estimate the tsunami damage most likely caused by the earthquake in several minutes. The estimation results are then sent to the Japanese government for their various kinds of decision-making. However, on-demand job execution brings several technical challenges in resource management. A naive implementation of on-demand job execution will critically decrease the system throughput. Moreover, if an urgent simulation with deadline constraints needs a large amount of computing resource, it might be impossible for a single HPC system to immediately secure computing resource enough to meet the deadline. Therefore, we have started a new research project named ExpressHPC to offer an expressway to urgent allocation of computing resources from multiple datacenters. This lightning talk will introduce the ExpressHPC project and then discuss the technical challenges and potential opportunities of research collaboration.
Invited Talk
TP
DescriptionFusion energy holds the promise of a sustainable carbon-free electricity source, but the scientific and technological hurdles are high. Progress is made by iterative development of experimental prototypes, where the cost of each prototype can be measured in units of $100MM. Validated multi-physics, multi-scale simulation codes are critical to reduce the number of such prototypes and mitigate design risk. We eschew low-level HPC coding and instead adopt high-level exascale-capable tools, and we expand their scope into new physics domains. In particular, the WarpX code, which won the 2022 ACM Gordon Bell Prize for first-principles simulations of laser-based electron accelerators, has been extended to implement high-fidelity reduced models of magnetically confined fusion plasmas. A zero-mass electron model has been added to shift the electromagnetic time and space scales by six orders of magnitude. A semi-implicit implementation has been added to reduce electrostatic solver time to solution by two to three orders of magnitude. Multiphysics models of plasma-material interaction are underway. We show multi-GPU performance scaling of 3D simulations of wave-particle interactions in field-reversed configuration (FRC) plasmas. We also comment on human productivity enhancements that have come with the adoption of the exascale tools and describe our IP framework for using open-source software in the commercial setting.
Panel
Performance Evaluation and/or Optimization Tools
Performance Optimization
TP
DescriptionHigh-performance computing platforms are evolving at a rapid pace with the integration of GPUs from three major vendors (NVIDIA, AMD, and Intel). With the rise of AI/ML use cases, software frameworks that target heterogeneous architectures are now trying to adapt to hybrid HPC and AI/ML workloads. Performance evaluation tools that have supported HPC architectures for years must now adapt to this extreme-scale heterogeneity with support for GPU runtimes that target diverse programming models. These programming models include OpenMP target offload directives, DPC++/SYCL, CUDA, HIP, OpenACC, OpenCL, MPI and require distinct performance instrumentation interfaces. This panel will discuss the challenges facing portable performance evaluation tools and will include panelists from national laboratories, industry, as well as academic institutions in the US and Europe. These panelists represent leading performance evaluation tools in the HPC and AI/ML research community as well as industry.
Exhibits
SCinet
TP
XO/EX
Workshop
Debugging and Correctness Tools
Fault-Tolerance, Reliability, Maintainability, and Adaptability
Software Engineering
W
DescriptionThe paper introduces ARBALEST-VEC, a redesigned
dynamic analysis tool for detecting data inconsistencies in
OpenMP offloading applications, specifically targeting GPU
environments. Building upon the shortcomings of its predecessor,
ARBALEST, the new tool incorporates several improvements.
ARBALEST-VEC is implemented on a more recent version of
LLVM, addressing issues such as inaccurate data movement
modeling and insufficient debugging information in ARBALEST.
It introduces a new OpenMP Tool interface (OMPT) event,
device mem, to accurately capture data movements, thus reducing
runtime overhead and enhancing debugging accuracy. The
tool also leverages additional debug information from LLVM
15 to generate more detailed and user-friendly bug reports.
Furthermore, ARBALEST-VEC employs a dedicated shadow
memory and vectorized dynamic analysis using SIMD instructions
to improve the performance and precision of data inconsistency
detection. Evaluations demonstrate that ARBALEST-VEC offers
improved accuracy, usability, and performance over ARBALEST,
with lower time overhead and more predictable memory usage.
dynamic analysis tool for detecting data inconsistencies in
OpenMP offloading applications, specifically targeting GPU
environments. Building upon the shortcomings of its predecessor,
ARBALEST, the new tool incorporates several improvements.
ARBALEST-VEC is implemented on a more recent version of
LLVM, addressing issues such as inaccurate data movement
modeling and insufficient debugging information in ARBALEST.
It introduces a new OpenMP Tool interface (OMPT) event,
device mem, to accurately capture data movements, thus reducing
runtime overhead and enhancing debugging accuracy. The
tool also leverages additional debug information from LLVM
15 to generate more detailed and user-friendly bug reports.
Furthermore, ARBALEST-VEC employs a dedicated shadow
memory and vectorized dynamic analysis using SIMD instructions
to improve the performance and precision of data inconsistency
detection. Evaluations demonstrate that ARBALEST-VEC offers
improved accuracy, usability, and performance over ARBALEST,
with lower time overhead and more predictable memory usage.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionGraph Edit Distance (GED) is a fundamental metric for assessing graph similarity with critical applications across various domains, including bioinformatics, classification, and pattern recognition. However, the exponential computational complexity of GED has hindered its adoption for large-scale graph analysis. This poster presents FAS-GED, a GPU framework for fast and accurate GED computation. FAS-GED achieves significant performance gains by optimizing memory accesses and minimizing data transfer while maintaining high accuracy. FAS-GED shows up to a 300x speedup over its CPU-based implementations on 48-CPU AMD EPYC. Our approach surpasses existing methods in speed and precision, demonstrating up to a 55x speedup over the NetworkX library for small graphs and reaching optimal solutions in 94% of cases. FAS-GED is a step toward unlocking the potential of GED for large-scale graph analysis in real-world applications.
Paper
Cloud Computing
Fault-Tolerance, Reliability, Maintainability, and Adaptability
Resource Management
State of the Practice
TP
DescriptionThe microservice architecture is increasingly popular for flexible, large-scale online applications. However, existing resource management mechanisms incur high latency in detecting Quality-of-Service (QoS) violations, and hence, fail to allocate resources effectively under commonly-observed varying load conditions. This results in over-allocation coupled with a late response that increase both the total cost of ownership and the magnitude of each QoS violation event. We present SurgeGuard, a decentralized resource controller for microservice applications specifically designed to guard application QoS during surges in load and network latency. SurgeGuard uses the key insight that for rapid detection and effective management of QoS violations, the controller must be aware of any available slack in latency and communication patterns between microservices within a task-graph.
Our experiments show that for the workloads in DeathStarBench, SurgeGuard on average reduces the combined violation magnitude and duration by 61.1% and 93.7%, respectively, compared to the well-known Parties and Caladan algorithms.
Our experiments show that for the workloads in DeathStarBench, SurgeGuard on average reduces the combined violation magnitude and duration by 61.1% and 93.7%, respectively, compared to the well-known Parties and Caladan algorithms.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionToday’s complex HPC systems are incredibly powerful yet equally likely to experience failures. The scientific applications on these HPC systems are mostly iterative in nature. Iterative solvers have some inherent fault tolerance, but they are still susceptible to errors. One subset of these iterative methods are the Krylov Subspace Methods. There has been limited research on the fault tolerance of these methods against soft errors. We know Preconditioned Conjugate Gradient (PCG) to be self-correcting in nature. But we don’t know much about other methods in the Krylov Subspace. Our goal is to study the error propagation caused by Sparse Matrix-Vector Multiplication (SpMV) operation in Lanczos Method, Bi-Conjugate Gradient (BiCG) Method and PCG Method. By using the results from the experiments and knowledge from previous works, we will generalize our findings for all the Krylov Subspace Methods.
Workshop
Data Movement and Memory
I/O, Storage, Archive
W
DescriptionLarge-scale DL on HPC systems like Frontier and Summit uses distributed node-local caching to address scalability and performance challenges. However, as these systems grow more complex, the risk of node failures increases, and current caching approaches lack fault tolerance, jeopardizing large-scale training jobs. We analyzed six months of SLURM job logs from Frontier and found that over 30% of jobs failed after an average of 75 minutes. To address this, we propose fault-tolerance strategies that recache data lost from failed nodes using a hash ring technique for balanced data recaching in the distributed node-local caching, reducing reliance on the PFS. Our extensive evaluations on Frontier showed that the hash ring-based recaching approach reduced training time by approximately 25% compared to the approach that redirects I/O to the PFS after node failures and demonstrated effective load balancing of training data across nodes.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionNumerical iterative algorithms are struck by multiple error types when deployed on large-scale HPC platforms: fail-stop errors (failures) and silent errors, striking both as computation errors and memory bit-flips. Our novel approach provides efficient fault-tolerant algorithms that are capable of detecting and correcting them simultaneously. Previous works never addressed all the error types simultaneously.
We introduce a hierarchical periodic pattern combining various general-purpose and application-specific techniques and optimize its shape in order to minimize the expected time per iteration. The derivation is intricate because optimizing a resilience period for one error type depends upon other errors possibly striking and slowing down execution progress.
A case study with the preconditioned conjugate gradient algorithm (PCG) demonstrates the good performance and flexibility of our approach, which easily adapts to different application and fault-tolerance parameter costs (e.g. iteration, verification, checkpoint, etc.).
Future work: extension to include more case studies.
We introduce a hierarchical periodic pattern combining various general-purpose and application-specific techniques and optimize its shape in order to minimize the expected time per iteration. The derivation is intricate because optimizing a resilience period for one error type depends upon other errors possibly striking and slowing down execution progress.
A case study with the preconditioned conjugate gradient algorithm (PCG) demonstrates the good performance and flexibility of our approach, which easily adapts to different application and fault-tolerance parameter costs (e.g. iteration, verification, checkpoint, etc.).
Future work: extension to include more case studies.
Workshop
Software Engineering
W
DescriptionTrends inside the world of hardware tell us clearly where the future of HPC is heading. Details will emerge later, but for now let’s just say “complexity on steroids.” At the same time, power consumption rules. This means software must map tightly onto hardware features. We “know” how (in principle) to do this, but there are a few “engineering details” to work out. And then, of course, there are the people who are responsible for creating that software. These three — the processors, the people, and the programming — must all come together to make the future of HPC work. Simple. In this talk, we’ll discuss all the messy details missing from this abstract and the challenges they imply. Then I’ll “rain on my own parade” and discuss the technical and sociological forces that may make this whole complex house of cards come tumbling down. It should be a thought-provoking, if not insightful, conversation.
Workshop
Software Engineering
W
Workshop
Applications and Application Frameworks
Artificial Intelligence/Machine Learning
Security
W
DescriptionScientific workflows have become highly heterogeneous leveraging distributed facilities such as HPC, Artificial Intelligence (AI), Machine Learning (ML), scientific instruments (data-driven pipelines) and edge computing. As a result, Identity and Access Management (IAM) and Cybersecurity challenges across the diverse hardware and software stacks are growing. Nevertheless, scientific productivity relies on lowering access barriers via seamless, single sign-on and federated login while ensuring access controls and compliance. We present an implementation of a federated
IAM solution, which is coupled with multiple layers of security controls, multi-factor authentication, cloud-native protocols, and time-limited role-based access controls (RBAC) that has been
co-designed and deployed for the Isambard-AI and HPC supercomputing Digital Research Infrastructures (DRIs) in the UK.
Isambard DRIs as a national research resource are expected
to comply with regulatory frameworks. Implementation details
for monitoring, alerting and controls are outlined in the paper
alongside selected user stories for demonstrating IAM workflows
for different roles.
IAM solution, which is coupled with multiple layers of security controls, multi-factor authentication, cloud-native protocols, and time-limited role-based access controls (RBAC) that has been
co-designed and deployed for the Isambard-AI and HPC supercomputing Digital Research Infrastructures (DRIs) in the UK.
Isambard DRIs as a national research resource are expected
to comply with regulatory frameworks. Implementation details
for monitoring, alerting and controls are outlined in the paper
alongside selected user stories for demonstrating IAM workflows
for different roles.
Doctoral Showcase
Posters
TP
DescriptionModern high-performance computing clusters switch to the GPUs, as opposed to CPUs, as the source of their computational power. GPUs are tailored for data-parallel algorithms where multiple cores perform the same operations on different memory locations. However, making CPU code run within GPU constraints is often a non-trivial task. Firstly, not all algorithms are easy to parallelize. Secondly, there is no single way to program GPUs from different manufacturers — all of them try to promote their own solutions. To solve this issue, a runtime GPU code generation and optimization platform, PfSolve, has been developed during this PhD. Originally based on VkFFT (Vulkan/CUDA/HIP/OpenCL/Level Zero/Metal Fast Fourier Transform library), PfSolve has been generalized and restructured.
QuiCC is a code under development in our research group designed to solve the equations of magnetohydrodynamics in a full sphere and other geometries. It uses a fully spectral approach to the problem, with the Jones-Worland (JW) polynomials as a radial basis and spherical harmonics (SH) as a spherical basis. The main goal of this dissertation is a GPU implementation of the FFT-based algorithm for their evaluation, which is more accurate and requires less memory than the regular quadrature approach. One of the main building blocks used by it is the efficient tridiagonal GPU solver, developed with the new warp-programming approach.
This work also presents additional algorithms redesigned within the platform, such as finite differences solver and double-double emulation of FP128 calculations on GPUs.
QuiCC is a code under development in our research group designed to solve the equations of magnetohydrodynamics in a full sphere and other geometries. It uses a fully spectral approach to the problem, with the Jones-Worland (JW) polynomials as a radial basis and spherical harmonics (SH) as a spherical basis. The main goal of this dissertation is a GPU implementation of the FFT-based algorithm for their evaluation, which is more accurate and requires less memory than the regular quadrature approach. One of the main building blocks used by it is the efficient tridiagonal GPU solver, developed with the new warp-programming approach.
This work also presents additional algorithms redesigned within the platform, such as finite differences solver and double-double emulation of FP128 calculations on GPUs.
Workshop
Data Compression
Data Movement and Memory
Middleware and System Software
W
DescriptionAs high-performance computing systems continue to advance, the gap between computing performance and I/O capabilities is widening. This bottleneck limits the storage capabilities of increasingly large-scale simulations, which generate data at never-before-seen granularities while only being able to store a small subset of the raw data. Recently, strategies for data-driven sampling have been proposed. However, a thorough analysis of how such intelligent samples can be used for data reconstruction is lacking. We propose a data-driven machine learning approach based on training neural networks to reconstruct full-scale datasets based on a simulation’s sampled output. Compared to current state-of-the-art reconstruction approaches, we demonstrate that our machine learning-based reconstruction has several advantages, including reconstruction quality, time-to-reconstruct, and knowledge transfer to unseen timesteps and grid resolutions. We propose and evaluate strategies that balance the sampling rates with model training (pretraining and fine-tuning) and data reconstruction time to demonstrate its efficacy.
Paper
Accelerators
Artificial Intelligence/Machine Learning
Codesign
State of the Practice
System Administration
TP
DescriptionThe rapid progress in Deep Learning (DL) and Large Language Models (LLMs) has exponentially increased demands of computational power and bandwidth. This, combined with the high costs of faster computing chips and interconnects, has significantly inflated High Performance Computing (HPC) construction costs. To address these challenges, we introduce the Fire-Flyer AI-HPC architecture, a synergistic hardware-software co-design framework and its best practices. For DL training, we deployed the Fire-Flyer 2 with 10,000 PCIe A100 GPUs, achieved performance approximating the DGX-A100 while reducing costs by half and energy consumption by 40%. We specifically engineered HFReduce to accelerate allreduce communication and implemented numerous measures to keep our Computation-Storage Integrated Network congestion-free. Through our software stack, including HaiScale, 3FS, and HAI-Platform, we achieved substantial scalability by overlapping computation and communication. Our system-oriented experience from DL training provides valuable insights to drive future advancements in AI-HPC.
Workshop
Software Engineering
W
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionAs HPC models grow increasingly complex, disparities in floating point implementations across hardware platforms begin to pose significant challenges to reproducibility and reliability. This is especially so, given that HPC employs hardware optimized for performance, which quite often deviates from the IEEE Standard. We leverage SMT solvers, particularly Z3, to develop a rigorous framework for analyzing and verifying the behavior of computer arithmetic implementations in emerging hardware realizations. Using bit-vectors to model IEEE non-standard behaviors, we are able to formally reason about intricate deviations in areas such as rounding rules, subnormal number handling, precision, normalization, etc. We demonstrate the framework's utility in two key applications: automating feature-targeted hardware testing for undocumented features and uncovering the degree of conformance to deeper properties such as monotonicity within these non-standard arithmetics. Our work also directly benefits cutting-edge GPU implementations, which is a timely issue underlying trust in scientific computation.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionDue to its historical popularity, Fortran was used to implement many important scientific applications. The complexity of these applications, along with the transition to modern high performance languages like C++, has made modernization and optimization challenging for these applications. Significant development time is incurred to understand and optimize key algorithms as well as leverage new accelerator systems. To reduce this development effort, we propose FortranX, a compiler framework to discover and optimize key algorithms in Fortran applications without source code modification. FortranX uses a compiler pass to recognize key algorithms, a code generation system to produce architecturally optimized kernels, and a heterogeneous runtime system to execute those kernels on various hardware platforms. We describe the design of FortranX and show initial performance results for a cyclic convolution kernel used in Poisson solvers for Partial Differential Equations (PDEs).
Workshop
Artificial Intelligence/Machine Learning
Broader Engagement
HPC in Society
W
DescriptionHPC is at the forefront of the present and future generations’ biggest challenges in social, economic, epistemological, and ethical issues related to sustainability, equity, accessibility, and justice. At its best, HPC is a powerful tool that can be used for the social good of humanity by governments, for-profit as well as not-for-profit organizations, and teams of dedicated scientists and practitioners. In the words of Phil Roth, SC24 General Chair, “it matters what we do with it.” In this presentation we present opportunities to exercise moral imagination, conceive value propositions for including ethics in HPC, introduce practical approaches to fostering three types of advocacy (self, individual, system), and demonstrate several techniques and tools that can be used to hone moral imagination and the development of ethical mindsets for the practice of advocacy.
Birds of a Feather
TP
XO/EX
DescriptionThis BoF will focus on provoking community discussion on the impact that revolutionary generative AI technology can have on HPC. Most of the BoF will establish a good environment that aims to attract audience interaction with the help of a diverse group of panelists and recognized experts and pioneers in the use of generative AI technologies for different HPC targets. We expect that this BoF will help to pave the road towards a future AI-assisted HPC era, putting the focus on the challenges and opportunities to be addressed in the upcoming years.
Workshop
Artificial Intelligence/Machine Learning
W
DescriptionHigh-fidelity direct numerical simulation of turbulent flows for most real-world applications remains an outstanding computational challenge.
Several machine learning approaches have recently been proposed to alleviate the computational cost even though they become unstable or unphysical for long time predictions.
We identify that the Fourier neural operator (FNO) based models combined with a partial differential equation (PDE) solver can accelerate fluid dynamic simulations and thus address computational expense of large-scale turbulence simulations.
We treat the FNO model on the same footing as a PDE solver and answer important questions about the volume and temporal resolution of data required to build pre-trained models for turbulence.
We also discuss the pitfalls of purely data-driven approaches that need to be avoided by the machine learning models to become viable and competitive tools for long time simulations of turbulence.
Several machine learning approaches have recently been proposed to alleviate the computational cost even though they become unstable or unphysical for long time predictions.
We identify that the Fourier neural operator (FNO) based models combined with a partial differential equation (PDE) solver can accelerate fluid dynamic simulations and thus address computational expense of large-scale turbulence simulations.
We treat the FNO model on the same footing as a PDE solver and answer important questions about the volume and temporal resolution of data required to build pre-trained models for turbulence.
We also discuss the pitfalls of purely data-driven approaches that need to be avoided by the machine learning models to become viable and competitive tools for long time simulations of turbulence.
Workshop
Architecture
Data-Intensive
Network
Performance Optimization
System Administration
W
DescriptionSince the advent of software-defined networking (SDN), Traffic Engineering (TE) has been highlighted as one of the key applications that can be achieved through software-controlled protocols. TE problems involve difficult decisions such as allocating flows, either via splitting them among multiple paths or by using a reservation system, to minimize congestion. However, creating an optimized solution is cumbersome and difficult as traffic patterns vary and change with network scale, capacity, and demand. AI methods can help alleviate this by finding optimized TE solutions for the best network performance. In this paper, we leverage Hecate to practically demonstrate TE on a real network, collaborating with PolKA, a source routing protocol tool. With real-time traffic statistics, Hecate uses this data to compute optimal paths that are then communicated to PolKA to allocate flows. This work proves valuable for truly engineered self-driving networks helping translate theory to practice.
Panel
Artificial Intelligence/Machine Learning
Compilers
TP
DescriptionMajor tech players were already investing heavily in domain-specific compilation for AI; a trend that was since turbocharged by the LLM revolution triggered by ChatGPT and its derivatives. With Python-based frontends for productivity, and MLIR as a robust foundational infrastructure, it could seem like AI outpaced traditional HPC in domain-specific compilation. Yet with every organization developing their own compilation stacks, and increasingly diverse AI accelerators, reuse is sparse and we are far from reaching generality. The goal of this panel is to exchange wisdom across traditional HPC and AI compilation. What challenges prevent more sharing and standardization of compilation techniques and infrastructure? How do we face the challenge of diverse and ever-evolving hardware? Are we moving from a fire-and-forget system to a more interactive compilation experience? Our expert panelists will give us invaluable insights from their positions in cutting edge compiler research and engineering.
Exhibitor Forum
Artificial Intelligence/Machine Learning
Hardware Technologies
TP
XO/EX
DescriptionThe radical increase in CapEx to integrate specialized hardware, cooling solutions, and power infrastructure has raised the need for manageability solutions that can adapt to this dynamic shift in the ecosystem. This presentation explores the evolving role of manageability in AI-focused environments, contrasting traditional server architectures with those tailored for AI and HPC workloads.
At the data center level, we explore how AI systems integrate into broader management frameworks, with a focus on the critical roles of power and cooling in maintaining optimal performance. We emphasize the growing need for open, cluster-level management solutions that can handle the unique demands of AI workloads, including predictive failure detection and clustering strategies.
This presentation is designed for professionals navigating the complexities of AI system deployment and management, offering insights into the future of data center operations in an AI-driven world.
At the data center level, we explore how AI systems integrate into broader management frameworks, with a focus on the critical roles of power and cooling in maintaining optimal performance. We emphasize the growing need for open, cluster-level management solutions that can handle the unique demands of AI workloads, including predictive failure detection and clustering strategies.
This presentation is designed for professionals navigating the complexities of AI system deployment and management, offering insights into the future of data center operations in an AI-driven world.
Workshop
Codesign
Data Movement and Memory
Facilities
W
DescriptionMuch attention is being placed on the use of artificial intelligence and machine learning for application development, system design, and performance analysis. Using our ongoing work on data-driven HPC, including AI and ML-based approaches, for HPC application trace and performance data synthesis, HPC software synthesis and code optimizations, and HPC system-monitoring, this talk inspires the promise of principled and automated data-driven approaches in the HPC system design lifecycle.
Workshop
Distributed Computing
Fault-Tolerance, Reliability, Maintainability, and Adaptability
W
DescriptionDisk failure data provides valuable insights for preventing failures, enhancing storage robustness, guiding system design and deployment, and ensuring reliable operations at data centers. This paper introduces two disk failure datasets collected from large-scale HPC production environments over the past five years, comprising over 5,000 failure records from more than 40,000 disks. We analyzed these datasets across multiple dimensions, including temporal, spatial, and relational trends, and performed a comprehensive reliability assessment. Our analysis yielded numerous observations and insights that influence various operational aspects of HPC storage systems. We believe this study offers a holistic understanding of disk failure trends likely to interest the HPC storage community.
Workshop
Distributed Computing
Education
Emerging Technologies
W
DescriptionDeveloping a sustainable high-performance computing (HPC) workforce pipeline remains a global priority.
South Africa has several workforce training initiatives aimed at developing career HPC system administrators. Apart from the annual South African Student Cluster Competition (SCC), there are no other formal training programmes available for the undergraduate student community.
Each year, the University of the Witwatersrand (“Wits”) has entered at least one team in the SCC. Through the implementation of student-led training approaches, Wits has enjoyed continued success at SCC events. Wits students have been part of numerous teams, achieving six top-three finishes in international Student Cluster Competitions.
This paper provides an overview of the student HPC Special Interest Group (SIG) formed at the University of the Witwatersrand that focuses on delivering HPC training to the undergraduate student community.
The paper outlines the approach towards growing and maintaining the interest group, including teaching and learning strategies to prepare for SCCs.
South Africa has several workforce training initiatives aimed at developing career HPC system administrators. Apart from the annual South African Student Cluster Competition (SCC), there are no other formal training programmes available for the undergraduate student community.
Each year, the University of the Witwatersrand (“Wits”) has entered at least one team in the SCC. Through the implementation of student-led training approaches, Wits has enjoyed continued success at SCC events. Wits students have been part of numerous teams, achieving six top-three finishes in international Student Cluster Competitions.
This paper provides an overview of the student HPC Special Interest Group (SIG) formed at the University of the Witwatersrand that focuses on delivering HPC training to the undergraduate student community.
The paper outlines the approach towards growing and maintaining the interest group, including teaching and learning strategies to prepare for SCCs.
Workshop
I/O, Storage, Archive
W
DescriptionDuring the past decade, Deep Learning (DL) algorithms, programming systems and hardware have converged with the High Performance Computing (HPC) counterparts. Nevertheless, the programming methodology of DL and HPC systems is stagnant, relying on highly-optimized, yet platform-specific and inflexible vendor-optimized libraries. Such libraries provide close-to-peak performance on specific platforms, kernels and shapes thereof that vendors have dedicated optimizations efforts, while they underperform in the remaining use-cases, yielding non-portable codes with performance glass-jaws. This talk will shade light on abstraction efforts, mainly targeting CPUs and widening to GPUs the close the approaches get to DSLs/Compilers. We will introduce the Tensor Processing Primitives (TPP) as an virtual and software-defined ISA abstraction in form of ukernels. Subsequently we will cover programming abstractions on top of TPP which is carried out in two steps: 1) Expressing the computational core using Tensor Processing Primitives (TPPs): a compact, versatile set of 2D-tensor operators, 2) Expressing the logical loops around TPPs in a high-level, declarative fashion whereas the exact instantiation (ordering, tiling, parallelization) is determined via simple knobs. We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms. We will close the talk by demonstrating how TPP can be the architectural target of a tensor compiler which in turn is then able to generate hand-coded performance.
Exhibits
Flash Session
TP
XO/EX
DescriptionFor over 150 years, Valvoline has been the driving force behind keeping the world’s most powerful machines running cool, from high-octane race cars to turbines. We understand that the heat generated by these machines is not unlike the challenges faced by today’s high-performance computing. The same expertise that protects engines under immense pressure now fuels our cutting-edge liquid cooling solutions, ensuring your HPC systems perform at the highest level, no matter the demand.
Workshop
Data Compression
Data Movement and Memory
Middleware and System Software
W
DescriptionThe performance of the GMRES iterative solver on GPUs is limited by the GPU main memory bandwidth. Compressed Basis GMRES outperforms GMRES by storing the Krylov basis in low precision, thereby reducing the memory access. An open question is whether compression techniques that are more sophisticated than casting to low precision can enable large runtime savings while preserving the accuracy of the final results. This paper presents the lightweight in-register compressor \frsz that can decompress at the bandwidth speed of a modern NVIDIA H100 GPU. In an experimental evaluation, we demonstrate using \frsz instead of low precision for compression of the Krylov basis can bring larger runtime benefits without impacting final accuracy.
Workshop
Distributed Computing
Fault-Tolerance, Reliability, Maintainability, and Adaptability
W
Workshop
Distributed Computing
Fault-Tolerance, Reliability, Maintainability, and Adaptability
W
Workshop
Distributed Computing
Fault-Tolerance, Reliability, Maintainability, and Adaptability
W
Workshop
Artificial Intelligence/Machine Learning
Codesign
W
DescriptionRIKEN is conducting a feasibility study for the next-generation supercomputing infrastructure “Fugaku-Next” by the Ministry of Education, Culture, Sports, Science and Technology (MEXT). Fugaku-Next is expected to become a platform for achieving the SDGs and Society 5.0 by providing advanced digital twins. We are working as an application group for the RIKEN team. The application group is composed of two subgroups: one that conducts research activities in the fields of computational science, data science, and social science, and the other that conducts research activities mainly in the field of computer science, such as performance modeling and benchmark construction. In this presentation, we will introduce an overview of the recent activities of the application group of the RIKEN team and some results of performance modeling for applications.
Workshop
Compilers
Parallel Programming Methods, Models, Languages and Environments
Performance Optimization
W
DescriptionFortran is the lingua franca of HPC code development and as such it is crucial that we as a community have open source Fortran compilers capable of generating high performance executables. Flang is LLVM's Fortran compiler and leverages MLIR which is a reusable compiler infrastructure which, as part of LLVM, has become popular in recent years.
However, whilst Flang leverages MLIR it does not fully integrate with it and instead provides bespoke translation and optimisation passes to target LLVM-IR. In this paper we first explore the performance of Flang against other compilers popular in HPC for a range of benchmarks before describing a mapping between Fortran and standard MLIR, exploring the performance of this. The result of this work is an up to three times speed up compared with Flang's existing approach across the benchmarks and experiments run, demonstrating that the Flang community should seriously consider leveraging standard MLIR.
However, whilst Flang leverages MLIR it does not fully integrate with it and instead provides bespoke translation and optimisation passes to target LLVM-IR. In this paper we first explore the performance of Flang against other compilers popular in HPC for a range of benchmarks before describing a mapping between Fortran and standard MLIR, exploring the performance of this. The result of this work is an up to three times speed up compared with Flang's existing approach across the benchmarks and experiments run, demonstrating that the Flang community should seriously consider leveraging standard MLIR.
Workshop
Artificial Intelligence/Machine Learning
W
DescriptionWe present novel generative models for accelerating simulations of high-dimensional systems through learning and evolving their effective dynamics. In our Generative Learning of Effective Dynamics (G-LED) framework, instances of high dimensional data are down sampled to a lower dimensional manifold that is evolved through an auto-regressive attention mechanism. Subsequently Bayesian diffusion models are employed, that map this low-dimensional manifold onto its corresponding high-dimensional space. These diffusion models operate simultaneously on batches of data and can incorporate physical constraints using the concept of virtual observables and gradient guidance. We demonstrate unprecedented capabilities in capturing the evolution of benchmark models such as Kuramoto-Sivashinsky as well as simulations of 3D turbulent channel flows.
Workshop
Artificial Intelligence/Machine Learning
Biology
Education
Emerging Technologies
Medicine
Modeling and Simulation
W
DescriptionThe chemical space of small molecules is vast, making de novo drug design challenging. Traditional methods are slow and costly. While AI advancements have improved this process, we still face limitations in exploring the larger chemical space. In oncological drug discovery, various factors such as selectivity, efficacy, safety, toxicity, and synthesizability must be considered.
We introduce the Generalized Generative Molecular Design (GGMD), an open-source tool that combines generative AI with population-based optimization algorithms for drug design and lead optimization. GGMD’s modular and customizable framework allows users to adjust methods to fit specific research needs, balancing trade-offs like efficacy and synthesizability. Designed for accessibility, GGMD is transportable and provides tools for visualizing results and refining parameters.
We’ve successfully used GGMD to optimize properties such as LogP and toxicity, leading to the discovery of new molecules.
We introduce the Generalized Generative Molecular Design (GGMD), an open-source tool that combines generative AI with population-based optimization algorithms for drug design and lead optimization. GGMD’s modular and customizable framework allows users to adjust methods to fit specific research needs, balancing trade-offs like efficacy and synthesizability. Designed for accessibility, GGMD is transportable and provides tools for visualizing results and refining parameters.
We’ve successfully used GGMD to optimize properties such as LogP and toxicity, leading to the discovery of new molecules.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionExaDigiT is an open-source framework for developing comprehensive digital twins (DTs) of liquid-cooled supercomputers. DTs merge telemetry, simulations, and AI/ML/RL to create a complete virtual representation of a system. This provides an effective tool for testing a variety of system optimizations, determining the impact and outcomes of hypothetical “what-if” scenarios, and creating virtual prototypes of future systems with performance and cost insights. Our framework consists of three primary modules including (1) a resource allocator and power simulator, (2) a thermofluidic cooling model, and (3) an augmented reality model of a supercomputing cluster and its cooling plant. ExaDigiT can predict the power and energy losses of synthetic and real workloads, simulate complex transient dynamics to provide accurate cooling predictions, and provide an interactive means of analyzing relevant data. ExaDigiT is released under Apache and MIT licenses.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionScientific groups are struggling to adapt their codes to quickly-developing GPU-based HPC platforms. The domain of distributed coupled cluster (CC) calculations is not an exception. Moreover, our applications to tiny QED effects require higher-order CC which include thousands of tensor contractions, which makes automatic treatment imperative.
The challenge is to allow efficient implementation by capturing key symmetries of the problem, while retaining the abstraction from the hardware. We present the tensor programming framework tenpi, which seeks to find this balance. It features a Python library user interface, global optimization of intermediates, a visualization module and Fortran code generator that bridges the DIRAC package for relativistic molecular calculations to tensor contraction libraries. tenpi brings higher-order CC functionality to the massively parallel module of DIRAC. The architecture and design decision schemes are accompanied by benchmarks and by first production calculations on Summit, Frontier and LUMI along with state-of-the-art tensor contraction software.
The challenge is to allow efficient implementation by capturing key symmetries of the problem, while retaining the abstraction from the hardware. We present the tensor programming framework tenpi, which seeks to find this balance. It features a Python library user interface, global optimization of intermediates, a visualization module and Fortran code generator that bridges the DIRAC package for relativistic molecular calculations to tensor contraction libraries. tenpi brings higher-order CC functionality to the massively parallel module of DIRAC. The architecture and design decision schemes are accompanied by benchmarks and by first production calculations on Summit, Frontier and LUMI along with state-of-the-art tensor contraction software.
Students@SC
TP
W
TUT
XO/EX
DescriptionGuided Interest Groups (GIGs) are community learning experiences designed to help student attendees navigate the SC Technical Program while focusing on key topics in high-performance computing (HPC) related to their interests. The GIGs this year span topics ranging from the fundamentals of HPC to state-of-the-art developments in machine learning, artificial intelligence, sustainability, and scientific applications. GIGs are open to all students attending the conference, with priority given to those participating in the Students@SC cohorts. Pre-registration for a specific GIG is required to attend this kickoff event. Refreshments will be provided.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionEfficient resource allocation in high-performance computing (HPC) environments is crucial for optimizing utilization, minimizing make-span, and enhancing throughput. We propose GNN-RL, a novel intelligent scheduler that leverages a hybrid Graph Neural Network and Reinforcement Learning model, learning from historical workload data to implement optimal scheduling policies. Experimental results show that GNN-RL significantly outperforms conventional methods. Compared to the First-Come-First-Served (FCFS) baseline, GNN-RL achieves a 2.1-fold increase in resource utilization (84.25\% vs. 39.84\%), a 114-fold improvement in throughput (40,061.86 vs. 351.69 jobs/s), and a 114-fold reduction in make-span (4.50s vs. 513.11s). GNN-RL also surpasses EASY Backfilling, with 1.3 times higher resource utilization and 2 times better throughput and make-span. The fairness index remains consistent, indicating that GNN-RL maintains fairness while improving other metrics. Our findings suggest GNN-RL is a significant advancement in intelligent HPC resource management, enabling more efficient and responsive computing environments.
Birds of a Feather
TP
XO/EX
DescriptionThe transition to inherently variable green energy fundamentally impacts HPC centers: energy availability will vary across the day, year, and geographical region. Constant supplies of electricity may become unaffordable, and HPC centers will need to adapt or shift their load accordingly. Such “adaptive capacity computing” (ACC) will enable them to react gracefully to varying power profiles, and achieve the optimal throughput possible. It also offers opportunities for reducing operational costs by leveraging dynamic electricity markets, stabilizing the grid, and making HPC greener.
This BoF discusses how ACC impacts system architecture, hardware, scheduling and resource management, programming models, and applications.
This BoF discusses how ACC impacts system architecture, hardware, scheduling and resource management, programming models, and applications.
Doctoral Showcase
Posters
TP
DescriptionMPI has become the de facto standard for distributed memory computing since its inception in 1994. While the MPI standard has evolved to include new technologies like RDMA, many applications still rely on the original set of MPI operations.
This thesis investigates the current usage of MPI. We note that developers underutilize modern MPI features, as their implementations often are not optimized. On the other hand, as many users rely on the "old" MPI features, MPI implementation developers have no incentive to optimize implementations for the new features. As a consequence, there is no incentive for MPI users to learn the new features, creating a vicious cycle.
To break this cycle, this thesis explores three main approaches:
1) Facilitating correctness checking tool support,
2) Modernizing MPI codes with compiler based approaches, and
3) Exploiting compiler knowledge to further optimize the implementation of modern MPI features.
In order to facilitate the development and improvement of tools aiding with MPI development, this thesis introduces the correctness benchmark MPI-BugBench as a standardized benchmark to evaluate the real-world applicability of such tools. Further, we show that compiler-based automatic modernization methods can encourage early adoption of new MPI features with minimal programmer effort (for example, partitioned operations).
Lastly, compiler knowledge can be utilized in order to further optimize the performance of MPI implementations (for example, in persistent). The use of compiler knowledge, in particular, enables modernization of existing MPI codes without the need for application developers to rewrite existing MPI codes.
This thesis investigates the current usage of MPI. We note that developers underutilize modern MPI features, as their implementations often are not optimized. On the other hand, as many users rely on the "old" MPI features, MPI implementation developers have no incentive to optimize implementations for the new features. As a consequence, there is no incentive for MPI users to learn the new features, creating a vicious cycle.
To break this cycle, this thesis explores three main approaches:
1) Facilitating correctness checking tool support,
2) Modernizing MPI codes with compiler based approaches, and
3) Exploiting compiler knowledge to further optimize the implementation of modern MPI features.
In order to facilitate the development and improvement of tools aiding with MPI development, this thesis introduces the correctness benchmark MPI-BugBench as a standardized benchmark to evaluate the real-world applicability of such tools. Further, we show that compiler-based automatic modernization methods can encourage early adoption of new MPI features with minimal programmer effort (for example, partitioned operations).
Lastly, compiler knowledge can be utilized in order to further optimize the performance of MPI implementations (for example, in persistent). The use of compiler knowledge, in particular, enables modernization of existing MPI codes without the need for application developers to rewrite existing MPI codes.
Workshop
Graph Algorithms
Heterogeneous Computing
Programming Frameworks and System Software
W
DescriptionThe solution of sparse symmetric positive definite linear systems is an important computational kernel in large-scale scientific and engineering modeling and simulation. We will solve the linear systems using a direct method, in which a Cholesky factorization of the coefficient matrix is performed using a right looking approach and the resulting triangular factors are used to compute the solution. Sparse Cholesky factorization is compute intensive. In this work we investigate techniques for reducing the factorization time in sparse Cholesky factorization by offloading some of the dense matrix operations on a GPU.
We will describe the techniques we have considered. We achieved up to 4x speedup compared to the CPU-only version.
We will describe the techniques we have considered. We achieved up to 4x speedup compared to the CPU-only version.
Awards and Award Talks
TP
W
DescriptionGPU Cluster for High Performance Computing Authors: Zhe Fan, Feng Qiu, Arie Kaufman, Suzanne Yoakum-Stover Center For Visual Computing and Department of Computer Science Stony Brook University, Stony Brook, NY, USA This paper was published at SC04 This seminal work focuses on the design of a large GPU cluster that demonstrates the usability, scalability and excellent price-to-performance ratio of GPU devices for executing High Performance Computing (HPC) applications at scale. The authors assembled a cluster with 32 dual CPU-GPU computation nodes connected by a 1 Gigabit Ethernet switch. They developed a parallel flow simulation using the Lattice Boltzmann Model from Computational Fluid Dynamics. They simulated the dispersion of airborne contaminants in the Times Square area of New York City, achieving impressive speed-up over a standard CPU-only parallel implementation. Building upon this result, the authors discussed several other potential applications of their GPU cluster, such as cellular automata, PDE solvers, and Finite Element Methods. Several papers had previously advocated the use of a single GPU to speed-up numerical computations (such as matrix multiplication). But back in 2004, this work was the first to demonstrate the full potential of GPU devices for large-scale applications. The rest is history: 20 years later, GPU devices have become omnipresent in HPC and represent over 95% of the peak performance of the majority of the fastest supercomputers in the world, including number 1 Frontier at Oak Ridge National Laboratory, TN, USA.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionError-bounded lossy compression is a critical technique for significantly reducing scientific data volumes. Compared to CPU-based compressors, GPU-based compressors exhibit substantially higher throughputs, fitting better for today's HPC applications. To overcome the data challenge, GPU-based scientific lossy compressors have been created. Notably, cuSZ has been proposed as the error-bounded compression framework and has become the design base of the subsequent work. A plethora of derived work has been proposed, leading to the discussion of optimality considering data quality, compression ratio, and data processing speed. This paper covers new research directions: the compressibility study, the new encoding study, and the applicability study.
Workshop
Data Compression
Data Movement and Memory
Middleware and System Software
W
DescriptionWe present gpuFastqLZ, an ultra-fast compression methodology for FASTQ sequence data on GPUs. Leveraging the high parallelism capabilities of GPUs, gpuFastqLZ incorporates several optimizations, including a fast algorithm for field separation, a 2-bit encoding scheme for base fields, and the implementation of Illumina binning and GPULZ compression algorithms.
We evaluate gpuFastqLZ on three datasets, across 324 hyperparameter settings, which shows that gpuFastqLZ outperforms existing compressors, achieving up to a 1300x speedup in compression throughput and a 1.1x improvement in compression ratio compared to GZIP and exceeds the state-of-the-art FASTQ compressor GENOZIP by up to 18X throughput.
We evaluate gpuFastqLZ on three datasets, across 324 hyperparameter settings, which shows that gpuFastqLZ outperforms existing compressors, achieving up to a 1300x speedup in compression throughput and a 1.1x improvement in compression ratio compared to GZIP and exceeds the state-of-the-art FASTQ compressor GENOZIP by up to 18X throughput.
Workshop
Architecture
Embedded and/or Reconfigurable Systems
Performance Optimization
Resource Management
W
DescriptionWith the advent of exascale computing, GPU acceleration has become central to the performance of supercomputers. Even at this extreme scale, most scientific and HPC-scale DNN applications underutilize GPU resources. Existing GPU sharing mechanisms can be used to increase utilization, throughput, and energy efficiency. However, naively co-scheduling workflows often does not yield optimal results. Scheduling multiple high-utilization workloads on the same set of GPUs, for example, leads to performance degradation due to high resource contention. In short, GPU sharing must be granularity- and interference-aware to maximize the benefit. We propose a scheduling approach that optimizes workflow scheduling configurations for given system metrics--i.e., throughput and energy efficiency, uses workload profiling data to right-size GPU resources for combinations of HPC workflows, and collocates workflows using existing concurrency mechanisms. We show that choosing the right arrangement of workflows to collocate can increase throughput by as much as 2x and energy efficiency by 1.6x.
Paper
Fault-Tolerance, Reliability, Maintainability, and Adaptability
Middleware and System Software
Performance Evaluation and/or Optimization Tools
Runtime Systems
TP
DescriptionPerformance variance is one of the nasty pitfalls of large-scale heterogeneous systems, which can lead to unexpected and unpredictable performance degradation for parallel programs. Such performance issues typically arise from various random hardware and software faults, making it exceedingly difficult to pinpoint the exact causes of performance variance in specific instances. In this paper, we propose \textit{GVARP}, a performance variance detection tool for large-scale heterogeneous systems. \textit{GVARP} employs static analysis to identify the performance-critical parameters of kernel functions. Additionally, \textit{GVARP} segments the program execution with external library calls and asynchronous kernel operations. Then \textit{GVARP} constructs a state transfer graph and estimates the workload of each program segment to identify and cluster instances of similar workloads, facilitating the detection of performance variance. Our evaluation results demonstrate that \textit{GVARP} effectively detects performance variance at a large scale with acceptable overhead and provides intuitive insights to locate the sources of performance variance.
Tutorial
Applications and Application Frameworks
Artificial Intelligence/Machine Learning
Portability
TUT
DescriptionSYCL is a programming model that lets developers support a wide variety of devices (CPUs, GPUs, and more) from a single-code base. Given the growing heterogeneity of processor roadmaps, moving to an open standard, platform-independent model such as SYCL is essential for modern software developers. SYCL has the further advantage of supporting a single-source style of programming using completely standard C++.
In this tutorial, we will introduce SYCL and provide programmers with a solid foundation they can build on to gain mastery of this language. The main benefit of using SYCL over other heterogeneous programming models is the single programming language approach, which enables one to target multiple devices using the same programming model, and therefore to have a cleaner, portable, and more readable code.
This is a hands-on tutorial. The real learning will happen as students write code. The format will be short presentations followed by hands-on exercises. Hence, attendees will require their own laptop to perform the hands-on exercises.
In this tutorial, we will introduce SYCL and provide programmers with a solid foundation they can build on to gain mastery of this language. The main benefit of using SYCL over other heterogeneous programming models is the single programming language approach, which enables one to target multiple devices using the same programming model, and therefore to have a cleaner, portable, and more readable code.
This is a hands-on tutorial. The real learning will happen as students write code. The format will be short presentations followed by hands-on exercises. Hence, attendees will require their own laptop to perform the hands-on exercises.
Workshop
Broader Engagement
Education
Inclusivity
W
DescriptionThe authors present and evaluate an unplugged activity to introduce parallel computing concepts to undergraduate students. Students in five CS classrooms used a deck of playing cards in small groups to consider how parallelization can improve performance and how improvement decreases with increased parallelization. Before and after the activity, students took a short survey about their solution and their ideas about parallelism. The authors carried out this activity in seven courses at five institutions in the 2023-2024 academic year. The results showed that students had an increased appreciation for parallelization and this type of activity.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionTo be energy efficient and fully utilize modern hardware, it is important to gain as much insight as possible into the performance and efficiency of an application. Especially in the age of artificial intelligence, it gets more and more important to keep, e.g., track of the total energy consumption of an application. However, gathering this hardware information in a vendor-independent and portable way is far from trivial.
Therefore, we propose the small, easy-to-use hardware sampling library "hws" for Python and C++, which makes it extremely easy to gather hardware information like CPU/GPU utilization, clock frequencies, power and memory consumption, or temperatures for CPUs as well as GPUs from NVIDIA, AMD, and Intel.
We further demonstrate the usefulness of our sampling library on the example of PLSSVM, a (multi-)GPU LS-SVM implementation.
Therefore, we propose the small, easy-to-use hardware sampling library "hws" for Python and C++, which makes it extremely easy to gather hardware information like CPU/GPU utilization, clock frequencies, power and memory consumption, or temperatures for CPUs as well as GPUs from NVIDIA, AMD, and Intel.
We further demonstrate the usefulness of our sampling library on the example of PLSSVM, a (multi-)GPU LS-SVM implementation.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionDeep learning (DL) thrives on data; however, it inherits a major limitation: training and testing datasets must be fully annotated for supervised deep neural networks (DNNs) training. To address this challenge, we introduce HARVEST-2.0, a high-performance computer-vision framework for end-to-end data preprocessing, training, inference, and visualization of computer vision tasks. HARVEST-2.0 utilizes cutting-edge semi-supervised learning algorithms requiring only a small subset of labeled data samples. HARVEST-2.0 provides an intuitive web-based interface, enabling domain experts with no prior DL or HPC knowledge to preprocess data, geotag images, train DL models on HPC systems, perform inference, and visualize the results. Our evaluations demonstrate accuracies within 3\% compared to fully supervised training, utilizing less than 80 labeled samples per class, and near-linearly reducing the execution time. HARVEST-2.0 is an effort along AI democratization, enabling end-users to carry out preprocessing, interactive labeling, inference, and distributed training in a user-friendly and flexible manner.
Birds of a Feather
TP
XO/EX
DescriptionHDF5 is a critical I/O library that has been used ubiquitously in HPC for over 25 years. HDF5 can be adapted to machine learning (ML) workflows, but may require changes to common HPC I/O usage. This BoF will bring HDF5 developers, machine learning experts, and interested community members together to discuss best practices when using HDF5 for ML on HPC systems. We will begin with a panel of experts, focusing on HDF5 features related to AI/ML and discuss using HDF5 in AI/ML applications. We will then invite the audience to share experiences and ask questions.
Workshop
Accelerators
Modeling and Simulation
Performance Evaluation and/or Optimization Tools
W
DescriptionModern central processing units (CPUs) feature single-instruction, multiple-data pipelines to accelerate compute-intensive floating-point and fixed-point workloads. Traditionally, these pipelines and corresponding instruction set architectures (ISAs) were designed for vector parallelism. The Scalable Matrix Extension (SME) was announced for the Arm architecture in 2021, and Apple's M4 chip is the first to support SME. This paper presents an in-depth study of SME on M4. Our microbenchmarks determine the maximum floating-point and fixed-point throughput of M4's SME acceleration and study the achievable bandwidth for transfers to and from the matrix registers. Furthermore, we used the insights gained to design a just-in-time code generator for SME-based small matrix multiplications. The results presented show that M4's SME support is FP32-centric, with an achievable throughput of over 2.3\,FP32 TFLOPS. Our just-in-time generated small matrix multiplication kernels outperform the vendor-optimized BLAS implementation in almost all tested configurations.
Invited Talk
TP
DescriptionLlama 3.2 is the latest edition of Meta’s Generative AI models. Llama represents a collection of openly available models that rival the top AI models in terms of state-of-the-art capabilities in general knowledge, steerability, math, tool use, and multilingual translation. In this talk, I’ll dive into the infrastructure behind the Llama series of models and explore the scaling challenges, innovative solutions and lessons learned, particularly those related to its compute, network, storage, and software ecosystem.
Workshop
Accelerators
Heterogeneous Computing
Parallel Programming Methods, Models, Languages and Environments
W
DescriptionWe propose a CPU-GPU heterogeneous computing method for solving time-evolution partial differential equation problems many times with guaranteed accuracy, in short time-to-solution and low energy-to-solution. On a single-GH200 node, the proposed method improved the computation speed by 86.4 and 8.67 times compared to the conventional method run only on CPU and only on GPU, respectively. Furthermore, the energy-to-solution was reduced by 32.2-fold (from 9944 J to 309 J) and 7.01-fold (from 2163 J to 309 J) when compared to using only the CPU and GPU, respectively. Using the proposed method on the Alps supercomputer, a 51.6-fold and 6.98-fold speedup was attained when compared to using only the CPU and GPU, respectively, and a high weak scaling efficiency of 94.3% was obtained up to 1,920 compute nodes. These implementations were realized using directive-based parallel programming models while enabling portability, indicating that directives are highly effective in analyses in heterogeneous computing environments.
Awards and Award Talks
TP
DescriptionHigh-level compiler transformations were developed soon after the first compilers. They are used to manipulate compound statements such as loops and if statements in order to vectorize, parallelize, and tile them to benefit from parallelism and enhance locality. Since high-level compiler transformations were at the core of much of Ken Kennedy’s work and have also been the main area of interest of the speaker, this is the natural topic for this talk.
The first part of the talk will contain a brief history of research in high-level compiler transformations. The results obtained by Ken Kennedy and numerous other researchers have been quite influential, to the point of being ubiquitous in today’s compilers and providing a powerful algebra of program transformations. Despite the important accomplishments, there is still much room for improvement, especially in the area of methodologies for the implementation and application of compiler transformations.
The second part will focus on compiler instability, which is a manifestation of the lack of a good methodology for compiler implementation and for deciding when to apply each program transformation. Given different versions of the same program which are obtained from each other by automatic transformations, a compiler is said to be instable when the object codes generated for these versions have different performances. Instability makes compilers weak optimization tools.
A tool, called Locus, will be described in the final part of the presentation. Locus contains a language to concisely describe a collection of transformations and facilitate the use of search engines. Locus can be used to convert programs into autotuning systems which by exploring the space of possible versions contribute to attenuate the impact of instability.
The first part of the talk will contain a brief history of research in high-level compiler transformations. The results obtained by Ken Kennedy and numerous other researchers have been quite influential, to the point of being ubiquitous in today’s compilers and providing a powerful algebra of program transformations. Despite the important accomplishments, there is still much room for improvement, especially in the area of methodologies for the implementation and application of compiler transformations.
The second part will focus on compiler instability, which is a manifestation of the lack of a good methodology for compiler implementation and for deciding when to apply each program transformation. Given different versions of the same program which are obtained from each other by automatic transformations, a compiler is said to be instable when the object codes generated for these versions have different performances. Instability makes compilers weak optimization tools.
A tool, called Locus, will be described in the final part of the presentation. Locus contains a language to concisely describe a collection of transformations and facilitate the use of search engines. Locus can be used to convert programs into autotuning systems which by exploring the space of possible versions contribute to attenuate the impact of instability.
Workshop
Broader Engagement
HPC in Society
Inclusivity
W
DescriptionWith the rise of data driven workloads and increase in model sizes, it has become increasingly critical to build computationally efficient hardware. Drawing inspiration from the brain to harness the computational advantages of such a neural structure, fundamental blocks of neurons and synapses are built to implement neuromorphic systems. We discuss techniques to demonstrate energy-efficient computing on analog configurable platforms to enable real-time systems on hardware. Further, we show tree-based machine learning models through In-Memory Computing (IMC) primitives such as Analog Content Addressable Memories (ACAMs) that are designed with emerging non-volatile technologies such as memristors. Such systems ultimately pave the path to take physical approaches to build large-scale systems for high performance computing in a holistic manner.
Birds of a Feather
TP
XO/EX
DescriptionThe ExaDigiT BoF invites practitioners to exchange knowledge on the development of data center digital twins, enhancing HPC system efficiency and understanding through modeling, simulation, the effective usage of telemetry, and AI. The ExaDigiT community, comprising over 90 members from 10+ global institutions, is developing a framework for creating and operating digital twins, driven by this wide range of participants. This community-driven approach has led to notable publications and practical tools. This BoF offers an opportunity to learn about ExaDigiT and join in the advancement of digital twins in high-performance computing, improving the understanding and operation of supercomputing systems.
Workshop
Applications and Application Frameworks
Artificial Intelligence/Machine Learning
Parallel Programming Methods, Models, Languages and Environments
W
Birds of a Feather
TP
XO/EX
DescriptionThe High Performance Software Foundation (HPSF) is a new umbrella project within the Linux Foundation. It aims to build, promote, and advance a portable core software stack for HPC by increasing adoption, lowering barriers to contribution, and supporting HPC open source projects. HPSF was announced at SC23 and officially formed this May at ISC24. At SC24, we will give attendees a chance to interact with HPSF’s governing board, initial members, and software projects. We will discuss HPSF initiatives such as continuous integration (CI), HPSF events, training, and our near-term roadmap. Join us and get involved in HPSF!
Tutorial
Parallel Programming Methods, Models, Languages and Environments
TUT
DescriptionHigh-Performance Networking technologies are generating a
lot of excitement towards building next-generation High-End
Computing (HEC) systems for HPC and AI with GPGPUs,
accelerators, and Data Center Processing Units (DPUs), and a
variety of application workloads. This tutorial will provide
an overview of these emerging technologies, their
architectural features, current market standing, and
suitability for designing HEC systems. It will start with a
brief overview of IB, HSE, RoCE, and Omni-Path interconnect.
An in-depth overview of the architectural features of these
interconnects will be presented with associated hands-on
exercises. It will be followed with an overview of the
emerging NVLink, NVSwitch, EFA, and Slingshot
architectures. We will then present advanced features of
commodity high-performance networks that enable performance
and scalability. We will then provide an overview of
enhanced offload capable network adapters like DPUs/IPUs
(Smart NICs), their capabilities and features. Next, an
overview of software stacks for high-performance networks
like Open Fabrics Verbs, LibFabrics, and UCX comparing the
performance of these stacks will be given. Next, challenges
in designing MPI library for these interconnects, solutions
and sample performance numbers will be presented. Finally,
we will have a set of additional hands-on sessions to understand
performance of networking technologies.
lot of excitement towards building next-generation High-End
Computing (HEC) systems for HPC and AI with GPGPUs,
accelerators, and Data Center Processing Units (DPUs), and a
variety of application workloads. This tutorial will provide
an overview of these emerging technologies, their
architectural features, current market standing, and
suitability for designing HEC systems. It will start with a
brief overview of IB, HSE, RoCE, and Omni-Path interconnect.
An in-depth overview of the architectural features of these
interconnects will be presented with associated hands-on
exercises. It will be followed with an overview of the
emerging NVLink, NVSwitch, EFA, and Slingshot
architectures. We will then present advanced features of
commodity high-performance networks that enable performance
and scalability. We will then provide an overview of
enhanced offload capable network adapters like DPUs/IPUs
(Smart NICs), their capabilities and features. Next, an
overview of software stacks for high-performance networks
like Open Fabrics Verbs, LibFabrics, and UCX comparing the
performance of these stacks will be given. Next, challenges
in designing MPI library for these interconnects, solutions
and sample performance numbers will be presented. Finally,
we will have a set of additional hands-on sessions to understand
performance of networking technologies.
HPC Creates Plenary
TP
W
TUT
XO/EX
DescriptionThe SC24 Opening Plenary, moderated by award-winning journalist Miles O’Brien, will bring together a diverse set of thought leaders to explore how high-performance computing (HPC) is shaping the future. Under the theme “HPC Creates,” the panel will delve into four arenas where HPC has made transformative impacts: scientific discovery, engineering for societal benefit, arts and entertainment, and workforce development.
Panelists will share insights on how HPC, along with artificial intelligence (AI), is driving advances in astrophysics, biomedical simulations, and sustainable engineering while revolutionizing creative industries such as film and music. In addition, the panel will discuss how the next generation of professionals is being equipped to tackle future challenges through HPC-driven education and training.
Attendees can expect a dynamic and thought-provoking conversation about HPC’s pivotal role in creating innovations that enhance our understanding of the universe, improve quality of life, and inspire new generations. This session sets the tone for SC24 by spotlighting HPC’s far-reaching influence and its potential to drive future breakthroughs.
Panelists will share insights on how HPC, along with artificial intelligence (AI), is driving advances in astrophysics, biomedical simulations, and sustainable engineering while revolutionizing creative industries such as film and music. In addition, the panel will discuss how the next generation of professionals is being equipped to tackle future challenges through HPC-driven education and training.
Attendees can expect a dynamic and thought-provoking conversation about HPC’s pivotal role in creating innovations that enhance our understanding of the universe, improve quality of life, and inspire new generations. This session sets the tone for SC24 by spotlighting HPC’s far-reaching influence and its potential to drive future breakthroughs.
Doctoral Showcase
Posters
TP
DescriptionThis doctoral showcase highlights three pivotal works conducted during my PhD that collectively advance the field of high-performance computing (HPC) resilience analysis using large language models (LLMs).
The first work introduces HAPPA, a modular platform for HPC Application Resilience Analysis. HAPPA integrates LLMs to understand long code sequences, employing innovative code representation techniques to predict resilience accurately. Through the DARE dataset, HAPPA demonstrates superior predictive accuracy over existing models, achieving a mean squared error (MSE) of 0.078 in Silent Data Corruption (SDC) prediction, significantly outperforming the PARIS model.
Building on this foundation, the second work investigates the resilience of loops in HPC programs through a semantic approach. By analyzing the computational patterns known as the 13 dwarfs of parallelism, this study quantifies the SDC rates for each pattern. Utilizing LLMs with prompt engineering, the research identifies loop semantics, providing insights into which loops are more error-prone and enhancing the development of resilient HPC applications.
Expanding the scope further, the third work evaluates the capabilities of LLMs in comprehending the syntax and semantics of Intermediate Representation (IR) code. The study conducts a comprehensive analysis using models like GPT-4o, GPT-3.5, and CodeLlama. By performing tasks such as decompiling IR code, generating CFGs, and simulating IR code execution, the research provides insights into the effectiveness of LLMs in handling low-level code analysis and their potential applications in program analysis.
These studies collectively demonstrate the potential of LLMs in enhancing the resilience of HPC applications through innovative analysis techniques and predictive modeling.
The first work introduces HAPPA, a modular platform for HPC Application Resilience Analysis. HAPPA integrates LLMs to understand long code sequences, employing innovative code representation techniques to predict resilience accurately. Through the DARE dataset, HAPPA demonstrates superior predictive accuracy over existing models, achieving a mean squared error (MSE) of 0.078 in Silent Data Corruption (SDC) prediction, significantly outperforming the PARIS model.
Building on this foundation, the second work investigates the resilience of loops in HPC programs through a semantic approach. By analyzing the computational patterns known as the 13 dwarfs of parallelism, this study quantifies the SDC rates for each pattern. Utilizing LLMs with prompt engineering, the research identifies loop semantics, providing insights into which loops are more error-prone and enhancing the development of resilient HPC applications.
Expanding the scope further, the third work evaluates the capabilities of LLMs in comprehending the syntax and semantics of Intermediate Representation (IR) code. The study conducts a comprehensive analysis using models like GPT-4o, GPT-3.5, and CodeLlama. By performing tasks such as decompiling IR code, generating CFGs, and simulating IR code execution, the research provides insights into the effectiveness of LLMs in handling low-level code analysis and their potential applications in program analysis.
These studies collectively demonstrate the potential of LLMs in enhancing the resilience of HPC applications through innovative analysis techniques and predictive modeling.
Workshop
Algorithms
Heterogeneous Computing
W
DescriptionThis study proposes a high-performance and reliable eigensolver via mixed-precision arithmetic between ordinary and highly-accurate precisions. Eigenvalue decomposition is ubiquitous in simulations. Various eigensolvers for computing approximations have been developed thus far. If eigenvalues are narrowly clustered, the computation of eigenvectors may be ill-posed. Thus, the computed eigenpairs may not be sufficiently accurate and lack reliability. In this study, we introduce mixed-precision iterative refinement methods to improve the accuracy of eigenvectors obtained using numerical methods. This approach contributes to obtaining sufficiently accurate results without arbitrary precision eigensolvers. We construct a high-performance and reliable eigensolver by combining the iterative refinement methods and EigenExa, a modern high-performance solver for large-scale and highly parallel computations. Numerical experiment results demonstrate the accuracy of the results and performance benchmark of the proposed approach.
Tutorial
Architecture
Emerging Technologies
I/O, Storage, Archive
Scalable Data Mining
TUT
DescriptionRecently we have seen a change in the diversity of applications utilizing high-performance computing (HPC) from primarily computational simulation approaches, to a more varied application mix including machine learning and data analytics. With this diversification in workloads, there has also been a diversification in I/O patterns; the movements in, and requirements on, data storage and access. Data storage technologies in HPC have long been optimized for large scale bulk operations focused on high-bandwidth with relatively low volumes of metadata operations. However, many applications now exhibit non-optimal I/O patterns for large scale parallel filesystems, with large amounts of small I/O operations, non-contiguous data access, and increases in read as well as write I/O loads.
Parallel filesystems, such as Lustre and Storage Scale, have been optimized and extended to provided higher metadata performance and to better handle small I/O operations. However, the underlying approach of POSIX-like I/O, with block-sized read and write operations, and file-level data storage, set limitations on the overall performance and functionality that such approaches can achieve. This tutorial will educate attendees in the design and usage of object stores, as alternatives to filesystems, using Ceph and DAOS as examples, through hands-on exercise and lecture sessions.
Parallel filesystems, such as Lustre and Storage Scale, have been optimized and extended to provided higher metadata performance and to better handle small I/O operations. However, the underlying approach of POSIX-like I/O, with block-sized read and write operations, and file-level data storage, set limitations on the overall performance and functionality that such approaches can achieve. This tutorial will educate attendees in the design and usage of object stores, as alternatives to filesystems, using Ceph and DAOS as examples, through hands-on exercise and lecture sessions.
Workshop
Broader Engagement
HPC in Society
Inclusivity
W
DescriptionSimulating quantum systems is a promising application for quantum computing. However, quantum computers of sufficient scale and quality to study scientifically interesting systems beyond the limit of classical computers do not yet exist. It will be necessary in the near term to efficiently partition and distribute quantum workloads in an HPC environment. Circuit knitting provides a path to simulating large systems on quantum devices of a limited size by reconstructing observables from smaller sub-circuits, but this reconstruction comes at an exponential cost. We present an adaptive circuit knitting method which finds efficient partitions of quantum circuits by discovering regions of minimal entanglement between subsystems. We apply this method to simulating the dynamics of strongly-disordered quantum spin chains, and show reductions in the cost of circuit knitting of one to two orders of magnitude.
Paper
Accelerators
Algorithms
Data Compression
Linear Algebra
Tensors
TP
DescriptionHigh-performance sparse matrix–matrix (SpMM) multiplication is paramount for science and industry, as the ever-increasing sizes of data prohibit using dense data structures. Yet, existing hardware, such as Tensor Cores (TC), is ill-suited for SpMM, as it imposes strict constraints on data structures that cannot be met by unstructured sparsity found in many applications. To address this, we introduce (S)parse (Ma)trix Matrix (T)ensor Core-accelerated (SMaT): a novel SpMM library that utilizes TCs for unstructured sparse matrices. Our block-sparse library leverages the low-level CUDA MMA (matrix-matrix-accumulate) API, maximizing the performance offered by modern GPUs. Algorithmic optimizations such as sparse matrix permutation, further improve performance by minimizing the number of non-zero blocks. The evaluation on NVIDIA A100 shows that SMaT outperforms SotA libraries (DASP, cuSPARSE, and Magicube) by up to 125x (on average 2.6x). SMaT can be used to accelerate many workloads in scientific computing, large-model training, inference, and others.
Workshop
Performance Optimization
Programming Frameworks and System Software
W
DescriptionWe present a performance study of geometric multigrid (GMG) on NVIDIA, AMD, and Intel GPU-accelerated supercomputers. The approach employs fine-grain data blocking in BrickLib, which reduces data movement in the GMG V-cycle by optimizing storage order for stencil access and communication.
Our GMG attains 73% in a peak performance portability metric, and 87% parallel efficiency when weak scaling to 512 GPUs on all three GPU-accelerated supercomputers.
Analysis shows stencil performance and MPI communication is well-correlated with a traditional linear model from which we can extract empirical latency, overhead, bandwidth, and throughput for comparison to theoretical GPU and network limits.
Observations show NVIDIA GPUs provide the lowest overhead and highest throughput per process with AMD and Intel GPUs delivering comparable performance.
Conversely, despite all three platforms employing the same Slingshot network, sustained bandwidth and latency vary widely when each GPU is dedicated one NIC.
Our GMG attains 73% in a peak performance portability metric, and 87% parallel efficiency when weak scaling to 512 GPUs on all three GPU-accelerated supercomputers.
Analysis shows stencil performance and MPI communication is well-correlated with a traditional linear model from which we can extract empirical latency, overhead, bandwidth, and throughput for comparison to theoretical GPU and network limits.
Observations show NVIDIA GPUs provide the lowest overhead and highest throughput per process with AMD and Intel GPUs delivering comparable performance.
Conversely, despite all three platforms employing the same Slingshot network, sustained bandwidth and latency vary widely when each GPU is dedicated one NIC.
Paper
Accelerators
Algorithms
Data Compression
I/O, Storage, Archive
Performance Optimization
TP
DescriptionError-bounded lossy compression is a critical technique for significantly reducing scientific data volumes. Compared to CPU-based compressors, GPU-based compressors exhibit substantially higher throughputs, fitting better for today's HPC applications. However, the critical limitations of existing GPU-based compressors are their low compression ratios and qualities, severely restricting their applicability. To overcome these, we introduce a novel GPU-based error-bounded scientific lossy compressor named cuSZ-I, with the following contributions: (1) A novel GPU-optimized interpolation-based prediction method significantly improves the compression ratio and decompression data quality. (2) The Huffman encoding module in cuSZ-I is optimized for better efficiency. (3) cuSZ-I is the first to integrate the NVIDIA Bitcomp-lossless as an additional compression-ratio-enhancing module. Evaluations show that cuSZ-I significantly outperforms other latest GPU-based lossy compressors in compression ratio under the same error bound (hence, the desired quality), showcasing a 476% advantage over the second-best. This leads to cuSZ-I's optimized performance in several real-world use cases.
Panel
Algorithms
Open Problems
TP
W
TUT
XO/EX
DescriptionThis panel speculates on open problems that can, should, or will (if the panelists are sufficiently brave!) set an agenda for HPC research over the next 50 years. The panel's concept is inspired by David Hilbert's 23 problems, several of which came to define 20th-century mathematical research. Panelists will propose, discuss, and debate what such a "problem set" for HPC in the 21st century could be, extrapolating from various contemporary issues, including the end of Moore's Law, the dominance of AI, what programs and programming are and will be, and the continued pervasiveness of computing in the fabric of modern science and engineering. The panelists are highly respected academic researchers, including those with industrial experience and those who, quite literally, wrote "the" textbooks on various aspects of parallel and high-performance computing.
Paper
Accelerators
Compilers
Heterogeneous Computing
Performance Evaluation and/or Optimization Tools
TP
DescriptionData races are egregious concurrency bugs that are especially problematic in performance-oriented GPU codes where large thread counts and multiple shared memory regions tend to exacerbate them. In this work, we present a new dynamic data-race checker called HiRace, whose key novelty is an innovative state machine designed to capitalize on the bulk-synchronous hierarchical GPU programming model. This state machine condenses an arbitrarily long access history into a constant-size state. We evaluate HiRace on a large, calibrated data-race benchmark suite. In over 3,500 studied executions of 580 CUDA kernels, 346 of which contain data races, we found HiRace to detect races missed by other tools without raising false alarms and to be more than 10 times faster on average than the current state of the art with half the memory overhead.
Exhibits
Flash Session
TP
XO/EX
DescriptionRapid growth of cancer research in recent decades has made data discovery and management difficult for many research labs. Cancer researchers are looking to technologies such as machine learning and artificial intelligence as analyses are increasingly becoming multidisciplinary. For the National Cancer Institute, GDIT supports the Cancer Research Data Commons in its mission to democratise access to large cancer research resources. GDIT has made accessible massive genomic data sets, from blockbuster cancer projects such as TCGA and CPTAC, through Google Cloud tooling such as BigQuery and on demand virtual machines. In this discussion, learn how these cloud technologies have enabled analyses that are uniquely inexpensive and rapid even when scaled to petabyte sized inputs. One supported research project calculated 6.6 billion correlations in 2.5 hours with a total cost of about $1. GDIT also supports the Imaging Data Commons in extracting quantitative data from the large existing medical imaging datasets such as MRI and CT scans through automated annotation approaches.
Invited Talk
TP
DescriptionAdvanced data and computing systems are vital to Linac Coherent Light Source (LCLS) operations, data interpretation and overall scientific productivity. The transition to MHz-era operation marks a fundamental change in scale that requires new infrastructure and architectures to link LCLS to the required scale of computing needed for scientific interpretation. The LCLS-II Data System leverages access to high performance compute to reduce time to science, improve the efficiency and quality of acquired data sets, and solve exascale problems that cannot be solved by other means. Feature extracted information generated in the data analysis pipeline — at the edge, local compute, or remote HPC resources — can be used to steer experiments and inform user decisions during beam time. AI/ML presents new opportunities to rapidly analyze large datasets and direct experiments, but creates its own challenges in scaling, adaptability, complexity, and trustworthiness. Collectively these advances are poised to significantly enhance experimental output and enable groundbreaking scientific exploration. We discuss the overall challenges facing LCLS, and explore the opportunities afforded by fully leveraging the remote HPC resources of the DOE complex.
Panel
Artificial Intelligence/Machine Learning
TP
DescriptionThe impact of the democratization and sophistication of AI technologies is being felt across the industry, including traditional HPC applications where we are seeing the convergence of AI and HPC. This convergence of AI and HPC has created a virtuous cycle of innovation where AI is accelerating HPC innovation and vice versa. AI technologies and specialized AI infrastructure, such as GPU, and cloud-based supercomputing services are fostering innovative new ways to solve traditional HPC problems. Conversely, HPC infrastructure and workloads such as simulations are shaping AI infrastructure and technology development. In this highly interactive panel, we invite experts from academia and multiple industry segments to share their experiences with innovation accelerated by converged HPC and AI, discuss adoption and scale challenges, and share their views on its future.
Students@SC
TP
W
TUT
XO/EX
DescriptionIn today's rapidly evolving technological landscape, the field of high-performance computing (HPC) is creating transformative opportunities across disciplines. This talk will explore the unique career pathways available to those pursuing graduate studies in HPC, based on my personal journey. I will share insights into how my experiences in grad school, research, and practical HPC applications shaped my career trajectory and how students can leverage similar opportunities to build their own futures in this field.
Additionally, the talk will feature highlights from the Art of HPC program, showcasing innovative alternative careers within HPC such as visualization, interdisciplinary research, and creative computing. These examples illustrate the diverse ways HPC professionals can apply their skills beyond traditional roles, expanding into areas like art, health, and science communication. Whether you're interested in research, industry, or other interdisciplinary fields, HPC offers a platform to create and thrive.
Additionally, the talk will feature highlights from the Art of HPC program, showcasing innovative alternative careers within HPC such as visualization, interdisciplinary research, and creative computing. These examples illustrate the diverse ways HPC professionals can apply their skills beyond traditional roles, expanding into areas like art, health, and science communication. Whether you're interested in research, industry, or other interdisciplinary fields, HPC offers a platform to create and thrive.
Paper
Post-Moore Computing
Quantum Computing
TP
DescriptionRecent advancements in Machine Learning (ML) have substantially improved its predictive and computational abilities, offering promising opportunities for surrogate modeling in scientific applications. By accurately approximating complex functions with low computational cost, ML-based surrogates can accelerate scientific applications by replacing computationally intensive components with faster model inference. However, integrating ML models into these applications remains a significant challenge, hindering the widespread adoption of ML surrogates as an approximation technique in modern scientific computing.
We propose an easy-to-use directive-based programming model that enables developers to seamlessly describe the use of ML models in scientific applications. The runtime support, as instructed by the programming model, performs data assimilation using the original algorithm and can replace the algorithm with model inference. Our evaluation across five benchmarks, testing over 5000 ML models, shows up to 83.6x speed improvements with minimal accuracy loss (as low as 0.01 RMSE).
We propose an easy-to-use directive-based programming model that enables developers to seamlessly describe the use of ML models in scientific applications. The runtime support, as instructed by the programming model, performs data assimilation using the original algorithm and can replace the algorithm with model inference. Our evaluation across five benchmarks, testing over 5000 ML models, shows up to 83.6x speed improvements with minimal accuracy loss (as low as 0.01 RMSE).
Invited Talk
TP
DescriptionSupercomputers are powerful scientific instruments, continuously delivering impressive breakthroughs in fields as diverse as materials science, medicine, climate research, and fundamental sciences. We, the HPC community, tend to focus our discussions at workshops and conferences on our own hardware and software achievements, often forgetting the ultimate purpose of our whole endeavor, which is to build and operate systems that advance science and engineering, and ultimately contribute to a better society. It is therefore worth taking a step back to look at some of the major scientific achievements that scientists around the world have been able to achieve thanks to the use of our beloved HPC infrastructures. This talk will showcase a selection of major scientific or engineering highlights recently published by users of current HPC systems around the world, demonstrating by example the major impact of HPC on science and society.
Birds of a Feather
TP
XO/EX
DescriptionOne in five people will be diagnosed with cancer in their lifetime, with nearly 20 million new cases diagnosed worldwide in 2022. While HPC has played a role in cancer research and clinical applications for several decades, recent advancements in HPC, AI, big data, predictive oncology, and digital twins have created a dramatic increase in new opportunities to improve lives affected by cancer. This BoF will bring together attendees who are interested in learning about past and future efforts across government, academia and industry employing HPC to gain new insights and reduce the burden of cancer for all.
Birds of a Feather
TP
XO/EX
DescriptionThis BoF delves into the evolving landscape of converged computing, where high-performance computing (HPC) and cloud resources are being integrated to meet the complex demands of modern scientific and technical computing. The increasing role of AI/ML is accelerating the need for convergence, but additional constraints such as sovereignty, the skill set of staff, and data center capacity limitations require consideration. In this BoF we will look at different personas: supercomputing centers, cloud providers, research computing, and industrial users. We aim to provide an overview of converged computing and to identify key concerns and opportunities in this rapidly changing field.
Workshop
Distributed Computing
Education
Emerging Technologies
W
DescriptionThis paper proposes a monitoring system that emails feedback to users about submitted jobs and has the capability to stop and resubmit jobs to a batch scheduler. The proposed system has been implemented for a small supercomputing environment with a mix of high-performance and high-throughput computing jobs. User feedback includes alerts for over- and under-utilization of CPU and physical memory.
This paper also discusses how predefined system thresholds were chosen, and proposes three algorithms: one algorithm for the proposed monitoring system and two algorithms for the prediction of CPU and physical memory utilization. The latter algorithms are based on users' input of the identification string (job ID) of a similar job that should have finished execution without errors. Lastly, a git repository is shared to make the code accessible for review.
This paper also discusses how predefined system thresholds were chosen, and proposes three algorithms: one algorithm for the proposed monitoring system and two algorithms for the prediction of CPU and physical memory utilization. The latter algorithms are based on users' input of the identification string (job ID) of a similar job that should have finished execution without errors. Lastly, a git repository is shared to make the code accessible for review.
Workshop
Distributed Computing
Education
Emerging Technologies
W
DescriptionThe HPC Carpentry project aims to develop workshop training materials to empower novices to effectively leverage HPC to solve scientific problems in their domains.
Modeled after The Carpentries training programs, the project’s goal is to develop foundational HPC skills. The workshop setting provides learners with hands-on experience, and provides sufficient vocabulary to make subsequent self-study more effective.
In a major milestone, the steering committee is leading HPC Carpentry through the formal incubation process to become an official Carpentries lesson program alongside the existing Carpentry programs.
Our most recent focus has been developing materials for a user workshop. We begin with an introduction to the command-line shell, followed by our Introduction to HPC lesson, covering resource management. We end with a lesson on HPC workflow management.
Future plans include building a developer workshop, reconnecting with disparate contributors, and engaging with the community through regular open conference calls and outreach.
Modeled after The Carpentries training programs, the project’s goal is to develop foundational HPC skills. The workshop setting provides learners with hands-on experience, and provides sufficient vocabulary to make subsequent self-study more effective.
In a major milestone, the steering committee is leading HPC Carpentry through the formal incubation process to become an official Carpentries lesson program alongside the existing Carpentry programs.
Our most recent focus has been developing materials for a user workshop. We begin with an introduction to the command-line shell, followed by our Introduction to HPC lesson, covering resource management. We end with a lesson on HPC workflow management.
Future plans include building a developer workshop, reconnecting with disparate contributors, and engaging with the community through regular open conference calls and outreach.
Art of HPC
Posters
TP
W
TUT
XO/EX
DescriptionThis image was created utilising publicly available photography from the SC21, SC22, and SC23 workshops and networking events. Images were digitally cut and arranged utilising Canva with additional graphics support made in Adobe Illustrator.
Workshop
Applications and Application Frameworks
Artificial Intelligence/Machine Learning
Security
W
DescriptionThe DOE Office of Science (SC) mission is to deliver the scientific discoveries and major scientific tools to transform our understanding of nature and advance the energy, economic, and national security of the United States. And the Advanced Scientific Computing Research program goals are delivering world leading computational and networking capabilities to extend the frontiers of science and technology. Currently, our research program has indicated interest in advancing research and development efforts focusing on three emerging areas namely energy efficient computing, analog computing, and neuromorphic computing. Therefore, it will be important to consider cybersecurity and integrity challenges as progress is achieved in these research areas as it involves coordination across potentially highly heterogenous, interoperating, and co-dependent components of future computing systems such as hardware, algorithms, system software, programming models, data management, and applications. This presentation will highlight potential basic research opportunities for emerging computing technologies and cybersecurity challenges for emerging high performance computing applications.
Workshop
Distributed Computing
Education
Emerging Technologies
W
DescriptionThe rapid advancement of new HPC technologies has facilitated the convergence of artificial intelligence (AI), big data analytics, and HPC platforms to solve complex, large-scale, real-time analytics and applications for scientific and non-scientific fields. Given the dynamism of today’s computational environments, the traditional classroom approach for HPC pedagogy does not fit all needs required at various levels of education. The traditional computer science education, which typically mentions briefly the concept of threading, is no longer apt for preparing the future HPC workforce. Additionally, many K-12 and post-college personnel are encountering problems or are involved in projects where high performance computing can make a useful contribution. In recent years, there have been several pedagogical and andragogical approaches initiated by the HPC community to increase the instructional effectiveness for bridging the gaps in HPC knowledge and skills. This work aims to share experiences with educational challenges and opportunities that stimulate the acquisition of high performance computing skills.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionWith large HPC systems, users will often jockey for better queue times to get quicker results. Unfortunately, getting accurate estimations of queue times requires understanding complex and abundant data collected from myriad HPC system loggers. To aid with this, researchers are exploring machine learning to shortcut the analysis of these factors and give discrete predictions. Unfortunately, these models are imperfect, expressing varying degrees of accuracy. This imperfection must be conveyed to users in the form of uncertainty quantification. Thus, to provide users with a better understanding of queue wait times on NREL's Eagle HPC system, we developed a visualization that simplifies this complex data and aids decision making. This visualization summarizes uncertainty information associated with a user's specific queue time prediction and places it into the larger context of historical data, encoding job submission variables that users can change to show the impact of their choices on queue wait time.
Workshop
Accelerators
Emerging Technologies
Hardware Technologies
W
DescriptionWhile RISC-V presents a new approach to ISA standards, it presents unique opportunities for innovation. This session will explain how RISC-V thinks about HPC, but then explore some ideas for how HPC should think about RISC-V. Is RISC-V just another platform or can it be more?
Birds of a Feather
TP
XO/EX
DescriptionGovernment agencies, industry and academia are demanding a new generation of tools to efficiently solve large-scale analytics problems in a variety of business, scientific and national security applications. This BoF gathers the community developing high-performance frameworks and workflows for large-scale graph analytics to survey current approaches, identify new challenges and opportunities, and discuss interoperability of emerging infrastructures. A central goal is developing requirements and recommendations for future tools. As in previous editions, this BoF will explore, compare, and contrast conventional implementations as well as algebraic approaches, inviting the GraphBLAS community to discuss its state and evolution.
Exhibits
SCinet
TP
XO/EX
Exhibits
SCinet
TP
XO/EX
Workshop
Broader Engagement
Education
Inclusivity
W
DescriptionQuantum accelerated supercomputing, or the integration of quantum computers and quantum emulators with classical supercomputers, allows domain scientists to address complex problems across various disciplines. To design and implement effective hybrid algorithms at scale, practitioners require not only an understanding of quantum computing (QC) and the problem domains, but also the High Performance Computing (HPC) skills to optimize quantum-classical workflows. Current QC curriculum largely overlooks the practical integration and scaling of hybrid algorithms, and often university quantum computing courses do not attract students who are most familiar with the problem domains.
We describe the pedagogical motivation for a module that addresses both of these shortcomings. We survey existing HPC and QC educational literature to create an integrated HPC+QC competency. This informs the design of an educational module, pilot-tested in a master's level applied machine learning course, which introduces students to QC concepts through a hybrid neural network example.
We describe the pedagogical motivation for a module that addresses both of these shortcomings. We survey existing HPC and QC educational literature to create an integrated HPC+QC competency. This informs the design of an educational module, pilot-tested in a master's level applied machine learning course, which introduces students to QC concepts through a hybrid neural network example.
Panel
Quantum Computing
TP
DescriptionQuantum computing is a quickly growing accelerator technology with more and more computer centers investigating the use of such systems as part of their long-term HPC strategy. However, while the potential computational power of quantum computers for targeted problems is generally known, concrete application use cases that specifically leverage both HPC and QC resources together are still a rarity. For this panel, we have invited five experts from different areas, backgrounds and regions. All are investigating the feasibility and opportunities found in such hybrid applications. We will discuss the characteristics, performance and scaling prospects of their application scenarios as well as the hardware advances they require before they can be realized. With this panel, we aim to provide a realistic picture of the field of quantum acceleration in HPC, to discuss realistic timelines and goals, and to offer the audience a well-founded picture based on firsthand experiences.
Workshop
Emerging Technologies
Modeling and Simulation
Scientific and Information Visualization
W
DescriptionAs HPC systems go beyond Exascale, the connectivity between various CPUs, GPUs and other accelerators has to be carefully managed and orchestrated for extracting optimal performance from respective systems. Performant network operations also need to address the scale and heterogeneity of compute elements as well as workloads (simulations, AI/ML etc.) in HPC systems. In this talk, I will share some reasons why Network Digital Twins have become a necessity for efficient HPCOps. We will also discuss requirements and challenges for building a HPC Network Digital Twin.
Workshop
Embedded and/or Reconfigurable Systems
Heterogeneous Computing
W
DescriptionStatic RAM FPGAs with their reconfigurability yields options to accomplish instruction set metamorphism or dynamic creation of accelerators/coprocessors as needed. In addition, the abundance of matrix multiplications in many HPC problems also gives the possibility to utilize Machine Learning (ML) support on FPGAs to achieve customized dynamic reconfiguration. Many HPC problems can be solved by Processing-in-Memory, and hence BlockRAMs enhanced with computing can be utilized to accelerate HPC applications. In this talk, I will describe some emerging avenues for reconfigurable HPC considering ML Support in FPGAs.
Birds of a Feather
TP
XO/EX
DescriptionOutreach is key to the sustainable growth of fields like HPC, which few people encounter in their daily lives. Effective science communication is essential to convey complex ideas, highlight research, and adapt our messages to resonate with diverse audiences, but many lack training in this crucial area. This session aims to start changing that by sharing experiences and gathering strategies for improvement. We will start with lightning talks before breakout group discussions on engaging with different audiences. Want to master science communication? Start practicing with your community! We invite anyone interested in outreach or science communication to participate.
Exhibits
Flash Session
TP
XO/EX
DescriptionHPC in the public cloud is a great option for many clients for flexibility, technology integration and job bursting. Yet for some, there are big challenges around cost, security and return on investment. How can they do more with less funds?
In this flash talk, learn how GDIT’s HPC Platform as a Service offers advantages in cost, security and reliability compared to traditional on-premise HPC and cloud-based HPC. See operational and pricing differences from traditional or cloud HPC, and notational ROM pricing comparing four configurations. GDIT’s HPC PaaS cost savings could be reduced between 66% to 85% depending on the configuration compared to similar public cloud HPC resources. Client data is protected with zero trust architecture and data encryption and supports two-factor authentication. HPC services are greatly improved with a robust, highly reliable all flash filesystem and node health check before a batch job is started.
GDIT has deployed HPC clusters and services for many large federal agencies for decades. The GDIT HPC PaaS is dedicated to NOAA’s Weather and Climate Operational Supercomputing System with some of the most demanding operational requirements: 99% uptime and 99% on time generation of critical National Weather Service forecast products.
In this flash talk, learn how GDIT’s HPC Platform as a Service offers advantages in cost, security and reliability compared to traditional on-premise HPC and cloud-based HPC. See operational and pricing differences from traditional or cloud HPC, and notational ROM pricing comparing four configurations. GDIT’s HPC PaaS cost savings could be reduced between 66% to 85% depending on the configuration compared to similar public cloud HPC resources. Client data is protected with zero trust architecture and data encryption and supports two-factor authentication. HPC services are greatly improved with a robust, highly reliable all flash filesystem and node health check before a batch job is started.
GDIT has deployed HPC clusters and services for many large federal agencies for decades. The GDIT HPC PaaS is dedicated to NOAA’s Weather and Climate Operational Supercomputing System with some of the most demanding operational requirements: 99% uptime and 99% on time generation of critical National Weather Service forecast products.
Workshop
Applications and Application Frameworks
Artificial Intelligence/Machine Learning
Security
W
DescriptionSupercomputing has been a discipline for at least four decades, but why has HPC security become such a hot topic the past several years? Just seven years ago the first HPC security papers were accepted and presented at SC17, yet efforts at NIST and elsewhere gained little traction. In this talk we explore some of the reasons why security of HPC systems has received so much more attention recently. We will discuss the expansion of scientific computing into new disciplines, changes in enterprise cybersecurity policy driving scientific computing away from general purpose devices to dedicated research computing assets, and the expansion of big data beyond what can be supported by single researcher workstations. We will also discuss how this expansion has increased the variety of codebases, languages, computational frameworks, and parallel computing models bringing new security challenges with them to the HPC space. We will explore why HPC is increasingly becoming a target due to its attractiveness to cybercriminals for cryptocurrency mining, the fact HPC centers host an increasing volume of non-public data, and how insider threat concerns are changing with the expanded userbase. Finally we will look at why HPC security is different than enterprise security, discussing why existing security research and common practices are not automatically usable for HPC operators, and how the feedback and incentive loop for vendors is broken.
Workshop
Applications and Application Frameworks
Artificial Intelligence/Machine Learning
Security
W
DescriptionHPC systems used for research run a wide variety of software and workflows. This software is often written or modified by users to meet the needs of their research projects, and rarely is built with security in mind. In this paper we explore several of the key techniques that MIT Lincoln Laboratory Su- percomputing Center has deployed on its systems to manage the security implications of these workflows by providing enforced separation for processes, filesystem access, network traffic, and accelerators to make every user feel like they are running on a personal HPC.
Workshop
Distributed Computing
Education
Emerging Technologies
W
DescriptionHPC-ED aims to improve discovery and sharing of CyberTraining resources through the combination of the HPC-ED CyberTraining Catalog, an effective and flexible interface, thoughtful metadata design, and active community participation. HPC-ED encourages authors to share training resource information while retaining ownership and allows organizations to enrich their local portals with shared materials. By basing the architecture on an established, flexible framework, HPC-ED can provide a range of solutions people and organizations can employ for sharing and discovering materials. In this paper we describe the initial pilot phase of the project, where we prototyped the HPC-ED catalog, established an initial metadata set, provided documentation, and began using the system to share and discover materials. Based on community feedback we are now planning an implementation phase focused on evolving our architecture and tools to meet community needs and feedback through improved interfaces and tools designed to address a range of preferences.
Workshop
State of the Practice
System Administration
W
DescriptionCloud platforms are increasingly being used to run HPC workloads. Major cloud providers offer a wide variety of virtual machine (VM) types, enabling users to find the optimal balance between performance and cost. However, this extensive selection of VM types can also present challenges, as users must decide not only which VM types to use but also how many nodes are required for a given workload. Although benchmarking data is available for well-known applications from major cloud providers, the choice of resources is also influenced by the specifics of the user's application input. This paper presents the vision and current implementation of HPCAdvisor, a tool designed to assist users in defining their HPC clusters in the cloud. It considers the application's input and utilizes a major cloud provider as a use case for its back-end component.
Workshop
Debugging and Correctness Tools
Hardware Technologies
Resource Management
State of the Practice
W
Workshop
Debugging and Correctness Tools
Hardware Technologies
Resource Management
State of the Practice
W
Workshop
Debugging and Correctness Tools
Hardware Technologies
Resource Management
State of the Practice
W
Workshop
Debugging and Correctness Tools
Hardware Technologies
Resource Management
State of the Practice
W
Workshop
Emerging Technologies
Modeling and Simulation
Scientific and Information Visualization
W
Workshop
Energy Efficiency
HPC Infrastructure
Sustainability
W
DescriptionDue to modern hardware's constantly growing energy demands, it is important to consider energy efficiency and power consumption. Especially in the age of AI, where a massive amount of computational power is necessary, energy consumption and the costs involved can become a significant problem. However, gathering this power information in a vendor-independent and portable way is far from trivial.
Therefore, we propose hws a hardware sampling library for Python and C++, which makes it extremely easy to gather hardware information like the current power draw or total power consumption, as well as other metrics like clock frequencies, memory consumption, or utilizations, for CPUs and GPUs from NVIDIA, AMD, and Intel. In a case study, we use our library to analyze three common hyperparameter optimization algorithms for two Neural Network architectures and one GPU-accelerated SVM implementation.
Therefore, we propose hws a hardware sampling library for Python and C++, which makes it extremely easy to gather hardware information like the current power draw or total power consumption, as well as other metrics like clock frequencies, memory consumption, or utilizations, for CPUs and GPUs from NVIDIA, AMD, and Intel. In a case study, we use our library to analyze three common hyperparameter optimization algorithms for two Neural Network architectures and one GPU-accelerated SVM implementation.
Paper
Architecture
Codesign
Data Movement and Memory
Energy Efficiency
Green Computing
Linear Algebra
TP
DescriptionIntegrating hybrid memories with heterogeneous processors could leverage heterogeneity in both compute and memory domains for better system efficiency. To ensure performance isolation, we introduce Hydrogen, a novel hardware architecture to optimize the allocation of hybrid memory resources to heterogeneous CPU-GPU systems. Hydrogen supports efficient capacity and bandwidth partitioning between CPUs and GPUs in both memory tiers. With the key observation that CPUs and GPUs exhibit distinct preferences to memory capacity and bandwidth, Hydrogen enables decoupled capacity and bandwidth allocation between CPUs and GPUs with flexible partitioning ratios. It also throttles overly excessive data migration from GPUs with a token-based mechanism. To effectively explore the large, multi-dimensional design space and support dynamically varying application behaviors, Hydrogen uses epoch-based online search for optimized configurations, and incorporates lightweight reconfiguration with reduced data movements.
Combining these novel techniques, Hydrogen significantly outperforms existing designs by 1.16× on average, and up to 1.31×.
Combining these novel techniques, Hydrogen significantly outperforms existing designs by 1.16× on average, and up to 1.31×.
Paper
Data Compression
Data Movement and Memory
Distributed Computing
Message Passing
Network
TP
DescriptionAs network bandwidth lags behind increasing computing power, efficient collective communication is a major challenge for exascale applications. Traditional approaches use error-bounded lossy compression to accelerate collective operations but suffer from the costly decompression-operation-compression (DOC) workflow. We propose hZCCL, the first homomorphic compression-communication co-design enabling direct operations on compressed data, avoiding expensive DOC overhead. Alongside the co-design framework, we introduce a lightweight, multi-core CPU-optimized compressor and a homomorphic compressor with a runtime heuristic to select efficient compression pipelines dynamically. We evaluate hZCCL with up to 512 nodes and across five application datasets. The experimental results demonstrate that our homomorphic compressor achieves a CPU throughput of up to 379.08 GB/s, surpassing the conventional DOC workflow by up to 36.53X. Moreover, our hZCCL-accelerated collectives outperform two state-of-the-art baselines, delivering speedups of up to 2.12X and 6.77X compared to original MPI collectives in single-thread and multi-thread modes, respectively, while maintaining data accuracy.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionWorkflows consist of individual applications such as scientific simulations and data analytics. These applications constitute different stages of the workflow, each comprising heterogeneous characteristics such as run-times and system requirements. The heterogeneity in these workflow stages dictates the need to efficiently characterize them in terms of I/O to provide insights that can lead to informed decisions for their optimization. In this work we have analyzed the run-times of the workflows Montage, 1000 Genome and MuMMI and have categorized their stages as I/O or Non-I/O bound. For the I/O bound stages we perform a detailed analysis of their bandwidth and resource requirements. Our findings conclude that Montage's mBgModel could benefit from Dynamic Resource Scheduling, while Genome's individuals_merge could benefit from data aggregations in the PFS requests and the usage of isolated storage solutions such as node-local storage. These optimizations could aid in serving the bandwidth requirements of this workflow stage.
Workshop
Graph Algorithms
Heterogeneous Computing
Programming Frameworks and System Software
W
Workshop
Graph Algorithms
Heterogeneous Computing
Programming Frameworks and System Software
W
Workshop
Graph Algorithms
Heterogeneous Computing
Programming Frameworks and System Software
W
Workshop
Graph Algorithms
Heterogeneous Computing
Programming Frameworks and System Software
W
Workshop
Artificial Intelligence/Machine Learning
Codesign
W
DescriptionThis talk describes the IARPA AGILE program, the first step towards catalyzing a computing revolution by pioneering new hardware and software co-designs tailored for data handling and movement. The goal is accelerating graph analytic applications through efficient, scalable systems balanced for data-intensive and compute-intensive workloads. These new transformative architectures will enable near real-time knowledge discovery and analytics from massive, randomly distributed, heterogeneous data streams and structures.
ACM Student Research Competition: Undergraduate Poster
Posters
TP
DescriptionHigh performance computing (HPC) applications using MPI (Message Passing Interface) often face non-determinism (ND) due to asynchronous MPI calls, making ND source identification challenging. Modeling execution as an event graph, where MPI calls are nodes and communication is edges, can be useful. Focusing on Message ND, which involves variability in MPI communication order across runs, we detect potential ND sources by comparing edge sets between event graphs. Accurate comparison requires aligning event graph nodes, but traditional methods like NetAlign, graphlet degree vectors, and Graph Auto-Encoders struggle due to the regularity of event graphs. We propose a meta graph heuristic utilizing structural constraints and a message passing scheme for sparse directed acyclic graphs, achieving up to 70% improvement in alignment accuracy over conventional techniques.
Students@SC
DescriptionCo-sponsored by the IEEE Computer Society and Students@SC, this exciting event is designed specifically for students who are interested in pursuing a career in high performance computing. Its main goal is to connect students with mentors from academia, industry, and national labs who will provide them with valuable insights and advice on career paths in their respective fields.
In this "speed dating" style session, students will meet in small groups with the team of mentors who will move between groups throughout the session to let attendees ask questions and gather insights from the experiences of mentors with varied backgrounds. Pre-registration is required to attend this event. Lunch is provided.
In this "speed dating" style session, students will meet in small groups with the team of mentors who will move between groups throughout the session to let attendees ask questions and gather insights from the experiences of mentors with varied backgrounds. Pre-registration is required to attend this event. Lunch is provided.
Awards and Award Talks
TP
DescriptionWe start with key aspects of the decade-long evolution of Google’s Tensor Processing Systems. These systems have unique requirements arising from serving billions of daily users across multiple products in production environments. Training Large Language ML Models (LLMs) can require 100,000 accelerators working together in synchrony for months. And LLM inference can require exaFLOP/OP speeds with response times of under a second. This has driven adoption of lower-precision numerics across the industry. Finally, since random errors from various sources are likely to occur at this immense scale during months-long training runs, fault tolerance is also a key feature of immense-scale ML systems.
Panel
Architecture
Precision
TP
DescriptionModern HPC architectures, such as GPUs, are making significant performance advances in lower precision math. This panel will highlight the progress being made within industry and academic communities on taking advantage of the low-precision performance opportunities of modern architectures. What effect is this having on established HPC modeling and simulation codes that have historically relied on single or double precision math? What effect is this having on codes that are being written today? How are scientists adapting and evaluating the tradeoffs to capture the advantage of all this new performance?
Workshop
Accelerators
Modeling and Simulation
Performance Evaluation and/or Optimization Tools
W
DescriptionThe limiting factor in the application of high-accuracy quantum molecular simulations to large systems has been the associated high computational costs in terms of both compute power and memory. In this paper we explore the use of various BLAS precision modes (BF16, TF32, and Complex_3M) in DCMESH, a framework utilized for the study of light-matter interaction. On a single stack of the Intel Data Center GPU Max Series 1550, we are able to achieve a speedup of 1.35x while retaining accuracy in key output parameters such as the number of excited electrons, current density, and kinetic energy. For large problem sizes, we observe speed-ups of up to 3.91x for individual BLAS calls. Switching between BLAS precision modes requires no source code changes (only environment variables), and so the approach we demonstrate could be readily applied to other high performance computing (HPC) workloads that spend significant time in BLAS calls.
Workshop
Debugging and Correctness Tools
Fault-Tolerance, Reliability, Maintainability, and Adaptability
Software Engineering
W
DescriptionRun-by-run variability in parallel programs caused by floating-point non-associativity can affect reproducibility in iterative algorithms due to accumulating errors, and correctness testing for stochastic programs. The sensitivity of deep learning (DL) training and inference to non-determinism can be extreme, and can prevent certification, accurate assessment of robustness and sensitivity, and bug detection. New approaches in scientific computing applications have coupled DL with simulation, leading to an aggravation of debugging and testing challenges. Here we perform an investigation of floating-point non-associativity within HPC programming models, analyze performance and productivity impacts of replacing atomic operations with deterministic alternatives on GPUs, and examine the recently-added deterministic options within the PyTorch framework within the context of GPU deployment. We evaluate the strategy of exploiting automatic determinism provided by deterministic hardware, using the Groq LPU accelerator for inference portions. We demonstrate the benefits that this strategy can provide within reproducibility and correctness efforts.
Birds of a Feather
TP
XO/EX
DescriptionZero Trust (ZT) is the cybersecurity architecture of choice and is now being discussed for supercomputing environments and workflows. ZT is based on a least-privilege per-request approach and has serious implications for HPC centers, application developers, and end-user workflows. Learn what ZT is, the purpose of NIST SP800-223, and about the U.S. Federal mandates; hear an update from the SC24 Security Workshop; and find out the approaches and challenges of HPC centers. Join this discussion to share your experiences and questions.
Art of HPC
Posters
TP
W
TUT
XO/EX
DescriptionThe image was painted using Arteza watercolors and brushes on a cold-pressed watercolor paper, and was captured and uploaded via an iPhone.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionThis poster presents our two-phase solution for improving GPU utilization in NSF-funded ACCESS high-performance computing (HPC) clusters, with a pilot implementation on Pittsburgh Supercomputing Center’s Bridges-2. Our approach addresses the limitations of Open XdMoD, which lacks per-job GPU usage monitoring and experiences delays in data availability. In phase one, we develop a data ingestion layer to collect GPU indices and resource usage data, utilizing existing software tools for efficient data aggregation and analysis. Analyzing 5,717 completed GPU jobs revealed issues such as workflow configuration errors, framework misconfigurations, and low GPU utilization. In phase two we create a user-facing platform with modern web tools. This platform will automatically detect inefficiencies, notify users via email, and provide actionable insights to optimize resource management. By addressing these issues and integrating real-time data presentation, we aim to enhance overall system utilization, reduce GPU job wait times, and enable more efficient use of existing resources.
Workshop
Broader Engagement
Education
Inclusivity
W
DescriptionIn many “Big Data” problems, the data to be analyzed are stored in files; to solve such problems, an input step reads the data from a file into an array for processing. This input step has traditionally been performed sequentially, causing the time to perform that step to grow linearly with N, the number of values in the file. This paper explores different ways to reduce the time consumed by the input step, including the use of different file formats, as well as parallel I/O via MPI-IO. To make parallel I/O easier for students to use, we have created OO_MPI_IO, a new set of C++ abstractions that hide the complexity of MPI-IO. We also demonstrate how these OO_MPI_IO abstractions can (i) improve the scalability of data-intensive problem solutions, and (ii) provide a means of helping students understand Amdahl’s and Gustafson’s Laws.
Workshop
Message Passing
Network
W
DescriptionExascale applications are increasingly being written in modern languages such as Python, Julia, C++, and Rust. The Message Passing Interface (MPI), the de facto standard for parallel computing, only defines interfaces for C and Fortran, languages that are very different from these modern languages, often containing more complex types and representations incompatible with MPI. The existing derived datatype interface is widely used for older applications, but fails to work efficiently for types containing multiple pointers, requiring application-specific initialization, or serialization. Applications written in these languages can still use MPI, but at the cost of complicated address manipulation or high overhead. This work proposes a new datatype interface for MPI giving more control to the application over buffer packing and the wire representation. We built a prototype for this interface, demonstrating it with Rust, Python, and C++, highlighting key concerns of each language and showing the improvements provided.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionPolyhedral optimizations have been a cornerstone of kernel optimization for many years. These techniques use a geometric model of loop iterations to enable transformations like tiling, fusion, and fission. The elegance of this approach lies in its ability to produce highly efficient code through fully static optimizations. However, modern kernel schedulers typically avoid the polyhedral model, opting instead for dynamic sampling techniques, such as evolutionary searches, to generate efficient code. The polyhedral model is often bypassed because, being entirely static, it struggles to adapt to the fine details of hardware. In this work, we demonstrate that it is possible to overcome this limitation by combining the polyhedral model with a post-optimization phase based on dynamic coordinate descent, which uses minimal sampling while still achieving excellent performance.
ACM Student Research Competition: Graduate Poster
Posters
TP
DescriptionSparse Matrix-Matrix Multiplication (SpGEMM) is a key kernel in many scientific applications and graph workloads. SpGEMM is known to suffer from poor performance due to irregular memory access patterns. Gustavson's algorithm, a traditional approach for SpGEMM, involves row/column-wise operations, facing challenges with irregular accesses to the second matrix. Our research focuses on enhancing memory locality through matrix reordering and cluster-wise computation to address this issue.
In this study, we evaluate the effect of 10 different reordering algorithms on SpGEMM performance. Then, we introduce a novel method that employs cluster-wise SpGEMM, merging similar rows into clusters. Our findings show that matrix reordering can improve SpGEMM performance by up to 2.3×, and our cluster-wise approach can further enhance performance by up to 30%.
In this study, we evaluate the effect of 10 different reordering algorithms on SpGEMM performance. Then, we introduce a novel method that employs cluster-wise SpGEMM, merging similar rows into clusters. Our findings show that matrix reordering can improve SpGEMM performance by up to 2.3×, and our cluster-wise approach can further enhance performance by up to 30%.
Workshop
Data Movement and Memory
I/O, Storage, Archive
W
DescriptionThis paper presents an approach to optimize SQL query execution in distributed engines using Object-Based Computational Storage (OCS). Modern analytics platforms like Presto often suffer from excessive data movement between compute and storage nodes, even when only a small subset of data is required. While solutions like S3 SELECT address this by allowing limited operations to be offloaded to storage, they are restricted to simple queries. The OCS system overcomes these limitations by enabling offloading of more complex, platform-independent query plans via Substrait. This work introduces a multi-layered offloading strategy, where query plans are decomposed and distributed between the OCS Front-End (OCSFE) and OCS Array (OCSA), enhancing resource utilization and reducing query latency. Moreover, this paper presents an integration with Presto, which allows seamless query offloading, and a heuristic algorithm that dynamically manages query distribution across OCS layers to ensure efficient execution and scalability.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionBlockchain technologies enable the success of digital currencies by providing security, decentralization, and trustless operation. Two dominant consensus algorithms, Bitcoin’s Proof-of-Work and Ethereum’s Proof-of-Stake, balance security, scalability, and energy efficiency, though PoW is energy-intensive and PoS faces centralization risks. Chia’s Proof-of-Space (PoSpace) offers a middle ground, using storage (instead of computation) for validation in the network while maintaining decentralization. PoSpace turns the computational-intensive problem into a data-intensive known as plotting. However, Chia’s plotting process stresses hardware, requiring expensive setups and shortening the lifespan of solid-state drives. This work takes a clean-slate approach to implementing an efficient PoSpace system that is lightweight and operates on small nodes (e.g. Raspberry Pis with 4-cores and 2GB RAM) to large systems (HPC server with 192-cores, 790GB RAM, and multiple NVMe storage devices). Our C and Rust implementations achieve significantly higher performance than Chia in plot generation and lookup efficiency across all system sizes.
Workshop
Architecture
Data-Intensive
Network
Performance Optimization
System Administration
W
DescriptionFour years have passed since the United States Government mandated federal agencies to complete the transition to Internet Protocol version 6 (IPv6). Despite the IPv4 address shortage and IPv6 mandate, Federally Funded Research and Development Centers (FFRDCs) are still struggling to sunset IPv4. As demonstrated on Supercomputing 2023’s SC23v6 wireless network, newer tooling such as RFC8925 allows clients to disable their IPv4 protocol stack while retaining legacy IP connectivity. However, SC23v6 wireless clients without RFC8925 support or a disabled IPv6 stack would continue to receive internet access via legacy IPv4. This paper introduces a method of using poisoned IPv4 DNS records to gracefully inform IPv4-only clients at Supercomputing 2024’s SC24v6 wireless network about their inability to use the current version of internet protocol, with a goal of minimal impact to RFC8925 and dual-stack clients. When implemented, this method may improve supportability and user experience of IPv6-only deployments at FFRDCs
Workshop
Distributed Computing
Performance Evaluation and/or Optimization Tools
Scientific and Information Visualization
W
Inclusivity
Inclusivity
TP
W
TUT
XO/EX
DescriptionDr Lanier will discuss how to integrate and apply psychological safety in our workplace. This interactive session aims to provide an overview of psychological safety, it’s impact on the students, staff and why it is essential not only for employee retention but also in the success of diverse workplaces.
Inclusivity
Inclusivity
TP
W
TUT
XO/EX
DescriptionCreating and increasing diversity takes a targeted approach. Outreach strategies vary among academia, non-profit companies, industry, and the government. Learn about the best practices on how outreach can help build communities that reflect and support the needs for diversity.
Inclusivity
Inclusivity
TP
W
TUT
XO/EX
DescriptionUnderrepresentation in the STEM fields may lead to missing the voices and perspectives of many brilliant people (whether through never entering the field, prematurely leaving a STEM career, or other factors which discourage active participation). Moreover, many barriers exist that may inhibit people from underrepresented groups who are in STEM careers from realizing their full potential. The combination of these factors implies a lot of missed opportunities for more creative and productive teams and for more innovative solutions to critical problems we face in this world. In this panel, panelists will address questions relevant to their experiences, best practices, and outcomes from engaging in efforts to broaden participation from different communities and groups. These efforts range across a variety of settings and impact diverse communities. The panelists will discuss ways to get involved with service and outreach work at their institutions, about how to promote a culture of belonging, and about how such efforts benefit the work environment for everyone.
Workshop
HPC Infrastructure
State of the Practice
System Administration
W
DescriptionWith the cost of tape-based storage being a fraction of the price of disk-based storage, using tape is an appealing way to drive costs downward, but this is not without challenges. NCAR’s Campaign Storage (CS) is a 120 PB disk-based IBM Storage Scale system that is projected to be outgrown by storage demands, but it has enough cold data that there is potential to offload to tape when data retrieval time requirements permit doing so. One notable challenge is reconciling users’ desire for a single namespace with the need to prevent issues that can occur when users have unfettered ability to read files. In this submission we introduce an architecture for solving issues including (1) a utility that enables users to self-request file migrations/recalls, (2) Storage Scale policies that can be used to handle migration/recall requests in a constrained way, and (3) a cost incentive for encouraging use of tape.
Workshop
Energy Efficiency
HPC Infrastructure
Sustainability
W
DescriptionThe growing demand for HPC necessitates significant energy consumption, posing a sustainability challenge for HPC centers, users, and society, especially
due to stricter environmental regulations. While efforts exist to reduce overall system energy consumption, optimizations
for GPU-based workloads, has received insufficient attention for workload-specific energy-efficiency optimizations. This work addresses this issue by proposing
dynamic approaches to increase energy-efficiency by controlling the GPU frequency dynamically through code instrumentation.
We further investigate
the energy-performance trade-off by comparing both static and dynamic GPU frequency scaling strategies as well as DVFS within SPH-EXA, a newly
developed, open-source, and GPU-centric simulation framework
specializing in astrophysical simulations. Our
findings demonstrate that code instrumentation enables detailed
energy consumption acquisition beyond traditional HPC system
monitoring, while dynamic frequency scaling of computational
kernels achieves energy reduction with limited performance loss.
This approach empowers researchers to develop more sustainable
large-scale scientific simulations running mainly on GPUs.
due to stricter environmental regulations. While efforts exist to reduce overall system energy consumption, optimizations
for GPU-based workloads, has received insufficient attention for workload-specific energy-efficiency optimizations. This work addresses this issue by proposing
dynamic approaches to increase energy-efficiency by controlling the GPU frequency dynamically through code instrumentation.
We further investigate
the energy-performance trade-off by comparing both static and dynamic GPU frequency scaling strategies as well as DVFS within SPH-EXA, a newly
developed, open-source, and GPU-centric simulation framework
specializing in astrophysical simulations. Our
findings demonstrate that code instrumentation enables detailed
energy consumption acquisition beyond traditional HPC system
monitoring, while dynamic frequency scaling of computational
kernels achieves energy reduction with limited performance loss.
This approach empowers researchers to develop more sustainable
large-scale scientific simulations running mainly on GPUs.
Workshop
Software Engineering
W
ACM Student Research Competition: Graduate Poster
Posters
TP
DescriptionQuantum computing is an emerging field that has had an impact on various domains. This poster focuses on the advantageous neutral atom technology for quantum computing. In neutral atom technology, a significant challenge arises in the measurement of quantum output, where the need to eject atoms physically leads to substantial time wastage in reloading arrays. To address this issue, we introduce a novel technique that leverages the probabilistic nature of quantum programs to reduce qubit ejections and atom array reloads.
Panel
Artificial Intelligence/Machine Learning
TP
W
TUT
XO/EX
DescriptionWith the acceleration of AI and HPC technologies, industry standards bodies are essential to guarantee interoperability and facilitate faster deployment. Collaboration among industry standards groups fosters the development and deployment of advanced HPC solutions, driving innovation, collaboration, and efficiency.
Leaders from CXL, DMTF, NVMe, Open Fabrics Alliance (OFA), PCI-SIG, SNIA, UALink, UCIe and Ultra Ethernet Consortium (UEC) will discuss how these standards are collaborating to foster innovation as technology trends accelerate.
Attendees will learn:
• What the new standards-based technologies mean now and for the future
• Why new standards bodies form
• How collaboration is fostering new innovations in the marketplace:
o UEC transport and OFA LibFabric
o SNIA Swordfish and NVMe
o DMTF Redfish, SNIA Swordfish, OFA Sunfish, and CXL
o PCIe, CXL, and UCIe as load-store interconnects
• How standards bodies cooperate instead of duplicating:
o Computational storage in SNIA and NVMe
o Redfish and Swordfish cooperation
Leaders from CXL, DMTF, NVMe, Open Fabrics Alliance (OFA), PCI-SIG, SNIA, UALink, UCIe and Ultra Ethernet Consortium (UEC) will discuss how these standards are collaborating to foster innovation as technology trends accelerate.
Attendees will learn:
• What the new standards-based technologies mean now and for the future
• Why new standards bodies form
• How collaboration is fostering new innovations in the marketplace:
o UEC transport and OFA LibFabric
o SNIA Swordfish and NVMe
o DMTF Redfish, SNIA Swordfish, OFA Sunfish, and CXL
o PCIe, CXL, and UCIe as load-store interconnects
• How standards bodies cooperate instead of duplicating:
o Computational storage in SNIA and NVMe
o Redfish and Swordfish cooperation
Birds of a Feather
TP
XO/EX
DescriptionThe convergence of AI and high-performance computing (HPC) is already demonstrating significant reductions in time-to-insight.
InfiniBand, an industry interconnect standard established by the InfiniBand Trade Association, continues to evolve with faster speeds, enhanced capabilities, and improved performance and scalability. It serves as the core of supercomputers, connecting compute and storage elements for parallel and synchronized computing.
This session will focus on the future innovations of InfiniBand, its newly announced roadmap of GDR, and LDR generations, highlighting the next-generation capabilities being developed to bridge scientific computing and simulation techniques with mainstream generative transformer models, natural language processing, and data analytics.
InfiniBand, an industry interconnect standard established by the InfiniBand Trade Association, continues to evolve with faster speeds, enhanced capabilities, and improved performance and scalability. It serves as the core of supercomputers, connecting compute and storage elements for parallel and synchronized computing.
This session will focus on the future innovations of InfiniBand, its newly announced roadmap of GDR, and LDR generations, highlighting the next-generation capabilities being developed to bridge scientific computing and simulation techniques with mainstream generative transformer models, natural language processing, and data analytics.
Workshop
Data Movement and Memory
I/O, Storage, Archive
W
DescriptionThe storage subsystem of the Aurora platform at Argonne National Laboratory offers the potential for unprecedented application I/O performance. Its hardware stack combines NVMe drives, persistent memory devices, a large collection of storage servers, and a unified RDMA-capable network fabric. Its software stack is based on the DAOS object storage system, thereby putting into practice decades of I/O research into minimizing overhead, maximizing access concurrency, and streamlining application interfaces. Taken together, these elements combine to present up to 230 PiB of projected storage capacity and up to 31 TiB/s of projected I/O throughput to applications executing on Aurora.
We must revisit fundamental issues such as hardware topology, network concurrency, and I/O APIs to understand their impact on real-world I/O performance. In this paper we present our initial experiences with the DAOS storage system on Aurora and characterize its sensitivity to these parameters at small scale.
We must revisit fundamental issues such as hardware topology, network concurrency, and I/O APIs to understand their impact on real-world I/O performance. In this paper we present our initial experiences with the DAOS storage system on Aurora and characterize its sensitivity to these parameters at small scale.
Exhibitor Forum
Facilities
Sustainability
TP
XO/EX
DescriptionThe world is increasingly reliant on HPC, AI and machine learning (ML) to power everything from weather forecasting to self-driving cars. HPC solutions and AI applications are particularly demanding due to their compute-intensive workloads. With data centers accounting for 0.9%-1.3% of global electricity consumption, according to the International Energy Association (2021), it is of growing importance to balance data center priorities with energy-conscious solutions.
In this presentation, Penguin Solutions will:
- Give insight into the pressures data center operators face to reduce the environmental footprint while remaining cost and performance competitive.
- Provide an overview of the different cooling options including immersion, direct to chip, augmented immersion, and negative pressure, and design scenarios best suited for each option.
- Offer recommendations to improve sustainability without compromising performance.
Attendees will gain a rich understanding of various cooling methods, when to consider deploying each, and insights for balancing business and sustainability goals.
In this presentation, Penguin Solutions will:
- Give insight into the pressures data center operators face to reduce the environmental footprint while remaining cost and performance competitive.
- Provide an overview of the different cooling options including immersion, direct to chip, augmented immersion, and negative pressure, and design scenarios best suited for each option.
- Offer recommendations to improve sustainability without compromising performance.
Attendees will gain a rich understanding of various cooling methods, when to consider deploying each, and insights for balancing business and sustainability goals.
Workshop
Energy Efficiency
HPC Infrastructure
Sustainability
W
DescriptionGPUs are known to be power-hungry, and due to the boom in artificial intelligence, they are the major contributors to the high power demands of datacenters. Most GPU usage in these popular workloads consists of large general matrix-matrix multiplications (GEMMs), which have therefore been optimized to achieve high utilization of hardware resources.
We show that modifying the input data to GEMMs, while maintaining the matrix shapes and sizes can notably change the power consumption of these kernels. We experiment with four kinds of input variations: value distribution, bit similarity, placement, and sparsity, across different data types. Our findings indicate that these variations can change the GPU power usage during GEMM by almost 40%.
We hypothesize that input-dependent power usage variations occur due to changes in the number of bit flips in the GPUs. We propose leveraging this property through compiler and scheduler optimizations to manage power and reduce energy consumption.
We show that modifying the input data to GEMMs, while maintaining the matrix shapes and sizes can notably change the power consumption of these kernels. We experiment with four kinds of input variations: value distribution, bit similarity, placement, and sparsity, across different data types. Our findings indicate that these variations can change the GPU power usage during GEMM by almost 40%.
We hypothesize that input-dependent power usage variations occur due to changes in the number of bit flips in the GPUs. We propose leveraging this property through compiler and scheduler optimizations to manage power and reduce energy consumption.
Workshop
Debugging and Correctness Tools
Performance Evaluation and/or Optimization Tools
W
DescriptionWe aim to identify the differences in Input/Output (I/O) behavior between multiple user programs through the inspection of system calls (i.e., requests made to the operating system). A typical program issues a large number of I/O requests to the operating system, thereby making the process of inspection challenging. In this paper, we address this challenge by presenting a methodology to synthesize I/O system call traces into a specific type of directed graph, known as the Directly-Follows-Graph (DFG). Based on the DFG, we present a technique to compare the traces from multiple programs or different configurations of the same program, such that it is possible to identify the differences in the I/O behavior. We apply our methodology to the IOR benchmark, and compare the contentions for file accesses when the benchmark is run with different options for file output and software interface.
Workshop
Distributed Computing
Education
Emerging Technologies
W
DescriptionA Captive Portal (CP) represents an interface for network connectivity where access to specific resources is restricted until certain conditions are satisfied. It represents a light-weighted sort of authentication mechanism where triggering requirements are displayed, typically in a web page-type interface, e.g. reading advertisements, accepting usage policies, or providing some form of credentials. Some HPC systems use this approach to validate users and grant access to resources (e.g. gateways to specialized portals, system policies disclosures, etc.).
In this work, we present an educational project aimed to introduce the technology behind CP infrastructures. For this, we developed a series of modules to emphasize each of the different aspects and features of this technology. The project is based on an open-source implementation widely used in computer network courses, making it well-suited for instructors and practitioners in this field.
In this work, we present an educational project aimed to introduce the technology behind CP infrastructures. For this, we developed a series of modules to emphasize each of the different aspects and features of this technology. The project is based on an open-source implementation widely used in computer network courses, making it well-suited for instructors and practitioners in this field.
Birds of a Feather
TP
XO/EX
DescriptionThis BoF session aims to explore the integration of confidential computing in HPC environments, focusing on leveraging Trusted Execution Environments (TEEs) for secure, high-performance cloud computing. The session will include brief presentations, a panel discussion, and an interactive Q&A to discuss practical applications, performance trade-offs, and future directions. Intended for HPC professionals, researchers, and industry experts, this session seeks to foster community engagement and collaboration in advancing secure computing technologies.
Workshop
Applications and Application Frameworks
Distributed Computing
Middleware and System Software
W
DescriptionHigh performance computing (HPC) systems have become essential for solving complex scientific problems, particularly in the context of deep learning (DL). This extended abstract presents a novel system that uses a multiobjective evolutionary algorithm (EA) to optimize hyperparameters for a deep learning model, AtomAI, to minimize validation training loss and energy use. We will be using the parallel and distributed computing capabilities of Dask and the scalable provenance features of FlowCept to measure CPU and
GPU resource usage as proxies for energy consumption. Our approach focuses on integrating multiple software components to operate efficiently on large-scale HPC systems, specifically targeting the OLCF's Frontier supercomputer, but should be generalizable to other HPC environments.
GPU resource usage as proxies for energy consumption. Our approach focuses on integrating multiple software components to operate efficiently on large-scale HPC systems, specifically targeting the OLCF's Frontier supercomputer, but should be generalizable to other HPC environments.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionHPCToolkit enables users to gather detailed information about application performance. Users can capture fine-grained measurement data, which may include instruction-level samples on CPUs and GPUs. Collected data can be huge, making manual inspection using GUI tools difficult and time-consuming. We explored existing tools Hatchet and Thicket for programmatic analysis of performance data to automate this process. However, they were not designed to handle data as large as HPCToolkit's. HPCToolkit's calling context trees are difficult to interpret and visualize using these tools because of their overwhelming detail. Moreover, importing multiple trees into Thicket can be slow, as unifying trees is costly when trees are large. To reduce the size of large trees, we implemented heuristics that would automatically detect and remove specific code regions. After creating smaller trees that we believe contain all the meaningful information about the program's behavior, we used Thicket to analyze multiple performance profiles measured by HPCToolkit.
Workshop
Distributed Computing
Experimental Facility
HPC Infrastructure
W
DescriptionWe explore the development of a performance-portable CPU/GPU ecosystem to integrate the Oak Ridge Leadership Computing facility and the Spallation Neutron Source (SNS), both of which are housed at Oak Ridge National Laboratory. We select a relevant data reduction workflow calculating the differential scattering cross-section from data collected by SNS's CORELLI and TOPAZ instruments. We compare the current CPU-only production implementation using the Garnet package against our proposed CPU/GPU implementation that uses the Julia scientific language and the JACC.jl performance-portable package. Two proxy apps were developed: (i) an app for extracting relevant Mantid kernels (MDNorm) in C++ and (ii) the Julia MiniVATES.jl miniapp. We present performance results for NVIDIA A100 and AMD MI100 GPUs and AMD EPYC 7513 and 7662 CPUs. The results provide insights for future generations of data reduction software that can embrace performance portability for an integrated research infrastructure across DOE's experimental and computational facilities.
Workshop
Distributed Computing
Education
Emerging Technologies
W
DescriptionThe interest in high-performance computing (HPC) has grown due to the need for skills to leverage advanced computing, such as managing large volumes of data, using complex algorithms, and developing AI applications. HPC ecosystems are essential for addressing complex scientific and engineering challenges, but integrating diverse stakeholders requires a nuanced understanding of collaboration and innovation within HPC environments. This proposal discusses how different actors have been integrated at different levels, including students from various disciplines, scientists, and decision-makers, through the introduction of new formal courses, the inclusion of relevant topics in existing courses, and other activities.
Workshop
Heterogeneous Computing
Parallel Programming Methods, Models, Languages and Environments
PAW-Full
Task Parallelism
W
DescriptionModern high-end systems are increasingly becoming heterogeneous, providing users options to use general purpose Graphics Processing Units (GPU) and other accelerators for additional performance. High Performance Computing (HPC) and Artificial Intelligence (AI) applications are often carefully arranged to overlap communications and computation for increased efficiency on such platforms. This has led to efforts to extend popular communication libraries to support GPU awareness and more recently, GPU-initiated operations. In this paper, we present Intel SHMEM, a library that enables users to write programs that are GPU aware, in that API calls support GPU memory, and also support GPU-initiated communication operations by embedding OpenSHMEM style calls within GPU kernels. We also propose thread-collaborative extensions to the OpenSHMEM standard that can enable users to better exploit the strengths of GPUs. Our implementation adapts to choose between direct load/store from GPU and the GPU copy engine based transfer to optimize performance on different configurations.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionThis work presents an automated, reproducible, ML-based performance modeling workflow for HPC systems. The proposed workflow fully automates data generation, preprocessing, ML model training and validation. Since the proposed approach is generic and not tailored to a specific application, our workflow can be utilized for performance modeling across a wide range of performance domains. The prototype implementation is based on the JUBE workflow environment, through which a user-friendly interactive console is realized. The effectiveness of the automated workflow is demonstrated with a case study on I/O bandwidth modeling and prediction.
Workshop
Applications and Application Frameworks
W
DescriptionSince 2013, LUNARC has aimed to provide an interactive HPC environment for its resource users. Several different architectures have been used, but since 2015, we have been using a remote desktop environment based on Cendio's ThinLinc combined with a custom backend framework, GfxLauncher, supporting hardware-accelerated graphics applications and Jupyter Notebooks submitted to the backend cluster.
Workshop
Applications and Application Frameworks
Artificial Intelligence/Machine Learning
Security
W
Workshop
Compilers
Parallel Programming Methods, Models, Languages and Environments
Performance Optimization
W
DescriptionThe rapid pace of hardware innovation and the quest for performance in HPC workloads demand that compilers deliver their best. To support this effort, we introduce ClangIR (CIR), a new intermediate representation (IR) for Clang that captures higher-level semantics for C, C++, and extensions. ClangIR streamlines domain-specific code transformations and analysis by eliminating the need to reconstruct semantics from lower-level IRs like LLVM. Additionally, it enables the HPC community to more seamlessly integrate C/C++ language extensions and custom backends into their Clang based compilers. Built on MLIR, ClangIR is an active open-source project on LLVM's GitHub repository. This talk will cover its design principles, benefits, and ongoing development efforts, as well as opportunities for collaboration and contribution from the community.
Workshop
Distributed Computing
Performance Evaluation and/or Optimization Tools
Scientific and Information Visualization
W
Workshop
Broader Engagement
Education
Inclusivity
W
DescriptionThis paper describes how we use n-body simulations as an interesting and visually compelling way to teach efficient, parallel, and distributed programming. Our first course focuses on bachelor students introducing them to algorithmic complexities and their implications in real-world problems and state-of-the-art tools like Git, remote development, or C++, by simulating the collision of two galaxies. This project teaches the mapping of mathematical functions into code, efficient implementations, and the pitfalls around complexity and scaling.
Our second course targets master students introducing them to intra- and inter-node parallelization. Here, the students simulate our solar system using OpenMP and MPI. The master students further deepen their knowledge of parallelization and scientific computing by choosing custom projects like simulating water molecules or implementing an interactive live visualization using GPUs.
Our courses are structured such that they can easily be adapted by other instructors. All our material is publicly available at https://github.com/orgs/SCTeaching-NBody.
Our second course targets master students introducing them to intra- and inter-node parallelization. Here, the students simulate our solar system using OpenMP and MPI. The master students further deepen their knowledge of parallelization and scientific computing by choosing custom projects like simulating water molecules or implementing an interactive live visualization using GPUs.
Our courses are structured such that they can easily be adapted by other instructors. All our material is publicly available at https://github.com/orgs/SCTeaching-NBody.
Tutorial
Emerging Technologies
Quantum Computing
TUT
DescriptionQuantum computing offers the potential to revolutionize high-performance computing by providing a means to solve certain computational problems faster than any classical computer. Relatively recently, quantum computing has advanced from a theoretical possibility to engineered reality, with commercial entities offering early prototype quantum processors representing a variety of qubit technologies and computational paradigms. The media have been showcasing each new development and implicitly conveying the message that quantum-computing ubiquity is nigh. Here, we will respond to this hype and provide an overview of the exciting but still early state of the field.
We introduce participants to the computational models underlying quantum computing. We work through examples of its immense computational power while highlighting what the quantum computing community still does not know in terms of quantum algorithms and where the power of quantum computing comes from. We examine the thought processes that programmers use to map problems to circuit-model quantum computers, quantum annealers, measurement-based quantum systems, analog Rydberg atom arrays, and other recent inventions in the quantum-computing space. We conclude with an overview of the hardware and algorithmic challenges that must be overcome before quantum computing becomes a component of the HPC developer's repertoire.
We introduce participants to the computational models underlying quantum computing. We work through examples of its immense computational power while highlighting what the quantum computing community still does not know in terms of quantum algorithms and where the power of quantum computing comes from. We examine the thought processes that programmers use to map problems to circuit-model quantum computers, quantum annealers, measurement-based quantum systems, analog Rydberg atom arrays, and other recent inventions in the quantum-computing space. We conclude with an overview of the hardware and algorithmic challenges that must be overcome before quantum computing becomes a component of the HPC developer's repertoire.
Art of HPC
Posters
TP
W
TUT
XO/EX
DescriptionSimulation data was computed on resources of the Argonne Leadership Computing Facility, and rendered using the VMD software package. No ML tools were leveraged in the rendering.
Workshop
Applications and Application Frameworks
Artificial Intelligence/Machine Learning
Parallel Programming Methods, Models, Languages and Environments
W
Workshop
Architecture
Network
Performance Optimization
System Administration
W
DescriptionAgile Network Management: The Impact of Automation and Orchestration
Workshop
Architecture
Network
Performance Optimization
System Administration
W
DescriptionThe internet community has a long history of designing new protocols to solve current (and sometimes urgent) problems in network engineering. Not many of these protocols, however, end up widely deployed—and those that are widely deployed often end up being used in ways completely different than the original intent. Why? The thesis of this talk is the phenomenon of failed and mis-deployed protocols is directly related to complexity. Simple protocols succeed and complex protocols fail. This might seem obvious, but it leaves open a lot of questions. What is a “simple protocol?” Do successful protocols always remain “simple,” or do they accrue complexity over time? What is the impact of accrued complexity on network operations and protocol extensibility? This talk will attempt to create a framework, and suggest research directions, leading to answers for these questions.
Workshop
Graph Algorithms
Heterogeneous Computing
Programming Frameworks and System Software
W
Workshop
Artificial Intelligence/Machine Learning
Graph Algorithms
Scalable Data Mining
W
Workshop
Data Movement and Memory
I/O, Storage, Archive
W
DescriptionThe democratization of the digital continuum necessitates an AI-integrated ecosystem where data and research infrastructure are seamlessly integrated and universally accessible. This talk overviews the imperative of bridging the gaps between these components through robust services, facilitating an inclusive AI landscape that empowers diverse research communities and domains. The National Data Platform (NDP) aims to lower the barriers to entry for AI research and applications through an integrated services approach to streamlining research and education workflows. This approach underscores the importance of open, extensible, and equitable systems in driving forward the capabilities of AI, ultimately contributing to the resolution of grand scientific and societal challenges. By examining real case studies leveraging open data platforms and scalable research infrastructure, the talk will highlight the role of composable systems and services in NDP to catalyze a platform to empower users from all backgrounds to engage in meaningful research, learning, and discovery.
Workshop
Graph Algorithms
Heterogeneous Computing
Programming Frameworks and System Software
W
DescriptionSparse tensor contraction (SpTC) is a crucial operation in high-performance applications, particularly in computational chemistry, high-order tensor decompositions, and quantum sciences. This talk will explore the performance challenges associated with SpTC and review current state-of-the-art solutions. We will focus on hash-table approaches, discussing the key features of hash table design that significantly impact performance. Additionally, a novel hash method will be introduced, featuring a fast hash function with guaranteed collision-free operations to efficiently support SpTC computations.
Workshop
Broader Engagement
Education
Inclusivity
W
DescriptionHPC systems are increasing in complexity, with the use of accelerators, complex interconnects, and growing number of nodes.These systems are critical to solving scientific challenges in areas such as materials design, climate modeling, computational biology, and the use of AI to advance scientific discovery.Generally, performance of such systems is focused on metrics such as runtime or throughput.However, the power requirement is becoming a major issue with current systems requiring tens of megawatts. Future HPC data centers are even considering gigawatts of power demand. This talk will focus on the importance of curriculum that exposes students to energy efficiency in terms of system design, application development, and monitoring.
Workshop
Artificial Intelligence/Machine Learning
Graph Algorithms
Scalable Data Mining
W
Workshop
Data Compression
Data Movement and Memory
Middleware and System Software
W
DescriptionWe are generating data in larger amounts and at higher speeds than ever before. Data compression is able to mitigate the resulting storage and transmission problems, but only if the compression ratio is high enough to obtain a meaningful benefit and the throughput is sufficient to not introduce a new bottleneck. Machine learning can help by automatically synthesizing effective compression algorithms. In our work, we go a step further by employing such synthesis tools to extract valuable insights, which have enabled us to iteratively create more and more powerful compression algorithms. Ultimately, this has resulted in GPU-based compressors for scientific data that outperform the state of the art in throughput and compression ratio, both for lossless compression and for lossy compression with guaranteed point-wise error bounds.
Workshop
Algorithms
Heterogeneous Computing
W
DescriptionAs traditional technology drivers of computing performance level off, the use of accelerators with various levels of specialization, are growing in importance. At the same time, data movement continues to dominate running time and energy costs, making communication cost reduction the primary optimization criteria for compilers and programmers. This requires new ways of thinking about algorithms to minimize and hide communication, expose fine-grained parallelism, and manage communication. These changes will affect the theoretical models of computing, the analysis of performance, the design of algorithms, and the practice of programming.
In this talk I will discuss prior work and open problems in optimizing communication, avoiding synchronization, and tolerating nondeterminism, using data analysis and statistical learning problems from biology as driving examples. I will discuss distributed data structures and communication optimizations in large-scale genome analysis, including metagenome assembly, protein clustering, and more. The algorithms represented data analysis “motifs” including hashing, alignment, generalized n-body, and sparse matrices. I will describe two parallelization approaches, one based on asynchronous one-sided communication and another based on bulk-synchronous collectives using GraphBLAS. I will give an overview of these approaches, describe the GPU parallelizations, and highlight some of the resulting scientific insights, including the discovery of new microbial species and new protein functional dark matter.
In this talk I will discuss prior work and open problems in optimizing communication, avoiding synchronization, and tolerating nondeterminism, using data analysis and statistical learning problems from biology as driving examples. I will discuss distributed data structures and communication optimizations in large-scale genome analysis, including metagenome assembly, protein clustering, and more. The algorithms represented data analysis “motifs” including hashing, alignment, generalized n-body, and sparse matrices. I will describe two parallelization approaches, one based on asynchronous one-sided communication and another based on bulk-synchronous collectives using GraphBLAS. I will give an overview of these approaches, describe the GPU parallelizations, and highlight some of the resulting scientific insights, including the discovery of new microbial species and new protein functional dark matter.
Workshop
Data Movement and Memory
Emerging Technologies
W
DescriptionThe growth of AI/Deep learning and data analytics has created many of the most challenging HPC workloads in recent years. Usually HPC/AI applications are driving the need for better memory and storage performance and capacity and despite significant advancements, memory and storage in HPC/AI still encounters several challenges in these. SK hynix has tried to continuous innovation and technological breakthroughs to solve these challenges in Memory/Storage. As part of these efforts, this talk will highlight the key roles that advanced memory and storage play in HPC/AI ecosystem and potential benefits of “Processing Near Data with CXL/HBM/SSD” and “CXL Pooled Memory's Data Sharing” for HPC/AI Systems.
Workshop
Algorithms
Heterogeneous Computing
W
DescriptionNeuromorphic computing is a popular technology for the future of computing. Neuromorphic systems have the opportunity to impact computing from the edge to HPC-scale systems. In this talk, I will overview the field of neuromorphic computing with a particular focus on challenges and opportunities in using neuromorphic computers. I will discuss neuromorphic applications for both machine learning and non-machine learning use cases, with applications in physics, transportation, nuclear engineering, and graph learning.
Workshop
Artificial Intelligence/Machine Learning
Graph Algorithms
Scalable Data Mining
W
Workshop
Accelerators
Emerging Technologies
Hardware Technologies
W
DescriptionDave Ditzel has been a driving force behind many hardware innovations and products, and based upon this wealth of experience and expertise founded Esperanto Technologies. Esperanto is one of the world's first companies to develop solutions to accelerate AI and HPC codes based upon RISC-V. With the first generation, ET-SoC-1, packing 1000 RISC-V cores per chip, and chips then being combined together in PCIe form factor, a wide range of solutions can be built using this technology to suit a plethora of applications. Not only has Esperanto technologies demonstrated significant performance benefits, but furthermore this solution is also highly energy efficient which is becoming of increasing importance to the AI and HPC communities.
Dave has been a proponent of RISC-V since the early days, and in this talk will reflect on how the technology has developed so far, why he leveraged RISC-V in Esperanto, where he sees the technology going in the future and why in the AI & HPC communities we should be excited about the possibilities unlocked by RISC-V.
Dave has been a proponent of RISC-V since the early days, and in this talk will reflect on how the technology has developed so far, why he leveraged RISC-V in Esperanto, where he sees the technology going in the future and why in the AI & HPC communities we should be excited about the possibilities unlocked by RISC-V.
Birds of a Feather
TP
XO/EX
DescriptionThe IO500 is the de facto benchmarking standard for HPC storage. We have released official lists at ISC and SC events since SC17 and now have over 200 entries. The purpose is to foster the IO500 community and ensure forward progress towards the common goals of creating, sharing, and benefiting from a large corpus of shared storage data. The IO500 also serves as the largest repository of detailed HPC storage information for researchers and system designers to analyze and evaluate over time. A key highlight is the presentation of the latest Research and Production IO500 lists.
Birds of a Feather
TP
XO/EX
DescriptionThe IPv6 landscape is getting easier to navigate and there’s no looking back. As the global transition from IPv4 continues, many ISPs are seeing approximately 50% of their traffic via IPv6. This BoF continues discussions from SC22 and SC23, provides a brief update on global IPv6 transition efforts, and demonstrates real-time IPv6 usage from SCinet! The HPC community continues to embrace IPv6 and learn about current status, cybersecurity concerns and transition implications. Join our discussion on transitioning HPC, data centers and networks. Ask questions, provide updates, and meet others with real-world experience — together we can advocate for IPv6.
Birds of a Feather
TP
XO/EX
DescriptionDOE has recently launched the Integrated Research Infrastructure (IRI) program, which is designed to enable new modes of integrated science across DOE user facilities. Common or unified interfaces are needed for these workflows to seamlessly orchestrate resources across high-performance computing, data, and network providers. These interfaces could be REST APIs for programmable workflows, expansive UIs like JupyterLab, or deep integration with external workflow orchestrators.
This BoF will summarize the current efforts of the IRI interfaces working group and individual ASCR user facilities to develop new interfaces. We invite the community to provide feedback to help guide these IRI efforts.
This BoF will summarize the current efforts of the IRI interfaces working group and individual ASCR user facilities to develop new interfaces. We invite the community to provide feedback to help guide these IRI efforts.
Workshop
Artificial Intelligence/Machine Learning
Graph Algorithms
Scalable Data Mining
W
DescriptionThe diversity of accelerators in computer systems poses significant challenges for software developers, such as managing vendor-specific compiler toolchains, code fragmentation requiring different kernel implementations, and performance portability issues. To address these, the Intelligent Runtime System (IRIS) was developed. IRIS works across various systems, from smartphones to supercomputers, enabling automatic performance scaling based on available accelerators. Although IRIS simplifies system details, optimal dynamic scheduling still requires user input to understand workload structures. To address this, we introduce a new scheduling policy for IRIS, termed IRIS-GNN, which is the first IRIS hybrid policy that operates in conjunction with the dynamic policies. This policy employs a Graph-Neural Network (GNN) to conduct Graph Classification of any task graphs submitted to IRIS. This GNN analyzes the structure and attributes of the task graph, categorizing it as either locality, concurrency, or mixed. This classification subsequently guides the selection of the dynamic policy used by IRIS.
Workshop
Distributed Computing
Performance Evaluation and/or Optimization Tools
Scientific and Information Visualization
W
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptioniSeeMore is a kinetic cluster of 256-Raspberry Pi (RPi) computers that visually realizes supercomputing concepts (parallelism, data flow, algorithms) through servo and LED driven movement. In this poster, we describe the design of iSeeMore, the first large-scale cluster to combine movement and light in the service of educating audiences on the parallel algorithms and systems that form the underpinnings of everyday technologies we use today. We discuss core design decisions, software features for synchronizing LED hats to computation/movement, the approach to visually demonstrating parallel AI/ML concepts (e.g., LLMs), and the plan to showcase iSeeMore to large audiences.
Birds of a Feather
TP
XO/EX
DescriptionC++ was named Tiobe Programming Language of the Year for 2022 by the Tiobe Index of language popularity. C/C++ is used in 79.4% of parallel programming languages (based on Hyperion Research's HPC Market Update Briefing at ISC 2021). C++26 already has many improvements relevant for the HPC C++ developer community, including SIMD, linear algebra, structured concurrency, and submdspan, among others. This BoF will pull together important leaders within ISO C++ Standard that are co-authors in key C++23 and C++26 features such as ML, executors, mdspan, library, concurrency, parallelism and GPU support.
Posters
TP
DescriptionWe present JACC (Julia for ACCelerators), the first high-level, meta-programming, and performance-portable model for the just-in-time and LLVM-based Julia language. JACC provides a unified and lightweight front end across different back ends available in Julia, enabling the same Julia code to run efficiently on many CPU and GPU targets. We evaluated the performance of JACC for common HPC kernels as well as for the most computationally demanding kernels used in applications such as HARVEY, a blood flow simulator to assist in the diagnosis and treatment of patients suffering from vascular diseases. We carried out the performance analysis on the most advanced U.S. DOE supercomputers: Aurora, Frontier, and Perlmutter. Overall, we show that JACC has a negligible overhead versus vendor-specific solutions, reporting GPU speedups over the CPU implementations with no extra cost.
Workshop
Accelerators
Heterogeneous Computing
Parallel Programming Methods, Models, Languages and Environments
W
DescriptionWe present JACC (Julia for ACCelerators), the first high-level, metaprogramming, and performance-portable model for the just-in-time and LLVM-based Julia language. JACC provides a unified and lightweight front end across different back ends available in Julia, enabling the same Julia code to run efficiently on many CPU and GPU targets. We evaluated the performance of JACC for common HPC kernels as well as for the most computationally demanding kernels used in applications, such as MiniFE, a proxy application for unstructured implicit finite element codes, HPCCG, a supercomputing benchmark test for sparse domains, and HARVEY, a blood flow simulator to assist in the diagnosis and treatment of patients suffering from vascular
diseases. We carried out the performance analysis on the most advanced US DOE supercomputers: Aurora, Frontier, and Perlmutter. Overall, we show that JACC has a negligible
overhead versus vendor-specific solutions, reporting GPU speedups over the CPU implementations with no extra cost.
diseases. We carried out the performance analysis on the most advanced US DOE supercomputers: Aurora, Frontier, and Perlmutter. Overall, we show that JACC has a negligible
overhead versus vendor-specific solutions, reporting GPU speedups over the CPU implementations with no extra cost.
Workshop
Data Movement and Memory
I/O, Storage, Archive
W
DescriptionHardware is becoming increasingly heterogeneous in modern high-performance computing clusters. However, computing environments for developing tools to harness these technologies are not easily available to researchers. This work showcases the need for a new high-pace, heterogeneous I/O research cluster and presents a novel software deployment framework named Jarvis to manage its hardware diversity. Jarvis is an extensible Python framework that allows users to create packages that deploy, manage, and monitor software, including complex applications (e.g., scientific simulations), support tools (e.g., Darshan, GDB), and storage systems (e.g., Lustre, DAOS). These packages can be combined to form complex deployment pipelines. To ensure pipelines are portable across hardware, Jarvis defines a novel resource graph schema file, which is a snapshot of a cluster's machine-specific information. This schema can be queried by Jarvis packages to deploy software across diverse hardware compositions with minimal user effort.
Birds of a Feather
TP
XO/EX
DescriptionThe Julia for HPC BoF provides a place for the HPC community with interests in the Julia programming language as an LLVM front-end for science to close the gap between high-productivity languages and the performance of compiled languages. We invite participants from industry, government and academia to discuss their experiences, identify and learn about opportunities and gaps. Topics include: community, adoption and support in HPC facilities, the Julia ecosystem, programming models and packages. The proposed third consecutive BoF continues the Julia for HPC working group’s engagement with the SC community, and complements the tutorial on Julia for HPC at SC24.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionComputational performance, e.g. CPU or GPU utilization, is crucial for analyzing machine learning (ML) applications and their resource-efficient deployment. However, the ML community often lacks accessible tools for holistic performance engineering, especially during exploratory programming such as implemented by Jupyter. Therefore, we present JUmPER, a Jupyter kernel that supports coarse-grained performance monitoring and fine-grained analysis tasks of user code in Jupyter.
JUmPER collects system metrics and stores them alongside executed user code. Additionally, code instrumentation can be enabled to collect performance events using Score-P. Built-in Jupyter magic commands provide visualizations of the monitored performance data directly in Jupyter. In addition, JUmPER preserves the exploratory programming experience by seamlessly integrating with Jupyter and reducing kernel runtime overhead through in-memory (pipe) communication and parallel marshalling of Python's interpreter state for the Score-P execution.
JUmPER thus provides a low-hurdle infrastructure for performance engineering in Jupyter and supports resource-efficient ML applications.
JUmPER collects system metrics and stores them alongside executed user code. Additionally, code instrumentation can be enabled to collect performance events using Score-P. Built-in Jupyter magic commands provide visualizations of the monitored performance data directly in Jupyter. In addition, JUmPER preserves the exploratory programming experience by seamlessly integrating with Jupyter and reducing kernel runtime overhead through in-memory (pipe) communication and parallel marshalling of Python's interpreter state for the Score-P execution.
JUmPER thus provides a low-hurdle infrastructure for performance engineering in Jupyter and supports resource-efficient ML applications.
Workshop
Applications and Application Frameworks
W
DescriptionComputational performance, e.g. CPU or GPU utilization, is crucial
for analyzing machine learning (ML) applications and their resource-
efficient deployment. However, the ML community often lacks
accessible tools for holistic performance engineering, especially
during exploratory programming such as implemented by Jupyter.
Therefore, we present JUmPER, a Jupyter kernel that supports coarse-
grained performance monitoring and fine-grained analysis tasks of
user code in Jupyter. JUmPER collects system metrics and stores them
alongside executed user code. Built-in Jupyter magic commands provide visualizations of the monitored performance data directly in Jupyter. Additionally, code instrumentation can be enabled to collect performance events using Score-P. JUmPER preserves
the exploratory programming experience by seamlessly integrating
with Jupyter and reducing kernel runtime overhead through in-memory
(pipe) communication and parallel marshalling of Python's interpreter
state for the Score-P execution. JUmPER thus provides a low-hurdle
infrastructure for performance engineering in Jupyter and supports
resource-efficient ML applications.
for analyzing machine learning (ML) applications and their resource-
efficient deployment. However, the ML community often lacks
accessible tools for holistic performance engineering, especially
during exploratory programming such as implemented by Jupyter.
Therefore, we present JUmPER, a Jupyter kernel that supports coarse-
grained performance monitoring and fine-grained analysis tasks of
user code in Jupyter. JUmPER collects system metrics and stores them
alongside executed user code. Built-in Jupyter magic commands provide visualizations of the monitored performance data directly in Jupyter. Additionally, code instrumentation can be enabled to collect performance events using Score-P. JUmPER preserves
the exploratory programming experience by seamlessly integrating
with Jupyter and reducing kernel runtime overhead through in-memory
(pipe) communication and parallel marshalling of Python's interpreter
state for the Score-P execution. JUmPER thus provides a low-hurdle
infrastructure for performance engineering in Jupyter and supports
resource-efficient ML applications.
Workshop
Architecture
Network
Performance Optimization
Security
System Administration
W
DescriptionOpen-science collaboration using Jupyter Notebooks may expose expensively trained AI models, high-performance computing resources, and training data to security vulnerabilities, such as unauthorized access, accidental deletion, or misuse. The ubiquitous deployments of Jupyter Notebooks (≈ 11 million public notebooks on Github) have transformed collaborative scientific computing by enabling reproducible research.
This paper describes the network-based attack taxonomy of Jupyter Noteboks. The open nature of Jupyter (direct data access, arbitrary code execution in multiple programming languages kernels) and its vast attack interface (terminal, file browser, untrusted cells) also attracts attacks attempting to misuse supercomputing resources and steal state-of-the-art research artifacts (CVE-2024-22415). We envisage even more sophisticated AI-driven attacks can be adapted to target Jupyter, where defenders have limited visibility. On balance, this is the first paper to systematically describe the threat model against Jupyter Notebooks and lay out the design of auditing Jupyter to have better visibility against such attacks.
This paper describes the network-based attack taxonomy of Jupyter Noteboks. The open nature of Jupyter (direct data access, arbitrary code execution in multiple programming languages kernels) and its vast attack interface (terminal, file browser, untrusted cells) also attracts attacks attempting to misuse supercomputing resources and steal state-of-the-art research artifacts (CVE-2024-22415). We envisage even more sophisticated AI-driven attacks can be adapted to target Jupyter, where defenders have limited visibility. On balance, this is the first paper to systematically describe the threat model against Jupyter Notebooks and lay out the design of auditing Jupyter to have better visibility against such attacks.
Workshop
Heterogeneous Computing
Parallel Programming Methods, Models, Languages and Environments
PAW-Abstract
Task Parallelism
W
DescriptionFortran 2023, with its "do concurrent" and coarray parallel programming features, displaces many uses of extra-language parallel programming models such as MPI, OpenMP, and OpenACC. The Cray, Intel, LFortran, LLVM, and NVIDIA compilers automatically parallelize do concurrent in shared memory. The Cray, Intel, and GNU compilers support coarrays in shared- and distributed-memory, while the NAG compiler supports coarrays in shared memory. Thus, language-based parallelism is emerging as a portable alternative to MPI+X.
This talk will present experiences with automatic "do concurrent" parallelization in the deep learning library Inference-Engine and coarray communication in the Intermediate Complexity Atmospheric Research (ICAR), respectively.
This talk will present experiences with automatic "do concurrent" parallelization in the deep learning library Inference-Engine and coarray communication in the Intermediate Complexity Atmospheric Research (ICAR), respectively.
Workshop
Cloud Computing
Middleware and System Software
State of the Practice
W
DescriptionWe present K-Foundry, a framework that enables the integration of Simple Linux Utility for Resource Management (SLURM) with Kubernetes (K8s) via a Kubernetes-like Control Plane (KCP). Our implementation seamlessly enables a unified communication and scheduling layer for a fleet of multiple diverse computing platforms. While SLURM and K8s traditionally support distinct scheduling models, they have recently started aligning their objectives by moving towards a more converged execution environment. Similarly to K8s, SLURM, for example, has begun providing container execution support. In this work, we aim to provide support for high performance computing (HPC) workloads in containerized environments in a seamless fashion for a fleet of distinct schedulers. The goal is to enhance the way we interact with the underlying infrastructure to meet growing computational demands.
Paper
Distributed Computing
Heterogeneous Computing
Programming Frameworks and System Software
Runtime Systems
Task Parallelism
TP
DescriptionThe Message-Passing Interface (MPI) and C++ form the backbone of high-performance computing, but MPI only provides C and Fortran bindings.
While this offers great language interoperability, high-level programming languages like C++ make software development quicker and less error-prone.
We propose novel C++ language bindings that cover all abstraction levels from low-level MPI calls to convenient STL-style bindings, where most parameters are inferred from a small subset of parameters, by bringing named parameters to C++.
This enables rapid prototyping and fine-tuning runtime behavior and memory management.
A flexible type system and additional safety guarantees help to prevent programming errors.
By exploiting C++'s template metaprogramming capabilities, this has (near) zero overhead, as only required code paths are generated at compile time.
We demonstrate that our library is a strong foundation for a future distributed standard library using multiple application benchmarks, ranging from text-book sorting algorithms to phylogenetic interference.
While this offers great language interoperability, high-level programming languages like C++ make software development quicker and less error-prone.
We propose novel C++ language bindings that cover all abstraction levels from low-level MPI calls to convenient STL-style bindings, where most parameters are inferred from a small subset of parameters, by bringing named parameters to C++.
This enables rapid prototyping and fine-tuning runtime behavior and memory management.
A flexible type system and additional safety guarantees help to prevent programming errors.
By exploiting C++'s template metaprogramming capabilities, this has (near) zero overhead, as only required code paths are generated at compile time.
We demonstrate that our library is a strong foundation for a future distributed standard library using multiple application benchmarks, ranging from text-book sorting algorithms to phylogenetic interference.
Workshop
I/O, Storage, Archive
W
DescriptionIn 2020, Meta changed the way we did AI Training. We moved to a synchronous training approach to power our recommendation systems. This pivot required us to build high speed low latency RDMA networks to interconnect GPUs. Over the years Meta has build some of the largest AI Clusters in the world to support training, increasing complex models to support rich user experience. We initially built with Ethernet as our interconnect, later also onboarded Infiniband to production. Such model complexity and scale increased an order of magnitude recently with evolution of GenAI, highlighted by our llama series of foundational models. In this talk, we will take you through such evolution of Meta’s AI Network and Communication Library software over the past 5 years. We will talk about the problems we ran into as we scaled such infrastructure and how we customized our training systems software stack to work through them. We will highlight the changes we did to the Scheduling, Collective Communication, Sharding and Network Transport layers to keep our Clusters performant from a communication perspective.
Workshop
Emerging Technologies
Modeling and Simulation
Scientific and Information Visualization
W
Workshop
Architecture
Network
Performance Optimization
System Administration
W
DescriptionKeynote at the INDIS Workshop
Workshop
Applications and Application Frameworks
W
DescriptionUsing UK national Exascale AI research resource called Isambard-AI as an example, this talk examines motivating user stories for interactive and urgent HPC with large language models (LLMs) and AI for science to uncover potential synergies. Adopting a cloud-native approach for resource management, scheduling, virtualization, and software packaging, delivery, and deployment has helped reduce access barriers for non-traditional HPC or AI users, while preserving the familiar interface of classic HPC platforms. Isambard-AI is based on the HPE Cray EX4000 system, and housed in a new, energy efficient Modular Data Centre in Bristol, UK. With nearly 5,500 NVIDIA Grace-Hopper GH200, it delivers over 21 ExaFLOP/s of 8-bit AI FLOPs and over 250 PetaFLOP/s of 64-bit FLOPs, for under 5MW.
Birds of a Feather
TP
XO/EX
DescriptionThe open-standard SYCL programming model provides a portable way to program heterogeneous systems. The abstractions and features for HPC in SYCL has seen its increased use in application domains needing GPU-accelerated Top500 machines such as Aurora and Dawn, for fusion energy, molecular dynamics, aerospace and AI.
In this session, we will bring together the community of everyone using and developing SYCL applications and implementations. We will discuss future directions and seek feedback on priorities for the next version of SYCL. A panel of SYCL experts, runtime/compiler implementers, and application specialists will lead an audience discussion and Q&A.
In this session, we will bring together the community of everyone using and developing SYCL applications and implementations. We will discuss future directions and seek feedback on priorities for the next version of SYCL. A panel of SYCL experts, runtime/compiler implementers, and application specialists will lead an audience discussion and Q&A.
Workshop
HPC Infrastructure
State of the Practice
System Administration
W
DescriptionThis lightning talk will explore our approach to reallocating compute nodes in an HPC system managed by Slurm, converting them from "batch" nodes to "cloud" nodes within a Kubernetes resource. We developed and implemented these methods on the Anvil ACCESS resource to support large-scale educational and training workshops. By temporarily shifting batch nodes to cloud nodes, we overcame capacity limitations on Anvil's Kubernetes infrastructure, enabling one workshop to scale up to 75 computing sessions, a 3.5x increase over what was possible with their allocation. We will give an overview of Anvil's xCAT + masterless puppet configuration management stack and introduce a script that facilitates the conversion of HPC batch nodes to Anvil Kubernetes nodes and back, providing flexible resource management through command-line options. While each institution's software stack is unique, our experience and guidelines offer a foundation that others can adapt to achieve similar success in their own environments.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionLarge language model (LLM) deployment necessitates high inference throughput due to the increasing demand for text generation. To accelerate inference, the prefill mechanism avoids repeated computations via introducing KV Cache (in HBM). However, the KV cache size increases with the input and generated text length, causing insufficient GPU memory and slow KV fetching. To address these issues, existing approaches compress the KV cache using prune-based mechanisms that only keep the important KV vectors in the cache. However, their compression ratio is limited because it is necessary to preserve inference accuracy in the accuracy-compression ratio tradeoff. To improve the compression ratio, we introduce KVSort, a novel framework that utilizes error-bounded lossy compression on sorted KV vectors. The evaluation shows that KVSort achieves up to 52x compression ratio and 6.8x end-to-end inference performance improvement, compared to a state-of-the-art approach that achieves 20x compression ratio and 5.5x end-to-end inference throughput.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionMany practical turbulent flow phenomena are naturally studied using a Lagrangian approach that treats the fluid medium as a collection of infinitesimal fluid particles. We present a GPU-accelerated algorithm for tracking particles in direct numerical simulations of isotropic turbulence, scaling up to 32768^3 using the world's first exascale computer (Frontier). Cubic spline interpolation is used to compute the particle velocity as the particles wander among sub-domains held by different parallel processes. We use a programming model that minimizes host-device data transfer by leveraging
memory parity between the CPU and GPU, reduces communication costs through a local decomposition for the particles, and uses OpenMP offloading on the GPU to accelerate the computation of cubic spline coefficients. The result is an algorithm shown to attain good weak scaling and strong scaling at problem sizes close to the capacity supported by the machine, at a cost nearly independent of the particle count.
memory parity between the CPU and GPU, reduces communication costs through a local decomposition for the particles, and uses OpenMP offloading on the GPU to accelerate the computation of cubic spline coefficients. The result is an algorithm shown to attain good weak scaling and strong scaling at problem sizes close to the capacity supported by the machine, at a cost nearly independent of the particle count.
Workshop
Heterogeneous Computing
Parallel Programming Methods, Models, Languages and Environments
PAW-Full
Task Parallelism
W
DescriptionThe discussion around "safe'" programming languages has significantly increased in recent years. The White House Office of the National Cyber Director released a report in February 2024 calling on the technical community to work towards proactively reducing attack surfaces in cyberspace, in part, specifically by adopting memory safe programming languages.
We introduce Lamellar, an asynchronous tasking and PGAS runtime system for HPC written in Rust, one such ``memory-safe'' language. We describe the entire Lamellar stack, from network interfaces to safe abstractions such as distributed LamellarArrays and Active Messages. The goal of our runtime is to enable end-users to develop entirely safe Rust code in their applications, limiting the use of any ``unsafe'' code blocks to rigorously tested code blocks within the runtime itself. We conclude by showing comparable performance against several C, C++, and Chapel implementations of a subset of the BALE kernel suite while maintaining strong memory safety principles.
We introduce Lamellar, an asynchronous tasking and PGAS runtime system for HPC written in Rust, one such ``memory-safe'' language. We describe the entire Lamellar stack, from network interfaces to safe abstractions such as distributed LamellarArrays and Active Messages. The goal of our runtime is to enable end-users to develop entirely safe Rust code in their applications, limiting the use of any ``unsafe'' code blocks to rigorously tested code blocks within the runtime itself. We conclude by showing comparable performance against several C, C++, and Chapel implementations of a subset of the BALE kernel suite while maintaining strong memory safety principles.
Workshop
Applications and Application Frameworks
Distributed Computing
Middleware and System Software
W
DescriptionThis paper presents Laminar 2.0, an enhanced serverless framework for running dispel4py streaming workflows. Building on Laminar, this version introduces improved dependency management, client-server functionality, and advanced deep learning models for semantic search. Key innovations include a structural code-to-code search using simplified parse
syntax trees (SPTs) for detecting similar Processing Elements (PEs) or workflows, even from incomplete code. Additionally, Laminar 2.0 optimizes text-to-code search through better pre-
processing of PEs. Our evaluation shows significant performance improvements over the previous version.
syntax trees (SPTs) for detecting similar Processing Elements (PEs) or workflows, even from incomplete code. Additionally, Laminar 2.0 optimizes text-to-code search through better pre-
processing of PEs. Our evaluation shows significant performance improvements over the previous version.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionThis study explores hyperparameter optimization for encoder-only genomics large language models, balancing machine learning performance, hardware resource usage and power consumption. Multiple search techniques and objective functions were implemented, and from the results we obtained two types of models: a Compact and an Optimal model. Comprehensive tests were carried out to rank these models based on their model performance and resource usage.
Paper
Accelerators
Applications and Application Frameworks
Artificial Intelligence/Machine Learning
Modeling and Simulation
Numerical Methods
TP
DescriptionAnomaly detection in computational workflows is critical for ensuring system reliability and security. However, traditional rule-based methods struggle to detect novel anomalies. This paper explores leveraging large language models (LLMs) for workflow anomaly detection by exploiting their ability to learn complex data patterns. Two approaches are investigated: 1) supervised fine-tuning (SFT), where pre-trained LLMs are fine-tuned on labeled data for sentence classification to identify anomalies, and 2) in-context learning (ICL) where prompts containing task descriptions and examples guide LLMs in few-shot anomaly detection without fine-tuning. The paper evaluates the performance, efficiency, generalization of SFT models, and explores zero-shot and few-shot ICL prompts and interpretability enhancement via chain-of-thought prompting. Experiments across multiple workflow datasets demonstrate the promising potential of LLMs for effective anomaly detection in complex executions.
Exhibitor Forum
Parallel Programming Methods, Models, Languages and Environments
TP
XO/EX
DescriptionWe describe an effort to compute pi to high precision using a distributed memory approach on a compute cluster. The work is an example of how one can use parallelism and the aggregate RAM of a cluster to greatly accelerate a computation that would otherwise use external storage to hold state. The cost of our approach is some programming complexity and the need for significant hardware resources (albeit for short periods of time) to realize the run-time saving on large-scale problems. The main goal motivating this work is the creation of a holistic benchmark that can be used to test a supercomputer in several critical dimensions: compute cores, memory, network, and storage. We expect that the code produced by this effort leads to fruitful interaction with the many talented people engaged in scientific computing efforts across the country.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionLarge, diverse datasets of executable programs are required for training and running machine learning models to find insights in program performance. While many open-source code repositories exist freely on popular software development websites such as GitHub, the safety and executability of such programs cannot be guaranteed. To bridge this gap, this study proposes LLMRPG (Large Language Model-based Randomized Program Generator), a program generator that harnesses open-source large language models (LLMs) fine-tuned for code generation to generate error-free, executable, and human-like programs on demand. The performance of LLMRPG was evaluated across popular open-source LLMs using heuristics such as the semantic similarity between programs, and the proportion of compilable and executable programs generated by LLMRPG. Analysis on the programs generated by LLMRPG demonstrates that these programs have satisfactory compilability and executability, as well as high diversity.
Exhibitor Forum
Emerging Technologies
Hardware Technologies
TP
XO/EX
DescriptionDespite significant advancements in high-performance computing, certain NP-hard problems remain computationally infeasible, even for the most advanced supercomputers. While heuristics and approximation algorithms offer faster, often near-optimal solutions, many applications demand improved accuracy or real-time processing capabilities.
LightSolver introduces a novel quantum-inspired computing paradigm designed to address the scalability and performance limitations of NP-hard problems. Its Laser Processing Unit™ (LPU) employs lasers for parallel processing at the speed of light, rivaling the performance and scalability of current supercomputers.
In this talk, we will showcase the integration of LightSolver with Ansys’ LS-DYNA multi-physics simulation software to optimize reordering for sparse matrix factorization, significantly improving the efficiency and scalability of implicit mechanical simulations.
We will also discuss its application in feature selection for machine learning, where our platform was benchmarked against classical approximation methods. LightSolver demonstrated superior accuracy, even with small data sets, thereby enhancing predictive performance.
LightSolver introduces a novel quantum-inspired computing paradigm designed to address the scalability and performance limitations of NP-hard problems. Its Laser Processing Unit™ (LPU) employs lasers for parallel processing at the speed of light, rivaling the performance and scalability of current supercomputers.
In this talk, we will showcase the integration of LightSolver with Ansys’ LS-DYNA multi-physics simulation software to optimize reordering for sparse matrix factorization, significantly improving the efficiency and scalability of implicit mechanical simulations.
We will also discuss its application in feature selection for machine learning, where our platform was benchmarked against classical approximation methods. LightSolver demonstrated superior accuracy, even with small data sets, thereby enhancing predictive performance.
Art of HPC
Posters
TP
W
TUT
XO/EX
DescriptionThe image was created using ParaView, from data computed on the Aurora supercomputer by the HACC collaboration.
Paper
Heterogeneous Computing
Linear Algebra
Network
Parallel Programming Methods, Models, Languages and Environments
Performance Evaluation and/or Optimization Tools
TP
DescriptionPerformance modeling is an essential tool in many areas, including performance characterization/optimization, design space exploration, and resource allocation problems, to
name a few. However, existing performance modeling approaches have limitations, such as high computational cost for discrete-event simulators, narrow flexibility of hardware emulators, or
restricted accuracy/generality of analytical/data-driven models.
To address these limitations, this paper proposes PerfVec, a novel deep-learning-based performance modeling framework that learns high-dimensional and independent/orthogonal program and microarchitecture representations. Once learned, a program representation can be used to predict its performance on any microarchitecture, and likewise, a microarchitecture representation can be applied in the performance prediction of any program. Additionally, PerfVec yields a foundation model that captures the performance essence of instructions, which can be directly used by developers in numerous performance modeling-related tasks without incurring its training cost. The evaluation demonstrates that PerfVec is more general and efficient than previous approaches.
name a few. However, existing performance modeling approaches have limitations, such as high computational cost for discrete-event simulators, narrow flexibility of hardware emulators, or
restricted accuracy/generality of analytical/data-driven models.
To address these limitations, this paper proposes PerfVec, a novel deep-learning-based performance modeling framework that learns high-dimensional and independent/orthogonal program and microarchitecture representations. Once learned, a program representation can be used to predict its performance on any microarchitecture, and likewise, a microarchitecture representation can be applied in the performance prediction of any program. Additionally, PerfVec yields a foundation model that captures the performance essence of instructions, which can be directly used by developers in numerous performance modeling-related tasks without incurring its training cost. The evaluation demonstrates that PerfVec is more general and efficient than previous approaches.
Workshop
Performance Optimization
Programming Frameworks and System Software
W
DescriptionMany High-Performance Computing (HPC) applications have large code-bases written in legacy Fortran. Porting these applications to C++ enables us to leverage GPU programming frameworks such as Kokkos and AMReX but can be time-intensive. We present our use of LLM-powered code converters to expedite this process using three available code converters: ChatGPT, CodeConvert, and DeepAI. Our findings indicate that CodeConvert produce superior results compared to ChatGPT and DeepAI, requiring only minor adjustments by the user. However, we note that particular care must be taken with preprocessing directives, as all the converters tend to omit them when converting longer functions. Finally, we demonstrate that CodeConvert gives an identical bit-for-bit comparison of simulation results when porting MYNN-EDMF, a widely used climate subgrid model, to C++. By showcasing the effectiveness of this approach, we highlight that readily available LLM converters can be effectively used to accelerate the optimization of Fortran applications.
Workshop
Applications and Application Frameworks
W
DescriptionIn this talk, I will explore how the DOE Joint Genome Institute (JGI) advances large-scale genomics data generation while connecting the scientific community beyond sequence production. JGI can enable real-time exploration of vast datasets through collaborations with platforms such as the National Microbiome Data Collaborative (NMDC) and KBase. This is made possible by the infrastructure we are building to ensure that workflow analysis is portable, reproducible, and shareable.
The JGI Analysis Workflow Service (JAWS), a scalable, centralized framework is at the core of this infrastructure. JAWS was developed to streamline the execution and management of computational workflows across DOE resources by simplifying complex workflows in DOE high-performance computing (HPC) clusters and cloud environments.
This approach empowers scientists to rapidly identify patterns or filter results, leveraging HPC to expedite time-sensitive analyses. JGI’s infrastructure is becoming increasingly relevant to support urgent scientific efforts, from bioenergy research to addressing climate resilience.
The JGI Analysis Workflow Service (JAWS), a scalable, centralized framework is at the core of this infrastructure. JAWS was developed to streamline the execution and management of computational workflows across DOE resources by simplifying complex workflows in DOE high-performance computing (HPC) clusters and cloud environments.
This approach empowers scientists to rapidly identify patterns or filter results, leveraging HPC to expedite time-sensitive analyses. JGI’s infrastructure is becoming increasingly relevant to support urgent scientific efforts, from bioenergy research to addressing climate resilience.
Workshop
Algorithms
Heterogeneous Computing
W
DescriptionLoad balancing (LB) is a challenge for parallel applications in High Performance Computing (HPC). Depending on various constraints, LB is an optimization problem. This paper focuses on the context of a given task distribution in distributed memory systems, where load imbalance might happen at runtime due to a weak performance model. In this imbalance context, LB refers to the Load Rebalancing Problem (LRP). Tasks should be migrated from one machine to another to improve the load. Our paper presents a formulation of LRP to be solved in a hybrid classical-quantum approach. We compare the quantum-based methods with the classical methods using heuristic algorithms. The experiments revolve around imbalance ratio and speedup based on the results of the applied methods, where the number of migrated tasks is a concern because task migration overhead is expensive. The quantum-based methods show positive performance gain even better than classical methods.
Workshop
Architecture
Data-Intensive
Network
Performance Optimization
System Administration
W
DescriptionProgrammable data planes have provided great flexibility in defining the behaviors of packet forwarding switches, routers, and network interface cards (NICs). The In-band
Network Telemetry (INT) technology further increased network operators’ potential to manage packet flows by enabling realtime and customizable monitoring of packets without creating much overhead on the network. These recent advancements in networking technology have generated significant research interest and activity, including studies on INT-based DDoS detection and mitigation mechanisms. However, in practice, INT technology has not been fully realized yet, especially in detecting network anomalies in real-time. In this paper, we aim to implement a holistic real-time INT-based DDoS detection mechanism. The proposed mechanism will retrieve INT data from the network, analyze it using machine learning (ML) models in real-time, and send the information to the control plane. We will also compare the performance of using INT to detect DDoS attacks against sFlow-based detection.
Network Telemetry (INT) technology further increased network operators’ potential to manage packet flows by enabling realtime and customizable monitoring of packets without creating much overhead on the network. These recent advancements in networking technology have generated significant research interest and activity, including studies on INT-based DDoS detection and mitigation mechanisms. However, in practice, INT technology has not been fully realized yet, especially in detecting network anomalies in real-time. In this paper, we aim to implement a holistic real-time INT-based DDoS detection mechanism. The proposed mechanism will retrieve INT data from the network, analyze it using machine learning (ML) models in real-time, and send the information to the control plane. We will also compare the performance of using INT to detect DDoS attacks against sFlow-based detection.
Workshop
Broader Engagement
Education
Inclusivity
W
DescriptionWe document an interactive half-day tutorial in which participants explore the advanced applications of National Science Data Fabric (NSDF) services and strategies for comprehensive scientific data analysis. Targeting researchers, students, developers, and scientists, the tutorial provides valuable insights into managing and analyzing large datasets, particularly those exceeding 100TB. Participants gain hands-on experience by constructing modular workflows, leveraging public and private data storage and streaming solutions, and deploying sophisticated visualization and analysis dashboards.
Our tutorial includes an overview of NSDF’s capabilities, addressing common data analysis challenges, and intermediate hands-on exercises using NSDF services for Earth science data. Advanced applications cover handling and visualizing massive datasets requiring high-resolution data management. Attendees gain a deeper understanding of integrating NSDF services into their research workflows, enhancing data accessibility, sharing, and collaborative scientific discovery. This tutorial advances knowledge in data-intensive computing and empowers participants to harness the full potential of NSDF in their respective fields.
Our tutorial includes an overview of NSDF’s capabilities, addressing common data analysis challenges, and intermediate hands-on exercises using NSDF services for Earth science data. Advanced applications cover handling and visualizing massive datasets requiring high-resolution data management. Attendees gain a deeper understanding of integrating NSDF services into their research workflows, enhancing data accessibility, sharing, and collaborative scientific discovery. This tutorial advances knowledge in data-intensive computing and empowers participants to harness the full potential of NSDF in their respective fields.
Exhibitor Forum
I/O, Storage, Archive
TP
XO/EX
DescriptionOpen standardizing — like NFS sets, NAS protocols, and ANSI T10 — defines SCSI’s object-based storage device (OSD) command sets and facilitates interoperability, community support, and vendor neutrality. We propose a standardized object-based computational storage (OCS) system amid the growing usage of object storage, the increasing bottlenecks from excessive data movement, the absence of standardization for data reduction functions within storage, and changing from SCSI interface to NVMe. The OCS, as a result of the collaboration with Los Alamos National Laboratory, performs computation in the storage itself and reduces data movement by transmitting only the results of the computation. This system supports a high-level OCS interface for object management and query pushdown, alongside a low-level OCS device (OCSD) command set for device-level object store and query processing. This talk will discuss the rationale behind OCS architecture, its integration with existing analytics systems, current prototype implementation, and early analytics acceleration results based on real-world HPC workloads.
Tutorial
Accelerators
Architecture
Emerging Technologies
Network
TUT
DescriptionThe past few years have witnessed an increased support for programmable network adapters, known as “Smart-NICs", that offer additional functionalities beyond standard packet processing capabilities. These devices often feature programmable lightweight processing cores, FPGAs, and even CPU- and GPU-based platforms capable of running separate operating systems. Their primary target has been data center operations, such as infrastructure management, packet filtering, and I/O acceleration, but are increasingly being explored for high-performance
computing (HPC) application acceleration.
This tutorial offers an in-depth exploration of the state-of-the-art for SmartNICs and the emerging software ecosystems supporting them. Attendees will engage in hands-on exercises to better understand how to take advantage of SmartNICs for application acceleration, including MPI collective operation offloading, OpenMP offload, system security, file I/O, and algorithmic modifications to maximize on-board processing power. Participants will have the opportunity to execute these exercises using cutting-edge SmartNICs like NVIDIA’s BlueField-3 Data
Processing Unit (DPU). The tutorial presenters will discuss additional techniques for optimizing applications to harness SmartNICs as communication accelerators in HPC systems.
computing (HPC) application acceleration.
This tutorial offers an in-depth exploration of the state-of-the-art for SmartNICs and the emerging software ecosystems supporting them. Attendees will engage in hands-on exercises to better understand how to take advantage of SmartNICs for application acceleration, including MPI collective operation offloading, OpenMP offload, system security, file I/O, and algorithmic modifications to maximize on-board processing power. Participants will have the opportunity to execute these exercises using cutting-edge SmartNICs like NVIDIA’s BlueField-3 Data
Processing Unit (DPU). The tutorial presenters will discuss additional techniques for optimizing applications to harness SmartNICs as communication accelerators in HPC systems.
Paper
Post-Moore Computing
Quantum Computing
TP
DescriptionThe rapid evolution of quantum hardware is propelling quantum computing to new frontiers. Nonetheless, the potential of natural language processing in the quantum paradigm (QNLP) is yet to be explored, including for Noisy Intermediate-Scale Quantum (NISQ) machines. To explore the QNLP frontier, we introduce LexiQL, a novel noise-aware QNLP technique for text classification on NISQ quantum machines. LexiQL employs an incremental data injection approach to process textual data in a quantum circuit. It also develops new and effective training methods, such as leveraging a diverse mix of expressible and shallow quantum circuits for the QNLP task of text classification. Our extensive evaluation using Yelp, IMDB, and Amazon datasets (along with synthetic QLNP datasets) demonstrates the effectiveness of LexiQL's noise-aware design in both ideal and noisy environments.
Workshop
Architecture
Data-Intensive
Network
Performance Optimization
System Administration
W
DescriptionScientific communities are increasingly using geographically distributed computing platforms. The current methods of compute placement predominantly use logically centralized controllers such as Kubernetes (K8s) to match tasks to available resources. However, this centralized approach is unsuitable in multi-organizational collaborations. Furthermore, workflows often need to use manual configurations tailored for a single platform and cannot adapt to dynamic changes across infrastructure.
Our work introduces a decentralized control plane for placing computations on geographically dispersed compute clusters using semantic names. We assign semantic names to computations to match requests with named Kubernetes (K8s) service endpoints. We show that this approach provides multiple benefits. First, it allows placement of computational jobs to be independent of location, enabling any cluster with sufficient resources to execute the computation. Second, it facilitates dynamic compute placement without requiring prior knowledge of cluster locations or predefined configurations.
Our work introduces a decentralized control plane for placing computations on geographically dispersed compute clusters using semantic names. We assign semantic names to computations to match requests with named Kubernetes (K8s) service endpoints. We show that this approach provides multiple benefits. First, it allows placement of computational jobs to be independent of location, enabling any cluster with sufficient resources to execute the computation. Second, it facilitates dynamic compute placement without requiring prior knowledge of cluster locations or predefined configurations.
Workshop
Artificial Intelligence/Machine Learning
Biology
Education
Emerging Technologies
Medicine
Modeling and Simulation
W
Workshop
Lightning Talk: In-Situ Temperature Profiling for Determining Detonation in White Dwarf Environments
Distributed Computing
Performance Evaluation and/or Optimization Tools
Scientific and Information Visualization
W
DescriptionThe temperature profile in white dwarf (WD) mergers plays a critical role in determining where and how detonations occur in stellar simulations. Existing methods, which primarily focus on maximum and ambient temperatures, often lack the precision needed for comprehensive detonation analysis. To address this, we developed a novel temperature profiling method that enhances accuracy by incorporating additional data, such as the size and average distance of hotspots on the star’s surface. This approach enables a more detailed understanding of detonation conditions. By implementing real-time, in-situ temperature profiling, we minimize the computational overhead typically caused by input/output (I/O) operations and data modeling, with the overhead ranging from 0.11% to 2.28% of total simulation time. In testing with Castro's wdmerger simulation, our method produced results with a significant speed-up of 8.55x in execution time of temperature profiling over Seitenzahl's method while maintaining the same level of accuracy among multiple WD detonation cases.
Workshop
Applications and Application Frameworks
Artificial Intelligence/Machine Learning
Parallel Programming Methods, Models, Languages and Environments
W
Workshop
Accelerators
Heterogeneous Computing
Parallel Programming Methods, Models, Languages and Environments
W
Workshop
Graph Algorithms
Heterogeneous Computing
Programming Frameworks and System Software
W
DescriptionTriangle counting and enumeration are commonly used in real-world applications on directed graphs. However, the performance of triangle counting algorithms is usually benchmarked on undirected graphs. As such, many of these algorithms and formulations are not suitable for identifying the types of directed triangles in directed graphs. In this work, we show how algorithms for counting each type of directed triad (directed triangle) can be formulated using linear algebra. Leveraging the FLAME methodology, we show that provably correct counting and enumeration algorithms for directed triads can be derived from the linear algebraic formulation. These algorithms can be used to count individual triads or together to count all possible triads. We show that despite being designed for individual use, the combined use of these algorithms yields a geometric mean speedup of 92.69x and 2.86x over the implementations in NetworkX and GraphBLAS (SuiteSparse 7.6), respectively, on various workloads from real-world directed graphs.
Exhibitor Forum
Facilities
Green Computing
TP
XO/EX
DescriptionThe current and next generation of high-performance CPUs and GPUs required to meet the demanding needs of AI and machine learning will generate a significant amount of heat, creating enormous challenges for hyperscalers already struggling to control their heat, energy consumption and footprint. With GPUs surpassing the 2,800 watt range, traditional air cooling methods have reached their limits, making way for innovative liquid cooling solutions such as immersive and direct-to-chip. This Exhibitor Forum presentation will provide an overview of two-phase direct-to-chip liquid cooling, highlighting how this technology can not only enhance the performance and reliability of high-performance AI servers, but also significantly contribute to sustainable data center operations and the industry-wide march towards net-zero carbon emissions. A case study of using a two-phase direct-to-chip monolithic cold plate designed to cool GB200 will be introduced.
Paper
Heterogeneous Computing
Linear Algebra
Network
Parallel Programming Methods, Models, Languages and Environments
Performance Evaluation and/or Optimization Tools
TP
Best Paper Finalist
DescriptionThe shift towards high-bandwidth networks driven by AI workloads in data centers and HPC clusters has unintentionally aggravated network latency, adversely affecting the performance of communication-intensive HPC applications. Given the significant difference in network latency tolerance among MPI applications, accurately determining an application's latency resilience is crucial. Traditional methods for assessing this metric, relying on specialized hardware or simulators, tend to be inflexible and time-consuming. In response, we introduce LLAMP, a novel toolchain utilizing the LogGPS model and linear programming to analytically evaluate HPC applications' network latency tolerance. LLAMP equips software developers and network architects with essential insights for optimizing HPC infrastructures and strategically deploying applications to minimize latency impacts. We validate our toolchain across various applications, such as MILC, LULESH, and LAMMPS. Additionally, we include a case study with the ICON climate model, underscoring LLAMP's utility in improving the design and optimization of future HPC systems and applications.
Workshop
Accelerators
Modeling and Simulation
Performance Evaluation and/or Optimization Tools
W
DescriptionLarge language models (LLMs) have propelled groundbreaking advancements across several domains and are commonly used for text generation applications. However, the computational demands of these complex models pose significant challenges, requiring efficient hardware acceleration. Benchmarking the performance of LLMs across diverse hardware platforms is crucial to understanding their scalability and throughput characteristics. We introduce LLM-Inference-Bench, a comprehensive benchmarking suite to evaluate the hardware inference performance of LLMs. We thoroughly analyze diverse hardware platforms, including GPUs from Nvidia and AMD, and specialized AI accelerators Intel Habana and SambaNova. Our evaluation includes several LLM inference frameworks and models from LLaMA, Mistral, and Qwen families with 7B and 70B parameters. Our benchmarking results reveal the strengths and limitations of various models, hardware platforms, and inference frameworks. We provide an interactive dashboard to help identify configurations for optimal performance for a given hardware platform.
Paper
Algorithms
Artificial Intelligence/Machine Learning
Data Movement and Memory
Graph Algorithms
TP
DescriptionAs Large Language Models (LLMs) are rapidly growing in popularity, LLM inference services must be able to serve requests from thousands of users while satisfying performance requirements. The performance of an LLM inference service is largely determined by the hardware onto which it is deployed, but understanding of which hardware will deliver on performance requirements remains challenging. In this work we present LLM-Pilot - a first-of-its-kind system for characterizing and predicting performance of LLM inference services. LLM-Pilot performs benchmarking of LLM inference services, under a realistic workload, across a variety of GPUs, and optimizes the service configuration for each considered GPU to maximize performance. Finally, using this characterization data, LLM-Pilot learns a predictive model, which can be used to recommend the most cost-effective hardware for a previously unseen LLM. Compared to existing methods, LLM-Pilot can deliver on performance requirements 33% more frequently, whilst reducing costs by 60% on average.
Workshop
Artificial Intelligence/Machine Learning
W
DescriptionLarge Language Models (LLMs) have advanced natural language processing but present challenges in training due to the complexity and resource demands. A significant issue is the inconsistency in checkpoint formats across pre-training libraries, complicating the use of pre-trained weights for continued training. To address this, we introduce llm-recipes, an open-source framework that streamlines the continual pre-training process by enabling direct use of Hugging Face Transformers checkpoints without conversion. This framework supports multi-node distributed training using PyTorch Fully Sharded Data Parallel (FSDP), enhancing scalability for large-scale models. Unlike existing tools, llm-recipes offers broader support for various model architectures and flexible training configurations, making it an adaptable solution for researchers and developers. Our experiments demonstrate its effective scalability, with high training throughput up to 64 GPUs, confirming its suitability for large-scale distributed training of LLMs.
Workshop
Accelerators
Heterogeneous Computing
Parallel Programming Methods, Models, Languages and Environments
W
DescriptionLarge Language Models (LLM) are evolving and have significantly revolutionized the landscape of software development. If done right, they can significantly accelerate the software development cycle. At the same time, the community is very cautious of the models being trained on biased or sensitive data, which can lead to biased outputs along with the inadvertent release of confidential information. Additionally, the carbon footprints and the un-explainability of these "black box" models continue to raise questions about the usability of LLMs.
With the abundance of opportunities LLMs have to offer, this paper explores the idea of "judging" tests used to evaluate compiler implementations of directive-based programming models as well as probe into the “black box” of LLMs. Based on our results, utilizing an agent-based prompting approach and setting up a validation pipeline structure drastically increased the quality of DeepSeek Coder, the LLM chosen for the evaluation purposes.
With the abundance of opportunities LLMs have to offer, this paper explores the idea of "judging" tests used to evaluate compiler implementations of directive-based programming models as well as probe into the “black box” of LLMs. Based on our results, utilizing an agent-based prompting approach and setting up a validation pipeline structure drastically increased the quality of DeepSeek Coder, the LLM chosen for the evaluation purposes.
Workshop
Compilers
Parallel Programming Methods, Models, Languages and Environments
Performance Optimization
W
Workshop
Compilers
Parallel Programming Methods, Models, Languages and Environments
Performance Optimization
W
Workshop
Compilers
Parallel Programming Methods, Models, Languages and Environments
Performance Optimization
W
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionLarge language models (LLMs) have achieved remarkable success in various natural language processing tasks. However, LLM inference is highly computational and memory-intensive, creating extreme deployment challenges. Tensor offloading, combined with tensor quantization and asynchronous task execution, provides a feasible solution by utilizing host memory to enable large-scale LLM inference with a limited number of GPUs. However, existing approaches struggle to fully utilize all available computational and memory resources due to a lack of consideration for (1) whether to use quantization effectively, and (2) managing thread-level parallelism within and across tasks. As a result, these approaches provide suboptimal solutions. In this paper, we introduce LM-Offload, a framework that addresses the above challenges by leveraging performance modeling and parallelism control. Experimental results demonstrate that LM-Offload outperforms FlexGen and ZeRO-Inference, two state-of-the-art systems for LLM inference, by up to 2.95× (2.34× on average) and 2.88× (1.57× on average), respectively, in inference throughput.
Paper
Algorithms
Artificial Intelligence/Machine Learning
Heterogeneous Computing
Performance Optimization
TP
DescriptionThe adaptation of pre-trained LLMs to diverse downstream tasks through fine-tuning is essential for numerous applications. However, the inefficiency of parameter-efficient fine-tuning (PEFT) techniques presents significant challenges regarding time investments and operational costs. In this paper, we first introduce a nuanced form of sparsity, termed Shadowy Sparsity, which is distinctive in fine-tuning and has not been adequately addressed for acceleration. Under Shadowy Sparsity, we propose Long Exposure, an efficient system to accelerate PEFT for LLMs. Long Exposure comprises three key components: Shadowy-sparsity Exposer employs a prolonged sensing range to capture more sparsity details under shadowy sparsity; Sequence-oriented Predictor provides efficient yet accurate predictions to handle large-sequence inputs and constantly evolving parameters; and Dynamic-aware Operator facilitates more structured computational patterns and coalesced memory accesses to address dynamic sparse operations. Comprehensive evaluations demonstrate that Long Exposure outperforms state-of-the-arts with up to 2.49x speedup in end-to-end fine-tuning, offering promising advancements in PEFT acceleration.
Exhibitor Forum
Facilities
TP
XO/EX
DescriptionThe recent rise in high-power AI processors has largely been made possible through advances in cooling technology, with liquid solutions leading the way. Despite the extraordinary heat dissipation potential that liquid cooling provides to microprocessors, the global demand for compute performance has already surpassed the capabilities of entry-level liquid cooling technologies due to the compounding challenge of higher heat flux coinciding with lower allowable processor temperatures. One response has been a growing effort to decrease facility water temperatures using less efficient methods such as mechanical chillers. As processor thermal resistance targets plummet, the coolant distribution unit (CDU) is becoming a significant contributor to the temperature drop between the processor and facility water. In this session we will examine the cost-benefit analysis of the CDU along with the opportunities and challenges of removing it from the full data center heat transfer path.
Exhibits
Flash Session
TP
XO/EX
DescriptionAre you in the race to design and build AI ready Data Centers with speed and precision to capture the market?
The impact of increasingly high compute processing is driving fundamental changes on data center designs, especially for Gen AI training data centers. Discover Schneider Electric’s end to end data center solution offerings to support the global deployment of AI anywhere and at scale.
Our solutions span across three key areas. Energy strategy, procurement and management. Digital infrastructure across high power densities: low and medium voltage switchgear, racks, UPSs, PDUs, cooling systems, building management systems, and management software. And, our unique combination of sustainability leadership, sustainability consulting expertise, and data center domain expertise means we can support with a holistic environmental sustainability strategy, program execution and reporting.
Join us to discover more.
The impact of increasingly high compute processing is driving fundamental changes on data center designs, especially for Gen AI training data centers. Discover Schneider Electric’s end to end data center solution offerings to support the global deployment of AI anywhere and at scale.
Our solutions span across three key areas. Energy strategy, procurement and management. Digital infrastructure across high power densities: low and medium voltage switchgear, racks, UPSs, PDUs, cooling systems, building management systems, and management software. And, our unique combination of sustainability leadership, sustainability consulting expertise, and data center domain expertise means we can support with a holistic environmental sustainability strategy, program execution and reporting.
Join us to discover more.
Paper
Accelerators
Algorithms
Data Compression
Linear Algebra
Tensors
TP
DescriptionStencil computations play a pivotal role in numerous scientific and industrial applications, yet their efficient execution on specialized hardware accelerators like Tensor Core Units (TCUs) remains a challenge. This paper introduces LoRAStencil, a novel stencil computing system designed to mitigate memory access redundancies on TCUs through low-rank adaptation. We first identify a nuanced form of this redundancy, dimension residue, specific to TCUs. Then LoRAStencil leverages orchestrated mathematical transformations to decompose stencil weight matrices into smaller rank-1 matrices, facilitating efficient data gathering along residual dimensions. It comprises three key components: memory-efficient Residual Dimension Gathering to facilitate more data reuse, compute-saving Pyramidal Matrix Adaptation to exploit the inherent low-rank characteristics, and performance-boosting Butterfly Vector Swapping to circumvent all data shuffles. Comprehensive evaluations demonstrate that LoRAStencil address dimension residues effectively, which outperforms state-of-the-arts with up to a 2.16x speedup, offering promising advancements for efficient tensorized stencil computation on TCUs by Low-Rank Adaptation.
Workshop
Embedded and/or Reconfigurable Systems
Heterogeneous Computing
W
DescriptionWe have seen great advances in the hardware and software ecosystem for FPGAs in recent years, with the release of ever more powerful FPGAs with specialised hardened components such as AI Engines for accelerating compute, and large investment put into tooling, high level synthesis and libraries. However there is still a disconnect between HPC developers, many of whom still write their codes in Fortran, and effectively run codes on these architectures, currently requiring the redevelopment of codes that involves significant time and expertise. In this talk I will describe our work leveraging MLIR to identify and seamlessly offload key computational components of programmer's a code to FPGAs and AIEs based upon the underlying algorithmic pattern. With the objective of requiring no code-level modifications to be made by the programmer, our approach connects frontends such as Flang to AMD's MLIR-AIE dialects and HLS LLVM backend to deliver optimised execution on FPGAs and AIEs.
Workshop
Emerging Technologies
Modeling and Simulation
Scientific and Information Visualization
W
Birds of a Feather
TP
XO/EX
DescriptionLustre is the leading open-source and open-development file system for HPC. Approximately two thirds of the top 100 supercomputers use Lustre. It is a community-developed technology with contributors from around the world. Lustre currently supports many HPC infrastructures beyond scientific research, such as financial services, energy, manufacturing and life sciences. Lustre clients are available for broadly deployed instruction set architectures such as x86, POWER, and Arm.
At this BoF, Lustre developers, administrators, and solution providers will gather to discuss recent Lustre developments and challenges, including the role of Lustre in AI and its use in cloud environments.
At this BoF, Lustre developers, administrators, and solution providers will gather to discuss recent Lustre developments and challenges, including the role of Lustre in AI and its use in cloud environments.
Workshop
Data Movement and Memory
I/O, Storage, Archive
W
DescriptionThis paper examines the current status of Lustre’s support for the NVIDIA GH200 Grace Hopper Superchip, focusing on the performance implications of its 64KiB page size. Experimental results show notable improvements in I/O performance when utilizing larger page sizes.
Paper
Architecture
Codesign
Data Movement and Memory
Energy Efficiency
Green Computing
Linear Algebra
TP
DescriptionBeyond the high-profile artificial intelligence and machine learning (AI/ML) workloads, the demand for high-performance matrix operations on standard and complex floating-point numbers remains strong but underserved. However, the widely adopted low-precision matrix processing units (MXUs) can only fulfill the need for AI/ML workloads, which are underutilized or idle when running applications outside their target domains.
This paper presents M3XU, multi-mode matrix processing units that support IEEE 754 single-precision and complex 32-bit floating-point numbers. M3XU does not rely on more precise but costly multipliers. Instead, M3XU proposes a multi-step approach that extends existing MXUs for AI/ML workloads. The resulting M3XU can seamlessly upgrade existing systems without programmers' efforts and maintain the bandwidth demand of existing memory subsystems. This paper evaluates M3XU with full-system emulation and hardware synthesis. M3XU can achieve a 3.89x speedup for 32-bit matrix multiplications and 3.8x speedup for complex number operations compared with conventional vector processing units.
This paper presents M3XU, multi-mode matrix processing units that support IEEE 754 single-precision and complex 32-bit floating-point numbers. M3XU does not rely on more precise but costly multipliers. Instead, M3XU proposes a multi-step approach that extends existing MXUs for AI/ML workloads. The resulting M3XU can seamlessly upgrade existing systems without programmers' efforts and maintain the bandwidth demand of existing memory subsystems. This paper evaluates M3XU with full-system emulation and hardware synthesis. M3XU can achieve a 3.89x speedup for 32-bit matrix multiplications and 3.8x speedup for complex number operations compared with conventional vector processing units.
Workshop
Artificial Intelligence/Machine Learning
W
DescriptionThe Advanced Particle-astrophysics Telescope (APT) is an orbital mission concept designed to contribute to multi-messenger observations of transient phenomena in deep space. APT will be uniquely able to detect and accurately localize short-duration gamma-ray bursts (GRBs) in the sky in real time. Current detection and analysis systems require resource-intensive ground-based computations; in contrast, APT will perform on-board analysis of GRBs, demanding analytical tools that deliver accurate results under severe size, weight, and power constraints.
In this work, we describe a neural network approach in our computation pipeline for GRB localization, demonstrating the capabilities of two neural networks: one to discard signals from background radiation, and one to estimate the uncertainty of GRB source direction constraints associated with individual gamma-ray photons. We validate the accuracy and computational efficiency of our networks using a physical simulation of GRB detection in the Antarctic Demonstrator for APT (ADAPT), a high-altitude balloon-borne prototype for APT.
In this work, we describe a neural network approach in our computation pipeline for GRB localization, demonstrating the capabilities of two neural networks: one to discard signals from background radiation, and one to estimate the uncertainty of GRB source direction constraints associated with individual gamma-ray photons. We validate the accuracy and computational efficiency of our networks using a physical simulation of GRB detection in the Antarctic Demonstrator for APT (ADAPT), a high-altitude balloon-borne prototype for APT.
ACM Student Research Competition: Undergraduate Poster
Posters
TP
DescriptionOvarian cancer (OC) significantly impacts women's health, and despite its prevalence, remains without a definitive cure. Early detection is crucial for improving treatment outcomes and reducing mortality rates and healthcare system costs. Leveraging advancements in machine learning, our study seeks to empower physicians with tools for more confident and timely diagnosis. This study introduces a novel approach using machine learning to enhance early-stage OC diagnosis. We propose the Data Driven Diagnosis Framework (DDD), a new feature extraction and ensemble method that improves classification accuracy. Using models such as Random Forest, Logistic Regression, Decision Tree, Gradient Boosting Machine, Extreme Gradient Boosting Machine, and language models, our approach shows accuracy improvements of up to 14%-28% over state-of-the-art methods.
Workshop
Artificial Intelligence/Machine Learning
Codesign
W
DescriptionDeep Learning Recommendation Models (DLRMs) are widely deployed in industry, demanding memory capacities at the terabyte scale. Tiered memory architectures offer a cost-effective solution but introduce complexities in embedding-vector placement due to intricate access patterns. In this talk, we introduce RecMG, a machine learning (ML)-guided system for vector caching and prefetching in tiered memory environments. RecMG tackles the unique challenges of data labeling and navigates the vast search space for embedding-vector placement, making ML practically feasible for DLRM inference. By leveraging separate ML models for caching and prefetching, along with a novel differentiable loss function, RecMG dramatically narrows the prefetching search space and minimizes on-demand fetches.RecMG effectively reduces end-to-end DLRM inference time by up to 43% in industrial-scale DLRM inference scenarios.
Tutorial
Applications and Application Frameworks
Embedded and/or Reconfigurable Systems
Emerging Technologies
Portability
TUT
DescriptionAre you new to the world of HPC and trying to find an affordable and accessible way that you can learn, practice and experiment? Do you miss the days when learning about HPC was connecting a few grey boxes together and configuring a cluster? Do you wish you could transfer all the complexity inherent in production HPC systems into an accessible sandbox environment, designed to facilitate teaching and experimental development? Stop wishing and come explore Magic Castle with this tutorial!
Magic Castle is an open-source software that replicates the HPC infrastructure experience via community or commercial cloud resources. It is easy to deploy and can be created in minutes. Once their cluster is deployed, the user is provided with a complete HPC cluster software environment including the scheduler, a data-transfer node, JupyterHub, and thousands of software applications compiled by experts and accessible via CVMFS. Since its initial public release in 2018, Magic Castle has been used for thousands of workshops and tutorials world-wide.
In this tutorial you will learn how to deploy a virtual HPC cluster on your preferred cloud resource in minutes, and fully customize your environment to suit your application, whether that be training, development, or practice.
Magic Castle is an open-source software that replicates the HPC infrastructure experience via community or commercial cloud resources. It is easy to deploy and can be created in minutes. Once their cluster is deployed, the user is provided with a complete HPC cluster software environment including the scheduler, a data-transfer node, JupyterHub, and thousands of software applications compiled by experts and accessible via CVMFS. Since its initial public release in 2018, Magic Castle has been used for thousands of workshops and tutorials world-wide.
In this tutorial you will learn how to deploy a virtual HPC cluster on your preferred cloud resource in minutes, and fully customize your environment to suit your application, whether that be training, development, or practice.
Art of HPC
Posters
TP
W
TUT
XO/EX
DescriptionThe simulation data was produced using the AthenaPK code (https://github.com/parthenon-hpc-lab/athenapk) using the Frontier supercomputer at the Oak Ridge Leadership Computing Facility (OLCF). The data is a time snapshot from the AthenaPK simulation of a galaxy cluster with a virial mass of ~6.6e14 solar masses and a central supermassive black hole of ~1.1e9 solar masses. The data was then visualized on the Andes cluster at OLCF using VisIt (https://github.com/visit-dav/visit). VisIt was used to generate an isosurface of the jet and the streamlines of the magnetic field. Both the isosurfaces and streamlines were exported to OBJ files to then be further visualized on Frontier using Blender (https://www.blender.org/). Blender was used to import both OBJ files and make the final PNG render of the objects using the Cycles render engine.
Birds of a Feather
TP
XO/EX
DescriptionThis BoF brings together participants to build a community and share information and ideas around exploring the challenges in sustainability of HPC software, focusing on 1) defining and measuring research software sustainability metrics and 2) enhancing research software stewardship practices. Participants will discuss software project community health, engineering practices, and funding stability. The session aims to foster collaboration, share insights, and develop actionable strategies for the long-term viability of HPC software projects.
Tutorial
Parallel Programming Methods, Models, Languages and Environments
Portability
System Administration
TUT
DescriptionModern scientific software stacks include thousands of packages, from C, C++, Fortran, and Rust libraries, to interpreted packages written in Python and R. HPC applications depend on hundreds of packages spanning all of these ecosystems. To achieve high performance, they must leverage low-level libraries like MPI, BLAS, and LAPACK. Many also make use of rapidly-changing and equaly complex AI packages. Integrating the all of the software necessary for modern HPC/AI workloads is extremely challenging, and the complexity can be an obstacle for users, administrators, support staff, and developers alike.
Spack is an open source package management tool that simplifies building, installing, customizing, and sharing software stacks. Its adoption has grown rapidly: it is used by end-users, developers, companies, clouds, and the world's largest HPC centers. Spack provides a powerful and flexible dependency model, a simple Python syntax for writing build recipes, and a repository of over 7,900 packages maintained by a community of over 1,300 contributors. This tutorial provides an introduction to Spack: installing and authoring packages, user and developer workflows, and HPC facility-wide deployment. Attendees will learn foundational skills for automating day-to-day tasks, as well as more advanced use cases.
Spack is an open source package management tool that simplifies building, installing, customizing, and sharing software stacks. Its adoption has grown rapidly: it is used by end-users, developers, companies, clouds, and the world's largest HPC centers. Spack provides a powerful and flexible dependency model, a simple Python syntax for writing build recipes, and a repository of over 7,900 packages maintained by a community of over 1,300 contributors. This tutorial provides an introduction to Spack: installing and authoring packages, user and developer workflows, and HPC facility-wide deployment. Attendees will learn foundational skills for automating day-to-day tasks, as well as more advanced use cases.
Birds of a Feather
TP
XO/EX
DescriptionMemory heterogeneity refers to the memory architecture with multiple memory components with diverse characteristics (such as latency and bandwidth). It is common to see heterogeneous memory (HM) in supercomputers nowadays. With the emergence of processing-in-memory and resource disaggregation, there will be more memory components with increasingly different features. Managing HM is challenging. The programmer often has to take care of memory allocation, decide data placement and migration, and make the best use of fast memory in HM. HM also introduces complexity in programming models, and introduces new performance bugs. We will discuss how memory heterogeneity will impact the HPC ecosystem.
Workshop
Applications and Application Frameworks
Distributed Computing
Middleware and System Software
W
DescriptionWhen large earthquakes happen, first responders need fast and accurate information regarding their impact. UCIS4EQ is an urgent computing platform that estimates ground shaking based on high-performance parallel 3D simulations. In this work, we present the PyCOMPSs implementation of UCIS4EQ towards urgent high-performance computing with a particular focus on providing a malleability mechanism. This allows UCIS4EQ to scan for data updates during runtime and dynamically incorporate said data into its execution. This involves, in particular, leveraging several HPC jobs. The implementation has been validated with data from the Samos Izmir earthquake. The malleable version of the workflow can reduce the use of computational resources by 40\% while reducing the variability of the results and keeping the same time-to-solution as a non-malleable run.
Paper
Accelerators
Algorithms
Linear Algebra
Modeling and Simulation
Numerical Methods
TP
DescriptionThis paper presents the formulation and implementation of a high performance algorithm to compute the many-body electronic correlation energy via the random-phase approximation within density functional theory. Our approach circumvents computational inefficiencies inherent in direct approaches which exhibit quartic scaling with respect to system size. Our formulation requires solving block linear systems whose coefficient matrices are complex symmetric; these systems are of widely-varying numerical difficulty. We develop a short-term recurrence block Krylov subspace solver for these systems and leverage a dynamic block size selection to mitigate load imbalances. This selection balances the increased cost per linear solver iteration with a reduction in the number of iterations for slowly-converging systems. Numerical experiments show that our implementation exhibits good parallel scalability, achieves faster solution times than direct approaches on even the smallest chemical system tested, and scales to larger systems and processor counts due to its cubic scaling and greater computational locality.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionWe present the recent efforts for MatRIS, the performance portable math library of IRIS runtime for multi-device heterogeneity. MatRIS provides dense linear algebra — BLAS (Basic Linear Algebra Subprograms) and LAPACK (Linear Algebra PACKage) — capabilities across different back ends available in the IRIS runtime, enabling the same MatRIS code to run efficiently on multi-device heterogeneous targets. MatRIS provides standard BLAS/LAPACK APIs. The motivation (philosophy) of MatRIS is: Implement once, deploy anywhere. The algorithms implemented in MatRIS are serial-like and architecture-agnostic, elevating the programming productivity in heterogeneous systems. While ensuring portability, MatRIS provides competitive or even better performance than state-of-the-art open-source and vendor solutions, such as DPLASMA, Chameleon, or NVIDIA cuSolverMG libraries.
Workshop
Distributed Computing
Experimental Facility
HPC Infrastructure
W
DescriptionX-ray light source facilities such as the Linac Coherence Light Source (LCLS) at SLAC National Accelerator Laboratory generate massive amounts of data that need to be analyzed quickly to inform ongoing experiments. However, the high repetition rate and dimensionality of these data streams make their analysis challenging in both scalability and interpretability. In this work, we propose an image-monitoring and classification framework that follows a three-stage process: dimensionality reduction by matrix sketching, visualization using UMAP, and clustering using OPTICS. In the dimensionality reduction step, we combine the Priority Sampling algorithm with a modified Frequent Directions algorithm to produce an accelerated rank-adaptive matrix sketching (ARAMS) algorithm, wherein practitioners specify the target error of the sketch as opposed to the rank. Furthermore, the framework is parallel, enabling real-time analysis of the underpinning structure of the data. We explore its effectiveness on both beam profile data and diffraction data from recent LCLS experiments.
Paper
Accelerators
Applications and Application Frameworks
Graph Algorithms
Modeling and Simulation
Numerical Methods
TP
DescriptionFast and accurate numerical simulations are crucial for designing large-scale geological carbon storage projects ensuring safe long-term CO2 containment -- as a climate change mitigation strategy. These simulations involve solving numerous large and complex linear systems arising from the implicit Finite-Volume (FV) discretization of PDEs governing subsurface fluid flow. Compounded with highly detailed geo-models, solving linear systems is computationally and memory expensive, and accounts for the majority of the simulation computing time. Modern intricate memory hierarchical systems are insufficient to overcome the challenges of large-scale numerical simulations. Therefore, exploring algorithms that can leverage alternative and balanced paradigms, such as dataflow and in-memory computing is crucial. This work introduces a matrix-free algorithm to solve FV-based linear systems using a dataflow architecture to significantly minimize memory bottlenecks. Our implementation achieves two orders-of-magnitude speedup compared to a GPGPU-based reference implementation, and up to 1.2 PFlops on a single dataflow device.
Exhibits
Flash Session
TP
XO/EX
DescriptionDiscover how Oracle Cloud Infrastructure (OCI) and NVIDIA's accelerated computing platform are boosting performance and improving TCO for HPC workloads such as computational fluid dynamics and structural mechanics. We will hear from a joint customer about their experience and benefits of using Oracle’s NVIDIA GPU-accelerated bare-metal compute instances. We will also dive into OCI’s current NVIDIA GPU portfolio and explore OCI’s future with NVIDIA GB200 NVL72 featuring Quantum-2 InfiniBand networking and how this next generation platform will benefit HPC workloads.
Paper
Fault-Tolerance, Reliability, Maintainability, and Adaptability
Middleware and System Software
Performance Evaluation and/or Optimization Tools
Runtime Systems
TP
DescriptionModern high-performance computing (HPC) systems play a fundamental role in driving scientific research, as they execute computationally intensive jobs from diverse domains. However, HPC jobs are characterized by conflicting computational requirements, which may cause inefficiency in resource usage, system throughput and energy consumption. One approach to tackle this problem is to distinguish between memory/compute-bound jobs at submission time, to make informed decisions about their execution. In this paper, we present MCBound, the first online data-driven framework to classify HPC jobs as memory/compute-bound before execution. We propose a systematic memory/compute-bound job characterization technique, and we use it to analyze the data of 2.2 million jobs run on the Supercomputer Fugaku. We implement MCBound for Fugaku and classify the jobs executed during February 2024. Our approach is proven effective, as it obtains an F1-macro average score of at least 0.89, while incurring a negligible overhead on the system's operations.
Paper
Accelerators
Compilers
Heterogeneous Computing
Performance Evaluation and/or Optimization Tools
TP
DescriptionOperator fusion enhances data locality and reduces GPU memory bandwidth pressure. but struggles with multiple compute-intensive operators due to saturated computation throughput. The variability in tensor sizes can make these operators memory-bound, necessitating efficient fused kernel generation, challenged by limited scheduling spaces, redundant accesses, and long tuning times.
We present MCFuser, a framework that efficiently generates high-performance fused kernels for memory-bound compute-intensive (MBCI) operator chains. MCFuser uses high-level tiling expressions for expansive search space delineation and Directed Acyclic Graph (DAG) analysis to cut redundant memory access, optimizing kernel performance. It prunes the search space with specific guidelines and combines an analytical performance model with heuristic search, significantly speeding up tuning. In tests with NVIDIA A100 and RTX3080 GPUs, MCFuser outperformed leading compilers like Ansor, delivering up to 5.9x kernel speedup and reducing tuning time by over 70-fold, proving its effectiveness and efficiency in enhancing kernel performance.
We present MCFuser, a framework that efficiently generates high-performance fused kernels for memory-bound compute-intensive (MBCI) operator chains. MCFuser uses high-level tiling expressions for expansive search space delineation and Directed Acyclic Graph (DAG) analysis to cut redundant memory access, optimizing kernel performance. It prunes the search space with specific guidelines and combines an analytical performance model with heuristic search, significantly speeding up tuning. In tests with NVIDIA A100 and RTX3080 GPUs, MCFuser outperformed leading compilers like Ansor, delivering up to 5.9x kernel speedup and reducing tuning time by over 70-fold, proving its effectiveness and efficiency in enhancing kernel performance.
Workshop
Artificial Intelligence/Machine Learning
Graph Algorithms
Scalable Data Mining
W
DescriptionScalable data management is essential for processing large scientific dataset on HPC platforms for distributed deep learning. In-memory distributed storage is preferred for its speed, enabling rapid, random, and frequent data access required by stochastic optimizers. Processes use one-sided or collective communication to fetch remote data, with optimal performance depending on (i) dataset characteristics, (ii) training scale, and (iii) interconnection network. Empirical analysis shows collective communication excels with larger mini-batch sizes and/or fewer processes, whereas one-sided communication outperforms at larger scales.
We propose MDLoader, a hybrid in-memory data loader for distributed graph neural network training. MDLoader features a model-driven performance estimator that dynamically selects between one-sided and collective communication at the beginning of training using Tree of Parzen Estimators (TPE). Evaluations on NERSC Perlmutter and OLCF Summit show MDLoader outperforms single-backend loaders by up to 2.83x and predicts the suitable communication method with 96.3% (Perlmutter) and 94.3% (Summit) success rate.
We propose MDLoader, a hybrid in-memory data loader for distributed graph neural network training. MDLoader features a model-driven performance estimator that dynamically selects between one-sided and collective communication at the beginning of training using Tree of Parzen Estimators (TPE). Evaluations on NERSC Perlmutter and OLCF Summit show MDLoader outperforms single-backend loaders by up to 2.83x and predicts the suitable communication method with 96.3% (Perlmutter) and 94.3% (Summit) success rate.
Birds of a Feather
TP
XO/EX
DescriptionCome and learn from the leaders of the professional societies focused on HPC — ACM, IEEE, and SIAM! Your SIGHPC, TCPP, TCHPC, SIAG-SC, and SIAG-CSE representatives invite SC24 participants to join this cross-society BoF to learn about the opportunities these societies provide. Each organization recognizes outstanding achievements in HPC with society awards, offers travel grants to students and early-career professionals, supports initiatives focused on education and outreach, and promotes diversity, equity, and inclusion. These representatives are also seeking feedback from the community to help improve their initiatives and to learn from each other.
Paper
I/O, Storage, Archive
TP
DescriptionLarge-scale data analytics, scientific simulation, and deep learning codes in HPC perform massive computations on data greatly exceeding the bounds of main memory. These out-of-core algorithms suffer from severe data movement penalties, programming complexity, and limited code reuse. To solve this, HPC sites have steadily increased DRAM capacity. However, this is not sustainable due to financial and environmental costs. A more elegant, low-cost, and portable solution is to expand memory to distributed multi-tiered storage. In this work, we propose MegaMmap: a software distributed shared memory (DSM) that enlarges effective memory capacity through intelligent tiered DRAM and storage management. MegaMmap provides workload-aware data organization, eviction, and prefetching policies to reduce DRAM consumption while ensuring speedy access to critical data. A variety of memory coherence optimizations are provided through an intuitive hinting system. Evaluations show that various workloads can be executed with a fraction of the DRAM while offering competitive performance
Workshop
Artificial Intelligence/Machine Learning
W
DescriptionArtificial intelligence is transforming scientific computing with deep neural network surrogates that approximate solutions to partial differential equations (PDEs). Traditional off-line training methods face issues with storage and I/O efficiency, as the training dataset has to be computed with numerical solvers up-front. Our previous work, Melissa framework, enables data to be created ``on-the-fly'' and streamed directly into the training process. In this paper, we introduce a new active learning method to enhance data-efficiency of the surrogate training in on-line context. The surrogate is trained to predict a time-step directly with different initial and boundary conditions parameters. Our approach uses Adaptive Multiple Importance Sampling guided by training loss statistics, in order to focus NN training on the difficult areas of the parameter space. Preliminary results for 2D heat PDE demonstrate the potential of this method, called Breed, to improve the generalization capabilities of surrogates while reducing computational overhead.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionIn recent years, the slowing of advancements in memory technology and applications’ increasing demand for memory have resulted in high performance computation becoming bottlenecked by availability of memory. One existing solution, far memory, involves swapping pages to a remote machine rather than a local disk. Function-as-a-service (FaaS) platforms have also become more prevalent, allowing the remote execution of workloads. We first explore the viability of integrating one FaaS tool, Globus Compute, with a remote swap system for far memory, FastSwap. Then, we investigate the performance of the combined system on various workloads to determine which ones can incorporate remote memory without excessive overhead cost. We find that for certain workloads, including breadth-first search and minimum spanning tree, it is possible to use up to 30% remote memory without significant slowdowns. In the poster session, we will present our approach, findings, limitations, and potential generalizations.
Students@SC
TP
W
TUT
XO/EX
DescriptionThe Mentor–Protégé Matching program is designed to foster the growth of the high-performance computing (HPC) community by pairing students with experienced mentors. This initiative supports students by aligning them with mentors based on research interests, career goals, and general interests, helping to build meaningful connections. Early sign-ups offer additional virtual activities to deepen these relationships before the conference. Participants are encouraged to engage in pre-conference exchanges and attend organized events during the conference to maximize the benefits of this mentoring experience. Pre-registration for the Mentor–Protégé Matching program is required to attend this social event prior to the keynote. Please note: space is limited. Mentor-Protégé participants must RSVP separately prior to the event. Breakfast will be provided.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionSignificant advances in weather prediction stemmed primarily from combining observational data, sophisticated modeling techniques, and analysis of simulated or historical weather data. Additionally, the applicability of machine learning-based applications on edge devices has expanded, addressing a variety of use cases, including weather prediction. This study builds on prior research using an Extreme Learning Machine (ELM) approach to detect weather anomalies in real-time, enhancing existing predictive systems. Our model is implemented on IBIS, an adaptable edge computing framework for multi-sensor data collection. The ELM model, applied to detect real-time weather anomalies, offers fast training and operational efficiency. Data on atmospheric phenomena, including pressure and wind, was generated and stored in a time series database. The model was then trained on 80% of 550,000 records. Our experiments demonstrated a 92% R2_score, supporting its effectiveness. Our work within IBIS represents a cost-effective and scalable solution for collecting, monitoring, and predicting hazardous atmospheric conditions.
Workshop
Accelerators
Modeling and Simulation
Performance Evaluation and/or Optimization Tools
W
DescriptionWith NVIDIA’s release of the Grace Superchip, all three big semiconductor companies in HPC (AMD, Intel, NVIDIA) are currently competing in the race for the best CPU. In this work we analyze the performance of these state-of-the-art CPUs and create an accurate in-core performance model for their microarchitectures (Zen 4, Golden Cove, and Neoverse V2), extending the Open Source Architecture Code Analyzer (OSACA) tool and comparing it with LLVM-MCA. Starting from the peculiarities and upsides/downsides of a single core, we extend our comparison by a variety of microbenchmarks and the capabilities of a full node. The “write-allocate (WA) evasion” feature, which can automatically reduce the memory traffic caused by write misses, receives special attention; we show that the Grace Superchip has a next-to-optimal implementation of WA evasion, and that the only way to avoid write allocates on Zen 4 is the explicit use of non-temporal stores.
Posters
TP
DescriptionEmerging applications in machine learning and personalized medicine introduce new challenges and requirements for secure computing. While the exclusive allocation of resources to a single tenant provides necessary isolation, it also comes at the cost of hardware underutilization. While solutions like containers allow for secure sharing of CPUs, new techniques are needed to efficiently co-locate applications on GPUs. We propose a new approach that merges the elasticity of the Function-as-a-Service (FaaS) with the physical GPU partitioning of NVIDIA MIG. In MIGnificient, we provide spatial isolation through concurrent execution on different device partitions, preventing side-channel attacks and performance interference. We employ local API remoting that controls kernel scheduling and memory transfers, enabling compute-communication overlap and improved resource management in virtualized API. MIGnificient overcomes the limitations of state-of-the-art solutions that rely on slower network-based API remoting and insecure NVIDIA MPS, creating a unifying model for optimized serverless GPU functions.
Paper
Accelerators
Algorithms
Linear Algebra
Modeling and Simulation
Numerical Methods
TP
DescriptionWe propose Mille-feuille for accelerating conjugate gradient (CG) and biconjugate gradient stabilized (BiCGSTAB) on GPUs. We analyze the two methods and list three findings related to the mixed-precision, the kernel synchronization costs, and the convergence-aware during the iteration. Then, (1) to enable tile-grained mixed-precision, we develop a tiled sparse format; (2) to reduce synchronization costs, we leverage atomic operations that consolidate the solving procedure into a single GPU kernel; (3) to support a convergence-aware mixed-precision strategy, we enable tile-wise on-chip dynamic precision conversion within the single kernel. The experimental results on an NVIDIA A100 and an AMD MI210 show that the Mille-feuille solver outperforms baseline implementations using the vendor-support cuSPARSE/hipSPARSE, as well as two state-of-the-art libraries PETSc and Ginkgo by a factor of on average 3.03x/2.68x, 5.37x, 4.36x (up to 8.77x/7.14x, 16.54x, 15.69x) in CG, on average 2.65x/2.32x, 3.57x, 3.78x (up to 7.51x/6.63x, 16.64x, 11.73x) in BiCGSTAB, respectively.
ACM Student Research Competition: Undergraduate Poster
Posters
TP
DescriptionTransformer-based Large Language Models have advanced natural language processing with their ability to generate fluent text. However, these models exhibit and amplify toxicity and bias learned from training data, posing new ethical challenges. We build upon the \attnlens{} framework to allow for scalable decoding of attention mechanism information. We then use this decoded information to implement a pipeline to generate and remove toxic memories from pre-trained language models in a way that is both human interpretable and effective while retaining model performance.
Workshop
Heterogeneous Computing
Parallel Programming Methods, Models, Languages and Environments
PAW-Full
Task Parallelism
W
DescriptionBulk synchronous programming (in distributed-memory systems) and the
fork-join pattern (in shared-memory systems) are often used for
problems where independent processes must periodically
synchronize. Frequent synchronization can greatly undermine the
performance of software designed to solve such problems. We use
the actor model of concurrent computing to balance the load of
hundreds of thousands of short-lived tasks, and mitigate
synchronization bottlenecks by buffering communication via actor
batching. The actor model is becoming increasingly popular in scientific and
high-performance computing because it can handle heterogeneous tasks
and computing environments with enhanced programming flexibility and
ease relative to conventional paradigms like MPI. For a hydrologic
simulation of continental North America with over 500,000 elements,
the proposed buffering approach is approximately 4 times faster than
no buffering, outperforms MPI on single and multiple nodes, and
remains competitive with OpenMP on a single node and MPI+OpenMP on
multiple nodes.
fork-join pattern (in shared-memory systems) are often used for
problems where independent processes must periodically
synchronize. Frequent synchronization can greatly undermine the
performance of software designed to solve such problems. We use
the actor model of concurrent computing to balance the load of
hundreds of thousands of short-lived tasks, and mitigate
synchronization bottlenecks by buffering communication via actor
batching. The actor model is becoming increasingly popular in scientific and
high-performance computing because it can handle heterogeneous tasks
and computing environments with enhanced programming flexibility and
ease relative to conventional paradigms like MPI. For a hydrologic
simulation of continental North America with over 500,000 elements,
the proposed buffering approach is approximately 4 times faster than
no buffering, outperforms MPI on single and multiple nodes, and
remains competitive with OpenMP on a single node and MPI+OpenMP on
multiple nodes.
Birds of a Feather
TP
XO/EX
DescriptionWhat if we have been oversolving in computational science and engineering for decades? Are low-precision arithmetic formats only for AI workloads? How can HPC applications exploit mixed-precision hardware features? This BoF invites the HPC community at large interested in applying mixed precisions into their workflows and discussing the impact on time-to-solution, memory footprint, data motion, and energy consumption. Experts from scientific applications/software libraries/hardware architectures will briefly provide the context on this trendy topic, share their own perspectives, and mostly engage with the audience via a set of questions, while gathering feedback to define a roadmap moving forward.
Paper
Algorithms
Artificial Intelligence/Machine Learning
Heterogeneous Computing
Performance Optimization
TP
DescriptionMixed-precision quantization has shown to be a promising method for enhancing the efficiency of LLMs. This technique boosts computational efficiency by processing most values with low-precision, high-throughput compute units and maintains accuracy by processing outliers in high-precision. However, due to the dynamic, irregular, and sparse nature of outliers, this approach is far from using hardware efficiently.
In this work, we propose MIXQ, an efficient mixed-precision quantization system. Through our in-depth analysis of outlier distribution, we introduce a locality-based outlier prediction algorithm that can predict all outliers of 95.8% of tokens. Based on this accurate prediction, we propose a quantization ahead of detection(QAD) technique that can verify the correctness of prediction. A new data structure is proposed for efficient outlier processing. Evaluation shows that
MIXQ achieves 1.52x and 1.78x speedup over FP16 and Bitsandbytes on 8-bit quantization; plus 1.48x, 1.93x, and 6x speedup over QUIK, FP16, and AWQ on 4-bit quantization.
In this work, we propose MIXQ, an efficient mixed-precision quantization system. Through our in-depth analysis of outlier distribution, we introduce a locality-based outlier prediction algorithm that can predict all outliers of 95.8% of tokens. Based on this accurate prediction, we propose a quantization ahead of detection(QAD) technique that can verify the correctness of prediction. A new data structure is proposed for efficient outlier processing. Evaluation shows that
MIXQ achieves 1.52x and 1.78x speedup over FP16 and Bitsandbytes on 8-bit quantization; plus 1.48x, 1.93x, and 6x speedup over QUIK, FP16, and AWQ on 4-bit quantization.
Birds of a Feather
TP
XO/EX
DescriptionSince it was first developed by Google in 2020, MLIR has become highly popular and adopted by some of the world’s largest companies and technologies. MLIR significantly lowers the barrier to entry in developing compilers and Domain Specific Languages (DSLs) by providing a framework for reuse, where Intermediate Representation (IR) dialects and transformations can be developed, integrated and reused, with many already provided by the community. There is significant potential for MLIR in HPC, and the aim of this BoF is to discuss how to drive further adoption of this technology in HPC and shape MLIR to suit our needs.
Birds of a Feather
TP
XO/EX
DescriptionMachine learning applications are rapidly expanding into scientific domains and challenging the hallmarks of traditional high performance computing workloads. We present MLPerf, a community-driven system performance benchmark which spans a range of machine learning tasks. The speakers at this BoF are experts in the fields of HPC, science applications, machine learning, and computer architecture, representing academia, government research organizations, and private industry. In this session, we will cover the past year’s developments within the MLPerf organization and provide an update on the latest round of submissions to MLPerf-HPC benchmark suite to solicit input from interested parties within the HPC community.
Workshop
Distributed Computing
Experimental Facility
HPC Infrastructure
W
Workshop
I/O, Storage, Archive
W
DescriptionIn the last decade, DL training has emerged as an HPC-scale workload running on large clusters. The dominant communication pattern in distributed data-parallel DL training is allreduce which is used to sum the model gradients across processes during backpropagation phase. Various allreduce algorithms have been developed to optimize communication time in DL training. Given the scale of DL workloads, it is crucial to evaluate the scaling efficiency of these algorithms on a variety of system architectures. We have extended the Structural Simulation Toolkit (SST) to simulate allreduce and barrier algorithms - Rabenseifner, ring, and, dissemination algorithms. We performed a design space exploration (DSE) study with three allreduce algorithms and two barrier algorithms running on six system network topologies for various message sizes. We quantified the performance benefits of using allreduce algorithms which preserve locality between communicating processes. In addition, we evaluated the scaling efficiency of centralized and decentralized barrier algorithms.
Workshop
Message Passing
Network
W
DescriptionThe MPI specification provides a restricted form of persistence in point-to-point and collective communication operations that purportedly enables libraries to amortize precomputation and setup costs over longer sequences of identical communication operations. Because of the way that MPI has chosen to represent semantics and modes of communication, further additions and modifications to the MPI specification often came and come at the cost of a (combinatorial) blow-up in the number of interface functions.
We discuss how to exploit orthogonality and separation of concerns more thoroughly to prevent the proliferation of concrete interface functions while still providing essentially the same persistence as current MPI and without any additional burden on library implementers. We introduce new variants of persistence, which we call pairwise and relaxed persistence. Our concrete proposals contribute to the discussion about why MPI is so huge and what could or should be done about that.
We discuss how to exploit orthogonality and separation of concerns more thoroughly to prevent the proliferation of concrete interface functions while still providing essentially the same persistence as current MPI and without any additional burden on library implementers. We introduce new variants of persistence, which we call pairwise and relaxed persistence. Our concrete proposals contribute to the discussion about why MPI is so huge and what could or should be done about that.
Paper
Accelerators
Compilers
Embedded and/or Reconfigurable Systems
Linear Algebra
Performance Evaluation and/or Optimization Tools
TP
DescriptionStencil computation is one of the most universal computation motifs in scientific applications such as weather prediction. Due to the complexity of scientific simulation, the stencil computation can contain a set of complex stencil operations that form a directed acyclic graph (referred to composite stencil). Unfortunately, most existing stencil optimizations and compilers only focus on intra-stencil operation, and cannot fully explore the performance improvement potential of composite stencils in nowadays applications. To this end, we propose Moirae, a framework that explores a novel optimization space and generates high-performance code for composite stencils. We first propose a lightweight cost model with a fine-grained analysis of memory access behavior to predict the performance. Based on the cost model, we propose an evolutionary search method to find a high-performance optimization, leveraging a search space pruning method with stencil domain knowledge. Experimental results show that Moirae can outperform the state-of-the-art composite stencil compilers.
Workshop
Debugging and Correctness Tools
Performance Evaluation and/or Optimization Tools
W
Workshop
Emerging Technologies
Modeling and Simulation
Scientific and Information Visualization
W
Workshop
Data Movement and Memory
I/O, Storage, Archive
W
DescriptionWith the gap between computing power and I/O performance growing ever wider on HPC systems, it is becoming crucial to optimize how applications perform I/O on storage resources. To achieve this, a good understanding of application I/O behavior is an essential preliminary step. In this paper, we introduce MOSAIC, a method for categorizing applications according to their I/O behavior. We first propose an abstraction for characterizing I/O operations in terms of periodicity, temporality and metadata access. We then present a set of segmentation-based techniques for quickly and automatically detecting meaningful data access patterns. In the end, MOSAIC is able to characterize a full set of real-world I/O traces from the Blue Waters supercomputer with 92% accuracy.
Workshop
Message Passing
Network
W
DescriptionThe progression of communication in the Message Passing Interface (MPI) is not well defined, yet it is critical for applications to achieve effective computation and communication overlapping. The opaque nature of MPI progress poses significant challenges in advancing MPI within HPC practices.
First, the lack of clarity hinders the development of explicit guidelines for enhancing computation and communication overlap in applications. Second, it prevents MPI from seamlessly integrating with contemporary programming paradigms. Third, it limits the extension of MPI functionalities from user space.
In this paper, we examine the role of MPI progress by analyzing the implementation of MPI messaging. We generalize the asynchronous communication pattern and identify key factors influencing application performance. We propose a set of MPI extensions designed to enable users to construct and manage an efficient progress engine explicitly. We compare our approach to previous efforts in the field, highlighting its reduced complexity and increased effectiveness.
First, the lack of clarity hinders the development of explicit guidelines for enhancing computation and communication overlap in applications. Second, it prevents MPI from seamlessly integrating with contemporary programming paradigms. Third, it limits the extension of MPI functionalities from user space.
In this paper, we examine the role of MPI progress by analyzing the implementation of MPI messaging. We generalize the asynchronous communication pattern and identify key factors influencing application performance. We propose a set of MPI extensions designed to enable users to construct and manage an efficient progress engine explicitly. We compare our approach to previous efforts in the field, highlighting its reduced complexity and increased effectiveness.
Birds of a Feather
TP
XO/EX
DescriptionMPICH is a widely used open-source implementation of the MPI message passing standard. It has been ported to many platforms and used by several vendors and research groups as the basis for their own MPI implementations. This BoF session will provide a forum for users of MPICH as well as developers of MPI implementations derived from MPICH to discuss experiences and issues in using and porting MPICH. Future plans for MPICH will be discussed. Representatives from MPICH-derived implementations will provide brief updates on the status of their efforts. MPICH developers will also be present for an open forum discussion.
ACM Gordon Bell Finalist
TP
DescriptionWe present a scalable, end-to-end workflow for protein design. By augmenting protein sequences with natural language descriptions of biochemical properties, we train generative models to preferentially align with protein fitness landscapes. Through complex experimental- and simulation-based observations, we integrate these measures as preferred parameters for generating new protein variants and demonstrate our workflow on five diverse supercomputers. We achieve >1 ExaFLOPS sustained performance in mixed precision on each supercomputer and a maximum sustained performance of 4.11 ExaFLOPS and peak performance of 5.57 ExaFLOPS. We establish the performance of our model on two tasks: (1) across a predetermined benchmark dataset of deep mutational scanning experiments to optimize the fitness-determining mutations in the yeast protein HIS7, and (2) in optimizing the design of the enzyme malate dehydrogenase to achieve lower activation barriers (and therefore increased catalytic rates) using simulation data. Our implementation thus sets high watermarks for multimodal protein design workflows.
Exhibits
SCinet
TP
XO/EX
Birds of a Feather
TP
XO/EX
DescriptionWe now live in a world of multi-architecture HPC systems, with multiple providers of Arm CPUs to facilitate these, alongside x86. This BoF session gathers the Arm HPC community to share insights on running Arm-based HPC systems alongside other architectures. Unlike previous sessions that focused on tool maturity and porting challenges, this discussion will emphasize recent developments in open-source availability and deployment best practices. It aims to address how different architectures can coexist effectively in complex heterogeneous operational scenarios and collaborative research, even with diverse starting points. Audience participation is highly encouraged.
Workshop
Architecture
Embedded and/or Reconfigurable Systems
Performance Optimization
Resource Management
W
DescriptionDistributed cluster applications, including machine learning tasks, database applications, and HPC workloads, often rely on NVMe-oF using RDMA for fast, block-level access to storage devices over a network. However, RDMA solutions add extra latency by requiring software on the critical path. In this paper, we present a distributed NVMe driver for sharing NVMe storage devices across hosts in a PCIe cluster. By building on PCIe shared memory capabilities, we demonstrate disaggregation of NVMe controllers at the I/O queue level, allowing them to be used in parallel by remote hosts without relying on RDMA. Our experimental results prove that our PCIe-based solution reduces network latency and is comparable to local access.
Workshop
Data Movement and Memory
Emerging Technologies
W
DescriptionHigh-end ARM processors are emerging in data centers and HPC systems, posing as a strong contender to x86 machines. Memory-centric profiling is an important approach for dissecting an application’s bottlenecks on memory access and guiding optimizations. Many existing memory profiling tools leverage hardware performance counters and precise event sampling, such as Intel PEBS and AMD IBS, to achieve high accuracy and low overhead. In this workshop, we present a multi-level memory profiling tool for ARM processors, leveraging the Statistical Profiling Extension (SPE). We evaluate the tool using both HPC and cloud workloads on the ARM Ampere processor. Our results provide the first quantitative assessment of time overhead and sampling accuracy of the ARM SPE for memory-centric profiling at different sampling periods and aux buffer sizes.
Workshop
Distributed Computing
Education
Emerging Technologies
W
Keynote
TP
W
TUT
XO/EX
DescriptionIn her keynote address, Dr. Fox will explore the transformative power of NASA science. From her unique perspective leading NASA’s diverse portfolio of science and research missions, she’ll share insights into the role of advanced supercomputing and data analytics in enabling the agency’s vision. This presentation will underscore the importance of HPC, AI, and ML technologies in knowledge discovery, fostering collaboration, and finding solutions to complex global challenges. Along the way, Dr. Fox will share some of NASA’s inspirational achievements, recent discoveries, and exciting upcoming missions.
Workshop
Energy Efficiency
HPC Infrastructure
Sustainability
W
DescriptionIn this paper, we address the challenges in achieving sustainable data-driven efficiency by providing a detailed exploration of the end-to-end operational data analytics (ODA) framework that evolved through two generations of supercomputer systems at the Oak Ridge Leadership Computing Facility (OLCF). This framework addresses large data streams ingested from heavily instrumented HPC environment that accumulates multi-terabytes per day. We outline the multifaceted data life cycle across HPC procurement, operations, and research & development, identifying key obstacles and design decisions that shape effective strategies in building and supporting data pipelines end-to-end.
By sharing key insights and lessons learned from our experience, we offer recommendations for the HPC community on enabling sustainable operational data analytics and beyond.
Our contributions aim to bridge the gap between potential and real benefits of operational data, guiding future efforts towards integrated and sustainable operational intelligence in high-performance computing environments.
By sharing key insights and lessons learned from our experience, we offer recommendations for the HPC community on enabling sustainable operational data analytics and beyond.
Our contributions aim to bridge the gap between potential and real benefits of operational data, guiding future efforts towards integrated and sustainable operational intelligence in high-performance computing environments.
Art of HPC
Posters
TP
W
TUT
XO/EX
DescriptionThe design was created with Adobe Creative Suite and purchased design assets from iStock. It was designed by NCSA staff with no AI or ML use.
Art of HPC
Posters
TP
W
TUT
XO/EX
DescriptionCreated with Adobe Creative Suite and purchased design assets from iStock. Designed by NCSA staff. No AI or ML use.
Workshop
Graph Algorithms
Heterogeneous Computing
Programming Frameworks and System Software
W
DescriptionGraph-based approximate nearest neighbor algorithms have shown high neighbor structure representation quality.
NN-Descent is a widely known graph-based approximate nearest neighbor (ANN) algorithm.
However, graph-based approaches are memory- and time-consuming.
To address the drawbacks, we develop a scalable distributed NN-Descent.
Our NEO-DNND (neighbor-checking efficiency optimized distributed NN-Descent) is built on top of MPI and designed to utilize network bandwidth efficiently.
NEO-DNND reduces duplicate elements, increases intra-node data sharing, and leverages available DRAM to replicate data that may be sent frequently.
NEO-DNND showed remarkable scalability up to 256 nodes and was able to construct neighborhood graphs from billion-scale datasets.
Compared to a leading shared-memory ANN library, NEO-DNND achieved competitive performance even on a single node and exhibited 41.7X better performance by scaling up to 32 nodes.
Furthermore, NEO-DNND outperformed a state-of-the-art distributed NN-Descent implementation, achieving up to a 6.0X speedup.
NN-Descent is a widely known graph-based approximate nearest neighbor (ANN) algorithm.
However, graph-based approaches are memory- and time-consuming.
To address the drawbacks, we develop a scalable distributed NN-Descent.
Our NEO-DNND (neighbor-checking efficiency optimized distributed NN-Descent) is built on top of MPI and designed to utilize network bandwidth efficiently.
NEO-DNND reduces duplicate elements, increases intra-node data sharing, and leverages available DRAM to replicate data that may be sent frequently.
NEO-DNND showed remarkable scalability up to 256 nodes and was able to construct neighborhood graphs from billion-scale datasets.
Compared to a leading shared-memory ANN library, NEO-DNND achieved competitive performance even on a single node and exhibited 41.7X better performance by scaling up to 32 nodes.
Furthermore, NEO-DNND outperformed a state-of-the-art distributed NN-Descent implementation, achieving up to a 6.0X speedup.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionThis project introduces an enhanced solution for accessing and processing NetCDF data, a widely used standard in geosciences for storing multidimensional data. Existing tools often compromise on performance or lack full workflow support. The proposed system integrates machine learning, specifically a CatBoost classifier, with a modern web application to improve the speed and accuracy of data querying and visualization. It provides a user-friendly interface for uploading NetCDF files and extracting metadata efficiently. Experimental results demonstrate a 64% F1-score in selecting optimal parameters and up to 80% improvement in processing time, significantly aiding scientific analysis.
Paper
Distributed Computing
Heterogeneous Computing
Programming Frameworks and System Software
Runtime Systems
Task Parallelism
TP
DescriptionThe emergence of programmable data planes (PDPs) has paved the way for in-network computing (INC), a paradigm wherein networking devices actively participate in distributed computations. However, PDPs are still a niche technology, mostly available to network operators, and rely on packet-processing DSLs like P4. This necessitates great networking expertise from INC programmers to articulate computational tasks in networking terms and reason about their code. To lift this barrier to INC we propose a unified compute interface for the data plane. We introduce C/C++ extensions that allow INC to be expressed as kernel functions processing in-flight messages, and APIs for establishing INC-aware communication. We develop a compiler that translates kernels into P4, and thin runtimes that handle the required network plumbing, shielding INC programmers from low-level networking details. We evaluate our system using common INC applications from the literature.
Paper
Data Compression
Data Movement and Memory
Distributed Computing
Message Passing
Network
TP
DescriptionIn the Fully Sharded Data Parallel (FSDP) training pipeline, collective operations can be interleaved to maximize the communication/computation overlap. In this scenario, outstanding operations such as Allgather and Reduce-Scatter can compete for the injection bandwidth and create pipeline bubbles. To address this problem, we propose a novel bandwidth-optimal Allgather collective algorithm that leverages hardware multicast. We use multicast to build a constant-time reliable Broadcast protocol, a building block for constructing an optimal Allgather schedule. Our Allgather algorithm achieves 2x traffic reduction on a 188-node testbed. To free the host side from running the protocol, we employ SmartNIC offloading. We extract the parallelism in our Allgather algorithm and map it to a SmartNIC specialized for hiding the cost of data movement. We show that our SmartNIC-offloaded collective progress engine can scale to the next generation of 1.6 Tbit/s links.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionReal-time object detection is an important and computationally intensive task that is gaining more attention in the field of autonomous systems. Recently, a novel object detection algorithm called RT-DETR has emerged, demonstrating superior speed compared to the popular YOLO series. In recent years, many edge devices optimized for artificial intelligence have been developed, allowing for faster model inference. Our study uses NVIDIA TensorRT to optimize models in the task of object tracking and detection on NVIDIA’s Orin device. Our best resulting model is the FP16 model with DLA with an average inference time of 19.9416 milliseconds and throughput of 50.1465 (frames) per second. This is a five-fold improvement compared with the standard unoptimized Pytorch FP32 model, with practically no accuracy sacrifice. Our study shows that applying TensorRT and quantization on object tracking and detection on NVIDIA’s Orin device is effective in reducing prediction time, allowing for faster detection.
Exhibitor Forum
Network
TP
XO/EX
DescriptionAs large language models rapidly grow, compute demands have surged, generating staggering amounts of data and requiring high-speed data movement between AI accelerators, memory and storage. Conventional interconnects and optics fail to scale to the cost and throughput that AI/HPC systems will require, causing major bottlenecks, poor hardware use, higher power consumption, and increased costs. New connectivity solutions are needed for efficient, high-performance AI/HPC architectures.
In this session, Fujitsu will highlight current AI/HPC architecture challenges and future needs. Ayar Labs will present a chiplet-based optical I/O solution offering significantly better bandwidth, power efficiency, and latency compared to traditional methods. They will also share new analysis on how optical I/O improves system performance and total cost of ownership (TCO) for AI inference and training. Attendees will learn how these new optical fabrics meet the performance and TCO requirements of next-gen AI/HPC systems.
In this session, Fujitsu will highlight current AI/HPC architecture challenges and future needs. Ayar Labs will present a chiplet-based optical I/O solution offering significantly better bandwidth, power efficiency, and latency compared to traditional methods. They will also share new analysis on how optical I/O improves system performance and total cost of ownership (TCO) for AI inference and training. Attendees will learn how these new optical fabrics meet the performance and TCO requirements of next-gen AI/HPC systems.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionTime-explicit fully-kinetic simulation of magnetically confined fusion plasmas is out of reach of existing supercomputers due to the multi-scale nature of the system. Physics approximations and time-implicit methods are typically used to simulate such plasmas. A new energy-conserving, semi-implicit Poisson solver has been developed and added to the open-source particle-in-cell code WarpX. The new solver enables electrostatic simulations to be performed using orders of magnitude less computational resources than previously possible, without reducing accuracy. Accuracy and computational speedup of the model is demonstrated by simulation of plasma expansion into vacuum and the spoke mode formation during Penning discharge (relevant to the plasma processing industry). The reduction in time-to-solution allows researchers to tackle problems that were previously computationally infeasible. Performance on GPUs and issues of scaling to fusion plasma conditions are discussed.
Workshop
Architecture
Data-Intensive
Network
Performance Optimization
System Administration
W
DescriptionNew architectures and topologies continue to be investigated to address the growing demands and challenges faced by Data Center Networks (DCN) - the communications backbone of the datacenter. We believe that novel techniques that leverage the high redundancy and symmetricity in DCN topologies can significantly simplify DCN protocols and operations and improve DCN performance. We adopted the folded-Clos topology to investigate a novel Multi-Root Meshed Tree Protocol (MR-MTP) and compared its performance to the popular protocol suite adopted in folded-Clos topologies, namely the Border Gateway Protocol (BGP) with Equal Cost Multipath Protocol (ECMP) with and without Bidirectional Forwarding Detection (BFD). We studied the convergence time, packet loss, control overhead and blast radius after an interface failure, introduced in multiple points. Our studies conducted on the FABRIC testbed provide strong validation that novel techniques that leverage the DCN structures can indeed simplify DCN protocol operations and improve performance.
Birds of a Feather
TP
XO/EX
DescriptionWith the rapid growth of electric vehicles and Net Zero targets, battery simulation is crucial. HPC facilitates accurate simulations of battery electrochemistry and thermal behavior, allowing improved predictions of cell performance, cyclic life, and safety to inform R&D. This BoF will address the challenges of developing exascale battery modelling capabilities and foster connections between HPC and battery science communities. The BoF will consist of talks discussing gaps in the state-of-the-art and the application of AI for multiscale battery modelling, followed by an audience discussion and panel session sharing insights, obstacles, and solutions to accelerate the collaborative development of batteries.
Workshop
HPC Infrastructure
State of the Practice
System Administration
W
DescriptionThis paper presents the transformation of Arizona State University's (ASU) user-facing high-performance computing (HPC) monitoring capabilities, i.e., on a significantly updated cluster node status dashboard deployed on ASU's HPC systems. The dashboard leverages modern JavaScript frameworks and direct Slurm API integration to address the growing needs of ASU's diverse and expanding HPC user community. We explore the motivations behind the dashboard's development, detail its technical implementation, and discuss how it accommodates the growing complexity of HPC environments while simplifying user and administrator interactions. The resulting system enhances the overall user experience, lowers the barrier of entry to HPC, and also better positions ASU's HPC resources for future growth, while marking a significant milestone in the progression of HPC resource visualization. The source code is provided online with an open-source license.
Workshop
Distributed Computing
Fault-Tolerance, Reliability, Maintainability, and Adaptability
W
DescriptionScientific research increasingly relies on distributed computational resources, storage systems, networks, and instruments, ranging from HPC and cloud systems to edge devices. Event-driven architecture (EDA) benefits applications targeting distributed research infrastructures by enabling the organization, communication, processing, reliability, and security of events generated from many sources. To support the development of scientific EDA, we introduce Octopus, a hybrid, cloud-to-edge event fabric designed to link many local event producers and consumers with cloud-hosted brokers. Octopus can be scaled to meet demand, permits the deployment of highly available triggers for automatic event processing, and enforces fine-grained access control. We identify requirements in self-driving laboratories, scientific data automation, online task scheduling, epidemic modeling, and dynamic workflow management use cases, and present results demonstrating Octopus’ ability to meet those requirements. Octopus supports producing and consuming events at a rate of over 4.2 M and 9.6 M events per second, respectively, from distributed clients.
Workshop
Message Passing
Network
W
DescriptionMessage matching is a critical process ensuring the correct delivery of messages in distributed and HPC environments. The advent of SmartNICs presents an opportunity to develop offloaded message-matching approaches that leverage this on-NIC programmable accelerator, retaining the flexibility of software-based solutions (e.g., tailoring to application matching behaviors or specialization for non-MPI matching semantics) while freeing up CPU resources. This can be especially beneficial for intensive I/O systems, such as those protected with PQC.
In this work, we propose a bin-based MPI message approach, Optimistic Tag Matching, explicitly designed for the lightweight, highly parallel architectures typical of on-path SmartNICs. We analyze several MPI applications, showing how most of them present a matching behavior suitable for offloading with the proposed strategy (i.e., low queue depths). Additionally, we show how, in those scenarios, offloaded optimistic matching maintains message rates comparable to traditional on-CPU MPI message matching while freeing up CPU resources.
In this work, we propose a bin-based MPI message approach, Optimistic Tag Matching, explicitly designed for the lightweight, highly parallel architectures typical of on-path SmartNICs. We analyze several MPI applications, showing how most of them present a matching behavior suitable for offloading with the proposed strategy (i.e., low queue depths). Additionally, we show how, in those scenarios, offloaded optimistic matching maintains message rates comparable to traditional on-CPU MPI message matching while freeing up CPU resources.
Workshop
Debugging and Correctness Tools
Fault-Tolerance, Reliability, Maintainability, and Adaptability
Software Engineering
W
DescriptionOpenMP® is a highly relevant parallelization standard in high-performance computing and all major compiler vendors support it. The standard defines the OpenMP Tool Interface (OMPT) as a mechanism for third-party tools to obtain information on dedicated runtime events. However, the implementation status differs across compilers. Since many correctness tools and profilers rely on OMPT, being able to judge the status of implementation across compilers is important. We created a test suite that provides unit-test sized tests suitable to evaluate the support of OMPT in AMD ROCm™ and other compilers. The test suite consists of 30 test cases covering both host-side and device-side events. While the test suite is not complete, it provides a useful initial vehicle to evaluate the OMPT implementation status. In our evaluation of various compiler versions, we identified the trend towards a full support of OMPT for the AMD ROCm toolchain, passing all device-side events tests.
Workshop
Debugging and Correctness Tools
Fault-Tolerance, Reliability, Maintainability, and Adaptability
Software Engineering
W
DescriptionOpenMP® is a widely used API in high-performance computing that enables parallelization on the host as well as offload work to an accelerator, such as a GPU. The OpenMP specification defines an OpenMP Tool Interface (OMPT), which allows a third-party tool be notified about OpenMP runtime events. Ensuring that the runtime correctly reports such events is thus important. We propose a unit testing framework for testing OpenMP implementations, like the ROCm™ compiler. It offers a simple-to-use framework that allows a tester to check for OMPT events in addition to the regular unit testing code. It also facilitates writing concise tests while bridging the semantic gap between the unit under test and the OMPT-event testing. Our experimental results show that for the ROCm compiler, ompTest provides coverage similar to its existing test cases, with better readability, and a compile and runtime speedup close to 2.5× for the ROCm OMPT test suite.
Workshop
Architecture
Data-Intensive
Network
Performance Optimization
System Administration
W
DescriptionWe investigated the computational capabilities of FABRIC, a nationwide research infrastructure with nearly 40 sites, for scaling neuroscience simulations. From the hardware standpoint, single-site characterization showed that FABRIC is a promising alternative to conventional neuroscience setups, particularly due to the availability of powerful graphics processing units (GPUs). While multi-site simulations are affected by network latency, it becomes less critical for larger networks. From the software perspective, we found that in the popular CoreNEURON library, cell distribution strategy (for parallel execution) does not affect the simulation time for biologically realistic networks, while other cases can be addressed with a minimum k-cut graph partitioning algorithm. Overall, scalability experiments revealed that FABRIC can be used to simulate networks of up to twenty-five thousand cells, with the limiting issue being GPU memory.
ACM Student Research Competition: Graduate Poster
Posters
TP
DescriptionWe study two algorithmic approaches to approximate triangle counting and compare their accuracy and efficiency. The first one is based on randomized matrix-matrix multiplication, which can be faster, simpler, and more parallelizable on modern processors. The second is based on trace estimation, which produces estimates with lower variance and greater accuracy.
Paper
Post-Moore Computing
Quantum Computing
TP
Best Student Paper Finalist
DescriptionIn this paper, we investigate the resilience of various implementations of state-of-the-art QEC codes to radiation-induced faults. We report data from over 400 million fault injections and correlate hardware faults with the logical error observed after decoding the code output, extrapolating physical-to-logical error rates.
We compare the code's radiation-induced logical error rate over the code distance, the number and role in the QEC of physical qubits, the underlying quantum computer topology, and particle energy spread in the chip.
We show that, by simply selecting and tuning properly the surface code, thus without introducing any overhead, the probability of correcting a radiation-induced fault is increased by up to 10%.
Finally, we provide indications and guidelines for the design of future QEC codes to further increase their effectiveness against radiation-induced events.
We compare the code's radiation-induced logical error rate over the code distance, the number and role in the QEC of physical qubits, the underlying quantum computer topology, and particle energy spread in the chip.
We show that, by simply selecting and tuning properly the surface code, thus without introducing any overhead, the probability of correcting a radiation-induced fault is increased by up to 10%.
Finally, we provide indications and guidelines for the design of future QEC codes to further increase their effectiveness against radiation-induced events.
Birds of a Feather
TP
XO/EX
DescriptionOpen MPI continues to drive the state of the art in HPC. This year, we've added new features, fixed bugs, improved performance, and collaborated with many across the HPC community. We'll discuss what Open MPI has accomplished over the past year and present a roadmap for the next year.
Open MPI's strength lies in its diversity: we represent many different viewpoints across the HPC ecosystem. To that end, many developers from the community will be present to discuss and answer your questions both during and after the BoF.
Open MPI's strength lies in its diversity: we represent many different viewpoints across the HPC ecosystem. To that end, many developers from the community will be present to discuss and answer your questions both during and after the BoF.
Birds of a Feather
TP
XO/EX
DescriptionThis BoF is meant to be an open discussion to guide the future roadmap for Open OnDemand (openondemand.org), by getting feedback from the community on the prioritization of the various tasks planned for the next few years. OOD is extremely relevant to ongoing discussions within the HPC community about user interfaces and science gateways. The session leaders, all part of the OOD development team, will jointly develop the content for the presentation in advance to ensure a wide range of viewpoints and topics are presented. We will also consult with our user advisory group in advance for their suggestions.
Birds of a Feather
TP
XO/EX
DescriptionAs demand for specialized architectures and innovative approaches grows in the post-Moore era of HPC and scientific edge computing, this BoF session explores the potential of domain-specific accelerators and open-source tools in architecture research and chip prototyping. Goals include fostering collaboration among professionals, identifying trends and challenges, and exploring AI-assisted design. Topics cover ongoing research, experiences with open-source tools and gap analysis. Additionally, we discuss computational demands and multi-physics simulation opportunities from chip prototyping as a new type of HPC workload. Expected outcomes include actionable insights, a collaborative network, and inspiration for new initiatives in architecture research and chip prototyping.
Workshop
Accelerators
Heterogeneous Computing
Parallel Programming Methods, Models, Languages and Environments
W
DescriptionGPUs are the heart of the latest generations of supercomputers. We accomplish efficient acceleration of a compressible multiphase flow solver via OpenACC on NVIDIA and AMD Instinct GPUs. Optimization is accomplished by specifying the directive clauses gang vector and collapse. Further speedups of six and ten times are achieved by packing user-defined types into coalesced multidimensional arrays and manual inlining via metaprogramming. Additional optimizations yield seven-times speedup of array packing and thirty-times speedup of select kernels on Frontier. Weak scaling efficiencies of 97% and 95% are observed when scaling to 50% of Summit and 87% of Frontier. Strong scaling efficiencies of 84% and 81% are observed when increasing the device count by a factor of 8 and 16 on V100 and MI250X hardware. The strong scaling efficiency of AMD’s MI250X increases to 92% when increasing the device county by a factor of 16 when GPU-aware MPI is used for communication.
Birds of a Feather
TP
XO/EX
DescriptionThe OpenACC organization helps researchers and developers advance science by expanding their parallel computing skills and supporting a directive-based, high-level parallel programming model on CPUs, GPUs, and more. OpenACC supports over 25 global hackathons annually and facilitated the acceleration of over 200 applications on multiple platforms (e.g., Frontier, Perlmutter, JUWELS, Summit, LUMI, and Piz Daint). This BoF serves as a forum for OpenACC users, implementers, and organization officers to openly discuss the status of OpenACC and its community. Presentations will be given by OpenACC officers, compiler implementers, and invited users, followed by an open-mic discussion with the audience.
Birds of a Feather
TP
XO/EX
DescriptionOpenHPC provides an open-source, community-driven stack of common ingredients to deploy and manage Linux-based HPC clusters. Formed in November 2015 and formalized as a Linux Foundation project in June 2016, OpenHPC continues to see rapid growth in adoption. It is used by thousands of organizations worldwide, including academic institutes, non-profit organizations, government labs, and commercial entities. At this BoF, speakers from the Technical Steering Committee will provide technical updates and near-term roadmaps. We then invite open discussion, allowing attendees to provide feedback on OpenHPC conventions and packaging, request additional components and configurations, and discuss future trends.
Birds of a Feather
TP
XO/EX
DescriptionThe OpenMP Architecture Review Board (ARB) will have released the OpenMP API version 6.0 shortly before SC24. Attendees will receive first-hand information about the new features of the OpenMP API version 6.0 directly from the language designers and implementers. This BoF will include short lightning talks, and discussion rounds will give participants ample opportunity to interact with OpenMP experts, ask questions, and provide community feedback. Sub-committee leaders of the OpenMP ARB will provide insight into new features of OpenMP version 6.0 as well as plans for versions 6.x and 7.0. Vendor representatives will discuss support and timelines for OpenMP features.
Birds of a Feather
TP
XO/EX
DescriptionOpenSHMEM is a PGAS API for single-sided asynchronous scalable communications in HPC applications. OpenSHMEM is a community-driven standard for this API across multiple architectures/implementations. This BoF brings together the OpenSHMEM community to present the latest accomplishments since the release of the 1.5 specification, and discuss future directions for the OpenSHMEM community as we develop version 1.6 and beyond. The BoF will consist of talks from end-users, implementers, middleware and tool developers to discuss their experiences and plans for using OpenSHMEM. We will then open the floor for discussion of the specification and our mid- to long-term goals.
Birds of a Feather
TP
XO/EX
DescriptionOperational data analytics (ODA) provides unique opportunities to analyze, understand, and optimize operations of HPC systems. Readily available open-source frameworks make the collection of monitoring data from different domains of the HPC system increasingly easy. However, making the data work for HPC operations is not straightforward and HPC sites are duplicating efforts to develop methods and tools to analyze and leverage the data. AI-based analysis methods are appealing, but certainly not the only option. This BoF aims to bring together practitioners in HPC operations to share use cases for ODA, discuss problems and provide feedback.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionDue to the heterogeneity of resources and data, client selection plays a paramount role in the efficacy of Federated Learning (FL) systems. The time taken by a training round is determined by the slowest client. Also, energy consumption and carbon footprint are seen as primary concerns. In this context, we propose two optimal time- and energy-aware client selection algorithms for FL: MEC and ECMTC. To the best of our knowledge, this work is the first to propose algorithms that make an optimal selection of clients with heterogeneous resources by jointly optimizing the execution time and energy consumption while defining how much data each client should use locally.
During the presentation, I will expose the challenges of selecting clients in FL systems, present our approach based on an illustrative example, and then show the experimental evaluation carried out in an HPC platform and the takeaway of our investigation.
During the presentation, I will expose the challenges of selecting clients in FL systems, present our approach based on an illustrative example, and then show the experimental evaluation carried out in an HPC platform and the takeaway of our investigation.
Workshop
Applications and Application Frameworks
W
DescriptionAs scientific experiments generate increasingly larger and more complex datasets, the need for efficient and scalable computing solutions becomes critical. This study explores the CMS experiment's implementation of machine learning inference-as-a-service to optimise hardware utilisation and meet rising computational demands within budget constraints. CMS is planning to dynamically offload ML tasks to the Perlmutter supercomputer using the Services for Optimised Network Inference on Coprocessors (SONIC) approach, leveraging the open-source Nvidia Triton inference server software. This talk will present the current status, performance metrics, challenges, and future directions, demonstrating how Perlmutter's interactive capabilities enhance the efficiency, scalability, and responsiveness of CMS workflows.
Tutorial
Applications and Application Frameworks
Emerging Technologies
Parallel Programming Methods, Models, Languages and Environments
TUT
DescriptionWith Exaflop systems already here, the application communities are eager to leverage these large and complex systems. The complexity is further increased by the applications' need to combine different aspects beyond traditional HPC solvers and simulators, with artificial intelligence (AI) and data analytics (DA). The eFlows4HPC project proposed a software stack and the HPC Workflows-as-a-Service (HPCWaaS) methodology to provide tools to simplify the development, deployment, execution, and reuse of workflows. These results are leveraged in the DT-GEO and CAELESTIS projects. These tools also aim to support the reproducibility, portability and ease of use of complex workflows.
The tutorial will focus on a set of tools and methodologies for managing the whole application workflow lifecycle. In particular, the tutorial will cover aspects of developing computational workflows with PyCOMPSs and new extensions to better integrate with AI and DA with examples from DT-GEO and CAELESTIS projects. The tutorial will also describe how to automatically record workflow provenance with PyCOMPSs to share FAIR workflows in public repositories, enabling their reproducibility. Finally, we will explain how to generate specific containers that leverage HPC systems features and use them in the workflow deployment phase. The tutorial will include hands-on sessions on different aspects.
The tutorial will focus on a set of tools and methodologies for managing the whole application workflow lifecycle. In particular, the tutorial will cover aspects of developing computational workflows with PyCOMPSs and new extensions to better integrate with AI and DA with examples from DT-GEO and CAELESTIS projects. The tutorial will also describe how to automatically record workflow provenance with PyCOMPSs to share FAIR workflows in public repositories, enabling their reproducibility. Finally, we will explain how to generate specific containers that leverage HPC systems features and use them in the workflow deployment phase. The tutorial will include hands-on sessions on different aspects.
Workshop
Emerging Technologies
Modeling and Simulation
Scientific and Information Visualization
W
DescriptionThe speaker will introduce a comprehensive approach to optimizing the sustainability of data centers using digital twins and machine learning (ML) techniques. The talk will focus on how the frontiers of digital twins can be extended beyond real-time monitoring and prediction to recommend and control multiple aspects of the data center, like cooling and workload scheduling, enabling enhanced management of data centers. It will discuss how hierarchical multi-agent reinforcement learning can optimize multiple geographically distributed data centers simultaneously and dynamically. At the same time, consortiums like ExaDigiT and benchmark platforms can democratize and enable more involvement by the AI research community to develop their own solutions to advance the sustainability of data centers that are having explosive growth with AI adoption.
Paper
Artificial Intelligence/Machine Learning
Distributed Computing
Heterogeneous Computing
Performance Optimization
TP
DescriptionMachine learning models are distributed across multiple nodes using numerous parallelism strategies. The resulting collective communication is often on the critical path due to a lack of independent coarse-grain computation kernels available to execute.
In this work, we propose fusing computation with its subsequent collective communication and leverage GPUs' massive parallelism, along with GPU-initiated communication, to overlap communication and computation. Specifically thread-blocks/workgroups (WGs) immediately communicate their results to remote GPUs after completing their computation,while other WGs within the same kernel perform computation. We developed three prototype fused operators (embedding+All-to-All, GEMV+AllReduce, and GEMM+All-to-All) to address the communication overheads in DLRM, Transformers and MoE model architectures. We expose fused kernels as new PyTorch operators, as well as extend the Triton framework to demonstrate their practicality. Our evaluations show our approach effectively overlaps communication with computations, subsequently reducing their combined execution time achieving 12% - 31% lower execution time across all three operators.
In this work, we propose fusing computation with its subsequent collective communication and leverage GPUs' massive parallelism, along with GPU-initiated communication, to overlap communication and computation. Specifically thread-blocks/workgroups (WGs) immediately communicate their results to remote GPUs after completing their computation,while other WGs within the same kernel perform computation. We developed three prototype fused operators (embedding+All-to-All, GEMV+AllReduce, and GEMM+All-to-All) to address the communication overheads in DLRM, Transformers and MoE model architectures. We expose fused kernels as new PyTorch operators, as well as extend the Triton framework to demonstrate their practicality. Our evaluations show our approach effectively overlaps communication with computations, subsequently reducing their combined execution time achieving 12% - 31% lower execution time across all three operators.
Exhibitor Forum
Architecture
Artificial Intelligence/Machine Learning
TP
XO/EX
DescriptionEarly adopters of AI, particularly in large language model (LLM) applications, are transforming industries like pharmaceuticals and high-performance computing (HPC). This session will explore key considerations for integrating AI with HPC, focusing on both hardware and software optimization.
We’ll discuss specialized hardware setups, such as GPUs and TPUs, which are essential for managing AI workloads, including thermal management strategies demonstrated by Europe’s first Exascale system, JUPITER. On the software side, we’ll cover the importance of tailored AI models and efficient resource management.
Finally, we'll present a case study from the European Centre for Medium-Range Weather Forecasts (ECMWF), showcasing how AI-driven optimization of their Tripleclouds model improved performance and accuracy. Featuring expert speakers and interactive polling, the session will provide practical strategies and real-world examples from leading projects in the HPC-AI arena. Participants will also have access to a downloadable white paper and customer case study.
We’ll discuss specialized hardware setups, such as GPUs and TPUs, which are essential for managing AI workloads, including thermal management strategies demonstrated by Europe’s first Exascale system, JUPITER. On the software side, we’ll cover the importance of tailored AI models and efficient resource management.
Finally, we'll present a case study from the European Centre for Medium-Range Weather Forecasts (ECMWF), showcasing how AI-driven optimization of their Tripleclouds model improved performance and accuracy. Featuring expert speakers and interactive polling, the session will provide practical strategies and real-world examples from leading projects in the HPC-AI arena. Participants will also have access to a downloadable white paper and customer case study.
Workshop
Performance Optimization
Programming Frameworks and System Software
W
DescriptionMILC-Dslash is a benchmark derived from the MILC code which simulates lattice-gauge theory on a four-dimensional hypercube. This paper outlines a gradual progression in increasing the parallelism of the MILC-Dslash kernel using SYCL, transitioning from a simple to a fully parallel implementation. This investigation encompasses different work-item index orders, work-group sizes, and memory access patterns arising from these strategies. Examples of components intertwined with the parallel strategies include atomic memory operations, shared variables, divergent instructions, and versions with and without using the SYCL complex library and the SYCLomatic tool. The best parallel strategy is twice as fast as the simplest strategy and shows a 10% improvement over the QUDA baseline, thanks to enhanced parallelism and the use of work-group local memory. This, along with other findings — such as optimizing GPU resource utilization at the expense of concurrency — could guide researchers and developers seeking to optimize parallel computing applications.
Paper
Post-Moore Computing
Quantum Computing
TP
DescriptionQuantum Fourier Transformation (QFT) sits at the heart of many of these applications in quantum computing. Existing work leverages SAT solver or heuristics to generate a hardware-compliant circuit for QFT by inserting SWAP gates to remap logical qubits to physical qubits. However, they might face problems such as long compilation time and suboptimal outcome in terms of the number of cycles to finish all operations. In this paper, we propose a domain-specific hardware mapping approach for QFT. We unify our insight of relaxed ordering and unit exploration in QFT to search for a qubit mapping solution with the help of program synthesis tools. Our method is the first one that guarantees linear-depth QFT circuits for Google Sycamore, IBM heavy-hex, and the lattice surgery, with respect to the number of qubits. Compared with state-of-the-art approaches, our method can save up to 53% in SWAP gate and 92% in depth.
Workshop
Accelerators
Heterogeneous Computing
Parallel Programming Methods, Models, Languages and Environments
W
DescriptionCurrently, the Weather Research and Forecasting
model (WRF) utilizes shared memory (OpenMP) and distributed
memory (MPI) parallelisms. To take advantage of GPU resources
on the Perlmutter supercomputer at NERSC, we port parts of
the computationally expensive routines of the Fast Spectral Bin
Microphysics (FSBM) microphysical scheme to NVIDIA GPUs
using OpenMP device offloading directives. To facilitate this
process, we explore a workflow for optimization which uses both
runtime profilers and a static code inspection tool Codee to
refactor the subroutine. We observe a 2.08x overall speedup for
the CONUS-12km thunderstorm test case.
model (WRF) utilizes shared memory (OpenMP) and distributed
memory (MPI) parallelisms. To take advantage of GPU resources
on the Perlmutter supercomputer at NERSC, we port parts of
the computationally expensive routines of the Fast Spectral Bin
Microphysics (FSBM) microphysical scheme to NVIDIA GPUs
using OpenMP device offloading directives. To facilitate this
process, we explore a workflow for optimization which uses both
runtime profilers and a static code inspection tool Codee to
refactor the subroutine. We observe a 2.08x overall speedup for
the CONUS-12km thunderstorm test case.
Workshop
Applications and Application Frameworks
Algorithms
Performance Evaluation and/or Optimization Tools
W
DescriptionScientific visualizations (SciVis) convert numerical and spatial data into images, enabling deeper insights into complex phenomena. Recent advancements in machine learning, particularly Deep Learning (DL), have significantly enhanced SciVis. By combining classical numerical techniques with DL, we can achieve the accuracies needed for real-world applications. However, DL models often exhibit undue confidence, especially with out-of-distribution (OOD) inputs, leading to misclassification with high confidence scores. To address this, we enhance DL models by quantifying uncertainty and enabling selective classification, allowing models to abstain from predictions when uncertainty is high. This approach outputs a prediction distribution, guiding users on when to seek human intervention. We evaluate this method across three tasks: airfoil pressure and velocity prediction using a Reynolds-Averaged Navier-Stokes (RANS) model, image classification with the ImageNet1K dataset, and digit recognition using the MNIST dataset.
ACM Gordon Bell Climate Modeling Finalist
TP
DescriptionEarth system predictability is challenged by the complexity of environmental dynamics and variables. Current AI foundation models, although advanced by large and heterogeneous data, are constrained by their size and data integration, limiting their effectiveness in addressing the full range of Earth system prediction challenges. To overcome these limitations, we introduce the Oak Ridge Base Foundation Model for Earth System Predictability (ORBIT), an advanced vision transformer model that scales up to 113 billion parameters using a novel hybrid tensor-data orthogonal parallelism technique. As the largest model of its kind, ORBIT surpasses the current climate AI foundation model size by a thousandfold. Scalability tests on the Frontier supercomputer demonstrate that ORBIT achieves 684 petaFLOPS to 1.6 exaFLOPS sustained throughput, with scaling efficiency maintained at 41% to 85% across 49,152 AMD GPUs. These breakthroughs establish new advances in AI-driven climate modeling and demonstrate promise to significantly improve Earth system predictability.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionHeterogeneous platforms demand tailored data structures, algorithm mappings, and efficient execution, often leading to numerous, hard-to-maintain variants of source codes. ORCHA, our performance portability orchestration system, addresses these challenges through abstractions and code generation, streamlining application development across diverse hardware.
Designed to adapt the FLASH multiphysics software for heterogeneous HPC platforms, ORCHA reduces code duplication and maintenance burdens by separating data management and parallelism from arithmetic logic. Key tools include CG-Kit for optimizing implementations, a macroprocessor for flexible arithmetic specialization, and Milhoja, a runtime for efficient graph execution.
This poster highlights performance evaluations of a shock hydrodynamics application across various hardware configurations, showcasing significant GPU performance improvements with ORCHA. We will also outline ongoing efforts to extend ORCHA's compatibility to other physics solvers, aiming to provide broader flexibility and enhanced performance in diverse computational environments.
Designed to adapt the FLASH multiphysics software for heterogeneous HPC platforms, ORCHA reduces code duplication and maintenance burdens by separating data management and parallelism from arithmetic logic. Key tools include CG-Kit for optimizing implementations, a macroprocessor for flexible arithmetic specialization, and Milhoja, a runtime for efficient graph execution.
This poster highlights performance evaluations of a shock hydrodynamics application across various hardware configurations, showcasing significant GPU performance improvements with ORCHA. We will also outline ongoing efforts to extend ORCHA's compatibility to other physics solvers, aiming to provide broader flexibility and enhanced performance in diverse computational environments.
Workshop
Applications and Application Frameworks
Artificial Intelligence/Machine Learning
Security
W
Workshop
Architecture
Network
Performance Optimization
System Administration
W
Workshop
Debugging and Correctness Tools
Performance Evaluation and/or Optimization Tools
W
DescriptionInstrumentation is a widely used technique for gathering performance data.
However, excessive instrumentation can lead to significant runtime overheads, potentially skewing performance analysis results.
In this work, we propose a novel approach to automatically generate and refine instrumentation configurations (ICs) to maximize measurement coverage while adhering to a user-defined overhead budget.
Our approach formulates the problem of selecting instrumented functions as a binary knapsack problem, integrating dynamic profile data and static call-graph information to estimate costs.
We implement this approach within the PIRA profiling infrastructure and
demonstrate its effectiveness with the LULESH, AMG2013, MILC and ASTAR proxy applications, achieving relevant hot spot coverage while staying within the specified overhead limit.
However, excessive instrumentation can lead to significant runtime overheads, potentially skewing performance analysis results.
In this work, we propose a novel approach to automatically generate and refine instrumentation configurations (ICs) to maximize measurement coverage while adhering to a user-defined overhead budget.
Our approach formulates the problem of selecting instrumented functions as a binary knapsack problem, integrating dynamic profile data and static call-graph information to estimate costs.
We implement this approach within the PIRA profiling infrastructure and
demonstrate its effectiveness with the LULESH, AMG2013, MILC and ASTAR proxy applications, achieving relevant hot spot coverage while staying within the specified overhead limit.
Workshop
Debugging and Correctness Tools
Performance Evaluation and/or Optimization Tools
W
DescriptionP-MoVE is a modern, open-source framework designed to monitor and visualize live and/or recorded performance data with the ultimate goal of being a digital twin for HPC systems. Leveraging a Knowledge Base (KB), built upon an HPC-specific ontology with an intuitive encoding for comprehending the performance, it rigorously manages telemetry samplers, databases, and visualization frameworks. The KB is generated through an in-depth probing of the system. It enables the configuration and monitoring of performance metric samplers, the generation of real-time visualizations, the establishment of linked-data connections, and the generation of queries for advanced analysis. Furthermore, with an Abstraction Layer, P-MoVE can be used for low-level profiling even on components from different vendors. It is equipped with modern profiling capabilities, including live cache-aware roofline modeling, crafted to provide real-time insights without impeding system performance. P-MoVE's capabilities have been demonstrated on various architectures using microbenchmarks and a common kernel, sparse-matrix vector multiplication.
Paper
Middleware and System Software
Programming Frameworks and System Software
Resource Management
TP
DescriptionLarge-scale computing systems are increasingly using GPUs to enable peta- and exa-scale levels of compute to meet the needs of modern applications. Given the widespread and growing use of ML, including in scientific applications, optimizing clusters for ML workloads is important. However, recent work has demonstrated that accelerators in these clusters can suffer from performance variability, leading to resource under-utilization and load imbalance. In this work we focus on how clusters schedulers can embrace performance variability to mitigate its effects. We design a novel cluster scheduler, PAL, which uses application-specific variability profiles to improve job performance and resource utilization. PAL also balances performance variability with locality. Overall, PAL significantly improves GPU-rich cluster scheduling: across traces for six ML workloads with a variety of variability profiles, PAL improves geomean job completion time by 42% and cluster utilization by 28% over existing state-of-the-art schedulers.
Workshop
Applications and Application Frameworks
Algorithms
Performance Evaluation and/or Optimization Tools
W
Workshop
Compilers
Parallel Programming Methods, Models, Languages and Environments
Performance Optimization
W
Workshop
Applications and Application Frameworks
Distributed Computing
Middleware and System Software
W
Workshop
Architecture
Network
Performance Optimization
System Administration
W
DescriptionThis panel will showcase groundbreaking advancements in high-speed networking, as demonstrated by this year’s Network Research Exhibition (NRE) participants. Bringing together experts from academia, research institutions, and industry, the session will explore cutting-edge innovations in network architectures, ultra-high-speed data transfers, and real-time scientific collaborations. Panelists will highlight key findings and challenges from the live NRE experiments, including advancements in AI,Quantum, testbeds, and global-scale data-intensive applications. The discussion will focus on the future of high-performance networking and its critical role in driving scientific discovery, addressing emerging trends and the evolving demands of data-driven research.
Piotr Rydlichowski
prydlich@man.poznan.pl
Dr Mariam Kiran
kiranm@ornl.gov
Hyunsuk Bang
hbang3@hawk.iit.edu
Joe Mambretti
j-mambretti@northwestern.edu
Harvey Newman
newman@hep.caltech.edu
Piotr Rydlichowski
prydlich@man.poznan.pl
Dr Mariam Kiran
kiranm@ornl.gov
Hyunsuk Bang
hbang3@hawk.iit.edu
Joe Mambretti
j-mambretti@northwestern.edu
Harvey Newman
newman@hep.caltech.edu
Workshop
Data Movement and Memory
I/O, Storage, Archive
W
DescriptionScientific computing is on the tipping point of a paradigm change. For decades, scientific computing has been largely performed on isolated HPC systems with tightly coupled communication, computation, and storage resources. However, we are moving towards a new paradigm that includes distributed resources such as experimental facilities, cloud infrastructure, and edge devices, in addition to HPC, coordinated into an Integrated Research Infrastructure (IRI). The goal of IRI is to accelerate scientific discovery through automated real-time experimental analysis and steering. A critical piece of making this new paradigm a reality is efficient management and movement of the vast amounts of data between the geographically distributed source and compute resources. In this panel, we will discuss the major challenges for managing data in this new paradigm, how current approaches can potentially be adapted to meet the challenges, and remaining gaps that must be addressed.
Workshop
Artificial Intelligence/Machine Learning
Biology
Education
Emerging Technologies
Medicine
Modeling and Simulation
W
DescriptionThe discussion will focus on innovation, collaboration, and the critical role of future leaders in driving breakthroughs at the intersection of AI and healthcare.
Workshop
Applications and Application Frameworks
Artificial Intelligence/Machine Learning
Security
W
Workshop
Artificial Intelligence/Machine Learning
Biology
Education
Emerging Technologies
Medicine
Modeling and Simulation
W
DescriptionAs a unifying theme in 2024, the special topic “What is Healthy AI?” will focus on the intersection of technology and innovation with cancer biology. “What is Healthy AI” embarks on the journey to craft the correct components for data and algorithms to seamlessly coexist with the complex intricacies of living systems ultimately fostering advancements in healthcare and specifically cancer. Healthy AI in computational biology transcends technological innovation—it embodies a holistic approach to harnessing the power of AI for the advancement and betterment of cancer research.
Paper
Post-Moore Computing
Quantum Computing
TP
DescriptionAmong different quantum computing technologies, neutral atom quantum computers have several advantageous features, such as multi-qubit gates, application-specific topologies, movable qubits, homogenous qubits, and long-range interactions. However, existing compilation techniques for neutral atoms fall short of leveraging these advantages in a practical and scalable manner. This paper introduces Parallax, a zero-SWAP, scalable, and parallelizable compilation and atom movement scheduling method tailored for neutral atom systems, which reduces high-error operations by 25% and increases the success rate by 28% on average compared to the state-of-the-art technique.
Tutorial
Broader Engagement
Parallel Programming Methods, Models, Languages and Environments
Performance Evaluation and/or Optimization Tools
TUT
DescriptionThis tutorial provides a comprehensive overview of parallel computing, emphasizing those aspects most relevant to the user. It is suitable for new users, students, managers, and anyone seeking an overview of parallel computing. It discusses software and hardware/software interaction, with an emphasis on standards, portability, and systems that are widely available.
The tutorial surveys basic parallel computing concepts, using examples selected from multiple engineering, scientific, and machine learning problems. These examples illustrate using MPI on distributed memory systems; OpenMP on shared memory systems; MPI+OpenMP on hybrid systems; and CUDA and compiler directives on GPUs and accelerators. It discusses numerous parallelization and load balancing approaches; software engineering and performance improvement tools; and gives a brief overview of recent developments such as exascale systems and the use of parallelism in LLMs.
The tutorial helps attendees make intelligent decisions by covering the primary options that are available, explaining how the different components work together and what they are most suitable for. Extensive pointers to web-based resources are provided to facilitate follow-up studies.
The tutorial surveys basic parallel computing concepts, using examples selected from multiple engineering, scientific, and machine learning problems. These examples illustrate using MPI on distributed memory systems; OpenMP on shared memory systems; MPI+OpenMP on hybrid systems; and CUDA and compiler directives on GPUs and accelerators. It discusses numerous parallelization and load balancing approaches; software engineering and performance improvement tools; and gives a brief overview of recent developments such as exascale systems and the use of parallelism in LLMs.
The tutorial helps attendees make intelligent decisions by covering the primary options that are available, explaining how the different components work together and what they are most suitable for. Extensive pointers to web-based resources are provided to facilitate follow-up studies.
Exhibitor Forum
Debugging and Correctness Tools
Software Engineering
TP
XO/EX
DescriptionToday’s HPC applications utilize multiple hardware and software technologies to maximize application parallelism and scalability. Finding and fixing problems in these applications can be difficult without leveraging advanced debugging features of a parallel debugger.
This interactive session will highlight parallel debugging features and techniques for effectively finding and solving challenging application problems that utilize MPI, OpenMP, and GPU technologies. You will learn:
* The latest OpenMP/OMPD debugging advancements
* About AMD GPU Asynchronous Wave Control debugging advancements
* How to combine advanced debugging features to efficiently solve tough parallel problems
Taking full advantage of the TotalView’s parallel debugging features will help you be more productive by reducing the time and effort required to identify and fix bugs. The ability to identify and resolve hard-to-find errors will result in more robust, reliable HPC applications.
This interactive session will highlight parallel debugging features and techniques for effectively finding and solving challenging application problems that utilize MPI, OpenMP, and GPU technologies. You will learn:
* The latest OpenMP/OMPD debugging advancements
* About AMD GPU Asynchronous Wave Control debugging advancements
* How to combine advanced debugging features to efficiently solve tough parallel problems
Taking full advantage of the TotalView’s parallel debugging features will help you be more productive by reducing the time and effort required to identify and fix bugs. The ability to identify and resolve hard-to-find errors will result in more robust, reliable HPC applications.
Tutorial
I/O, Storage, Archive
TUT
DescriptionI/O on HPC systems is a black art. This tutorial sheds light on the
state-of-the-art in parallel I/O, and provides the knowledge necessary for
attendees to best leverage I/O resources available to them. We cover the
entire I/O software stack including storage and parallel file systems at the
lowest layer, the role of NVRAM devices, intermediate layers (such as
MPI-IO), and high-level I/O libraries (such as HDF-5). We emphasize ways to
use these interfaces that result in high performance and tools for generating
insight into these stacks.
Our first third of the tutorial covers parallel I/O fundamentals.
We discuss storage technologies, both present and near-future and the major
parallel and distributed file systems. We focus on application in our second
third, connecting storage to our examination of the upper library layers of
the I/O stack, covering MPI-IO, Parallel netCDF, and HDF5. Finally, we
discuss tools for understanding I/O behavior.
state-of-the-art in parallel I/O, and provides the knowledge necessary for
attendees to best leverage I/O resources available to them. We cover the
entire I/O software stack including storage and parallel file systems at the
lowest layer, the role of NVRAM devices, intermediate layers (such as
MPI-IO), and high-level I/O libraries (such as HDF-5). We emphasize ways to
use these interfaces that result in high performance and tools for generating
insight into these stacks.
Our first third of the tutorial covers parallel I/O fundamentals.
We discuss storage technologies, both present and near-future and the major
parallel and distributed file systems. We focus on application in our second
third, connecting storage to our examination of the upper library layers of
the I/O stack, covering MPI-IO, Parallel netCDF, and HDF5. Finally, we
discuss tools for understanding I/O behavior.
Workshop
Compilers
Parallel Programming Methods, Models, Languages and Environments
Performance Optimization
W
DescriptionFortran compilers that provide support for Fortran's native parallel features often do so with a runtime library that depends on details of both the compiler implementation and the communication library, while others provide limited or no support at all. This paper introduces a new generalized interface that is both compiler- and runtime-library-agnostic, providing flexibility while fully supporting all of Fortran's parallel features. The Parallel Runtime Interface for Fortran (PRIF) was developed to be portable across shared- and distributed-memory systems, with varying operating systems, toolchains and architectures. It achieves this by defining a set of Fortran procedures corresponding to each of the parallel features defined in the Fortran standard that may be invoked by a Fortran compiler and implemented by a runtime library. PRIF aims to be used as the solution for LLVM Flang to provide parallel Fortran support. This paper also briefly describes our PRIF prototype implementation: Caffeine.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionNeural network verification provides model robustness guarantees in the presence of noise. We generate verification specifications for medical imaging models based on the U-Net architecture and solve pixel-by-pixel verification problems on a massive scale. Efficiency of this NP-complete problem is studied using α,β-CROWN. We implement parallelization per pixel and demonstrate orders-of-magnitude speedup allowing faster characterization or increased timeout values for greater solving capability.
ACM Student Research Competition: Undergraduate Poster
Posters
Parallelization of the Finite Element-Based Mesh Warping Algorithm Using Hybrid Parallel Programming
TP
DescriptionWarping large volume meshes is computationally expensive and has applications to biomechanics, aerodynamics, image processing, and cardiology. Existing parallel implementations of mesh warping algorithms do not take advantage of shared-memory and one-sided communication features available in MPI-3. In this poster, we describe our parallelization of the finite element-based mesh warping algorithm for tetrahedral meshes. Our implementation takes advantage of shared memory and one-sided communication and deforms a mesh by solving a linear system with multiple right-hand sides based on the solution of a Poisson boundary value problem. Our results demonstrate excellent efficiency and strong scalability on up to 32 cores on a single node. Furthermore, we show a 90% increase in speedup with 256 cores distributed uniformly across 64 nodes versus our largest single node speedup while observing sublinear speedups overall.
Workshop
Applications and Application Frameworks
Distributed Computing
Middleware and System Software
W
DescriptionCommon Workflow Language (CWL) is a widely adopted language for defining and sharing computational workflows. It is designed to be independent of the execution engine on which workflows are executed. Here, we describe our experiences integrating CWL with Parsl, a Python-based parallel programming library designed to manage execution of workflows across diverse computing environments.
We propose a new method that converts CWL CommandLineTool definitions into Parsl apps, enabling Parsl scripts to easily import and use tools represented in CWL. We describe a Parsl runner capable of executing a CWL CommandLineTool directly. We also describe a proof-of-concept extension to support inline Python in a CWL workflow definition, enabling seamless use in Parsl's Python ecosystem. We demonstrate benefits of this integration by presenting example CWL CommandLineTool definitions that show how they can be used in Parsl, and comparing performance of executing an image processing workflow using Parsl-CWL and other CWL runners.
We propose a new method that converts CWL CommandLineTool definitions into Parsl apps, enabling Parsl scripts to easily import and use tools represented in CWL. We describe a Parsl runner capable of executing a CWL CommandLineTool directly. We also describe a proof-of-concept extension to support inline Python in a CWL workflow definition, enabling seamless use in Parsl's Python ecosystem. We demonstrate benefits of this integration by presenting example CWL CommandLineTool definitions that show how they can be used in Parsl, and comparing performance of executing an image processing workflow using Parsl-CWL and other CWL runners.
Paper
Accelerators
Artificial Intelligence/Machine Learning
Cloud Computing
Distributed Computing
Heterogeneous Computing
Performance Optimization
TP
DescriptionIn cloud environments, GPU-based deep neural network (DNN) inference servers are required to meet the Service Level Objective (SLO) latency for each workload under a specified request rate, while also minimizing GPU resource consumption. However, previous studies have not fully achieved this objective. In this paper, we propose ParvaGPU, a technology that facilitates spatial GPU sharing for large-scale DNN inference in cloud computing. ParvaGPU integrates NVIDIA's Multi-Instance GPU (MIG) and Multi-Process Service (MPS) technologies to enhance GPU utilization, with the goal of meeting the diverse SLOs of each workload and reducing overall GPU usage. Specifically, ParvaGPU addresses the challenges of minimizing underutilization within allocated GPU space partitions and external fragmentation in combined MIG and MPS environments. We conducted our assessment on multiple A100 GPUs, evaluating 11 diverse DNN workloads with varying SLOs. Our evaluation revealed no SLO violations and a significant reduction in GPU usage compared to state-of-the-art frameworks.
Invited Talk
TP
DescriptionIn an era of rapid AI progress, leveraging accelerated computing and big data has unlocked new possibilities to develop general-purpose AI models. As AI systems like OpenAI's ChatGPT showcase excellent performance in the digital realm, we are compelled to ask: How can such success be translated into the physical world to create generalist robots capable of everyday tasks? In this talk, I will present our data-centric research principles and approaches toward building general-purpose robot autonomy in the open world. Specifically, I will discuss our recent works using large-scale computing and GPU-accelerated physics simulation to generate high-quality training data for robotic foundation models. By combining these advances with cutting-edge developments in humanoid robotics, we are laying the foundation for the next generation of autonomous robots.
Workshop
Heterogeneous Computing
Parallel Programming Methods, Models, Languages and Environments
Task Parallelism
W
Workshop
Heterogeneous Computing
Parallel Programming Methods, Models, Languages and Environments
Task Parallelism
W
Workshop
Heterogeneous Computing
Parallel Programming Methods, Models, Languages and Environments
Task Parallelism
W
Exhibits
SCinet
TP
XO/EX
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionPerformance inefficiencies in software can severely impact application quality and resource utilization. Addressing these issues often requires significant developer effort, yet the lack of large-scale, open-source performance datasets hinders the development of effective mitigation strategies. To fill this gap, we present PcMINER, a tool that mines performance inefficiency-related commits from GitHub at scale. PcMINER uses PcERT-KD, a transformer model that classifies these commits with accuracy comparable to 7B parameter LLMs but with reduced computational costs, making it ideal for CPU cluster deployment. By mining GitHub repositories with a 50-node CPU cluster, PcMINER has generated a dataset of 162K performance-related commits in C++ and 103.8K in Python. This dataset promises to enhance data-driven approaches to detecting performance inefficiencies.
In the poster session, I will present the problem, motivation, methodology, and results, with additional details that may be accessible through a QR code, and will provide a brief oral overview.
In the poster session, I will present the problem, motivation, methodology, and results, with additional details that may be accessible through a QR code, and will provide a brief oral overview.
Workshop
Data Movement and Memory
I/O, Storage, Archive
W
DescriptionPDSW 2024 Welcome Presentation
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionIn scientific computing, artificial intelligence and/or machine learning (AI/ML) are appearing increasingly often in scientific workflows on supercomputers, due to their ability to solve more complex problems. With respect to performance analysis tools, the nature of these workflows creates new requirements for performance analysis tools, in particular incentivizing lower integration costs and support for more diverse codes.
In response, we introduce PerfFlowAspect, which approaches this problem via reduced instrumentation costs, support for C/C++ and Python code bases, and multiple trace formats that support multiple workflow components. To evaluate its effectiveness, we consider the use cases of AMS: a complex application to simplify machine learning surrogate model integration in HPC codes.
PerfFlowAspect is an open-source tool under active research and development. At the poster session, I will present my work with the aid of the poster by elaborating the data, text and figures in it.
In response, we introduce PerfFlowAspect, which approaches this problem via reduced instrumentation costs, support for C/C++ and Python code bases, and multiple trace formats that support multiple workflow components. To evaluate its effectiveness, we consider the use cases of AMS: a complex application to simplify machine learning surrogate model integration in HPC codes.
PerfFlowAspect is an open-source tool under active research and development. At the poster session, I will present my work with the aid of the poster by elaborating the data, text and figures in it.
Workshop
I/O, Storage, Archive
W
DescriptionIn this paper we evaluate multiple parallel programming models
with respect to both ease of expression, and resulting performance.
We do this by implementing the mathematical algorithm
known as the `power method' in a variety of ways,
using modern C++ technniques.
with respect to both ease of expression, and resulting performance.
We do this by implementing the mathematical algorithm
known as the `power method' in a variety of ways,
using modern C++ technniques.
Workshop
Accelerators
Modeling and Simulation
Performance Evaluation and/or Optimization Tools
W
DescriptionIn current discrete GPU systems, the penalty of data movement between host and device memory is inevitable, forcing many large-scale applications to include optimizations that amortize this cost. On systems like the AMD Instinct™ MI300A series accelerators, based on the accelerated processing unit (APU) architecture, host and device memories are unified into a single physical storage. On an APU, the GPU can access memory in the same way the CPU does, thus avoiding the need for additional data movement (zero-copy). To inform developers of MI300A's expected advantages and potential overheads, we follow an experimental approach to study our OpenMP implementation that leverages MI300A zero-copy. Performance results show that zero-copy is faster than the legacy “copy” implementation by a ratio of 1.2X-2.3X for a production-ready application, but that it incurs up to 11% penalty for one SPECaccel 2023 benchmark.
Workshop
Debugging and Correctness Tools
Hardware Technologies
Resource Management
State of the Practice
W
DescriptionThe NSF funded Frontera system, the fastest academic supercomputer in the US, has supported numerous scientific applications in the area of HPC, Big Data and Machine Learning domains over the last five years. Applications leveraged the computational capabilities of the system to achieve breakthroughs in science and engineering. Frontera enables researchers to tackle significant and complex challenges than ever before. As Frontera is nearing the end of its production life, it will be replaced by a new system, "Horizon". An intermediate system, "Vista", will bridge the gap by enabling researchers to access updated software and hardware technologies before Horizon is available. This paper summarizes early experiences on Vista by reporting on the performance of key applications and presenting its design and architecture. Early results are presented from the CPU architecture of Vista, "Grace-Grace", and compared with other processor technologies at TACC from Intel and AMD.
Workshop
Graph Algorithms
Heterogeneous Computing
Programming Frameworks and System Software
W
DescriptionLarge-scale Computational Fluid Dynamics (CFD) simulations are typical HPC applications that require both high memory bandwidth and large memory capacity. However, it is difficult to achieve high performance for such applications on modern high-performance processors due to their low memory bandwidth compared to their high computational power. Near-memory computing can overcome this problem by placing on-chip memory near arithmetic units and reducing off-chip accesses. MN-Core is a distributed memory SIMD processor with each core having its own addressable memory, realizing a near-memory computing processor. MN-Core can be an attractive platform for executing bandwidth-demanding HPC applications. This paper reports the performance of MN-Core for three kernels from the NICAM benchmark, taken from NICAM global climate model. The evaluation results show that MN-Core realizes 986GFLOPS at the maximum, which is 13.4% of its peak performance. This efficiency is comparable to those obtained on CPUs with high memory bandwidth, such as Fujitsu A64FX.
Workshop
Performance Optimization
Programming Frameworks and System Software
W
DescriptionThe rapid advancement of machine learning (ML) technologies has driven the development of specialized hardware accelerators designed to facilitate more efficient model training. This paper introduces the CARAML benchmark suite, which is employed to assess performance and energy consumption during the training of transformer-based large language models and computer vision models on a range of hardware accelerators, including systems from NVIDIA, AMD, and Graphcore. CARAML provides a compact, automated, extensible, and reproducible framework for assessing the performance and energy of ML workloads across various novel hardware architectures. The design and implementation of CARAML, along with a custom power measurement tool called jpwr, are discussed in detail.
Workshop
Performance Optimization
Programming Frameworks and System Software
W
DescriptionAs HPC systems move into the exascale era an increasing diversity of hardware is deployed. The last decade saw the ascendance of NVIDIA GPU-accelerated systems among the largest scale HPC systems and spurred the need for application developers to consider approaches to performance portability that preserved developer productivity. This challenge has been compounded in the last several years by the introduction of the first two exascale systems, Frontier and Aurora. These systems utilize new GPUs, with Frontier utilizing the AMD MI-250X and Aurora the Intel Max 1550. In addition, these systems introduce new program models for applications. This study investigates the performance portability of 12 HPC/ML applications on three large scale HPC systems that utilize GPUs from different vendors: Frontier (AMD), Aurora (Intel), and Polaris (NVIDIA). The performance and portability of these applications was investigated on single GPU, single node, and multi-node scales on each of the three systems.
Workshop
Applications and Application Frameworks
Distributed Computing
Middleware and System Software
W
DescriptionUnderstanding performance and provenance of task-based workflows poses significant challenges, particularly in distributed configurations where resources are shared by multiple applications. Task-based workflow management systems further complicate performance predictability because of their dynamicity that subtly alters task execution order from run to run.
In this paper we propose a layered characterization framework for performance and task provenance for Dask.distributed workflows running on high-performance computing platforms. It collects data from jobs, the workflow management system, and the operating system to aid in understanding the performance of these workflows. Our approach encompasses three main contributions: first, an extension of Dask.distributed to capture high-fidelity task provenance using Mochi data services; second, the adaptation of the established HPC I/O characterization tool Darshan to gather high-fidelity I/O data, thereby enhancing the granularity of our analysis; and third, a framework to combine and process the collected data and provide helpful insights into performance characterization and reproducibility.
In this paper we propose a layered characterization framework for performance and task provenance for Dask.distributed workflows running on high-performance computing platforms. It collects data from jobs, the workflow management system, and the operating system to aid in understanding the performance of these workflows. Our approach encompasses three main contributions: first, an extension of Dask.distributed to capture high-fidelity task provenance using Mochi data services; second, the adaptation of the established HPC I/O characterization tool Darshan to gather high-fidelity I/O data, thereby enhancing the granularity of our analysis; and third, a framework to combine and process the collected data and provide helpful insights into performance characterization and reproducibility.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionWind farm simulations require data from mesoscale atmospheric simulations as initial and boundary conditions for the microscale turbine environments. The Energy Research and Forecasting (ERF) code bridges this scale gap and provides an efficient GPU-enabled parallel implementation with adaptive mesh refinement through the underlying AMReX framework. This poster outlines strategies that reduce the communication overhead among parallel processes through run time settings or systemic changes in the memory management. This includes using shared-memory parallelism over CPUs, enabling direct GPU-GPU data transfers, and implementing a separate memory pool on the GPU for communication buffers. I will present the performance scaling and improvements from these techniques as part of the poster. We are currently developing an in-memory coupling of the compressible flow ERF code with the incompressible turbine solver ExaWind, which will allow holistic wind farm simulations.
Tutorial
Numerical Methods
Performance Evaluation and/or Optimization Tools
Portability
TUT
DescriptionThis tutorial covers code analysis, performance modeling, and optimization for sparse linear solvers on CPU and GPU nodes. Performance Engineering is often taught using simple loops as instructive examples for performance models and how they can guide optimization; however, full, preconditioned linear solvers comprise multiple back-to-back loops enclosed in an iteration scheme that is executed until convergence is achieved. Consequently, the concept of “optimal performance” has to account for both hardware resource efficiency and iterative solver convergence. We convey a performance engineering process that is geared towards linear iterative solvers. After introducing basic notions of hardware organization and storage for dense and sparse data structures, we show how the Roofline performance model can be applied to such solvers in predictive and diagnostic ways and how it can be used to assess the hardware efficiency of a solver, covering important corner cases such as pure memory boundedness. Then we advance to the structure of preconditioned solvers, using the Conjugate Gradient Method (CG) algorithm as a leading example. Hotspots and bottlenecks of the complete solver are identified followed by the introduction of advanced performance optimization techniques like the use of mixed precision and cache blocking. Hands-on exercises in Python complement the lectures.
Workshop
Graph Algorithms
Heterogeneous Computing
Programming Frameworks and System Software
W
DescriptionAlthough recent supercomputers have been improving their computational performance, achieving performance scaling with respect to the number of nodes is not easy due to long inter-node communication latency. Many attempts have been made to hide communication latency and maintain strong scalability even for dense matrix multiplication. Matrix multiplication is an ideal candidate for benchmarking the performance of supercomputers. The Cerebras CS-2 system is an accelerator for deep learning with the world’s largest chip, the wafer-scale engine 2 (WSE-2). The WSE-2 can be considered a distributed memory system that comes with 745500 processing elements connected in a low-latency 2D mesh topology. This paper presents the maximum performance, weak and strong scaling performance, and proposes a performance model for single-precision matrix multiplication on CS-2. We observed the maximum performance of 349.0 TFlops/s (matrix size: 33000x33000) and a weak scaling efficiency of 1.00. The mean absolute percentage error of the model was 4.7%.
Workshop
Debugging and Correctness Tools
Performance Evaluation and/or Optimization Tools
W
DescriptionModern computer processors improve their computing power by having multiple cores. Traditionally these cores were homogeneous: many identical cores with the same capabilities. Instead it is possible to create processors that have heterogeneous (or hybrid) cores, where the various cores have differing capabilities. This can lead to energy savings and other efficiencies, but complicates performance analysis. Heterogeneous cores have been common for years in embedded ARM processors; recently support has appeared in x86 desktop processors as well. It is likely that before long server and high-performance systems will also gain hybrid cores.
We look at current Linux support for heterogeneous processors and detail the various problems encountered when adding support for them to the PAPI performance measurement library.
We look at current Linux support for heterogeneous processors and detail the various problems encountered when adding support for them to the PAPI performance measurement library.
Workshop
Performance Optimization
Programming Frameworks and System Software
W
DescriptionBioinformatics workloads differ significantly from traditional scientific computing and AI workloads because they consist primarily of integer-only operations and string comparisons rather than floating-point operations. The underlying algorithms usually have low arithmetic intensity, irregular memory access patterns, and non-deterministic workloads. Local Assembly is an essential step in large-scale genome assembly software and is typically implemented using de Bruijn graphs. This paper examines the performance, portability, and productivity of a local assembly GPU kernel from a metagenome assembly pipeline implemented using hash table data structures on NVIDIA, AMD, and Intel GPUs. We focus on the challenges of achieving portability while maintaining performance for a complex bioinformatics GPU kernel that relies on hardware-specific optimizations. In this paper, we evaluate the local assembly kernel's performance and portability across different GPU architectures, identify performance bottlenecks, and propose modifications in existing tools and methods for performance modeling and analysis of integer-heavy bioinformatics application kernels.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionThe volume of data required for high performance computing (HPC) jobs is growing faster than the memory storage available to store the required data, leading to performance bottlenecks. Hence the need for inline data compression, which reduces the amount of allocated memory needed by storing all data in its compressed format and decompressing/recompressing single variables as needed. We apply inline compression to HPC application pySDC, a framework that solves collocation problems iteratively using parallel-in-time methods. We introduce a new version of pySDC that has a compression manager to add inline compression functionality, along with a software cache that stores the decompressed state of the most frequently used variables. We use lossy compressor ZFP and test our model with varying software cache sizes. Results show that having no cache has the best compression ratio, but having a cache of size 16 improves the timing while also slightly improving the memory footprint.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionLAMMPS (Large-scale Atomic/Molecular Massively Parallel Simulator) is a widely used molecular dynamics simulator. It is used here to simulate the high-pressure BC8 phase of carbon using the Spectral Neighbor Analysis Potential (SNAP). This simulation employs the Kokkos C++ performance portability layer for its inter-atomic potential calculations in SNAP on GPUs. We evaluate LAMMPS’ performance across different programming environments and MPI implementations on two leadership-class supercomputers, Perlmutter at NERSC and Frontier at OLCF. Additionally, we analyze performance trends within containerized environments. Our systematic empirical study assesses various configurations on these systems to provide insights and recommendations for optimizing application performance. This study aims to guide users in selecting the most effective setup for their application.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionThe NERSC-10 Benchmark Suite is a collection of tests which are designed to evaluate various aspects of system architecture and performance. This study utilizes the suite to evaluate the CPU compute nodes of two leadership-class supercomputers, Perlmutter at NERSC and Summit at OLCF, using two N10 benchmarks. We compare different BLAS implementations, including vendor-provided options and open-source alternatives. Our analysis reveals performance difference between vendor-provided and open-source BLAS implementations by leveraging two benchmarks from this suite. The results offer valuable insights for users, highlighting which BLAS implementations may be optimal for compute-bound applications on each system.
Workshop
Accelerators
Heterogeneous Computing
Parallel Programming Methods, Models, Languages and Environments
W
DescriptionTo fully leverage the computing power available in GPU-based HPC systems, it is crucial to balance performance and portability across GPU vendors. This work discusses the restructuring and porting of several key kernels in the GAMESS quantum chemistry software to AMD, Intel, and NVIDIA GPUs via OpenMP API. By leveraging OpenMP, the same code is made portable across GPU vendors. However, due to vendor-specific implementation of OpenMP, GPU code generation varies and can result in large variations in performance even on the same hardware. This work highlights the challenges faced in GPU offloading via OpenMP, which are likely to be encountered in other porting efforts, including memory limitations, the need for substantial restructuring, and differences in compiler optimizations. Also presented are strategies and approaches to address these challenges, along with performance results across supercomputing systems such as Summit, Aurora, Frontier and Perlmutter, using a range of vendor software stacks.
Workshop
Performance Optimization
Programming Frameworks and System Software
W
Workshop
Performance Optimization
Programming Frameworks and System Software
W
DescriptionIn this paper, we present GPU-optimizations for an ice-sheet modeling code known as MPAS-Albany Land Ice (MALI). MALI is a C++ template code that leverages Kokkos programming model for portability and Trilinos library for data structures, nonlinear and linear solvers. Performance of the most expensive kernel is assessed via the Roofline model to highlight the potential for code improvement according to the underlying GPU architecture. We perform optimizations consisting of loop fusion, loop optimizations and local accumulation to productively and portably attain an overall speedup of 3$\times$ in either NVIDIA and AMD GPU. We analyze the performance gains using a time-oriented performance portability model based on time per invocation and GPU data movement. Results show an increment between 20\% and 50\% on the performance portability metric by improving data locality and highlights the importance of optimizing GPU-ported scientific applications to maximize memory bandwidth and minimize data movement on modern supercomputers.
Workshop
Performance Optimization
Programming Frameworks and System Software
W
Workshop
Performance Optimization
Programming Frameworks and System Software
W
Workshop
Performance Optimization
Programming Frameworks and System Software
W
Workshop
Performance Optimization
Programming Frameworks and System Software
W
Workshop
Performance Optimization
Programming Frameworks and System Software
W
ACM Student Research Competition: Graduate Poster
Posters
TP
DescriptionMany parallel applications rely on iterative stencil operations, whose performance is dominated by communication costs at large scales. Several MPI optimizations, such as persistent and partitioned communication, reduce overheads and improve communication efficiency through amortized setup costs and reduced synchronization of threaded sends. This paper presents the performance of stencil communication in the Comb benchmarking suite when using non-blocking, persistent, and partitioned communication routines. The impact of each optimization is analyzed at various scales. Further, the paper presents an analysis of the impact of process count, thread count, and message size on partitioned communication routines. Measured timings show that persistent MPI communication can provide a speedup of up to 37% over the baseline MPI communication, and partitioned MPI communication can provide a speedup of up to 68%.
Workshop
Data Movement and Memory
Emerging Technologies
W
DescriptionWith unprecedented demand for GenAI inference, acceleration of primitives that dominate GenAI, such as GEMV, is receiving considerable attention. A challenge with GEMVs is the high memory-bandwidth this primitive demands. Multiple memory vendors have proposed commercially-viable PIM prototypes that attain bandwidth boost over processor via augmenting memory banks with compute capabilities and broadcasting same command to all banks. While proposed PIM designs stand to accelerate GEMV, we observe that a key impediment to harness PIM acceleration is deducing optimal data-placement to place the matrix in memory banks. To this end, we tease out factors that impact data-placement and propose PIMnast which, like a gymnast, balances these factors to identify data-placements that deliver GEMV acceleration. Across a spectrum of GenAI models, PIMnast, along with additional orchestration knobs we identify, delivers up to 6.86x speedup for GEMVs (of the available 7x roofline speedup) leading to up to 5x speedup for per-token latencies.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionDecision trees are popularly used in statistics and machine learning. Piecewise linear trees, a type of model-based decision tree, employ linear models to evaluate splits and predict outcomes at the leaf nodes. While they can offer high accuracy, they are computationally expensive, and currently, no scalable implementations exist without harming accuracy.
We introduce PINE, an efficient yet effective approach for training piecewise linear trees, incorporating various algorithmic and system optimizations. These optimizations enable fast training on multicore CPUs without sacrificing model accuracy. We also present PINEBoost, which applies gradient boosting to PINE, and compare its performance with existing frameworks. Experimental results demonstrate that PINE and PINEBoost achieve superior accuracy and faster convergence rates across general datasets in regression tasks compared to state-of-the-art gradient boosting decision trees.
We introduce PINE, an efficient yet effective approach for training piecewise linear trees, incorporating various algorithmic and system optimizations. These optimizations enable fast training on multicore CPUs without sacrificing model accuracy. We also present PINEBoost, which applies gradient boosting to PINE, and compare its performance with existing frameworks. Experimental results demonstrate that PINE and PINEBoost achieve superior accuracy and faster convergence rates across general datasets in regression tasks compared to state-of-the-art gradient boosting decision trees.
Paper
Accelerators
Artificial Intelligence/Machine Learning
Cloud Computing
Distributed Computing
Heterogeneous Computing
Performance Optimization
TP
DescriptionInference of Large Language Models (LLMs) across computer clusters has become a focal point of research in recent times, with many acceleration techniques taking inspiration from
CPU speculative execution. These techniques reduce bottlenecks associated with memory bandwidth, but also increase end-to-end latency per inference run, requiring high-speculation acceptance rates to improve performance. As a remedy, we propose PipeInfer, a pipelined speculative acceleration technique to reduce inter-token latency and improve system utilization for single-request scenarios, while also improving tolerance to low speculation acceptance rates and low-bandwidth interconnects. PipeInfer exhibits up to a 2.15x improvement in generation speed over standard speculative inference.
PipeInfer achieves its improvement through Continuous Asynchronous Speculation and Early Inference Cancellation, the former improving latency and generation speed by running single-token inference simultaneously with several speculative runs, while the latter improves speed and latency by skipping the computation of invalidated runs, even in the middle of inference.
CPU speculative execution. These techniques reduce bottlenecks associated with memory bandwidth, but also increase end-to-end latency per inference run, requiring high-speculation acceptance rates to improve performance. As a remedy, we propose PipeInfer, a pipelined speculative acceleration technique to reduce inter-token latency and improve system utilization for single-request scenarios, while also improving tolerance to low speculation acceptance rates and low-bandwidth interconnects. PipeInfer exhibits up to a 2.15x improvement in generation speed over standard speculative inference.
PipeInfer achieves its improvement through Continuous Asynchronous Speculation and Early Inference Cancellation, the former improving latency and generation speed by running single-token inference simultaneously with several speculative runs, while the latter improves speed and latency by skipping the computation of invalidated runs, even in the middle of inference.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionInference of large language models (LLMs) across computer clusters has become a focal point of research in recent times, with many acceleration techniques taking inspiration from CPU speculative execution. These techniques reduce memory bandwidth requirements, but also increase
latency per inference run, requiring high speculation acceptance rates to improve performance. As a remedy, we propose PipeInfer, a pipelined speculative acceleration technique to reduce inter-token latency and improve system utilization while also improving tolerance to low speculation acceptance rates and low-bandwidth interconnects. PipeInfer exhibits up to a 2.15× improvement in generation speed over standard speculative inference. PipeInfer achieves its improvement through Continuous Asynchronous Speculation and Early Inference Cancellation; the former improves latency and generation speed by running single-token inference simultaneously with several speculative runs, while the latter improves speed and latency by skipping the computation of invalidated runs.
latency per inference run, requiring high speculation acceptance rates to improve performance. As a remedy, we propose PipeInfer, a pipelined speculative acceleration technique to reduce inter-token latency and improve system utilization while also improving tolerance to low speculation acceptance rates and low-bandwidth interconnects. PipeInfer exhibits up to a 2.15× improvement in generation speed over standard speculative inference. PipeInfer achieves its improvement through Continuous Asynchronous Speculation and Early Inference Cancellation; the former improves latency and generation speed by running single-token inference simultaneously with several speculative runs, while the latter improves speed and latency by skipping the computation of invalidated runs.
Workshop
Accelerators
Modeling and Simulation
Performance Evaluation and/or Optimization Tools
W
Workshop
Accelerators
Modeling and Simulation
Performance Evaluation and/or Optimization Tools
W
Workshop
Accelerators
Modeling and Simulation
Performance Evaluation and/or Optimization Tools
W
Workshop
Accelerators
Modeling and Simulation
Performance Evaluation and/or Optimization Tools
W
DescriptionIntel Data Center GPU Max 1550, known as Ponte Vecchio (PVC), is a new Intel GPU architecture for high-performance computing. It is the basis of two systems on the June 2024 TOP500 list, Dawn (#51) and Aurora (#2).
This work provides micro-benchmarking data on PVCs from which application developers may benefit, shows how the micro-benchmarking results are indicative of mini-app performance on PVC, and demonstrates real applications on large-scale Intel GPU systems.
We quantify the obtainable performance from PVC systems through micro-benchmarking fundamental architectural properties. We evaluate the performance of four mini-apps with known performance characteristics, and two full applications, comparing performance on a node of Aurora and Dawn with a node of NVIDIA H100 GPUs and a node of AMD MI250 GPUs. We show the figure-of-merit of the mini-apps on a single PVC ranges from 0.6–1.8X the performance of an H100, and 0.8–7.5X of an MI250.
This work provides micro-benchmarking data on PVCs from which application developers may benefit, shows how the micro-benchmarking results are indicative of mini-app performance on PVC, and demonstrates real applications on large-scale Intel GPU systems.
We quantify the obtainable performance from PVC systems through micro-benchmarking fundamental architectural properties. We evaluate the performance of four mini-apps with known performance characteristics, and two full applications, comparing performance on a node of Aurora and Dawn with a node of NVIDIA H100 GPUs and a node of AMD MI250 GPUs. We show the figure-of-merit of the mini-apps on a single PVC ranges from 0.6–1.8X the performance of an H100, and 0.8–7.5X of an MI250.
Workshop
Accelerators
Heterogeneous Computing
Parallel Programming Methods, Models, Languages and Environments
W
DescriptionThere is a continuing interest in using standard language constructs for accelerated computing in order to avoid (sometimes vendor-specific) external APIs. For Fortran codes, the {\tt do concurrent} (DC) loop has been successfully demonstrated on the NVIDIA platform. However, support for DC on other platforms has taken longer to implement. Recently, Intel has added DC GPU offload support to its compiler, as has HPE for AMD GPUs. In this paper, we explore the current portability of using DC across GPU vendors using the in-production solar surface flux evolution code HipFT. We discuss implementation and compilation details, including when/where using directive APIs for data movement is needed/desired compared to using a unified memory system. The performance achieved on both data center and consumer platforms is shown.
Workshop
Distributed Computing
Experimental Facility
HPC Infrastructure
W
DescriptionWe describe portable software workflows for processing ptychography data from large-scale experimental facilities. The Advanced Photon Source (APS) and Linac Coherent Light Source (LCLS) are examples of facilities which generate large amounts of high-energy x-rays used for scientific research. Both facilities recently underwent upgrades that increased data output by orders-of-magnitude. The increased data rates necessitate a processing pipelines that provides timely feedback to beamline scientists and scientific users. X-ray ptychography, a data-intensive computational imaging technique, stands to benefit from robust software workflows that facilitate the use of large computing facilities. Here we describe a file-based workflow that can be deployed at any ptychography beamline. We detail its use during two experiments: the first reconstructed 98 experimental scans sent from APS to the Argonne Leadership Computing Facility (ALCF). The second experiment demonstrated cross-facility capabilities by reconstructing 15 experimental scans sent from users at the Linac Coherent Light Source (LCLS) to ALCF.
Tutorial
Broader Engagement
Parallel Programming Methods, Models, Languages and Environments
Portability
TUT
DescriptionIn this tutorial, you’ll discover the portable parallelism and concurrency features of the ISO C++23 standard and learn to accelerate HPC applications on modern, heterogeneous GPU-based systems from all three main vendors (AMD, Intel, NVIDIA), without any non-standard extensions. We’ll show you how to parallelize classic HPC patterns like multi-dimensional loops and reductions, and how to solve common problems like overlapping MPI communication with GPU computation. The material is supplemented with numerous hands-on exercises and illustrative HPC mini-applications. All exercises will be done on cloud GPU-instances directly in your web-browser; no setup required. The tutorial synthesizes practical techniques acquired from our professional experience to show how the C++23 standard programming model applies to real-world HPC workloads, and which thoughts went into implementing and designing the programming model itself. You'll also receive links to additional resources and a preview of upcoming C++ features.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionOcean simulation models often underperform on modern high-performance computing (HPC) architectures, necessitating costly and time-consuming code rewrites.
We introduce Poseidon, an HPC-oriented source-to-source translator for Fortran-based fluid dynamics solvers used in ocean and weather models with regular grid structures. Poseidon aims to recover high-level information and semantics lost during the process of converting numerics to source code.
We demonstrate Poseidon's approach using a research code implementing the 2D fast barotropic solver of full 3D ocean simulation models, which involves over 20 stencil-like kernels. Kernel fusion-based code optimization can already lead to a high combinatorial complexity.
Preliminary results include various performance studies with and without data flow graph-based modifications based on an exhaustive search for kernel fusion. Measurements show that Poseidon can generate optimized Fortran code.
In future work, Poseidon automatic code rewrite should help to: port existing code to GPU, hide process communications latency and apply automatic differentiation.
We introduce Poseidon, an HPC-oriented source-to-source translator for Fortran-based fluid dynamics solvers used in ocean and weather models with regular grid structures. Poseidon aims to recover high-level information and semantics lost during the process of converting numerics to source code.
We demonstrate Poseidon's approach using a research code implementing the 2D fast barotropic solver of full 3D ocean simulation models, which involves over 20 stencil-like kernels. Kernel fusion-based code optimization can already lead to a high combinatorial complexity.
Preliminary results include various performance studies with and without data flow graph-based modifications based on an exhaustive search for kernel fusion. Measurements show that Poseidon can generate optimized Fortran code.
In future work, Poseidon automatic code rewrite should help to: port existing code to GPU, hide process communications latency and apply automatic differentiation.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionAs HPC applications become more I/O intensive, understanding their power consumption patterns is necessary to develop energy-saving solutions. Here, we evaluate the energy consumption of I/O operations on two popular HPC parallel file systems: Lustre and DAOS. We develop models to predict the energy usage of sequential writes and evaluate their accuracy against our gathered benchmarks. Our models can be used to enhance the accuracy of energy-predicting frameworks by allowing them to consider storage configuration when estimating total energy consumption.
Exhibits
Flash Session
TP
XO/EX
DescriptionDiscuss how Intel®’s cutting-edge Xeon® CPUs and Gaudi® 3 GPUs, when paired with Hypertec’s innovative immersion-born technology, are transforming data centers into energy-efficient powerhouses for AI and high-performance computing.
Birds of a Feather
TP
XO/EX
DescriptionWriting good parallel programs is painful. Very painful.
This is because high-performance compute systems have evolved from simple single-core machines, strung together with Ethernet, into multi-core, multi-accelerator, multi-level monsters. As a consequence, programming such systems means dealing with synchronization and communication overhead, load imbalance, a multitude of programming models and languages.
In this BoF we bring together application people to share their pain in programming parallel systems, with people working on programming frameworks and models. We hope this can lead to insights and solutions to alleviate the pain.
This is because high-performance compute systems have evolved from simple single-core machines, strung together with Ethernet, into multi-core, multi-accelerator, multi-level monsters. As a consequence, programming such systems means dealing with synchronization and communication overhead, load imbalance, a multitude of programming models and languages.
In this BoF we bring together application people to share their pain in programming parallel systems, with people working on programming frameworks and models. We hope this can lead to insights and solutions to alleviate the pain.
Workshop
Compilers
Parallel Programming Methods, Models, Languages and Environments
Performance Optimization
W
DescriptionThe Zig programming language, which is designed to provide performance and safety as first class concerns, has become popular in recent years. Given that Zig is built upon LLVM, and-so enjoys many of the benefits provided by the ecosystem, including access to a rich set of backends, Zig has significant potential for high performance workloads. However, it is yet to gain acceptance in HPC and one of the reasons for this is that support for the pragma driven shared memory parallelism is missing.
In this paper we describe enhancing the Zig compiler to add support for OpenMP loop directives. Then exploring performance using NASA’s NAS Parallel Benchmark (NPB) suite. We demonstrate that not only does our integration of OpenMP with Zig scale comparatively to Fortran and C reference implementations of NPB, but furthermore Zig provides up to a 1.25 times performance increase compared to Fortran.
In this paper we describe enhancing the Zig compiler to add support for OpenMP loop directives. Then exploring performance using NASA’s NAS Parallel Benchmark (NPB) suite. We demonstrate that not only does our integration of OpenMP with Zig scale comparatively to Fortran and C reference implementations of NPB, but furthermore Zig provides up to a 1.25 times performance increase compared to Fortran.
Workshop
Graph Algorithms
Heterogeneous Computing
Programming Frameworks and System Software
W
DescriptionAs high-performance computing (HPC) systems advance towards Exascale computing, their size and complexity increase, introducing new maintenance challenges. Modern HPC systems feature data monitoring infrastructures that provide insights into the system's state. This data can be leveraged to train machine learning models to anticipate anomalies that require compute nodes to undergo maintenance procedures. This paper presents a novel approach to predicting such anomalies by creating a graph per measurement that encodes current and past sensor readings and information related to the compute node sensors. The experiments were performed with data collected from Marconi 100, a tier-0 production supercomputer at CINECA in Bologna, Italy. Our results show that the machine learning model can accurately predict anomalies and surpass current State-Of-The-Art (SOTA) models regarding the quality of predictions and the time horizon considered to forecast them.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionIn High Energy Physics (HEP), large-scale experiments generate massive amounts of data that are distributed globally. To reduce redundant data transfers and improve analysis efficiency, a disk caching system named XCache is used to manage data accesses. By analyzing 11 months of access logs (4.5 million requests), we identified patterns in dataset usage and developed a predictive model to forecast the popularity of frequently accessed datasets.
Based on extensive exploratory data analysis, we found that pinging the most popular datasets (pinning these in the cache) could significantly improve access efficiency, and we implemented an LSTM model to predict dataset accesses and optimize cache policies.
The model demonstrates strong predictive performance with a low mean relative error of 0.779 across training and test datasets. Future work will incorporate anomaly detection techniques to improve robustness. This study highlights the potential of LSTM models in optimizing distributed content caching in HEP.
Based on extensive exploratory data analysis, we found that pinging the most popular datasets (pinning these in the cache) could significantly improve access efficiency, and we implemented an LSTM model to predict dataset accesses and optimize cache policies.
The model demonstrates strong predictive performance with a low mean relative error of 0.779 across training and test datasets. Future work will incorporate anomaly detection techniques to improve robustness. This study highlights the potential of LSTM models in optimizing distributed content caching in HEP.
Workshop
I/O, Storage, Archive
W
DescriptionPredicting the structure of proteins has been a grand challenge for over 60 years. Google's DeepMind team leveraged Artificial intelligence in 2020 to develop AlphaFold and achieved an accuracy above 90 for two-thirds of the proteins in CASP's competition. AlphaFold has been very successful in biology and medicine. However, a lack of training code and expansive computational requirements created an open-source implementation, OpenFold. OpenFold is fast, memory-efficient, and provides an OpenProtein dataset with five million MSAs. MLCommons added OpenFold to their HPC benchmarks suite in 2023 and was evaluated by four institutions on NVIDIA GPU architectures. This work presents our endeavours to port, run and tune OpenFold on Intel's Ponte Vecchio (PVC) GPUs. To the best of our knowledge, this is the first large-scale study of the distributed implementation of OpenFold application with Intel PVC GPU, presenting the challenges, opportunities and performance of the application on Intel's Max series architecture.
Workshop
Artificial Intelligence/Machine Learning
W
DescriptionAuroraGPT seeks to test the hypothesis that a model trained on
additional science data and text will improve performance on scien-
tific tasks. If we consider that existing models such as PALM—the
predecessor to Google’s Gemini model family were trained on
770B tokens of which only ∼1.9% was scientific text. To meet our
goal, we seek to incorporate substantially more scientific text.
In this presentation, we will share the recent progress of the
AuroraGPT Data Team, how we contribute the project of building
a science-focused LLM with AuroraGPT, how we collaborate with
the other teams, and what topics we see as open questions. As
the data team, our team is responsible for identifying, preparing,
and deduplicating scientific data and text. We will talk about the
systems and data quality challenges that our team tackles to prepare
terabytes of scientific data and text to produce high-quality text
and data for training.
additional science data and text will improve performance on scien-
tific tasks. If we consider that existing models such as PALM—the
predecessor to Google’s Gemini model family were trained on
770B tokens of which only ∼1.9% was scientific text. To meet our
goal, we seek to incorporate substantially more scientific text.
In this presentation, we will share the recent progress of the
AuroraGPT Data Team, how we contribute the project of building
a science-focused LLM with AuroraGPT, how we collaborate with
the other teams, and what topics we see as open questions. As
the data team, our team is responsible for identifying, preparing,
and deduplicating scientific data and text. We will talk about the
systems and data quality challenges that our team tackles to prepare
terabytes of scientific data and text to produce high-quality text
and data for training.
Workshop
Accelerators
Emerging Technologies
Hardware Technologies
W
DescriptionIn recent years, interest in RISC-V computing architectures has moved to mainstream, especially in the field of High Performance Computing. The first single board RISC-V CPUs implementing the finalized ratified vector specification are being released. The family of vector processors offers support for variable-length array processing as opposed to the fixed-length processing functionality offered by SIMD. Vector processors offer opportunities to perform vector-chaining which allows temporary results to be used without the need to resolve memory references.
In this work, we use the Octo-Tiger astrophysics application to study these early RISC-V chips with vector machine support. We report on our experience in porting this modern C++ code (built upon several open-source libraries such as HPX/Kokkos). In addition, we show the impact of the RISC-V Vector extension on a RISC-V single board computer. We also compare the application's performance, scalability, and power consumption on desktop-grade RISC-V computer to an A64FX system.
In this work, we use the Octo-Tiger astrophysics application to study these early RISC-V chips with vector machine support. We report on our experience in porting this modern C++ code (built upon several open-source libraries such as HPX/Kokkos). In addition, we show the impact of the RISC-V Vector extension on a RISC-V single board computer. We also compare the application's performance, scalability, and power consumption on desktop-grade RISC-V computer to an A64FX system.
Art of HPC
Posters
TP
W
TUT
XO/EX
DescriptionSimulation data was computed on resources of the Argonne Leadership Computing Facility, and rendered using ParaView. No ML tools were leveraged in the rendering.
Exhibits
Flash Session
TP
XO/EX
DescriptionWe will explore best practices for maintaining fluid cleanliness within the Technical Cooling System piping and Coolant Distribution Units. Corrosion products can lead to reduced heat transfer at critical locations on the CPU or GPU chip, as well as cooling flow restrictions due to fine strainer clogging. The effects of fluid types and temperatures will also be discussed.
Tutorial
Artificial Intelligence/Machine Learning
Network
TUT
DescriptionRecent advances in Machine and Deep Learning (ML/DL) have led to many exciting challenges and opportunities. Modern ML/DL frameworks including TensorFlow, PyTorch, and cuML enable high-performance training, inference, and deployment for various types of ML models and Deep Neural Networks (DNNs). This tutorial provides an overview of recent trends in ML/DL and the role of cutting-edge hardware architectures and interconnects in moving the field forward. We will also present an overview of different DNN architectures, ML/DL frameworks, DL Training and Inference, and Hyperparameter Optimization with special focus on parallelization strategies for large models such as GPT, LLaMA, BERT, ViT, and ResNet. We highlight new challenges and opportunities for communication runtimes to exploit high-performance CPU/GPU architectures to efficiently support large-scale distributed training. We also highlight some of our co-design efforts to utilize MPI for large-scale DNN training on cutting-edge CPU/GPU/DPU architectures available on modern HPC clusters. Throughout the tutorial, we include several hands-on exercises to enable attendees to gain first-hand experience of running distributed ML/DL training and hyperparameter optimizations on a modern GPU cluster.
Workshop
Artificial Intelligence/Machine Learning
Broader Engagement
HPC in Society
W
DescriptionLarge-scale systems collect enormous amounts of data and metadata about users located around the globe. Access to such systems faces regulatory, political, economic, and technological barriers, while privacy constraints prevent data sharing. We discuss operational practices for redesigning private and equitable HPC flows where privacy can serve utility. We explore anonymization techniques in the HPC context and discuss challenges in incorporating them into large-scale systems. We refer to current data protection regulations to address data transparency, processing, and management responsibilities for systems administrators.
Workshop
Accelerators
Heterogeneous Computing
Parallel Programming Methods, Models, Languages and Environments
W
DescriptionHPC programming ecosystem is mostly based on sequential C/C++/Fortran languages. These fundamental languages are then augmented with other frameworks such as MPI and OpenMP to enable different types of parallelism. Increased prevalence of GPUs in HPC and AI further complicates the landscape with the addition of other frameworks like CUDA, HIP, SYCL, OpenACC, Kokkos in the landscape. The combination of one programming model for GPU parallelism and others like MPI for other forms of parallelism can make programming clusters and supercomputers difficult.
Chapel is a programming language, where parallelism and locality are first-class concepts. In this paper, we introduce Chapel's support for GPU programming. We demonstrate that Chapel's language features for expressing parallelism and locality can make programming GPUs significantly easier. We share our experience with porting an image analysis application for coral reef biodiversity analysis, and show our initial experimental results on Frontier.
Chapel is a programming language, where parallelism and locality are first-class concepts. In this paper, we introduce Chapel's support for GPU programming. We demonstrate that Chapel's language features for expressing parallelism and locality can make programming GPUs significantly easier. We share our experience with porting an image analysis application for coral reef biodiversity analysis, and show our initial experimental results on Frontier.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionLarge language models (LLMs) have shown they can perform scientific tasks. They are capable of assisting researchers in data interpretation, instrument operation, knowledge synthesis, and hypothesis generation. However, LLMs must first be trained on a large dataset of scientific tasks and data. Training these models requires a substantial amount of time, energy, and computational resources, as the process of altering a model’s parameters through each iteration is expensive. Researchers have developed optimizations that can speed up the process of training LLMs with new data. In our research, we aim to profile LLMs with optimizations during the steps of fine-tuning to identify bottlenecks or improvements in runtime. Some of the optimizations we utilized include Low-Rank Adaptation (LoRA), BitFit, and Adapter. From our visual diagrams and runtime charts, we can gain a better understanding of their performance and profile breakdown during training and fine-tuning.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionTraining large language models (LLMs) efficiently requires addressing the communication overhead introduced by parallelism strategies like Tensor, Pipeline, and Data Parallelism. This work profiles the communication patterns in LLM pretraining using the Polaris supercomputer, highlighting the impact of Tensor Parallelism, which suffers from significant overhead as parallelism scales. To mitigate this, we apply hZCCL, a homomorphic compression technique that reduces communication costs by eliminating decompression-operation-compression cycles. Our results show hZCCL accelerates training, achieving up to 6.77× speedup in multi-threaded modes while maintaining data accuracy. These improvements allow for more efficient scaling of LLM pretraining across distributed nodes.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionPagosa is a hydrodynamics computer code designed for massively parallel environments, operating within an Eulerian framework using a fixed Cartesian mesh. This project investigates the performance of Pagosa on Sapphire Rapids nodes, which feature many-core architectures and high bandwidth memory. By conducting strong and weak scaling studies, we aim to evaluate the impact of hyper-threading and propose modifications to Pagosa’s MPI environment, including a hybrid MPI and OpenMP parallel decomposition. The anticipated outcomes will inform strategies for optimizing Pagosa, ensuring it remains capable of tackling complex computational problems efficiently.
Tutorial
Accelerators
Artificial Intelligence/Machine Learning
Broader Engagement
TUT
DescriptionScientific applications are increasingly adopting Artificial Intelligence (AI) techniques to advance science. There are specialized hardware accelerators designed and built to run AI applications efficiently. With a wide diversity in the hardware architectures and software stacks of these systems, it is challenging to understand the differences between these accelerators, their capabilities, programming approaches, and how they perform, particularly for scientific applications. In this tutorial, we will cover an overview of the AI accelerators landscape focusing on SambaNova, Cerebras, Graphcore, Groq, and Habana systems along with architectural features and details of their software stacks. We will have hands-on exercises to help attendees understand how to program these systems by learning how to refactor codes and compile and run the models on these systems. The tutorial will provide the attendees with an understanding of the key capabilities of emerging AI accelerators and their performance implications for scientific applications.
Visit the Tutorial Website
Visit the Tutorial Website
Tutorial
Accelerators
Message Passing
Parallel Programming Methods, Models, Languages and Environments
Portability
TUT
DescriptionIf you are an HPC programmer, you know OpenMP. Alongside MPI, OpenMP is the open, cross-vendor foundation of HPC. As hardware complexity has grown, OpenMP has grown as well adding GPU support in OpenMP 4.0 (2013). With a decade of evolution since then, OpenMP GPU technology is now a mature option for programming any GPU you are likely to find on the market.
While there are many ways to program a GPU, the best way is through OpenMP. Why? Because the GPU does not exist in isolation. There are always one or more CPUs on a node. Programmers need portable code that fully exploits all available processors. In other words, programmers need a programming model, such as OpenMP, that fully embraces heterogeneity.
In this tutorial, we explore GPU programming with OpenMP. We assume attendees already know the fundamentals of multithreading with OpenMP, so we will focus on the directives that define how to map loops onto GPUs and optimize data movement between the CPU and GPU. Students will use their own laptops (with Windows, Linux, or macOS) to connect to remote servers with GPUs and all the software needed for the tutorial.
While there are many ways to program a GPU, the best way is through OpenMP. Why? Because the GPU does not exist in isolation. There are always one or more CPUs on a node. Programmers need portable code that fully exploits all available processors. In other words, programmers need a programming model, such as OpenMP, that fully embraces heterogeneity.
In this tutorial, we explore GPU programming with OpenMP. We assume attendees already know the fundamentals of multithreading with OpenMP, so we will focus on the directives that define how to map loops onto GPUs and optimize data movement between the CPU and GPU. Students will use their own laptops (with Windows, Linux, or macOS) to connect to remote servers with GPUs and all the software needed for the tutorial.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionLarge language models (LLMs) often require well-designed prompts for effective responses, but optimizing prompts is challenging due to prompt sensitivity, where small changes can cause significant performance variations. This study evaluates prompt performance across all permutations of independent phrases to investigate prompt sensitivity and robustness. We used two datasets: GSM8k, for mathematical reasoning, and a custom prompt for summarizing database metadata. Performance was assessed using the llama3-instruct-7B model on Ollama and parallelized in a high-performance computing environment. We compared phrase indices in the best and worst prompts and used Hamming distance to measure performance changes between phrase orderings. Results show that prompt phrase ordering significantly affects LLM performance, with Hamming distance indicating that changes can dramatically alter scores, often by chance. This supports existing findings on prompt sensitivity. Our study highlights the challenges in prompt optimization, indicating that modifying phrases in a successful prompt does not guarantee another successful prompt.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionSoftware auto-tuning (AT) is a technology that parameterizes factors that affect performance as "performance parameters" and automatically tunes performance. In AT, the tool searches for better values for performance parameters by repeatedly running the program. Therefore, if the target is a program with long execution times, such as machine learning, AT will take a very long time. For this problem, we have tried to reduce the time by running the target program in parallel. However, simple parallelization does not always take full advantage of the parallelism of the computer system.
In this study, we proposed the system-resource-based search. This method increases the number of execution targets, and the system does not have excess computing resources. The system-resource-based search was independent of the size of the search space and made the best use of the computational resources available on the supercomputer.
In this study, we proposed the system-resource-based search. This method increases the number of execution targets, and the system does not have excess computing resources. The system-resource-based search was independent of the size of the search space and made the best use of the computational resources available on the supercomputer.
Workshop
Embedded and/or Reconfigurable Systems
Heterogeneous Computing
W
Workshop
I/O, Storage, Archive
W
DescriptionIn the microservice paradigm, monolithic applications are decomposed into finer-grained modules invoked independently in a data-flow fashion. The different modules communicate through remote procedure calls (RPCs), which constitute a critical component of the infrastructure. To ensure portable passage of RPC metadata, arguments, and return values between different microservices, RPCs involve serialization/deserialization activities, part of the RPC data center tax. We demonstrate how RPC server logic, including serialization/deserialization, can be offloaded to Data Processing Units (DPUs). This effectively reduces the RPC data center tax on the host, where applications' business logic runs. While we focus on offloading Protocol Buffers deserialization used by the popular gRPC framework, our findings can be applied to other RPC infrastructures. Our experimental results demonstrate that RPC offloading performs similarly to traditional methods while significantly reducing CPU usage.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionEffective environmental monitoring traditionally required physical presence at research sites. IoT technologies now enable remote data collection, revolutionizing this field. This work presents a prototype of smart buoys designed for coastal and marine ecosystems. Located in the IBIS testbed, these buoys are equipped with various sensors and Single Board Computers (SBCs) that not only collect real-time data but also offer significant growth potential. As the number of deployed buoys increases, they can integrate with supercomputing resources for advanced data processing and analysis. Initial tests demonstrate their ability to monitor environmental parameters accurately, enhancing weather forecasting, storm tracking, and maritime safety. The results underscore the potential of IoT-based smart buoys to advance remote monitoring, improve decision-making, and drive innovation in marine research and conservation.
Workshop
Cloud Computing
Middleware and System Software
State of the Practice
W
DescriptionThis paper addresses a key challenge with using serverless computing for machine learning (ML) inference, which is cold starts that occur during initial invocations and container inactivity. Fixed keep-alive policies, like the commonly adopted 10-minute strategy, have been implemented by cloud providers to alleviate cold start issues. However, the substantial size of ML models poses a significant hurdle, leading to elevated keep-alive costs and potential strain on system resources. In response to these challenges, we introduce PULSE, a dynamic 10-minute keep-alive mechanism that employs ML model variants to optimize the balance between keep-alive costs, accuracy, and service time while avoiding peaks in keep-alive memory consumption. Our evaluation, using real-world serverless workloads and commonly used machine learning models, demonstrates reduced keep-alive costs compared to the fixed policy. Additionally, we observe that integrating PULSE improves the performance of existing state-of-the-art serverless function warm-up strategies.
ACM Gordon Bell Finalist
TP
DescriptionRaman spectroscopy offers invaluable insights into the chemical composition and structural characteristics of various materials, making it a powerful tool for structural analysis. However, accurate quantum mechanical simulations of Raman spectra for large systems, such as biological materials, have been limited due to immense computational costs and technical challenges. In this study, we developed efficient algorithms and optimized implementations on heterogeneous computing architectures to enable fast and highly scalable ab initio simulations of Raman spectra for large-scale biological systems with up to 100 million atoms. Our simulations have achieved nearly linear strong and weak scaling on two cutting-edge high-performance computing systems, with peak FP64 performances reaching 400 PFLOPS on 96,000 nodes of the new Sunway supercomputer and 85 PFLOPS on 6,000 node of the ORISE supercomputer. These advances provide promising prospects for extending quantum mechanical simulations to biological systems.
Workshop
Applications and Application Frameworks
Artificial Intelligence/Machine Learning
Parallel Programming Methods, Models, Languages and Environments
W
DescriptionPython is the most popular programming language. OpenMP is the most
popular parallel programming API. Projecting OpenMP into Python
will help expand the HPC community. We
call our Python-based OpenMP system PyOMP.
In this short paper we describe PyOMP and
its use for parallel programming
for CPUs and GPUs. We describe its implementation through the well
known Numba just-in-time (JIT) compiler and how to install PyOMP
on your own systems. We provide some performance results suggesting
performance on par with that from C and OpenMP, but our focus here
is not detailed benchmarking. We leave that to other papers. Our
goal here is to show how to use PyOMP so we can grow the PyOMP community.
popular parallel programming API. Projecting OpenMP into Python
will help expand the HPC community. We
call our Python-based OpenMP system PyOMP.
In this short paper we describe PyOMP and
its use for parallel programming
for CPUs and GPUs. We describe its implementation through the well
known Numba just-in-time (JIT) compiler and how to install PyOMP
on your own systems. We provide some performance results suggesting
performance on par with that from C and OpenMP, but our focus here
is not detailed benchmarking. We leave that to other papers. Our
goal here is to show how to use PyOMP so we can grow the PyOMP community.
Tutorial
Applications and Application Frameworks
Parallel Programming Methods, Models, Languages and Environments
TUT
DescriptionPeople know Python. In many cases, it is their primary programming language. When they move to HPC, however, they have to translate everything into a low-level language such as C. This creates a barrier of entry to HPC. Wouldn’t it be nice if people could “keep everything in Python”?
In this tutorial we present a system for parallel programming in Python. We use OpenMP directives (identical to those used in C/Fortran) expressed as strings inside Python with statements coupled to the Numba JIT compiler. Numba drops us into LLVM with built-in OpenMP support, resulting in performance on par with that from C/OpenMP. We call our system PyOMP.
In this tutorial we present PyOMP for both the CPU and the GPU. We describe how to install PyOMP on a system and walk through the core design patterns from HPC. We run the tutorial in an interactive “demo-mode” as we write PyOMP programs together. Then, we will explore how our PyOMP programs map onto other well-known approaches to parallel programming in Python. By looking at how core patterns in PyOMP move between different programming models, we learn a great deal about those different programming models and greatly deepen our understanding of PyOMP.
In this tutorial we present a system for parallel programming in Python. We use OpenMP directives (identical to those used in C/Fortran) expressed as strings inside Python with statements coupled to the Numba JIT compiler. Numba drops us into LLVM with built-in OpenMP support, resulting in performance on par with that from C/OpenMP. We call our system PyOMP.
In this tutorial we present PyOMP for both the CPU and the GPU. We describe how to install PyOMP on a system and walk through the core design patterns from HPC. We run the tutorial in an interactive “demo-mode” as we write PyOMP programs together. Then, we will explore how our PyOMP programs map onto other well-known approaches to parallel programming in Python. By looking at how core patterns in PyOMP move between different programming models, we learn a great deal about those different programming models and greatly deepen our understanding of PyOMP.
Doctoral Showcase
Posters
TP
DescriptionQuantum computing has achieved significant milestones in recent years, underscoring its potential benefits for NP-hard applications both currently and in the future. Despite these advancements, contemporary quantum computers are hindered by noise and a limited number of qubits. These limitations pose significant challenges for quantum applications, noise management, resource scheduling, and optimization. This research addresses the gap between quantum algorithms and hardware characteristics by examining quantum applications from high-level circuit definitions to noise model generation, fault injection, resource management, and optimization.
This study highlights the quantum advantage over classical computing, explores early quantum neural networks, evaluates quantum metrics, and constructs a noise model based on quantum hardware using fault injection techniques to investigate the vulnerabilities of quantum algorithms, operations, and qubits. Our research considers the uncertainty factors of qubits, including random faults and single and double fault injections with circuit cutting and its limitations. The final stage evaluates the performance of quantum jobs submitted to backends under controlled noise, uncontrolled errors, and established metrics.
The proposed Quantum NFSO (Noise, Fault, and Scheduling Optimization) model presents a comprehensive approach to scheduling and optimization, accounting for noise, random errors, resource management, job scheduling, and circuit optimization. While quantum computers hold immense potential, it is essential to calibrate expectations appropriately. In the NISQ (Noisy Intermediate-Scale Quantum) era, scientists and researchers must align their perspectives with the technology's limitations. Understanding these constraints is crucial for advancing future computational technologies, including quantum computing.
This study highlights the quantum advantage over classical computing, explores early quantum neural networks, evaluates quantum metrics, and constructs a noise model based on quantum hardware using fault injection techniques to investigate the vulnerabilities of quantum algorithms, operations, and qubits. Our research considers the uncertainty factors of qubits, including random faults and single and double fault injections with circuit cutting and its limitations. The final stage evaluates the performance of quantum jobs submitted to backends under controlled noise, uncontrolled errors, and established metrics.
The proposed Quantum NFSO (Noise, Fault, and Scheduling Optimization) model presents a comprehensive approach to scheduling and optimization, accounting for noise, random errors, resource management, job scheduling, and circuit optimization. While quantum computers hold immense potential, it is essential to calibrate expectations appropriately. In the NISQ (Noisy Intermediate-Scale Quantum) era, scientists and researchers must align their perspectives with the technology's limitations. Understanding these constraints is crucial for advancing future computational technologies, including quantum computing.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionThis poster introduces QDD, a multi-node implementation of a decision diagram (DD)-based quantum circuit simulator. DD-based simulators offer faster simulation of algorithms like Shor's compared to statevector (SV) simulators by compressing the quantum state using a graph representation. However, parallelizing DD-based simulators has been challenging due to their dynamic data structures.
QDD addresses this by distributing the quantum state across multiple nodes and using ring communication to minimize communication overhead. Automatic SWAP gate insertion further optimizes communication. Experiments show QDD significantly outperforms the SV-based simulator in simulating Shor's algorithm. With 256 nodes, QDD achieves up to 10x faster runtime compared to a single-node implementation. The experiment also examined the number of processes per node and found that one process per node was preferable unless rack-to-rack communication occurred.
The poster explains the background of quantum simulation and decision diagram, and then explains the multi-node method and experimental results in detail.
QDD addresses this by distributing the quantum state across multiple nodes and using ring communication to minimize communication overhead. Automatic SWAP gate insertion further optimizes communication. Experiments show QDD significantly outperforms the SV-based simulator in simulating Shor's algorithm. With 256 nodes, QDD achieves up to 10x faster runtime compared to a single-node implementation. The experiment also examined the number of processes per node and found that one process per node was preferable unless rack-to-rack communication occurred.
The poster explains the background of quantum simulation and decision diagram, and then explains the multi-node method and experimental results in detail.
Posters
TP
DescriptionThis work extends Quantum Framework (QFw) by integrating it with Northwest Quantum Simulator (NWQ-Sim) and by introducing a lightweight python library (qfw\_backend) that allows multiple frontends (e.g., Qiskit) to interact with QFw. This extension enables QFw to flexibly decouple frontends from backends (e.g., NWQ-Sim). We demonstrate this capability by executing a Greenberger-Horne-Zeilinger (GHZ) circuit using Qiskit and Pennylane with different backends. Also, QFw enables easy scaling to multiple nodes. We showcase this by running GHZ scaling tests up to 32 qubits for different numbers of nodes on Frontier. To demonstrate the use of QFw for real-world problems, we solve a metamaterial optimization problem which uses a Quantum Approximate Optimization Algorithm (QAOA). We observe that QFw over NWQ-Sim marginally improves Qiskit-aer's accuracy. These additions prepare QFw to run hybrid applications in a hybrid resource environment since it treats actual quantum hardware and simulators alike.
Workshop
Architecture
Network
Performance Optimization
Quantum Computing
System Administration
W
DescriptionThis paper investigates the design of a regional Quantum Network in Tennessee (QNTN) that will connect three quantum local area networks in different cities. We explore two approaches for achieving this interconnection: deploying a satellite constellation in the space layer and employing high-altitude platforms (HAPs) in the aerial layer. Our comparison reveals that a space-ground architecture that uses 108 satellites provides 55.17% coverage of the day and handles 57.75% of entanglement distribution requests with an average fidelity of 0.96. In contrast, the air-ground architecture delivers full-day coverage, fulfills 100% of requests, and achieves a higher average fidelity of 0.98. However, HAPs face significant challenges such as limited operational time, sensitivity to vibrations and weather conditions, and the need for continuous maintenance. This paper contributes to the understanding of optimal architecture for regional quantum networks, highlighting the trade-offs between satellite-based and air-ground approaches.
Invited Talk
TP
DescriptionIn the last decade, the quantum computing field has advanced from theory and university lab experiments to operational computing machines developed by several companies and startups. Even in these early development stages, quantum computers are becoming increasingly popular among several scientific and industrial users who want to test their real performance and adapt their solutions to this new programming paradigm. To control and operate quantum devices, one needs traditional computation: the quantum operations instructions and the quantum processing unit (QPU) readout are orchestrated by a "classical" computer. On top of that, there are no purely quantum algorithms, since several pre- and post-processing algorithmic subroutines require traditional computation. All these facts lead us to integrate quantum computers with classical computers, extending the supercomputer's capabilities by enabling a new chip technology. This talk will address the quantum-HPC integration roadmap: the motivation, methods, challenges, and opportunities. It will also summarize the experience that Barcelona Supercomputing Center is gaining with the installation, operation, and integration of two quantum computers on-premise.
Tutorial
Artificial Intelligence/Machine Learning
Emerging Technologies
Quantum Computing
TUT
DescriptionLeveraging quantum algorithms holds the potential for exponential speedup and disruptive new approaches in machine learning. In this SC tutorial, we present an introduction to quantum computing with a focus on its application to machine learning. It is aimed at computer- and data scientists and HPC users of any scientific domain. Our focus is to show how quantum computers are programmed today, while providing the essential math and physics background required to develop quantum algorithms. Our tutorial contains two parts. The first part introduces basic ideas and examples of gate-based quantum computing designed for participants with only basic knowledge of quantum computing. The second part illustrates quantum classifiers, such as the Variational Quantum Classifier, as an introduction to Quantum Machine Learning. The tutorial builds on our successful tutorials at ISC by refining the parts that were challenging for participants. This tutorial will use the Qiskit framework to show how quantum algorithms can be constructed from available building blocks. As participants progress through this tutorial, they will acquire practical insights into the complexities of quantum computing, equipping them with the fundamental skills to explore the potential of quantum machine learning in their respective fields.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionQuantum computing simulators via classical computing are essential for small-scale comprehension and testing of quantum computing algorithms. Quantum Volume (QV) is a well-established benchmark for comparing Quantum Processing Units (QPUs) in the NISQ era. However, there is no QV benchmark of a large variety of current quantum simulators. This poster compares quantum computing simulators running on a single CPU or GPU-based HPC node using the Quantum Volume benchmark. As simulators do not suffer from noise, the metric used in the comparison is the time required to simulate a set Quantum Volume. In the poster session, we will provide further differentiating information about each simulator, and can provide more detail into how the proof of HOP convergence, a key component in QV testing with noiseless simulators, works.
Invited Talk
TP
DescriptionOver approximately the last decade, quantum computing has evolved from purely laboratory research to enabling a community of quantum computational scientists with tools for the exploration and development of quantum algorithms and applications. More recently, as quantum systems have become more capable and mature, the technology has started to reach the broad community of computational scientists, with integration of quantum systems in HPC datacenters becoming more and more frequent. This new stage of development constitutes a first step towards our vision of quantum-centric supercomputing: integrated quantum and classical computing resources working together in parallel to run computations beyond what was possible before.
This talk will present some of the efforts along that vision, and will show how quantum computing will naturally interplay with supercomputing to increase the computational reach on heterogeneous quantum and classical systems. Furthermore, it will show how supercomputing can enable quantum computations hitherto only thought possible in a fault-tolerant scenario. These results open a very promising path towards extracting value from quantum computers before the maturity of quantum error correction. This talk will discuss not only how quantum computing can help define the next evolution of supercomputing, but also how supercomputing can have a critical role at different stages of a quantum computation and how classical developers can already actively engage with heterogeneous workflows in integrated quantum and classical systems.
This talk will present some of the efforts along that vision, and will show how quantum computing will naturally interplay with supercomputing to increase the computational reach on heterogeneous quantum and classical systems. Furthermore, it will show how supercomputing can enable quantum computations hitherto only thought possible in a fault-tolerant scenario. These results open a very promising path towards extracting value from quantum computers before the maturity of quantum error correction. This talk will discuss not only how quantum computing can help define the next evolution of supercomputing, but also how supercomputing can have a critical role at different stages of a quantum computation and how classical developers can already actively engage with heterogeneous workflows in integrated quantum and classical systems.
Workshop
Emerging Technologies
Modeling and Simulation
Scientific and Information Visualization
W
Workshop
Performance Optimization
Programming Frameworks and System Software
W
DescriptionMaintaining performant code in a world of fast-evolving computer architectures and programming models poses a significant challenge to scientists. Typically, benchmark codes are used to model some aspects of a large application code's performance, and are easier to build and run. Such benchmarks can help assess the effects of code or algorithm changes, system updates, and new hardware. However, most performance benchmarks are not written using a wide range of GPU programming models. The RAJA Performance Suite provides a comprehensive set of computational kernels implemented in a variety of programming models. We integrated the performance measurement and analysis tools Caliper and Thicket into the RAJAPerf to facilitate performance comparison across kernel implementations and architectures. This paper describes the RAJAPerf, performance metrics that can be collected, and experimental analysis with case studies.
Paper
Accelerators
Applications and Application Frameworks
Graph Algorithms
Modeling and Simulation
Numerical Methods
TP
DescriptionComputational Pangenomics is an emerging field that studies genetic variation using a graph structure encompassing multiple genomes. Visualizing pangenome graphs is vital for understanding genome diversity. Yet, handling large graphs can be challenging due to the high computational demands of the graph layout process.
In this work, we conduct a thorough performance characterization of a state-of-the-art pangenome graph layout algorithm, revealing significant data-level parallelism, which makes GPUs a promising option for compute acceleration. However, irregular data access and the algorithm's memory-bound nature present significant hurdles. To overcome these challenges, we develop a solution implementing three key optimizations: a cache-friendly data layout, coalesced random states, and warp merging. Additionally, we propose a quantitative metric for scalable evaluation of pangenome layout quality.
Evaluated on 24 human whole-chromosome pangenomes, our GPU-based solution achieves a 57.3x speedup over the state-of-the-art multithreaded CPU baseline without layout quality loss, reducing execution time from hours to minutes.
In this work, we conduct a thorough performance characterization of a state-of-the-art pangenome graph layout algorithm, revealing significant data-level parallelism, which makes GPUs a promising option for compute acceleration. However, irregular data access and the algorithm's memory-bound nature present significant hurdles. To overcome these challenges, we develop a solution implementing three key optimizations: a cache-friendly data layout, coalesced random states, and warp merging. Additionally, we propose a quantitative metric for scalable evaluation of pangenome layout quality.
Evaluated on 24 human whole-chromosome pangenomes, our GPU-based solution achieves a 57.3x speedup over the state-of-the-art multithreaded CPU baseline without layout quality loss, reducing execution time from hours to minutes.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionPerformance of all-purpose communication libraries like MPI is fundamentally limited by the all-in-one approach these libraries use. RAPIDS (Reduced API Data-transfer Specifications) divides the functionality of these libraries into separate, more focused APIs, enabling library and application developers to avoid costly overhead of functionality they don’t use. This approach is highly adaptive and will evolve alongside modern GPUs, DPUs, and other accelerators.
Birds of a Feather
TP
XO/EX
DescriptionIn view of DOE’s Integrated Research Infrastructure (IRI), we are looking at a future where experimental or observational facilities routinely send their data to HPC for real-time computing to help steer experiments. Performance variability and latency introduced by buffering data on shared file systems is not acceptable for these multi-facility workflows — data should instead be streamed directly into HPC compute nodes over WAN.
This BoF will explore policies, strategies, and tools for direct, real-time streaming workflows in the HPC environment, aiming to raise awareness and foster a community effort to routinely and systematically support these workflows in the future.
This BoF will explore policies, strategies, and tools for direct, real-time streaming workflows in the HPC environment, aiming to raise awareness and foster a community effort to routinely and systematically support these workflows in the future.
Paper
Cloud Computing
Fault-Tolerance, Reliability, Maintainability, and Adaptability
Heterogeneous Computing
Performance Optimization
State of the Practice
TP
DescriptionHigh-dimensional grid-based simulations serve as both a tool and a challenge in researching various domains.
The main challenge of these approaches is the well-known curse of dimensionality, amplified by the need for fine resolutions in high-fidelity applications.
The combination technique (CT) provides a straightforward way of performing such simulations while alleviating the curse of dimensionality.
Recent work demonstrated the potential of the CT to join multiple systems simultaneously to perform a single high-dimensional simulation.
This paper shows an extension to three or more systems and addresses some remaining challenges: load balancing on heterogeneous hardware; utilizing compression to maximize the communication bandwidth; efficient I/O management through hardware mapping; improving memory utilization through algorithmic optimizations.
Combining these contributions, we demonstrate the CT for extreme-scale Superfacility scenarios of 46-trillion DOF on two systems and 35-trillion DOF on three systems.
Scenarios at these resolutions would be intractable with full-grid solvers (>1,000-nonillion DOF each).
The main challenge of these approaches is the well-known curse of dimensionality, amplified by the need for fine resolutions in high-fidelity applications.
The combination technique (CT) provides a straightforward way of performing such simulations while alleviating the curse of dimensionality.
Recent work demonstrated the potential of the CT to join multiple systems simultaneously to perform a single high-dimensional simulation.
This paper shows an extension to three or more systems and addresses some remaining challenges: load balancing on heterogeneous hardware; utilizing compression to maximize the communication bandwidth; efficient I/O management through hardware mapping; improving memory utilization through algorithmic optimizations.
Combining these contributions, we demonstrate the CT for extreme-scale Superfacility scenarios of 46-trillion DOF on two systems and 35-trillion DOF on three systems.
Scenarios at these resolutions would be intractable with full-grid solvers (>1,000-nonillion DOF each).
Paper
Post-Moore Computing
Quantum Computing
TP
DescriptionData representation in quantum state space offers an alternative function space for machine learning tasks. However, benchmarking these algorithms at a practical scale has been limited by ineffective simulation methods. We develop a quantum kernel framework using a Matrix Product State (MPS) simulator and employ it to perform a classification task with 165 features and 6400 training data points, well beyond the scale of any prior work. We make use of a circuit ansatz on a linear chain of qubits with increasing interaction distance between qubits. We assess the MPS simulator performance on CPUs and GPUs and, by systematically increasing the qubit interaction distance, we identify a crossover point beyond which the GPU implementation runs faster. We show that quantum kernel model performance improves as the feature dimension and training data increases, which is the first evidence of quantum model performance at scale.
Workshop
Architecture
Data-Intensive
Network
Performance Optimization
System Administration
W
DescriptionThis paper reviews recent enhancements to the Linux kernel that impact network throughput, and their potential impact on Data Transfer Node (DTN) performance. In particular, we explore the benefits of MSG ZEROCOPY and BIG TCP in controlled testbed environments at AmLight and ESnet. We compare performance on three different Linux kernel versions, on Intel vs AMD processors, and over multiple round trip times. Our results indicates that MSG ZEROCOPY, in conjunction with packet pacing, provides up to 35% improvement in throughput, and that Linux 6.8 provides an increase in throughput of up to 38% on WAN and 30% on LAN compared to the 5.15 kernel. We conclude with recommendations for both host benchmarking and production-ready DTN configurations.
Paper
Accelerators
Artificial Intelligence/Machine Learning
Cloud Computing
Distributed Computing
Heterogeneous Computing
Performance Optimization
TP
DescriptionIndustrial recommendation models typically involve numerous feature fields. The embedding computation workloads are heterogeneous across these fields, thus requiring varied optimal code schedules. While existing solutions apply basic fusion optimization for embedding operations, they inefficiently treat all feature-fields with identical schedules, leading to suboptimal performance. In this paper, we introduce RecFlex, which generates fused kernels with distinct schedules for different feature-fields. RecFlex employs the interference-aware schedule tuner to tune schedules and the heterogeneous schedule fusion compiler to generate fused kernels, addressing two major challenges. To determine optimal schedules of different feature-fields within the fused kernel, RecFlex proposes a two-stage interference-simulated tuning strategy. To handle dynamic workloads that challenge tuning and fusion, RecFlex combines compile-time schedule tuning with runtime kernel thread mapping. RecFlex surpasses state-of-the-art libraries and compilers, achieving average speedups of 2.64×, 20.77×, and 11.31× over TorchRec, HugeCTR, and RECom, respectively. RecFlex is publicly available at https://github.com/PanZaifeng/RecFlex.
Exhibits
SCinet
TP
XO/EX
Exhibitor Forum
Hardware Technologies
TP
XO/EX
DescriptionThe ongoing disruption of generative AI, machine learning and disaggregated computing forces HPC system architects to rethink next-gen computing topologies. Traditional server, storage and networking implementations are giving way to GPU clustering, cache coherency and extended reach cabling within the rack, rack-to-rack, and across supercomputers.
Traditional copper interconnects have latency and distance limitations that limit scaling. Mid-board or on-board optical transceivers offer improved performance and form factor advantages while solving distance, latency, and bandwidth challenges.
In this presentation, Matthew Burns will detail Samtec’s growing portfolio of mid-board optical transceivers. Additionally, he will discuss a real-world CXL® over optics demonstration featuring Samtec, Rambus, and VIAVI technology making remote cache coherency a reality. Lastly, he will update the latest performance capabilities and use cases of Samtec’s Halo™ Next-Gen Optical Transceivers.
Traditional copper interconnects have latency and distance limitations that limit scaling. Mid-board or on-board optical transceivers offer improved performance and form factor advantages while solving distance, latency, and bandwidth challenges.
In this presentation, Matthew Burns will detail Samtec’s growing portfolio of mid-board optical transceivers. Additionally, he will discuss a real-world CXL® over optics demonstration featuring Samtec, Rambus, and VIAVI technology making remote cache coherency a reality. Lastly, he will update the latest performance capabilities and use cases of Samtec’s Halo™ Next-Gen Optical Transceivers.
Workshop
Data Movement and Memory
I/O, Storage, Archive
W
DescriptionThe High-Velocity AI Cache (HVAC) mechanism promises to improve I/O performance by 12% while pertaining large foundation models for climate.
Workshop
Artificial Intelligence/Machine Learning
Codesign
W
DescriptionIn the post-Moore era, the quest for enhanced performance and reproducibility is more critical than ever. As researchers and engineers in high-performance computing (HPC) and scientific computing, reimagining key areas such as algorithms, hardware architecture, and software is essential to drive progress. In this talk, we will explore how performance engineering is evolving, focusing on checkpointing and the management of intermediate data in scientific workflows.
We will first discuss the shift from traditional low-frequency checkpointing techniques to modern high-frequency approaches that require complete histories and efficient memory use. By breaking data into chunks, using hash functions to store only modified data, and leveraging Merkle-tree structures, we improve efficiency, scalability, and GPU utilization while addressing challenges like sparse data updates and limited I/O bandwidth.
We will also examine the balance between performance and data persistence in workflows, where cloud infrastructures often sacrifice reproducibility for speed. To overcome this, we propose a persistent, scalable architecture that makes node-local data shareable across nodes. By rethinking checkpointing and cloud data architectures, we show how innovations in algorithms, hardware, and software can significantly advance both performance and reproducibility in the post-Moore era.
We will first discuss the shift from traditional low-frequency checkpointing techniques to modern high-frequency approaches that require complete histories and efficient memory use. By breaking data into chunks, using hash functions to store only modified data, and leveraging Merkle-tree structures, we improve efficiency, scalability, and GPU utilization while addressing challenges like sparse data updates and limited I/O bandwidth.
We will also examine the balance between performance and data persistence in workflows, where cloud infrastructures often sacrifice reproducibility for speed. To overcome this, we propose a persistent, scalable architecture that makes node-local data shareable across nodes. By rethinking checkpointing and cloud data architectures, we show how innovations in algorithms, hardware, and software can significantly advance both performance and reproducibility in the post-Moore era.
Birds of a Feather
TP
XO/EX
DescriptionThe time when the experimental results of a scientific paper consist of just a plot or some numbers on a table is ending. Computational experiments need to provide mechanisms so that other peers can re-execute and verify that the presented results are true, and thus, reproducible. This session will start by exploring past experiences of former Reproducibility Initiative committee members, to trigger discussion. We will then review ongoing workflow and performance reproducibility initiatives that try to tackle the problem through provenance recording. We expect attendees to share their own experiences in achieving reproducibility, and offer opinions on future directions.
Workshop
Artificial Intelligence/Machine Learning
Biology
Education
Emerging Technologies
Medicine
Modeling and Simulation
W
DescriptionIt can be difficult for medical imaging researchers to use machine learning (ML) models developed by others because usage conventions vary greatly between developers. Additionally, many ML training pipelines do not support the DICOM standard for medical imagery. These factors can make it difficult for cancer researchers, other than the original developers, to use and extend open-source ML/deep learning models. We suggest these factors may even be slowing the adoption of AI in the medical research community. In this presentation, we will present our own deep learning results from a lung cancer radiology study and a rhabdomyosarcoma pathology study and discuss how we used the MHub.ai architecture for inferencing with both our own models and open-source models in these projects.
During this presentation, we will briefly describe the MHub.ai architecture and demonstrate its use through the analysis of a cohort of CT and PET imagery from lung cancer patients treated at Orlando Health using SBRT (Stereotactic Body RadioTherapy). We will also discuss our recent experience porting pathology segmentation models to MHub, including an NCI-developed rhabdomyosarcoma (RMS) whole slide segmentation model [4], and processing pathology imagery in the DICOM-WSI format. The focus of the talk will be on our lung cancer and RMS research results, how we used deep learning models on these projects, and how we have utilized and created MHub.ai models. Attendees will learn about our radiology and pathology imaging analysis workflows and how our work benefited from the use of MHub and will understand how MHub might, in turn, help their own research.
MHub.ai [1] is a project being developed by a consortium including the Harvard Artificial Intelligence in Medicine Program [2] and supported by the National Cancer Institute. The MHub.ai project has released open-source software, documentation, and processes to ease the packaging and deployment of previously published high-quality ML models. A set of curated segmentation and analysis models are already available in the MHub model gallery and a 3D Slicer extension allows execution of MHub models directly from inside 3D Slicer.
References
[1] MHub main website: https://mhub.ai/, visited Jul 31, 2024
[2] AIM Artificial Intelligence in Medicine Program, website: https://aim.hms.harvard.edu/, visited Jul 31, 2024
[3] 3D Slicer home page: https://www.slicer.org/, visited Jul 31, 2024
[4] David Milewski, Hyun Jung, G Thomas Brown, Yanling Liu, Ben Somerville, Curtis Lisle, et al., Predicting molecular subtype and survival of rhabdomyosarcoma patients using deep learning of H&E images: a report from the children's oncology group, Clinical Cancer Research, Vol 29, Issue 2, pgs. 364-378, https://doi.org/10.1158/1078-0432.CCR-22-1663
During this presentation, we will briefly describe the MHub.ai architecture and demonstrate its use through the analysis of a cohort of CT and PET imagery from lung cancer patients treated at Orlando Health using SBRT (Stereotactic Body RadioTherapy). We will also discuss our recent experience porting pathology segmentation models to MHub, including an NCI-developed rhabdomyosarcoma (RMS) whole slide segmentation model [4], and processing pathology imagery in the DICOM-WSI format. The focus of the talk will be on our lung cancer and RMS research results, how we used deep learning models on these projects, and how we have utilized and created MHub.ai models. Attendees will learn about our radiology and pathology imaging analysis workflows and how our work benefited from the use of MHub and will understand how MHub might, in turn, help their own research.
MHub.ai [1] is a project being developed by a consortium including the Harvard Artificial Intelligence in Medicine Program [2] and supported by the National Cancer Institute. The MHub.ai project has released open-source software, documentation, and processes to ease the packaging and deployment of previously published high-quality ML models. A set of curated segmentation and analysis models are already available in the MHub model gallery and a 3D Slicer extension allows execution of MHub models directly from inside 3D Slicer.
References
[1] MHub main website: https://mhub.ai/, visited Jul 31, 2024
[2] AIM Artificial Intelligence in Medicine Program, website: https://aim.hms.harvard.edu/, visited Jul 31, 2024
[3] 3D Slicer home page: https://www.slicer.org/, visited Jul 31, 2024
[4] David Milewski, Hyun Jung, G Thomas Brown, Yanling Liu, Ben Somerville, Curtis Lisle, et al., Predicting molecular subtype and survival of rhabdomyosarcoma patients using deep learning of H&E images: a report from the children's oncology group, Clinical Cancer Research, Vol 29, Issue 2, pgs. 364-378, https://doi.org/10.1158/1078-0432.CCR-22-1663
Workshop
Architecture
Embedded and/or Reconfigurable Systems
Performance Optimization
Resource Management
W
DescriptionThe increase in demand for compute performance at scale has never been greater, and the current silicon supply chain delivering monolithic and generic SoCs can no longer deliver the generation over generation increase in performance needed to satiate this demand. Chiplets offer the promise of diversity to better match workloads to computational infrastructure, creating large-scale high-performance computers with pools of heterogenous processors that can be dynamically composed to provide virtual compute nodes specialized for particular workloads. Unfortunately, Chiplet technology today is mostly used in proprietary settings by a few large SoC suppliers, limiting the ability of the market to innovate. The Open Compute Project (OCP) Community believes that opening the silicon supply chain enabling innovation in specialized silicon processing elements by smaller companies is necessary to meet the future demands of high-performance computing. This talk will cover the ongoing work at the OCP Open Chiplet Economy Project focused on enabling a new silicon supply chain with an open chiplet marketplace intended to foster the innovation and emerging market for specialized chiplet-based System in Package (SiP) SoCs, enabling composable systems.
Workshop
Architecture
Embedded and/or Reconfigurable Systems
Performance Optimization
Resource Management
W
Workshop
Architecture
Embedded and/or Reconfigurable Systems
Performance Optimization
Resource Management
W
Workshop
Architecture
Embedded and/or Reconfigurable Systems
Performance Optimization
Resource Management
W
Paper
Accelerators
Applications and Application Frameworks
Modeling and Simulation
Numerical Methods
Task Parallelism
TP
DescriptionHigh energy physics experiments produce petabytes of data annually that must be reduced to gain insight into the laws of nature. Early-stage reduction executes long-running, high-throughput workflows across thousands of nodes spanning multiple facilities to produce shared datasets. Later stages are typically written by individuals or small groups and must be refined and re-run many times for correctness. Reducing iteration times of later stages is key to accelerating discovery. We demonstrate our experience reshaping late-stage analysis applications on thousands of nodes. It is not enough merely to increase scale: it is necessary to make changes throughout the stack, including storage systems, data management, task scheduling, and application design. We demonstrate these changes when applied to two analysis applications built on open source data analysis frameworks (Coffea, Dask, TaskVine). We evaluate the performance of the applications on opportunistic campus clusters, showing effective scaling up to 7200 cores, thus producing significant speedup.
Workshop
Heterogeneous Computing
Parallel Programming Methods, Models, Languages and Environments
PAW-Abstract
Task Parallelism
W
DescriptionTraditional static resource allocation in supercomputers (jobs retain a fixed set of resources) leads to inefficiencies. Resource adaptivity (jobs can change resources at runtime) significantly increases supercomputer efficiency.
This work exploits Asynchronous Many-Task (AMT) programming, which is particularly well suited to adaptivity, thanks to its transparent resource management. The AMT runtime system dynamically assigns user-defined small tasks to workers to achieve load balancing and adapt to resource changes.
Contributions of this work include techniques for malleability and evolving capabilities, allowing programs to dynamically change resources without interrupting computation. Heuristics for automatic load detection determine when to start or terminate processes, which is particularly beneficial for unpredictable workloads. Practicality is demonstrated by adapting the GLB library. A generic communication interface enables interaction between programs and resource managers. Evaluations with a prototype resource manager show significant improvements in batch makespan, node utilization, and job turnaround time for both malleable and evolving jobs.
This work exploits Asynchronous Many-Task (AMT) programming, which is particularly well suited to adaptivity, thanks to its transparent resource management. The AMT runtime system dynamically assigns user-defined small tasks to workers to achieve load balancing and adapt to resource changes.
Contributions of this work include techniques for malleability and evolving capabilities, allowing programs to dynamically change resources without interrupting computation. Heuristics for automatic load detection determine when to start or terminate processes, which is particularly beneficial for unpredictable workloads. Practicality is demonstrated by adapting the GLB library. A generic communication interface enables interaction between programs and resource managers. Evaluations with a prototype resource manager show significant improvements in batch makespan, node utilization, and job turnaround time for both malleable and evolving jobs.
Paper
Data Movement and Memory
Performance Evaluation and/or Optimization Tools
Resource Management
State of the Practice
TP
DescriptionIn the field of computational science, effectively supporting researchers necessitates a deep understanding of how they utilize computational resources. Building upon a decade-old survey that explored the practices and challenges of research computation, this study aims to bridge the understanding gap between providers of computational resources and researchers who rely on them. This study revisits key survey questions and gathers feedback on open-ended topics from over a hundred interviews. Quantitative analyses of present and past results illuminate the landscape of research computation. Qualitative analyses, including careful use of large language models, highlight trends and challenges with concrete evidence. Given the rapid evolution of computational science, this paper offers a \textit{toolkit} with methodologies and insights to simplify future research and ensure ongoing examination of the landscape. This study, with its findings and toolkit, guides enhancements to computational systems, deepens understanding of user needs, and streamlines reassessment of the computational landscape.
Panel
Architecture
Hardware Technologies
TP
DescriptionRISC-V is an open standard Instruction Set Architecture (ISA) which enables the open development of CPUs and a shared common software ecosystem. With over 15 billion RISC-V cores, which is accelerating rapidly, we are seeing a revolution driven by open hardware. Nonetheless, for all the successes that RISC-V has faced, it is yet to become popular in HPC. Recent advances, however, such as the vectorisation standard and data centre RISC-V hardware, make this technology a more realistic proposition for our workloads.
This panel brings together a range of experts in RISC-V with the HPC community to explore the advantages, challenges, and opportunities that the openness of RISC-V can bring to HPC, as well as opportunities for us to mould RISC-V to suit our needs. Led by the RISC-V HPC SIG, interested attendees can then join us and participate in one of the most exciting open-source technological activities of our time.
This panel brings together a range of experts in RISC-V with the HPC community to explore the advantages, challenges, and opportunities that the openness of RISC-V can bring to HPC, as well as opportunities for us to mould RISC-V to suit our needs. Led by the RISC-V HPC SIG, interested attendees can then join us and participate in one of the most exciting open-source technological activities of our time.
Students@SC
TP
W
TUT
XO/EX
DescriptionIn this workshop, participants will be provided an overview of the different types of elevator pitches. There will be tips on posture, presence, and perspective. This workshop will be coupled with the résumé workshop to allow participants a chance to hone their interviewing skills. Participants will be provided with a worksheet to sketch out their ideas and, of course, practice their pitch! Workshop attendees should plan to bring their résumés to this session as contributors will be providing feedback as time permits.
Workshop
Applications and Application Frameworks
Artificial Intelligence/Machine Learning
Security
W
Workshop
Applications and Application Frameworks
Artificial Intelligence/Machine Learning
Security
W
Workshop
Applications and Application Frameworks
Artificial Intelligence/Machine Learning
Security
W
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionGraph neural networks (GNNs) have demonstrated significant success in modeling graphs; however, they encounter challenges in efficiently scaling to large graphs. To address this, we propose the SanQus system, advancing our previous work, Sancus. SanQus reduces the need for expensive communication among distributed workers by utilizing Staleness and Quantization-Aware broadcasting. SanQus manages embedding staleness, skips unnecessary broadcasts, and treats decentralized GNN processing as sequential matrix operations. To further reduce communication, SanQus caches historical embeddings and performs quantization-aware broadcast. Theoretically, SanQus demonstrates bounded approximation errors and optimal convergence rates. Extensive experiments on big graphs with common GNN models show that SanQus reduces communication by up to 86% and triples throughput without sacrificing accuracy, outperforming state-of-the-art systems.
Workshop
Architecture
Data-Intensive
Network
Performance Optimization
System Administration
W
DescriptionTransferring large amounts of data poses a significant challenge, often resulting in poor performance and slow speeds. With the upgrade of the core national South African National Research Network (SANReN) backbone network capacity, 100 Gbps DTNs have been implemented in Cape Town and Johannesburg, South Africa with Globus data transfer software installed. Implemented, as part of the DTNs onto a single hardware platform, is perfSONAR functionality used for monitoring, troubleshooting and diagnosis of network issues. Leveraging the infrastructure deployed as part of the AmLight Express and Protect (AmLight-Exp) collaboration using the South Atlantic Cable System (SACS) connecting Africa and the US, the SANReN 100 Gbps combined DTN/perfSONAR nodes were demonstrated at the Supercomputing23 conference hosted in Denver, Colorado, United States of America in 2023. Several tests were subsequently conducted from December 2023 to July 2024, which are presented in this paper.
Workshop
Artificial Intelligence/Machine Learning
Graph Algorithms
Scalable Data Mining
W
DescriptionThis work develops a distributed graph neural network (GNN) methodology for mesh-based modeling applications using a consistent message passing layer. As the name implies, the focus is on enabling scalable operations that satisfy physical consistency via halo nodes at sub-graph boundaries. Here, consistency refers to the fact that a GNN trained and evaluated on one rank (one large graph) is arithmetically equivalent to evaluations on multiple ranks (a partitioned graph). This concept is demonstrated by interfacing GNNs with NekRS, a GPU-capable exascale CFD solver developed at Argonne National Laboratory. It is shown how the NekRS mesh partitioning can be linked to the distributed GNN training and inference routines, resulting in a scalable mesh-based data-driven modeling workflow. We study the impact of consistency on the scalability of mesh-based GNNs, demonstrating efficient scaling in consistent GNNs for up to O(1B) graph nodes on Frontier.
Tutorial
Emerging Technologies
Scalable Data Mining
TUT
DescriptionThere are several popular Big Data processing frameworks including Apache Spark and Dask. These frameworks are not capable of exploiting high-speed and low-latency networks like InfiniBand, Omni-Path, Slingshot, and others. In the High-Performance Computing (HPC)community, the Message-Passing Interface (MPI) libraries are widely adopted to tackle this issue by executing scientific and engineering applications on parallel hardware connected via fast interconnect.
This tutorial introduces MPI4Spark and MPI4Dask that are enhanced Spark and Dask frameworks, respectively, and capable of utilizing MPI for communication in a parallel and distributed setting on HPC systems. MPI4Spark can launch the Spark ecosystem using MPI launchers to utilize MPI communication. It also maintains isolation for application execution by forking new processes using Dynamic Process Management (DPM). MPI4Spark also provides portability and performance benefits as it can utilize popular HPC interconnects. MPI4Dask is an MPI-based custom Dask framework that is targeted for modern HPC clusters built with CPU and NVIDIA GPUs.
This tutorial provides a detailed overview of the design, implementation, and evaluation of MPI4Spark and MPI4Dask on state-of-the-art HPC systems. Later, we also cover writing, running, and demonstrating user Big Data applications on HPC systems.
This tutorial introduces MPI4Spark and MPI4Dask that are enhanced Spark and Dask frameworks, respectively, and capable of utilizing MPI for communication in a parallel and distributed setting on HPC systems. MPI4Spark can launch the Spark ecosystem using MPI launchers to utilize MPI communication. It also maintains isolation for application execution by forking new processes using Dynamic Process Management (DPM). MPI4Spark also provides portability and performance benefits as it can utilize popular HPC interconnects. MPI4Dask is an MPI-based custom Dask framework that is targeted for modern HPC clusters built with CPU and NVIDIA GPUs.
This tutorial provides a detailed overview of the design, implementation, and evaluation of MPI4Spark and MPI4Dask on state-of-the-art HPC systems. Later, we also cover writing, running, and demonstrating user Big Data applications on HPC systems.
Workshop
Distributed Computing
Performance Evaluation and/or Optimization Tools
Scientific and Information Visualization
W
DescriptionLarge-scale scientific simulations present significant challenges in data processing efficiency.
This paper addresses the critical issue of I/O and data processing performance bottlenecks within the domain of extreme-scale Smoothed-particle Hydrodynamics (SPH) and gravity simulations.
We present a novel I/O software architecture implemented in the scalable SPH-EXA framework, incorporating a variety of in-situ and post-hoc data analysis pipelines, facilitating rapid analysis and visualization of extreme-scale physical datasets.
The performance of our I/O architecture is evaluated through comprehensive benchmarking across a wide range of data scales, conducted on the Piz Daint supercomputer.
This paper addresses the critical issue of I/O and data processing performance bottlenecks within the domain of extreme-scale Smoothed-particle Hydrodynamics (SPH) and gravity simulations.
We present a novel I/O software architecture implemented in the scalable SPH-EXA framework, incorporating a variety of in-situ and post-hoc data analysis pipelines, facilitating rapid analysis and visualization of extreme-scale physical datasets.
The performance of our I/O architecture is evaluated through comprehensive benchmarking across a wide range of data scales, conducted on the Piz Daint supercomputer.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionTo deliver advanced services with high performance and flexibility, we are developing a computing platform that integrates various hardware accelerators (HWAs). Our goal is to build customized services for each user by combining application functions. To achieve this, we propose Hardware Function Chaining (HFC) technology that enables sharing a stateful function among multiple users and low-latency data transfer between HWAs. HFC uses a chain control circuit that allows the HWA to autonomously manage the destinations of multiple data flows (function chains). This method avoids CPU bottlenecks. We compared our HFC-based system with a look-aside configuration, where chain control is handled by the CPU, and evaluated the performance involving up to eight NOP functions in scenarios with multiple different function chains. The results showed that our approach reduces latency to up to 1/13 that of the look-aside configuration and maintains a stable latency and throughput as the system scales.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionMotifs, small subgraphs of k vertices, such as triangles and cliques, are well studied for static networks. They are used to characterize different biological networks and align different networks. Counting motifs reveal insights into the topological structure of a network such as MPI event graphs. However, for large networks and motifs, computing these frequencies is computationally expensive. Recent advances into sized-k or less motifs show promise but have difficulty scaling. Moreover, counting the frequency of all sized-k or less motifs on dynamic networks is still lacking.
We present a scalable method to compute the global edge-based frequencies of motifs of size 4 vertices or less on a fully dynamic network. Instead of recomputing the counts from scratch, we update the frequencies based on only the changed edges. Our results show that our method is scalable and outperforms the benchmark results by 10 times.
We present a scalable method to compute the global edge-based frequencies of motifs of size 4 vertices or less on a fully dynamic network. Instead of recomputing the counts from scratch, we update the frequencies based on only the changed edges. Our results show that our method is scalable and outperforms the benchmark results by 10 times.
Workshop
Distributed Computing
Experimental Facility
HPC Infrastructure
W
DescriptionEarth observation and earth system models are sources of vast, multi-modal datasets that are invaluable for advancing climate and environmental research. However, their scale and complexity pose challenges for processing and analysis. In this paper we discuss our experiences in developing a scientific research application using an automated multi-facility workflow that orchestrates data collection, preprocessing, artificial intelligence (AI) inferencing, and data movement across diverse computational resources, leveraging the Advanced Computing Ecosystem Testbed at the Oak Ridge Leadership Computing Facility (OLCF). We demonstrate that our AI application workflow can be seamlessly integrated and orchestrated across research facilities to extract new scientific insights from climate datasets using data intensive computational methods. The results indicate that the multi-facility workflow reduces processing time, enhances scalability, and maintains high efficiency across varying workloads. Our workflow processes 12,000 high-resolution satellite images in 44 seconds using 80 workers distributed across 10 nodes on the OLCF systems.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionThe Scalable Performance and Accuracy analysis for Distributed and Extreme-scale systems (SPADE) project focuses on advancing monitoring, optimization, evaluation, and decision-making capabilities for extreme-scale systems. This poster presents efforts targeting several advanced monitoring capabilities, including developing support for AMD's new RocProfiler SDK to enable the analysis of hardware performance counters on AMD APUs, such as the MI300, which will be integrated into El Capitan. Another effort involves extending the PAPI library for heterogeneous CPU support, allowing users to simultaneously monitor the performance of chips that support both high-end and low-end processors, enabling more effective tuning between various cores. Additionally, the project includes the development of a Python version of PAPI (cyPAPI), specifically for use with frameworks and tools being developed for Python in HPC environments. This effort extends to exploring beta versions of cyPAPI with PyTorch to advance decision-making capabilities for mixed-precision tuning of machine learning applications.
Doctoral Showcase
Posters
TP
DescriptionEdge accelerators, such as NVIDIA Jetson, are enabling rapid inference of deep neural network (DNN) models and computer vision algorithms through low-end graphics processing unit (GPU) modules integrated with ARM-based processors. Their compact form factor allows integration with mobile platforms, such as unmanned aerial vehicles (UAVs) with onboard cameras, facilitating real-time execution of diverse scientific workflows, from wildfire monitoring to disaster management. The limited compute resources of mobile edge accelerators necessitate collaboration with remote servers in the cloud for processing compute-intensive workloads. These remote servers can include high-performance computers, serverless cloud platforms offering Functions-as-a-Service (FaaS), or private GPU servers.
In my PhD dissertation, the work proposes and implements a scalable platform designed to support multiple mobile devices (UAVs) with edge accelerators, collaborating with remote servers to provide real-time performance for a wide range of spatio-temporal autonomous applications. The platform incorporates deadline-driven scheduling heuristics, strategies for preemptively dropping tasks based on their earliest deadlines, migration of tasks from edge to cloud, work stealing from cloud back to edge, and adaptation to network variability, all while ensuring quality of service (QoS). Outputs from the servers can be used by other mobile devices or the planning platform itself to orchestrate the next set of tasks in the workflow. Evaluations against baseline algorithms and multiple workloads demonstrate that the proposed heuristics achieve an optimal balance between task completion and accrued utility.
In my PhD dissertation, the work proposes and implements a scalable platform designed to support multiple mobile devices (UAVs) with edge accelerators, collaborating with remote servers to provide real-time performance for a wide range of spatio-temporal autonomous applications. The platform incorporates deadline-driven scheduling heuristics, strategies for preemptively dropping tasks based on their earliest deadlines, migration of tasks from edge to cloud, work stealing from cloud back to edge, and adaptation to network variability, all while ensuring quality of service (QoS). Outputs from the servers can be used by other mobile devices or the planning platform itself to orchestrate the next set of tasks in the workflow. Evaluations against baseline algorithms and multiple workloads demonstrate that the proposed heuristics achieve an optimal balance between task completion and accrued utility.
Workshop
Data Movement and Memory
I/O, Storage, Archive
W
DescriptionThe growing demand for large-scale AI applications brings performance challenges to parallel file systems. The performance of parallel file systems depends on both hardware components and software architecture. In parallel file systems, the performance of remote function execution is critical because most operations on a parallel file system require network communication and remote I/O. This work-in-progress paper introduces a new RPC layer that employs an optimized architecture for many-core CPUs and high-speed network devices. The RPC layer adopts a scalable task-stealing model that offers fairness in task execution and leverages the parallel performance of many-core CPUs. Our preliminary performance implementation indicates that the RPC layer can process more than four million RPC operations per second on a single server. This paper introduces the design of the RPC layer and several performance evaluation results.
Paper
Accelerators
Applications and Application Frameworks
Graph Algorithms
Modeling and Simulation
Numerical Methods
TP
DescriptionPhysical phenomena such as chemical reactions, bond-breaking, and phase transition require molecular dynamics (MD) simulation with ab initio accuracy ranging from milliseconds to microseconds. However, previous state-of-the-art neural network-based MD packages such as DeePMD-kit can only reach 4.7 nanoseconds per day on the Fugaku supercomputer. In this paper, we present a novel node-based parallelization scheme to reduce communication by 81%, then optimize the computationally intensive kernels with sve-gemm and mixed precision. Finally, we implement intra-node load balance to further improve the overall performance. Numerical results on the Fugaku supercomputer show that our work has significantly improved the time-to-solution of the DeePMD-kit by a factor of 31.7x, reaching 149 nanoseconds per day on 12,000 computing nodes. This work has opened the door for millisecond simulation with ab initio accuracy within one week for the first time.
Paper
Accelerators
Applications and Application Frameworks
Artificial Intelligence/Machine Learning
Distributed Computing
Graph Algorithms
Heterogeneous Computing
Tensors
TP
DescriptionTraining GNNs on billion-edge graphs faces significant memory and data transfer bottlenecks, especially with GPU-based sampling. Traditional methods struggle with CPU-GPU data transfer bottlenecks or high data shuffling and synchronization overheads in multi-GPU setups.
To overcome these challenges in GNN training on large-scale graphs, we introduce HyDRA, a pioneering framework that elevates mini-batch, sampling-based training. HyDRA transforms cross-GPU sampling by seamlessly integrating sampling and data transfer into a single kernel. It employs a hybrid pointer-driven technique for efficient neighbor retrieval, utilizes targeted replication for high-degree vertices to cut communication overhead, and adopts dynamic cross-batch data orchestration with pipelining to decrease redundant transfers. Tested on systems with up to 64 A100 GPUs, HyDRA significantly surpasses existing methods, offering 1.4x to 5.3x quicker training speeds than DSP and DGL-UVA, and showing up to a 42x boost in multi-GPU scalability. This establishes HyDRA as a benchmark for high-performance, large-scale GNN training.
To overcome these challenges in GNN training on large-scale graphs, we introduce HyDRA, a pioneering framework that elevates mini-batch, sampling-based training. HyDRA transforms cross-GPU sampling by seamlessly integrating sampling and data transfer into a single kernel. It employs a hybrid pointer-driven technique for efficient neighbor retrieval, utilizes targeted replication for high-degree vertices to cut communication overhead, and adopts dynamic cross-batch data orchestration with pipelining to decrease redundant transfers. Tested on systems with up to 64 A100 GPUs, HyDRA significantly surpasses existing methods, offering 1.4x to 5.3x quicker training speeds than DSP and DGL-UVA, and showing up to a 42x boost in multi-GPU scalability. This establishes HyDRA as a benchmark for high-performance, large-scale GNN training.
Workshop
Artificial Intelligence/Machine Learning
W
DescriptionThis workshop session will focus on scientific applications of large-scale AI across multiple disciplines.
Birds of a Feather
TP
XO/EX
DescriptionSoftware has become central to all aspects of modern science and technology. Especially in high-performance computing (HPC) and computational science and engineering (CSE), it is becoming ever larger and more complex while computer platforms evolve and become more diverse. Simultaneously, the teams behind the software are becoming larger, more technically diverse, and more geographically distributed.
This BoF provides an opportunity for people concerned about these topics to share existing experiences and activities, discuss how we can improve on them, and share the results. Presentations and discussion notes will be made available at the BoF series website, http://bit.ly/swe-cse-bof.
This BoF provides an opportunity for people concerned about these topics to share existing experiences and activities, discuss how we can improve on them, and share the results. Presentations and discussion notes will be made available at the BoF series website, http://bit.ly/swe-cse-bof.
Tutorial
Applications and Application Frameworks
Emerging Technologies
TUT
DescriptionMemory-to-memory data streaming between scientific instruments and remote high-performance computing (HPC) nodes has emerged as a key requirement to enable online processing of high-volume and high-velocity data for feature detection, experiment steering, and other purposes. In contrast to file transfer between scientific facilities for which a well-defined architecture exists in the form of science DMZ, data transfer nodes (DTN) and the associated tools, there is no well-defined infrastructure to enable efficient and secure memory-to-memory data streaming between scientific instruments and HPC nodes. It is especially important as both scientific instruments and HPC nodes lack direct external network connectivity. SciStream establishes a well-defined architecture and control protocols with an open-source implementation to enable distributed scientific workflows to use their choice of data streaming tools to move data from scientific instruments’ memory to HPC nodes’ memory. In this tutorial, we will start with motivating the need for SciStream, describe the architecture and protocols that it uses to establish authenticated and transparent connections between producers and consumers; discuss design considerations, our implementation approach and evaluation results. We will show a live demo of SciStream followed by hands-on exercises. We will also discuss our experience integrating and running real-world scientific applications with SciStream.
Workshop
Artificial Intelligence/Machine Learning
W
DescriptionThis work presents SciTrust, a comprehensive framework for assessing the trustworthiness of large language models (LLMs) in scientific contexts, with a focus on truthfulness, accuracy, hallucination, and sycophancy. The framework introduces four novel open-ended benchmarks in Computer Science, Chemistry, Biology, and Physics, and employs a multi-faceted evaluation approach combining traditional metrics with LLM-based evaluation. SciTrust was applied to five LLMs, including one general-purpose and four scientific models, revealing nuanced strengths and weaknesses across different models and benchmarks. The study also evaluated SciTrust's performance and scalability on high-performance computing systems. Results showed varying performance across models, with Llama3-70B-Instruct performing strongly overall, while Galactica-120B and SciGLM-6B excelled among scientific models. SciTrust aims to advance the development of trustworthy AI in scientific applications and establish a foundation for future research on model robustness, safety, and ethics in scientific contexts. We have open-sourced our framework, including all associated scripts and datasets, at https://github.com/herronej/SciTrust.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionWith the surge in symbolic data across various fields, efficient pattern matching and regular expression processing have become crucial. Non-deterministic Finite Automata (NFA), commonly used for pattern matching, face memory bottlenecks on general-purpose platforms. This has driven interest in Domain-Specific Architectures (DSAs), like FPGA and ASICs, for their efficiency. Modern applications require identifying the best match path, such as in DNA sequence alignment. This work aims to enhance FPGA-based automata processors to report the best sequence alignment score by integrating weights into automata transitions. Challenges include increased state-space complexity and memory requirements. The proposed NAPOLY+ design incorporates score values and arithmetic components to manage scores, balancing performance and resource use. Evaluation on the Zynq Ultrascale+ FPGA showed high device utilization and scalability, with preliminary results focusing on end-to-end design evaluation.
Workshop
Applications and Application Frameworks
Algorithms
Performance Evaluation and/or Optimization Tools
W
DescriptionCheckpoint/Restart (C/R) saves the running state of the programs periodically, which consumes considerable time and system resources. We observe that not every piece of data is involved in the computation in typical HPC applications; such unused data should be excluded from checkpointing for better storage and compute efficiency. We propose a systematic approach that leverages automatic differentiation (AD) to scrutinize every element within variables (e.g., arrays) necessary for checkpointing. This allows us to identify critical and uncritical elements and eliminate uncritical elements from checkpointing. Specifically, we inspect every single element within a variable necessary for checkpointing with an AD tool to determine whether the element has an impact on the application output or not. We validate our approach with all benchmarks from the NPB suite. We visualize the distribution of critical and uncritical elements within a variable with respect to its binary impact (yes or no) on the application output.
Workshop
Applications and Application Frameworks
Artificial Intelligence/Machine Learning
Parallel Programming Methods, Models, Languages and Environments
W
DescriptionPython is a powerful and user-friendly programming language widely adopted by researchers and scientists. As data sizes and computational complexities grow, CPU-based Python struggles to meet the speed and scale demanded by cutting-edge research. Distributed accelerated computing offers an infrastructure to efficiently solve and test hypotheses in data-driven problems. Whether it’s analyzing data generated by recording the scattering of high-energy electron beams, building new methodology to solve complex CFD problems, or build machine learning (ML) models. Researchers are increasingly seeking ways to effortlessly scale their programs. Our upcoming demonstration will provide a comprehensive walkthrough on how to use cuNumeric and Legate to seamlessly scale your Python programs from a single CPU core to multi-GPU, multi-node supercomputers without any modifications to your code.
Acknowledgment and potential co-presenter:
Jason R. Green, Professor, Department of Chemistry, Department of Physics, University of Massachusetts, Boston
Pat McCormick, Senior Computer Scientist, Team Leader, LANL
Acknowledgment and potential co-presenter:
Jason R. Green, Professor, Department of Chemistry, Department of Physics, University of Massachusetts, Boston
Pat McCormick, Senior Computer Scientist, Team Leader, LANL
Tutorial
Security
TUT
DescriptionSecuring your network is not enough! Every service that you deploy is a window into your data center from the outside world, and a window that could be exploited by an attacker. Our goal is to increase the number of people in the workforce who can act as defenders of our HPC and data infrastructure. In this tutorial we cover weaknesses from the most recent "Stubborn Weaknesses in the CWE Top 25" list from MITRE. These weaknesses are the ones most present in real-world security exploits, and also the ones that have consistently stayed in the top 25 for at least five years. Attendees will learn how to recognize these weaknesses and code in a way that avoids them. Another issue affecting the security of our cyber-infrastructure is that its software depends upon a myriad of packages and libraries, and those come from different sources. Dependency analysis tools can catch flaws in those packages and libraries, and that affects the safety of the application. The more programmers are exposed to training in addressing security issues and the more they learn how to use dependency analysis tools, the bigger the impact that we can make on the security of our cyber-infrastructure.
For the hands-on exercises, we will be using two web applications that we recommend to download in advance.For Windows machines:
In VMware run this virtual machine image: https://research.cs.wisc.edu/mist/SoftwareSecurityCourse/Exercises/software-security-web.ova
For Mac OS machines:
0) Prerequisites: JDK, mongodb, Postman
1) Create a directory for the exercises.
2) Download to that directory the three tar files located athttps://research.cs.wisc.edu/mist/SoftwareSecurityCourse/Exercises/tar_files/
3) For each of the three tar files run "tar xvf file-name.tar"If you have any questions or issues please contact elisa@cs.wisc.edu
For the hands-on exercises, we will be using two web applications that we recommend to download in advance.For Windows machines:
In VMware run this virtual machine image: https://research.cs.wisc.edu/mist/SoftwareSecurityCourse/Exercises/software-security-web.ova
For Mac OS machines:
0) Prerequisites: JDK, mongodb, Postman
1) Create a directory for the exercises.
2) Download to that directory the three tar files located athttps://research.cs.wisc.edu/mist/SoftwareSecurityCourse/Exercises/tar_files/
3) For each of the three tar files run "tar xvf file-name.tar"If you have any questions or issues please contact elisa@cs.wisc.edu
Workshop
Architecture
Data-Intensive
Network
Performance Optimization
System Administration
W
DescriptionAccording to the European Union Aviation Safety Agency (EASA), AI-based algorithms, combined with extensive fleet data, could enable early detection of potential engine failures, leading to proactive predictive maintenance in air travel. At a global level, the Independent Data Consortium for Aviation (IDCA) recognizes the potential of collaborative data sharing in the airline industry. However, data ownership-related issues, such as privacy, intellectual property, and regulatory compliance, pose significant obstacles to realizing the vision of combining fleet data to improve predictive maintenance algorithms. In this paper, we use NASA’s Turbofan Jet Engine Dataset (N-CMAPSS) to demonstrate how airlines could leverage the power of Federated Learning (FL) and microservices, to collaboratively train a global Machine-Learning (ML) model that can enable airline companies to utilize their data for predictive maintenance, while maintaining control.
Workshop
Applications and Application Frameworks
Artificial Intelligence/Machine Learning
Security
W
DescriptionIn this talk we will provide a brief overview of the National Science Foundation followed by a more in depth discussion of selected programs of interest to researchers at the intersections of security, privacy, performance, and high end computing.
Workshop
Applications and Application Frameworks
Artificial Intelligence/Machine Learning
Security
W
DescriptionPreempting attacks targeting supercomputing systems before damage remains the top security priority. The main challenge is that noisy attack attempts and unreliable alerts often mask real attacks. This paper describes a security testbed embedded in live traffic of a supercomputer at the National Center for Supercomputing Applications (NCSA). The objective is to demonstrate attack preemption, i.e., stopping system compromise and data breaches at petascale supercomputers. Deployment of our testbed at NCSA enables the following key contributions: 1) Insights from characterizing unique attack patterns found in real security logs of more than 200 security incidents curated in the past two decades at NCSA. 2) Deployment of an attack visualization tool to illustrate the challenges of identifying real attacks in HPC environments and to support security operators in interactive attack analyses. 3) Demonstrate the testbed's utility by running novel models, such as Factor Graph-Based models, to preempt a real-world ransomware family.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionModern batch schedulers in HPC environments enable the shared use of available computational resources via provisioning discrete sets of resources matching user requirements. The lack of elasticity in such scenarios is often addressed using a Pilot job model where multiple separate requests are pooled. In this work, we explore computational elasticity in a popular Python-based workflow system: Parsl. We identify limitations in existing scaling logic and propose a new resource-aware scheduler. We show a significant improvement in the efficiency of compute resources consumed with minimal loss in time to solution.
Workshop
Algorithms
Heterogeneous Computing
W
DescriptionWe study the performance behavior of the sparse matrix-vector product operation in distributed computing environments, in the case of very large non-diagonal matrices where the nonzero elements are placed irregularly across the matrix. In particular, we focus on the distributed storage of the result vector in cases where it becomes too large to be stored fully on each process, and its redistribution between the iterations of a sequence of SpMV operations. We propose two methods to this effect; one aims at minimizing the memory requirements of storing the result vector, the other optimizes the communications required for the redistribution. We perform large-scale experiments on the Fugaku supercomputer, in order to show the importance of taking into account the network topology to correctly identify and address communication bottlenecks. The results show that the most efficient proposed method has a runtime several times faster than a non-optimal one on this topology.
Workshop
Applications and Application Frameworks
Distributed Computing
Middleware and System Software
W
DescriptionContainers have become an important component for scientific workflows, enhancing reproducibility, portability, and isolation when coupled with workflow management systems. However, integrating containers with these systems can be complex, potentially hindering wider adoption. Serverless platforms offer a solution by providing a layer of abstraction over container orchestrators, simplifying management while introducing event-driven capabilities.
This paper presents a novel integration of serverless with workflow management systems to optimize scientific workflow execution. Our approach leverages serverless functions to dynamically provision containers for workflow tasks, resulting in up to 30\% faster execution. We found that performance can be further improved by reusing containers between multiple different tasks that were provisioned by the serverless platform. These findings demonstrate the utility of combining specialized container orchestration with established workflow management to streamline scientific computing, improve resource utilization, and accelerate time-to-results. Serverless' event-driven architecture enables efficient resource scaling, aligning with the dynamic nature of scientific workloads.
This paper presents a novel integration of serverless with workflow management systems to optimize scientific workflow execution. Our approach leverages serverless functions to dynamically provision containers for workflow tasks, resulting in up to 30\% faster execution. We found that performance can be further improved by reusing containers between multiple different tasks that were provisioned by the serverless platform. These findings demonstrate the utility of combining specialized container orchestration with established workflow management to streamline scientific computing, improve resource utilization, and accelerate time-to-results. Serverless' event-driven architecture enables efficient resource scaling, aligning with the dynamic nature of scientific workloads.
Panel
Cloud Computing
Serverless
TP
DescriptionWith the ongoing convergence of high-performance computing and cloud, HPC gains a chance to transform and improve its runtimes and programming models. HPC systems can increase their efficiency and accessibility by adapting elastic cloud paradigms, with the prime example being serverless computing. Serverless abstracts away resource management and introduces fine-grained allocations, allowing system operators to improve their efficiency with elastic containers (CaaS), functions (FaaS), and acceleration (XaaS). However, adopting serverless technologies brings challenges that have not been treated adequately in HPC: security of multi-tenant deployments, portability, and performance isolation in shared resources.
In this interactive panel, experts from academia and national labs will debate how serverless can support the rigorous demands of HPC applications. They will share experiences of introducing elastic programming models into the rigid world of high-performance systems and outline predictions for the future: Will serverless schedulers become first-class citizens on HPC systems?
In this interactive panel, experts from academia and national labs will debate how serverless can support the rigorous demands of HPC applications. They will share experiences of introducing elastic programming models into the rigid world of high-performance systems and outline predictions for the future: Will serverless schedulers become first-class citizens on HPC systems?
Exhibits
Flash Session
TP
XO/EX
Workshop
Graph Algorithms
Heterogeneous Computing
Programming Frameworks and System Software
W
DescriptionMessage aggregation is widely used with a goal to reduce communication cost in HPC applications. The discrepancy in the order of overhead of sending a message and cost of per byte transferred motivates the need for message aggregation, for several irregular fine-grained messaging applications like graph algorithms and parallel discrete event simulation (PDES) . While the benefit of message aggregation is often analyzed in terms of reducing the overhead, specifically the per message cost, we also analyze different schemes that can aid in reducing the message latency, ie. the time from when a message is sent to the time when it is received. Message latency can affect several applications like PDES with speculative execution where reducing message latency could result in fewer rollbacks. Specifically in our work, we demonstrate the effectiveness of node-aware message aggregation schemes for a range of proxy applications with respect to messaging overhead and latency.
Workshop
Shepherd: Seamless Integration of Service Workflows into Task-Based Workflows through Log Monitoring
Applications and Application Frameworks
Distributed Computing
Middleware and System Software
W
DescriptionTraditional workflow managers focus on coordinating discrete tasks: actions that run to completion. However, emerging workflows require persistent services that must be managed alongside traditional tasks. We introduce Shepherd, a local workflow manager that runs services as a task, enabling them to be seamlessly integrated into larger distributed workflows. By inferring service states through log outputs and file creations, Shepherd enables the coordinated startup and shutdown of dependent services without modifying their original code. We demonstrate Shepherd's effectiveness in large-scale drone simulations, where it enhances workflow flexibility, reliability, and comprehensive logging and visualization.
Workshop
Data Compression
Data Movement and Memory
Middleware and System Software
W
DescriptionThis work proposes a compression-enabled roofline model to facilitate this adaptability with data compression techniques to balance and transform between computational and memory demands. This model enables applications to adjust in response to the specific strengths and limitations of the underlying hardware and system to optimize resource utilization. The effectiveness of this approach is demonstrated with matrix multiplication kernels on different input sizes, with turning on/off various compression techniques, including 1) low-precision floating point; 2) sparse matrix formulation; and 3) compressed arrays with ZFP. By reducing memory transfer volumes and cache misses and increasing data locality and computational intensity through compression, the specific roofline model can transform between compute and memory bounds to align more efficiently with system capabilities. This advancement not only improves overall performance but also maximizes adaptability in diverse computing environments.
Birds of a Feather
TP
XO/EX
DescriptionThe annual business meeting of SIGHPC is your opportunity to hear about and discuss the status of SIGHPC and its chapters. All of the elected officers and many other volunteers will be present to answer your questions about SIGHPC. Representatives from our chapters will also be available. We will also be discussing upcoming plans for the year.
Workshop
Data Movement and Memory
I/O, Storage, Archive
W
DescriptionHigh-performance computing (HPC) applications, such as Nyx, QMCPACK, and Montage, depend on parallel file systems (PFS) like Lustre, BeeGFS, and PVFS for reliable and efficient data management and access. However, PFS can fail due to hardware faults, software bugs, or power outages. These failures are generally categorized as: fail-stop failures, which render the PFS unmountable or inaccessible; and partial failures, which compromise specific PFS components, allowing the system to remain functional but potentially causing unnoticed damage or silent errors. There have been lots of studies analyzing the data corruptions due to both fail-stop behaviors and partial failures. However, they ignore potentially more complicated corruption in special data areas, particularly the metadata area of parallel file systems, which is the focus of this study.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionAzure Cloud offers a wide range of resources for running HPC workloads, requiring users to configure their deployment by selecting VM types, number of VMs, and processes per VM. Suboptimal decisions may lead to longer execution times or additional costs for the user. We are developing an open-source tool to assist users in making these decisions by considering application input parameters, as they influence resource consumption. The tool automates the time-consuming process of setting up the cloud environment, executing the benchmarking runs, handling output, and providing users with resource selection recommendations as high-level insights on run times and costs across different VM types and number of VMs. In this work, we present initial results and insights on reducing the number of cloud executions needed to provide such guidance, leveraging data analytics and optimization techniques with two well-known HPC applications: OpenFOAM and LAMMPS.
Workshop
Broader Engagement
HPC in Society
Inclusivity
W
DescriptionSimPoint has a wide-ranging impact on computer architecture research for automatically finding a small set of simulation points to represent the complete execution of a program for efficient and accurate simulations. While many studies have used SimPoint as part of their methodology, there's been little consideration of whether the set of Simulation Points that Simpoint provides is as small as possible. We propose SimPoint++, which replaces the BIC method by combining WCSS and Silhouette to find the optimal cluster number. The new Python framework of SimPoint++ also provides a dimension reduction pipeline for effective clustering and supports multi-thread application analysis.
We evaluate SimPoint++ with Spec CPU 2017 benchmarks. SimPoint++ achieves comparable or higher accuracy with significantly fewer simulation points, resulting in a 5x speed-up in simulation time compared to state-of-the-art solutions.
We evaluate SimPoint++ with Spec CPU 2017 benchmarks. SimPoint++ achieves comparable or higher accuracy with significantly fewer simulation points, resulting in a 5x speed-up in simulation time compared to state-of-the-art solutions.
Workshop
Artificial Intelligence/Machine Learning
Codesign
W
DescriptionUntil now, most LLM are managed using GPU or dedicated accelerators. But, their cost combined to their low availability on the market and their level of energy consumption are prompting us to turn to other solutions. In this context, SiPearl’s high-performance energy-efficient processor with built-in High Bandwidth Memory (HBM), Rhea, will be the ultimate solution for LLM workloads. The LLM’s workflow can be divided into three steps: 1) sanitizing and extracting features, 2) building/training founding models, and 3) fine-tuning and using models. While collecting, identifying and extracting relevant features from the raw data (1st step) is already done on CPU, the other steps are still performed on GPU.
This talk describes why and how other tasks can be carried out more advantageously on SiPearl’s processor with built-in HBM. It covers inference, fine-tuning and training and demonstrates among other things the resilience of Rhea which is more flexible to model changes than solutions currently in use.
This talk describes why and how other tasks can be carried out more advantageously on SiPearl’s processor with built-in HBM. It covers inference, fine-tuning and training and demonstrates among other things the resilience of Rhea which is more flexible to model changes than solutions currently in use.
Workshop
Artificial Intelligence/Machine Learning
W
DescriptionThe dramatic progress of Large Language Models (LLMs)
in the past 3-4 years opens the potential to use them for
scientific applications. To be applicable for scientific research,
the skills, trustworthiness and safety of LLMs must be tested.
While several frameworks/benchmarks have emerged as de
facto standards for evaluating general-purpose LLMs (e.g.,
Eleuther AI Harness [2] and HELM [3] for skills, DecodingTrust
[5] for trustworthiness), few of them are specifically
related to science. In this extended abstract, we report the
discussions of the ”Skills, Safety, and Trust Evaluation of
Large Language Models” break-out session of the Trillion
Parameter Consortium workshop in Barcelona (June 2024),
which exposed the gaps in the evaluation method that must be
addressed before using LLMs broadly in scientific contexts.
in the past 3-4 years opens the potential to use them for
scientific applications. To be applicable for scientific research,
the skills, trustworthiness and safety of LLMs must be tested.
While several frameworks/benchmarks have emerged as de
facto standards for evaluating general-purpose LLMs (e.g.,
Eleuther AI Harness [2] and HELM [3] for skills, DecodingTrust
[5] for trustworthiness), few of them are specifically
related to science. In this extended abstract, we report the
discussions of the ”Skills, Safety, and Trust Evaluation of
Large Language Models” break-out session of the Trillion
Parameter Consortium workshop in Barcelona (June 2024),
which exposed the gaps in the evaluation method that must be
addressed before using LLMs broadly in scientific contexts.
Workshop
SLICES: A Scientific Large Scale Infrastructure for Computing and Communication Experimental Studies
Cloud Computing
Distributed Computing
W
Workshop
Cloud Computing
Middleware and System Software
State of the Practice
W
DescriptionAI and ML engineers are often unfamiliar with traditional HPC environments. They rely on cloud native systems like Kubernetes to abstract complex system management. These platforms lack the fine-grained resource control and advanced scheduling features crucial for HPC/AI/ML workloads. Slurm, an open-source HPC workload manager, excels in allocating resources, managing parallel jobs, and handling task queues for workloads. In this talk, we introduce our new project, Slinky, which bridges the gap between HPC and cloud native worlds by running Slurm in Kubernetes. By combining Slurm's robust capabilities with the Kubernetes user-friendly interface, Slinky creates a powerful solution; it delivers HPC-level performance and scheduling within an accessible cloud native platform. This integration empowers AI and ML engineers to harness the full potential of their resources without requiring extensive systems expertise.
Birds of a Feather
TP
XO/EX
DescriptionSlurm is an open-source workload manager used on much of the TOP500 systems and provides a rich set of features, including topology-aware optimized resource allocation, cloud bursting, hierarchical bank accounts with fair-share job prioritization and many resource limits.
The meeting will consist of three parts: The Slurm development team will present details about the newly released Slurm 24.11, planned changes for the upcoming 25.05 and future releases, and solicit user feedback. Everyone interested in Slurm use and development is encouraged to attend.
The meeting will consist of three parts: The Slurm development team will present details about the newly released Slurm 24.11, planned changes for the upcoming 25.05 and future releases, and solicit user feedback. Everyone interested in Slurm use and development is encouraged to attend.
Paper
Distributed Computing
Middleware and System Software
TP
DescriptionThe deployment of ML serving applications, featuring multiple inference functions on serverless platforms, has gained substantial popularity, leading to numerous developments of new systems. However, these systems often focus on optimizing resource provisioning and cold start management separately, ultimately resulting in higher monetary costs.
This paper introduces SMIless, a highly efficient serverless system tailored for serving DAG-based ML inference in heterogeneous environments. SMIless effectively co-optimizes resource configuration and cold-start management in the context of dynamic invocations. This is achieved by seamlessly integrating adaptive pre-warming windows, striking an effective balance between performance and cost. We have implemented SMIless on top of OpenFaaS and conducted extensive evaluations using real-world ML serving applications. The experimental results demonstrate that SMIless can achieve up to a 5.73$\times$ reduction in the overall costs while meeting the SLA requirements for all user requests, surpassing the performance of state-of-the-art solutions.
This paper introduces SMIless, a highly efficient serverless system tailored for serving DAG-based ML inference in heterogeneous environments. SMIless effectively co-optimizes resource configuration and cold-start management in the context of dynamic invocations. This is achieved by seamlessly integrating adaptive pre-warming windows, striking an effective balance between performance and cost. We have implemented SMIless on top of OpenFaaS and conducted extensive evaluations using real-world ML serving applications. The experimental results demonstrate that SMIless can achieve up to a 5.73$\times$ reduction in the overall costs while meeting the SLA requirements for all user requests, surpassing the performance of state-of-the-art solutions.
Workshop
Artificial Intelligence/Machine Learning
Codesign
W
DescriptionIn this talk, we introduce a new programming system for Partitioned Global Address Space (PGAS) applications, where point-to-point remote operations can be expressed as fine-grained asynchronous active messages (or equivalently, as remote asynchronous tasks). One of the major benefits of this approach is that it enables asynchronous movement of computation to data as opposed to traditional approaches of more synchronous movement of data to computation. This approach can be viewed as extending the classical Bulk Synchronous Processing (BSP) model to a Fine-grained-Asynchronous Bulk-Synchronous Parallelism (FA-BSP) model. We will discuss an actor-based programming system to realize the FA-BSP execution model, and present recent results illustrating the benefits of this approach on current HPC systems.
Looking to the future, we will also discuss ongoing work on hardware support of the FA-BSP execution model being undertaken in the Flow-Optimized Reconfigurable Zones of Acceleration (FORZA) project led by Georgia Tech that is supported by the IARPA AGILE program. The FORZA project is pursuing a software-hardware co-design approach to address the significant disruptions currently under way in HPC hardware and software. In hardware, there is a Pandora's box of new architectural approaches being proposed to sustain performance improvements beyond the end of Moore’s Law. In software, there is an increased urgency for enabling large-scale data analytics applications for societal benefits. To address these challenges, the FORZA project is focusing on large-scale graph analytics as an important exemplar of the challenges that need to be addressed by future HPC systems.
We would like to acknowledge all participants in the FORZA project from Georgia Tech, Cornelis Networks, Tactical Computing Labs, UC Santa Barbara, and U. Notre Dame. The opinions in this talk are solely those of the speaker and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of any of these organizations, the ODNI, IARPA, or U.S. Government.
Looking to the future, we will also discuss ongoing work on hardware support of the FA-BSP execution model being undertaken in the Flow-Optimized Reconfigurable Zones of Acceleration (FORZA) project led by Georgia Tech that is supported by the IARPA AGILE program. The FORZA project is pursuing a software-hardware co-design approach to address the significant disruptions currently under way in HPC hardware and software. In hardware, there is a Pandora's box of new architectural approaches being proposed to sustain performance improvements beyond the end of Moore’s Law. In software, there is an increased urgency for enabling large-scale data analytics applications for societal benefits. To address these challenges, the FORZA project is focusing on large-scale graph analytics as an important exemplar of the challenges that need to be addressed by future HPC systems.
We would like to acknowledge all participants in the FORZA project from Georgia Tech, Cornelis Networks, Tactical Computing Labs, UC Santa Barbara, and U. Notre Dame. The opinions in this talk are solely those of the speaker and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of any of these organizations, the ODNI, IARPA, or U.S. Government.
Birds of a Feather
TP
XO/EX
DescriptionSpack is a package manager for scientific computing with a rapidly growing open-source community of over 1,400 contributors from academia, industry, and laboratories around the world. This session will open with updates on the latest Spack release, including new features around compilers and binary caches. We'll talk about our project roadmap and our move to open governance as a Linux Foundation project. Afterwards, we'll open the floor for questions and poll the audience for their thoughts on future Spack directions. All are invited to provide feedback, request features, and engage with the Spack team. Help us make HPC software simpler!
Workshop
Heterogeneous Computing
Parallel Programming Methods, Models, Languages and Environments
PAW-Full
Task Parallelism
W
DescriptionCurrent programming models face challenges in dealing with modern supercomputers' growing parallelism and heterogeneity. Emerging programming models, like the task-based programming model found in the asynchronous many-task HPX programming framework, offer new ways to express parallelism, enhance scalability, and mask synchronization and communication latency on multi-core and distributed systems.
Regular high-performance computing benchmarks are often unsuitable for comparing different programming models due to their limited code complexity. However, real-world scientific applications are usually too complex. As a middle ground, proxy applications model the behavior of actual scientific problems, while reducing code complexity.
In our research on using HPX to program machines with heterogeneous compute units (e.g., GPU and FPGA/AI Engines), we have also substantially optimized a pure HPX-based software baseline of the LULESH proxy application. This paper discusses the techniques we applied yielding single-node speed-ups of 1.33x to 2.25x for different problem sizes relative to the LULESH OpenMP reference implementation.
Regular high-performance computing benchmarks are often unsuitable for comparing different programming models due to their limited code complexity. However, real-world scientific applications are usually too complex. As a middle ground, proxy applications model the behavior of actual scientific problems, while reducing code complexity.
In our research on using HPX to program machines with heterogeneous compute units (e.g., GPU and FPGA/AI Engines), we have also substantially optimized a pure HPX-based software baseline of the LULESH proxy application. This paper discusses the techniques we applied yielding single-node speed-ups of 1.33x to 2.25x for different problem sizes relative to the LULESH OpenMP reference implementation.
Workshop
Codesign
Data Movement and Memory
Facilities
W
DescriptionNote any next steps and follow-up possibilities for community.
Workshop
Codesign
Data Movement and Memory
Facilities
W
DescriptionOur invited speakers address this year's charge question, then our audience & panelists will dig deeper in a moderated discussion.
Workshop
Codesign
Data Movement and Memory
Facilities
W
DescriptionWelcome and brief word of introduction to the goals of the symposium.
Workshop
Codesign
Data Movement and Memory
Facilities
W
DescriptionA sampling of some of the latest thoughts and advances
Workshop
HPC Infrastructure
State of the Practice
System Administration
W
DescriptionHPC System Administrators and user support teams spend a considerable amount of time on software installation because most of them that come prepackaged in OS package managers may not be optimized for our compute resources and networks, or lack desired compilation features. These installations can be complex due to specific versions of compilers, dependencies, and MPI libraries that all together create disorganization and are inflexible to manage. This issue is even bigger at smaller institutions with limited resources that support heterogeneous clusters and have to install software for different hardware configurations to achieve the best utilization and optimization. Moreover, many researchers want the freedom to manage their software stacks and only use package managers of their choosing.
To lower the learning barriers for our users, enable ease of software installation for administrators and end users, bring a structured directory tree, and have a hierarchical structure of modules, we introduce SStack.
To lower the learning barriers for our users, enable ease of software installation for administrators and end users, bring a structured directory tree, and have a hierarchical structure of modules, we introduce SStack.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionARM-based multicore CPUs, such as NVIDIA Grace and Fujitsu A64FX, dominate contemporary HPC, featuring 32-256 cores with cache hierarchies and up to 1 TB/s memory bandwidth. While benchmarks like STREAM show similar performance across these systems, diverse applications, particularly graph and nearest-neighbor (e.g., stencils), reveal distinct performance profiles. Analyzing these profiles with low-level performance data can uncover system bottlenecks. We propose a template focusing on stalls and memory accesses to identify bottlenecks efficiently by studying key CPU/memory performance events using Linux perf. Our approach engages all cores (144 for Grace, 48 for A64FX) with platform-specific compilers (ARMClang 24.04 for Grace, Fujitsu 4.10 for A64FX). This method effectively categorizes application scenarios by analyzing stalls and memory accesses, enabling quick identification of corner cases.
Paper
Accelerators
Compilers
Heterogeneous Computing
Performance Evaluation and/or Optimization Tools
TP
DescriptionIncreasing heterogeneity in HPC architectures and compiler advancements have led to OpenMP being frequently used to enable computations on heterogeneous devices. However, the efficient movement of data on heterogeneous computing platforms is crucial for achieving high utilization. Programmers must explicitly map data between the host and connected accelerator devices to achieve efficient data movement. Ensuring efficient data transfer requires programmers to reason about complex data flow. This can be a laborious and error-prone process since the programmer must keep a mental model of data validity and lifetime spanning multiple data environments. We present a static analysis tool, OMPDart (OpenMP Data Reduction Tool), for OpenMP programs that models data dependencies between host and device regions and applies source code transformations to achieve efficient data transfer. Our evaluations on nine HPC benchmarks demonstrate that OMPDart is capable of generating effective data mapping constructs that substantially reduce data transfer between host and device.
Exhibits
SCinet
TP
XO/EX
Paper
Distributed Computing
Middleware and System Software
TP
Best Student Paper Finalist
DescriptionDeep reinforcement learning (DRL) has gained immense success in many applications, including gaming AI, scientific simulations, and large-scale (HPC) system scheduling. DRL training, which involves a trial-and-error process, demands considerable time and computational resources. To address this, distributed DRL algorithms and paradigms have been developed to expedite training using extensive resources.
However, existing distributed DRL solutions rely on synchronous learning with serverful infrastructures, suffering from low training efficiency and overwhelming training costs.
This paper proposes Stellaris, the first to introduce a generic asynchronous learning paradigm for distributed DRL training with serverless computing.
We devise an importance sampling truncation technique to stabilize DRL training and develop a staleness-aware gradient aggregation method tailored to the dynamic staleness in asynchronous serverless DRL training.
Experiments on AWS EC2 regular testbeds and HPC clusters show that Stellaris outperforms existing state-of-the-art DRL baselines by achieving 2.2X higher rewards (i.e., training quality) and reducing 41% training costs.
However, existing distributed DRL solutions rely on synchronous learning with serverful infrastructures, suffering from low training efficiency and overwhelming training costs.
This paper proposes Stellaris, the first to introduce a generic asynchronous learning paradigm for distributed DRL training with serverless computing.
We devise an importance sampling truncation technique to stabilize DRL training and develop a staleness-aware gradient aggregation method tailored to the dynamic staleness in asynchronous serverless DRL training.
Experiments on AWS EC2 regular testbeds and HPC clusters show that Stellaris outperforms existing state-of-the-art DRL baselines by achieving 2.2X higher rewards (i.e., training quality) and reducing 41% training costs.
Students@SC
TP
W
TUT
XO/EX
DescriptionThis event, which will take place from 8:30am to 3:30pm on Sunday, November 17, in person at SC24 in Room B216, is open to anyone interested in learning more about high-performance computing (HPC). Participants will receive an overview of HPC programming environments, parallel programming models, job schedulers, and job launchers. Afterward, they will be directed to self-guided HPC challenges covering basic parallel programming, AI, and GPU programming topics. These challenges will be performed on Oak Ridge Leadership Computing Facility’s (OLCF) Frontier Exascale system and Purdue’s Anvil supercomputer. Frontier is currently the most powerful supercomputer in the world. Students will have access to Frontier during the workshop and to Anvil afterward to complete the exercises required for an HPC Crash Course certificate. No pre-registration is required.
Pre-Workshop Session:
We will host a virtual help session on November 7 at 10:00am (EST) to review requirements, what to expect, and why you should know about HPC.
To attend the help session, please complete this web form: https://forms.gle/jPRxGebRW3HSLHSZ8
We will send you the link to join the session before November 7.
Frontier Access Requirements:
Eligible participants will be provided access tokens and usernames for Frontier during the workshop. To gain access to Frontier:
1. Bring a government-issued photo ID to the workshop for quick access vetting.
2. Bring an internet-ready laptop to the event.
3. Note that foreign nationals from countries listed in section 15 CFR 740.7 License Exceptions for Computers (including Cuba, Iran, North Korea, Sudan, and Syria) may require a lengthy approval process for access to DOE supercomputers. If approval cannot be obtained in time for the HPC Crash Course, affected participants can apply to work on Anvil.
Anvil Access Requirements:
Participants who need more time to complete exercises after the workshop or who cannot gain access to Frontier can apply for access to Anvil. It is strongly recommended that participants apply for Anvil access ahead of the workshop.
To apply for Anvil access, follow these steps:
1. For participants who do not have an ACCESS ID already, please go to https://identity.access-ci.org/new-user and follow the instructions listed here: https://drive.google.com/file/d/1G9fGTFN8Mk-CaL_EW8O0gean3haayBza/view?usp=sharing
2. Once you have your ACCESS ID, please complete this form: https://docs.google.com/forms/d/e/1FAIpQLSesTnC7UF6B5Nr8paGaLwQu4QKxu6K8KC8lO1Gm0El1h0duXg/viewform
Once we have your ACCESS ID, Anvil admins will grant you access.
Pre-Workshop Session:
We will host a virtual help session on November 7 at 10:00am (EST) to review requirements, what to expect, and why you should know about HPC.
To attend the help session, please complete this web form: https://forms.gle/jPRxGebRW3HSLHSZ8
We will send you the link to join the session before November 7.
Frontier Access Requirements:
Eligible participants will be provided access tokens and usernames for Frontier during the workshop. To gain access to Frontier:
1. Bring a government-issued photo ID to the workshop for quick access vetting.
2. Bring an internet-ready laptop to the event.
3. Note that foreign nationals from countries listed in section 15 CFR 740.7 License Exceptions for Computers (including Cuba, Iran, North Korea, Sudan, and Syria) may require a lengthy approval process for access to DOE supercomputers. If approval cannot be obtained in time for the HPC Crash Course, affected participants can apply to work on Anvil.
Anvil Access Requirements:
Participants who need more time to complete exercises after the workshop or who cannot gain access to Frontier can apply for access to Anvil. It is strongly recommended that participants apply for Anvil access ahead of the workshop.
To apply for Anvil access, follow these steps:
1. For participants who do not have an ACCESS ID already, please go to https://identity.access-ci.org/new-user and follow the instructions listed here: https://drive.google.com/file/d/1G9fGTFN8Mk-CaL_EW8O0gean3haayBza/view?usp=sharing
2. Once you have your ACCESS ID, please complete this form: https://docs.google.com/forms/d/e/1FAIpQLSesTnC7UF6B5Nr8paGaLwQu4QKxu6K8KC8lO1Gm0El1h0duXg/viewform
Once we have your ACCESS ID, Anvil admins will grant you access.
Workshop
Data Movement and Memory
Emerging Technologies
W
DescriptionWe evaluate the performance of the baseline and optimized reductions in OpenMP on an NVIDIA Grace-Hopper system. We explore the impacts of the number of teams, the number of elements to sum per loop iteration, and simultaneous execution on the central-processing unit (CPU) and the GPU in the unified memory (UM) mode upon the reduction performance. The experimental results show that the optimized reductions are 6.120X to 20.906X faster than the baselines on the GPU, and their efficiency ranges from 89% to 95% of the theoretical GPU memory bandwidth. Depending on where an input array is allocated in the program when co-running the reduction on the CPU and GPU in the UM mode, the average speedup over the GPU-only execution is approximately 2.484 or 1.067, and the speedup of the optimized reductions over the baseline reductions ranges from 0.996 to 10.654 or from 0.998 to 6.729.
Workshop
Artificial Intelligence/Machine Learning
Broader Engagement
HPC in Society
W
Birds of a Feather
TP
XO/EX
DescriptionA hardware management service that provides dynamic application resource matching for Composable Disaggregated Infrastructure (CDI) can potentially increase runtime performance, energy efficiency, and improve resource utilization for large-scale computing systems by providing granular control of network connected pooled resources. The OpenFabrics Alliance (OFA) is developing Sunfish, a vendor-agnostic CDI framework to augment nodes with additional disaggregated components.
This BoF is an update on the status of Sunfish and an invitation to new members to join with us to further develop Sunfish. Speakers will pose questions, directing the focus of Sunfish development towards problems and features that matter to the audience.
This BoF is an update on the status of Sunfish and an invitation to new members to join with us to further develop Sunfish. Speakers will pose questions, directing the focus of Sunfish development towards problems and features that matter to the audience.
Art of HPC
Posters
TP
W
TUT
XO/EX
DescriptionThe two CFD simulations to compute the wake from the wind turbines were performed using the high-order finite-difference flow solver XCompact3D (https://www.incompact3d.com). The simulations were run on ARCHER2, the UK's national HPC service (https://www.archer2.ac.uk). A precursor simulation was run to generate the neutral atmospheric boundary layer. The wind turbine simulations were run with a billion mesh points for 20,000 iterations before collecting the flow field.
We used ParaView for the initial postprocessing of the CFD simulation's velocity data to compute the Q-criterion. The final video was made using Blender, including the simulation of the ocean. The representation of the sky uses an HDRi from Poly Haven, "The Sky Is On Fire" by Greg Zaal and Rico Cilliers (see https://polyhaven.com/a/the_sky_is_on_fire).
The video has a total of 590 frames and took 14 hours to render on a workstation with an AMD Ryzen Threadripper PRO 7985WX (64 cores), six NVIDIA GeForce RTX 4090 GPUs (512GB DDR5), and four Samsung 990 PRO 4TB NVMe SSDs.
We used ParaView for the initial postprocessing of the CFD simulation's velocity data to compute the Q-criterion. The final video was made using Blender, including the simulation of the ocean. The representation of the sky uses an HDRi from Poly Haven, "The Sky Is On Fire" by Greg Zaal and Rico Cilliers (see https://polyhaven.com/a/the_sky_is_on_fire).
The video has a total of 590 frames and took 14 hours to render on a workstation with an AMD Ryzen Threadripper PRO 7985WX (64 cores), six NVIDIA GeForce RTX 4090 GPUs (512GB DDR5), and four Samsung 990 PRO 4TB NVMe SSDs.
Birds of a Feather
TP
XO/EX
DescriptionMembers of underrepresented groups often lack access to role models within their minority. The HPC community is still predominantly male, making it difficult for young women to find female "superheroes" to identify with. Such role models are crucial for career planning and guidance. This session aims to provide women in particular with the opportunity to meet influential, well-recognized female HPC "superheroes" from academia, research labs, HPC centers and industry. Join us to be inspired and find relatable role models as we work together to build a more inclusive and connected HPC community.
Tutorial
Applications and Application Frameworks
Architecture
Broader Engagement
Parallel Programming Methods, Models, Languages and Environments
TUT
DescriptionHigh-performance computing (HPC) resources and the underlying mathematics which has enabled their application have been the basis for much of the technical progress in the modern world. Engineering design, weather forecasting, robotics, process automation, and many other advances have been made possible by HPC. HPC allows scientists and engineers to start asking questions they have only dreamed of asking before.
HPC is not easy to utilize. Developers have to address resource consumption for the first time along with different languages and parallel application programming interfaces. It requires knowledge of the underlying hardware architecture. It requires a different and much deeper method of programming. This tutorial will provide an introduction to HPC for those who are new to the field. it will provide the big picture. It will cover how HPC hardware is different than traditional computing hardware. It will cover parallelization and the different methods of programming. it will provide an understanding of the starting steps, costs, and the future of HPC.
HPC is not easy to utilize. Developers have to address resource consumption for the first time along with different languages and parallel application programming interfaces. It requires knowledge of the underlying hardware architecture. It requires a different and much deeper method of programming. This tutorial will provide an introduction to HPC for those who are new to the field. it will provide the big picture. It will cover how HPC hardware is different than traditional computing hardware. It will cover parallelization and the different methods of programming. it will provide an understanding of the starting steps, costs, and the future of HPC.
Doctoral Showcase
Posters
TP
DescriptionQuantum computing applications require expert knowledge to perform complex steps: (1) selecting a suitable quantum algorithm, (2) generating the quantum circuit, (3) compiling/executing the quantum circuit, and (4) decoding the results. This creates high entry barriers for end users with limited expertise who need solutions for domain-specific problems. This poster highlights methods developed to assist end users, resulting in multiple open-source tools in the Munich Quantum Toolkit (MQT) on GitHub.
The poster focuses on three main tasks:
1) End-User Workflow (MQT ProblemSolver): Providing a workflow from classical input to a quantum solution, then returning it to classical format.
2) Quantum Device Selection and Compilation (MQT Predictor): Selecting and efficiently compiling the most suitable quantum device.
3) Benchmark Suite (MQT Bench): Offering representative test cases in a benchmark suite of quantum applications.
These tools simplify quantum computing, making it accessible to non-experts.
The poster focuses on three main tasks:
1) End-User Workflow (MQT ProblemSolver): Providing a workflow from classical input to a quantum solution, then returning it to classical format.
2) Quantum Device Selection and Compilation (MQT Predictor): Selecting and efficiently compiling the most suitable quantum device.
3) Benchmark Suite (MQT Bench): Offering representative test cases in a benchmark suite of quantum applications.
These tools simplify quantum computing, making it accessible to non-experts.
Paper
Post-Moore Computing
Quantum Computing
TP
DescriptionQuantum Computational Superiority boasts rapid computation and high-energy efficiency. Despite recent advances in classical algorithms aimed at refuting the milestone claim of Google's Sycamore, challenges remain in generating uncorrelated samples of random quantum circuits.
In this paper, we present a groundbreaking large-scale system technology that leverages optimization on global-, node-, and device-levels to achieve unprecedented scalability for tensor networks. This enables the handling of large-scale tensor networks with memory capacities reaching tens of terabytes, surpassing memory space constraints on a single node. Our techniques enable accommodating large-scale tensor networks with up to tens of terabytes of memory, reaching up to 2304 GPUs with a peak computing power of 718.8 PFLOPS half-precision. Notably, our most remarkable result is a time-to-solution of 17.18 seconds, with energy consumption of only 0.29 kWh, outperforming Google's quantum processor Sycamore in both speed and energy efficiency, which recorded 600 seconds and 4.3~kWh, respectively.
In this paper, we present a groundbreaking large-scale system technology that leverages optimization on global-, node-, and device-levels to achieve unprecedented scalability for tensor networks. This enables the handling of large-scale tensor networks with memory capacities reaching tens of terabytes, surpassing memory space constraints on a single node. Our techniques enable accommodating large-scale tensor networks with up to tens of terabytes of memory, reaching up to 2304 GPUs with a peak computing power of 718.8 PFLOPS half-precision. Notably, our most remarkable result is a time-to-solution of 17.18 seconds, with energy consumption of only 0.29 kWh, outperforming Google's quantum processor Sycamore in both speed and energy efficiency, which recorded 600 seconds and 4.3~kWh, respectively.
Workshop
Survey of Technologies for Developers of Parallel Applications — Task-Based and Scale-Free Computing
Heterogeneous Computing
Parallel Programming Methods, Models, Languages and Environments
Task Parallelism
W
Panel
Energy Efficiency
Sustainability
TP
DescriptionWhat does it mean for HPC to be sustainable? The largest supercomputers today are consuming more than 20 megawatts, and those built to support AI training and inference are even larger. We have made significant improvements to operational efficiency in HPC. We now need to consider a broader scope of environmental impacts across the life cycle of our systems and data centers. This includes design, manufacturing, transportation, operations, and end-of life. How do we manage water as a resource and trade off data center energy efficiency and water consumption? Is there a way to improve sustainability by operating supercomputer and data centers more dynamically without adversely affecting users? Can there be effective heat re-use for district heating and greenhouse food production? Reducing the carbon impact of ever larger HPC and AI clusters is going to require better and more consistent reporting. What are the right metrics for really driving accountability?
Workshop
Energy Efficiency
HPC Infrastructure
Sustainability
W
DescriptionThe use of Artificial Intelligence (AI) and Machine Learning (ML) as part of scientific workloads is becoming increasingly widespread. It is imperative to understand how to configure AI and ML applications on HPC systems to optimise their performance and energy efficiency, thereby minimising their environmental impact. In this study, we use MLPerf HPC's DeepCAM benchmark to assess and explore the energy efficiency of ML applications on different hardware platforms. We highlight the challenges that, despite growing popularity, ML frameworks still present in a traditional HPC environment, as well as the challenges of measuring power and energy on a variety of HPC and cloud-like virtualised systems. We conclude our study by proposing recommendations that will improve and encourage best practices around sustainable AI and ML workloads on HPC systems.
Workshop
Energy Efficiency
HPC Infrastructure
Sustainability
W
Workshop
Energy Efficiency
HPC Infrastructure
Sustainability
W
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionCurrent (centralized) resource management strategies typically require a global view of distributed HPC systems, relying on a cluster-wide resource manager for scheduling, with static, expert-tuned rules. This centralized decision-making approach suffers from resilience, efficiency and scalability issues. In this work, we describe our initial progress in the SWARM project that takes a novel decentralized multi-agent approach leveraging Swarm Intelligence (SI) and consensus strategies for enhanced robustness, resilience, and fault tolerance. We present our foundational SWARM system model to improve network overlays, enhance job selection using multi-agent consensus algorithms, and design SI-inspired scheduling approaches.
Paper
Accelerators
Data Movement and Memory
Emerging Technologies
Hardware Technologies
Heterogeneous Computing
Linear Algebra
Network
TP
DescriptionExisting high-performance computing (HPC) interconnection architectures are based on high-radix switches, which limits the injection/local performance and introduces latency/energy/cost overhead. The new wafer-scale packaging and high-speed wireline technologies provide high-density, low-latency, and high-bandwidth connectivity, thus promising to support direct-connected high-radix interconnection architecture.
In this paper, we propose a wafer-based interconnection architecture called Switch-Less-Dragonfly-on-Wafers. By utilizing distributed high-bandwidth networks-on-chip-on-wafer, costly high-radix switches of the Dragonfly topology are eliminated while increasing the injection/local throughput and maintaining the global throughput. Based on the proposed architecture, we also introduce baseline and improved deadlock-free minimal/non-minimal routing algorithms with only one additional virtual channel. Extensive evaluations show that the Switch-Less-Dragonfly-on-Wafers outperforms the traditional switch-based Dragonfly in both cost and performance. Similar approaches can be applied to other switch-based direct topologies, thus promising to power future large-scale supercomputers.
In this paper, we propose a wafer-based interconnection architecture called Switch-Less-Dragonfly-on-Wafers. By utilizing distributed high-bandwidth networks-on-chip-on-wafer, costly high-radix switches of the Dragonfly topology are eliminated while increasing the injection/local throughput and maintaining the global throughput. Based on the proposed architecture, we also introduce baseline and improved deadlock-free minimal/non-minimal routing algorithms with only one additional virtual channel. Extensive evaluations show that the Switch-Less-Dragonfly-on-Wafers outperforms the traditional switch-based Dragonfly in both cost and performance. Similar approaches can be applied to other switch-based direct topologies, thus promising to power future large-scale supercomputers.
Workshop
Data Movement and Memory
Emerging Technologies
W
DescriptionIn this paper, we characterize symmetric locality. In designing algorithms, compilers, and systems, data movement is a common bottleneck in high-performance computation, in which we improve cache and memory performance. We study a special type of data reuse in the form of repeated traversals, or re-traversals, which are based on the symmetric group. The cyclic and sawtooth traces are previously known results in symmetric locality, and in this work, we would like to generalize this result for any re-traversal. Then, we also provide an abstract framework for applications in compiler design and machine learning models to improve the memory performance of certain programs.
Workshop
Accelerators
Modeling and Simulation
Performance Evaluation and/or Optimization Tools
W
DescriptionHPC system architects routinely use application profiling and performance modeling to evaluate hardware and software performance trade-offs. However, the focus on individual applications leaves gaps in the understanding of system utilization because it is impractical to collect profiles and models for every application. In this paper, we use hardware activity metrics gathered from NERSC’s Perlmutter system to perform a roofline performance analysis of a diverse scientific workload and provide quantitative empirical evidence for widely held beliefs that had previously been inferred from scattered analyses of individual applications. Specifically, we confirm the predominance of double-precision operations. The arithmetic intensity distribution suggests that near equal fractions of the workload are compute-bound and bandwidth-bound on Perlmutter GPUs. These results stand in worrisome contrast to hardware performance trends, where artificial intelligence applications driving processors emphasize the performance of reduced-precision operations, and gains in memory bandwidth are not keeping pace with peak processing rates.
Workshop
Data Compression
Data Movement and Memory
Middleware and System Software
W
DescriptionError-bounded lossy compression has been a critical technique to significantly reduce the sheer amounts of simulation datasets for high-performance computing (HPC) scientific applications while effectively controlling the data distortion based on user-specified error bound. In many real-world use cases, users must perform computational operations on the compressed data. However, none of the existing error-bounded lossy compressors support operations, inevitably resulting in undesired decompression costs. In this paper, we propose a novel error-bounded lossy compressor (called SZOps), which supports not only error-bounding features but efficient computations (i.e. negation, scalar addition, scalar multiplication, mean, variance, etc.) on the compressed data without the complete decompression step, which is the first attempt to the best of our knowledge. We develop several optimization strategies to maximize the overall compression ratio and execution performance. We evaluate SZOps compared to other state-of-the-art lossy compressors based on multiple real-world scientific application datasets.
Workshop
Distributed Computing
Education
Emerging Technologies
W
DescriptionWe present a training program named T3-CIDERS, a "Train-The-Trainer Approach to Fostering CI (cyberinfrastructure)- and Data-Enabled Research in CyberSecurity.'' T3-CIDERS is a train-the-trainer program for advanced CI skills that is designed to be synergistic with research, teaching, and learning activities in cybersecurity and cyber-related disciplines. The participants, termed "future trainers'' (FTs), are trained in effective instructional design and hands-on CI materials. T3-CIDERS aims to enhance cybersecurity research and education through broader adoption of advanced CI techniques such as artificial intelligence, big data, parallel programming, and platforms like high-performance computing (HPC) systems. T3-CIDERS includes pre-training, a weeklong summer institute, ongoing learning engagements, and local training activities. The FTs conduct local training tailored to the needs at their respective home institutions. Community building is integral to T3-CIDERS as its overarching goal. The first cohort of FTs who took the 2024 summer institute comprises faculty members, researchers, and students representing multiple states.
Paper
Accelerators
Algorithms
Data Compression
I/O, Storage, Archive
Performance Optimization
TP
DescriptionAs simulation-based scientific discovery advances to exascale, a major question that the community is striving to answer is how to co-design data storage and complex physics-rich analytics in a way that the time to knowledge can be minimized for post-processing. This paper aims to address the issue of I/O interference for data analytics over local ephemeral storage, which is shared by multiple applications in a non-exclusive node usage scenario. At the core of this work is a coordinated cross-layer approach that reacts to storage interference from both storage and application layers. We evaluated three real-world data analytics, XGC, GenASiS, and CFD, on Chameleon, and quantitatively demonstrated that the I/O performance can be vastly improved, e.g., by 52% versus no adaptivity and 36% versus single layer adaptivity, while maintaining acceptable outcomes of data analysis.
Workshop
Debugging and Correctness Tools
Fault-Tolerance, Reliability, Maintainability, and Adaptability
Software Engineering
W
DescriptionDeterminacy races are concurrent programming hazards occurring when two accesses on the same memory address are not ordered, and at least one is writing.
Their presence hints at a correctness error, particularly under asynchronous task-based parallel programming models.
This paper introduces Taskgrind: a Valgrind tool for memory access analysis of parallel programming models such as Cilk or OpenMP.
We illustrate the tool's capabilities with a determinacy-race analysis and confront it with state-of-the-art tools.
Results show fewer false negatives and memory overheads on a set of microbenchmarks and LULESH, with meaningful error reports toward assisting programmers when parallelizing programs.
Their presence hints at a correctness error, particularly under asynchronous task-based parallel programming models.
This paper introduces Taskgrind: a Valgrind tool for memory access analysis of parallel programming models such as Cilk or OpenMP.
We illustrate the tool's capabilities with a determinacy-race analysis and confront it with state-of-the-art tools.
Results show fewer false negatives and memory overheads on a set of microbenchmarks and LULESH, with meaningful error reports toward assisting programmers when parallelizing programs.
Exhibits
SCinet
TP
XO/EX
Workshop
Debugging and Correctness Tools
Hardware Technologies
Resource Management
State of the Practice
W
DescriptionAs scientific codes are ported between GPU platforms, continuous testing is required to ensure numerical robustness and identify numerical differences. Compiler-induced numerical differences occur when a program is compiled and run on different GPUs, and the numerical outcomes are different for the same input. We present a study of compiler-induced numerical differences between NVIDIA and AMD GPUs. Our approach uses Varity to generate thousands of short numerical tests in CUDA and HIP, and their inputs; then, we use differential testing to check if the program produced a numerical inconsistency when run on these GPUs. We also use the HIPIFY tool to convert CUDA tests into HIP and check if there are numerical inconsistencies induced by HIPIFY. We generated more than 600,000 tests and found subtle numerical differences that come from (1) math library calls, (2) differences in floating-point precision (FP64 versus FP32), and (3) converting code with HIPIFY.
Workshop
Debugging and Correctness Tools
Hardware Technologies
Resource Management
State of the Practice
W
DescriptionWe present a randomized differential testing approach to test OpenMP implementations. In contrast to previous work that manually creates dozens of verification and validation tests, our approach is able to randomly generate thousands of tests, exposing OpenMP implementations to a wide range of program behaviors. We represent the space of possible random OpenMP tests using a grammar and implement our method as an extension of the Varity program generator. By generating 1,800 OpenMP tests, we find various performance anomalies and correctness issues when we apply it to three OpenMP implementations: GCC, Clang, and Intel. We also present several case studies that analyze the anomalies and give more details about the classes of tests that our approach creates.
Workshop
Heterogeneous Computing
Parallel Programming Methods, Models, Languages and Environments
Task Parallelism
W
DescriptionSupercomputers get faster and more complex every year. MPI, long the dominant model for distributed computation, has adapted by combining with models for intra-node parallelism (e.g., OpenMP, CUDA). These MPI+X hybrids offer performance but demand significant programmer effort to write, debug, and tune applications.
Alternatives to MPI+X are worth exploring as programmer productivity becomes a major component of the time to science. Alternatives include parallel programing languages (e.g., Chapel, Regent, Fortran 2018), general purpose libraries (e.g., Charm++, COMPSs, HPX, Legion, UPC++), and domain specific libraries (e.g., Arkouda, Dask, Spark). With many options to choose from, it is hard for programmers to know which alternative models are appropriate for their application and for programming model developers to understand the opportunities for improvement.
Through discussion of specific applications, PAW-ATM brings together application experts and programming model developers to improve applications and models.
Alternatives to MPI+X are worth exploring as programmer productivity becomes a major component of the time to science. Alternatives include parallel programing languages (e.g., Chapel, Regent, Fortran 2018), general purpose libraries (e.g., Charm++, COMPSs, HPX, Legion, UPC++), and domain specific libraries (e.g., Arkouda, Dask, Spark). With many options to choose from, it is hard for programmers to know which alternative models are appropriate for their application and for programming model developers to understand the opportunities for improvement.
Through discussion of specific applications, PAW-ATM brings together application experts and programming model developers to improve applications and models.
Tutorial
Architecture
Embedded and/or Reconfigurable Systems
TUT
DescriptionThis tutorial presents Chameleon (www.chameleoncloud.org), an open experimental platform providing access to state-of-the-art infrastructure for projects in Computer Science research, education, and emergent applications. Chameleon consists of three main operating sites -- located at University of Chicago, TACC, and NCAR – each providing access to innovative architecture configurations including e.g., Fujitsu FX700 (“Fugaku nodes”), Liqid and GigaIO disaggregated hardware, a range of GPUs, and others. The hardware is reconfigurable at bare metal level to support research on topics like power management and performance variability, where control over the full software stack is important. In addition to datacenter hardware, Chameleon also supports experiments using edge hardware allowing for a range of edge to cloud explorations.
This tutorial will introduce attendees to the Chameleon platform and explain how to best use it for HPC research. We will first explain the basic system capabilities in the context of a typical experiment, and then progress to more advanced features including construction of complex experimental environments such as virtual clusters; experimenting in the edge to cloud continuum, illustrated with examples using autonomous vehicles; and best practices and digital artifacts supporting packaging experiments for reproducibility, highlighting Chameleon's role as the default platform for SC24 Artifact Evaluation.
This tutorial will introduce attendees to the Chameleon platform and explain how to best use it for HPC research. We will first explain the basic system capabilities in the context of a typical experiment, and then progress to more advanced features including construction of complex experimental environments such as virtual clusters; experimenting in the edge to cloud continuum, illustrated with examples using autonomous vehicles; and best practices and digital artifacts supporting packaging experiments for reproducibility, highlighting Chameleon's role as the default platform for SC24 Artifact Evaluation.
Exhibits
SCinet
TP
XO/EX
Invited Talk
TP
DescriptionThis talk will outline three revolutions that happened in Earth system modelling in the past decades. The quiet revolution has leveraged better observations and more compute power to allow for constant improvements in prediction quality over the last decades; the digital revolution has enabled us to perform kilometer-scale simulations on modern supercomputers that further increase the quality of our models; and the machine learning revolution has now shown that machine-learned weather models are often competitive with physics-based weather models for many forecast scores while being easier, smaller and cheaper. This talk will summarize past developments, explain current challenges and opportunities, and outline what the future of Earth system modelling will look like — in particular, regarding machine-learned foundation models in a physical domain such as Earth system modelling.
Workshop
Emerging Technologies
Modeling and Simulation
Scientific and Information Visualization
W
DescriptionThe Digital Twin Consortium: Accelerating Industry Transformation through Collaborative Innovation
AbstractL This session presents an overview of the Digital Twin Consortium, including its mission, structure, work product, and strategic direction. We will cover insights into developments in the evolution of digital twins and highlight the role of AI in enhancing capabilities and applications. Current initiatives, research directions, use cases, and opportunities for cross-industry collaboration will be discussed. A Q&A session will conclude the presentation.
AbstractL This session presents an overview of the Digital Twin Consortium, including its mission, structure, work product, and strategic direction. We will cover insights into developments in the evolution of digital twins and highlight the role of AI in enhancing capabilities and applications. Current initiatives, research directions, use cases, and opportunities for cross-industry collaboration will be discussed. A Q&A session will conclude the presentation.
Workshop
Debugging and Correctness Tools
Fault-Tolerance, Reliability, Maintainability, and Adaptability
Software Engineering
W
DescriptionIn a series of related works developing an ensemble consistency testing approach for multiple popular global climate models (GCMs), one test scenario has repeatedly stood out. Why does the use of the Fused Multiply-Add (FMA) operation result in model configurations getting flagged as failures, while changes to compiler choice, optimization level, processor type and number, etc. are passed as expected? This work explores the impacts of FMA on GCM simulation output from a distributional perspective and provides directions for future work to enable model developers and users to use numerical optimization techniques with confidence.
Birds of a Feather
TP
XO/EX
DescriptionAs supercomputing welcomes new workflows of simulations, data science and artificial intelligence in the exascale era, the goal of this session is to pose, engage, debate, and address the question: "How should the SC community evolve performance benchmarks?"
This session will be organized as presentations and panel discussions with audience participation that will invite active members of the TOP500, HPCG, MLPerf, TeraSort, and key personnel from industry, academia, and government to discuss the value, need and desire for evolving the benchmark suite that is inclusive and accommodative of emerging applications to guide future supercomputing system design and architecture.
This session will be organized as presentations and panel discussions with audience participation that will invite active members of the TOP500, HPCG, MLPerf, TeraSort, and key personnel from industry, academia, and government to discuss the value, need and desire for evolving the benchmark suite that is inclusive and accommodative of emerging applications to guide future supercomputing system design and architecture.
Workshop
Message Passing
Network
W
DescriptionThe panel will discuss the future of MPI, and how AI is playing a role. Panelists are still being determined.
Birds of a Feather
TP
XO/EX
DescriptionThe National Science Foundation's vision and investment plans for cyberinfrastructure (CI) are designed to address the evolving needs of the science and engineering research community. Senior leadership and program staff from NSF’s Office of Advanced Cyberinfrastructure (OAC) will discuss OAC's vision, strategic and national priorities, as well as the latest funding opportunities across all aspects of the research cyberinfrastructure ecosystem. Substantial time will be devoted to audience Q&A between attendees and NSF staff and unstructured time to meet informally with NSF staff.
Exhibitor Forum
Facilities
Sustainability
TP
XO/EX
DescriptionThe escalating demands of HPC are pushing data center cooling to its limits and driving the need for more sustainable solutions. Gartner’s recent Data Center Infrastructure Hype Cycle highlights the growing importance of energy efficiency and sustainability in data centers, with advances in power, cooling, processing and automation technology gaining traction. This presentation will explore how an innovative hybrid approach to cooling, combining air and liquid cooling methodologies, can address the thermal challenges of HPC environments while aligning with the industry’s shift towards more sustainable and efficient data center operations. Real-world case studies will demonstrate the impact of this approach on increased performance, higher densities, energy savings, reduced total cost of ownership and overall sustainability in AI/HPC environments.
Birds of a Feather
TP
XO/EX
DescriptionWith power being a first-order design constraint on par with performance, it is important to measure and analyze energy efficiency trends in supercomputing. To raise the awareness of greenness as a first-order design constraint, the Green500 seeks to characterize the energy efficiency of supercomputers for different metrics, workloads, and methodologies. This BoF discusses trends across the Green500 and highlights from the current Green500 list. In addition, the Green500, TOP500, and Energy Efficient HPC Working Group have been working together on improving power measurement methodology, and this BoF presents recommendations for changes that will improve ease of submission without compromising accuracy.
Workshop
Artificial Intelligence/Machine Learning
Biology
Education
Emerging Technologies
Medicine
Modeling and Simulation
W
DescriptionPredictive oncology can be defined as the branch of precision medicine focused on improving cancer treatment outcomes by customizing therapeutic decisions for each patient based on all available information – genetic, molecular, cellular, and clinical. The rapid evolution of machine learning has led to a proliferation of sophisticated predictive oncology models. While many of these models show promise in research settings, clinical adoption has moved slowly for several reasons. One key challenge lies in generalizability; models trained on preclinical datasets often fail to translate to patient data. This limitation primarily arises from limited access to data and disparities between preclinical training and real-world contexts, compounded by the inherent heterogeneity of patient populations and the dynamic nature of disease status. Another major challenge relates to model transparency and interpretability – the ability to scrutinize the inner workings of a model and explain the biomolecular factors that underlie each of its predictions. The lack of model interpretation has been recognized as one of the most important barriers to building trustworthy AI systems in high-stakes clinical applications. In addition to challenges in model development, the successful clinical application of predictive oncology models also faces infrastructure and regulatory hurdles. The financial, computational, and regulatory resources needed to run both retrospective and prospective studies are rarely available outside major biomedical research campuses, especially in low-income regions. These challenges, among others, highlight the need for structured recommendations for model development, which clearly enumerate the methodological and clinical utility risks.
To address these fundamental challenges, we propose seven hallmarks all predictive oncology models should strive to address. These are: 1) Data Relevance and Actionability, ensuring the model's input is both pertinent and actionable; 2) Expressive Architecture, denoting the model's ability to capture complex biological interactions; 3) Standardized Benchmarking, for consistent model evaluation; 4) Demonstrated Generalizability, to ensure model performance across diverse settings; 5) Mechanistic Interpretability, for understanding the biological basis of model predictions; 6) Accessibility and Reproducibility, guaranteeing user-friendly model use; and 7) Fairness, to promote equitable model application across different patient demographics and resource-constrained communities. In addition, we consider how ethical principles apply to each of the seven hallmarks to maximize the societal benefits of therapy response models. We illustrate the systematic evaluation of a predictive oncology model via a scorecard. We also formulate a hallmarks-based checklist for model developers to succinctly enumerate the advances and risks associated with a model. We hope that the broader community – not only cancer researchers but regulators, clinicians, and lawmakers – will engage in shaping these guidelines, leading to the adoption of a concise set of standards.
To address these fundamental challenges, we propose seven hallmarks all predictive oncology models should strive to address. These are: 1) Data Relevance and Actionability, ensuring the model's input is both pertinent and actionable; 2) Expressive Architecture, denoting the model's ability to capture complex biological interactions; 3) Standardized Benchmarking, for consistent model evaluation; 4) Demonstrated Generalizability, to ensure model performance across diverse settings; 5) Mechanistic Interpretability, for understanding the biological basis of model predictions; 6) Accessibility and Reproducibility, guaranteeing user-friendly model use; and 7) Fairness, to promote equitable model application across different patient demographics and resource-constrained communities. In addition, we consider how ethical principles apply to each of the seven hallmarks to maximize the societal benefits of therapy response models. We illustrate the systematic evaluation of a predictive oncology model via a scorecard. We also formulate a hallmarks-based checklist for model developers to succinctly enumerate the advances and risks associated with a model. We hope that the broader community – not only cancer researchers but regulators, clinicians, and lawmakers – will engage in shaping these guidelines, leading to the adoption of a concise set of standards.
Art of HPC
Posters
TP
W
TUT
XO/EX
DescriptionThis image was created using a Canon EOS 7, a Google Pixel phone, Adobe Photoshop, Adobe Premiere, NCAR stock video, and ThingLink.
Workshop
Cloud Computing
Distributed Computing
W
Panel
Data Management
TP
DescriptionIn this panel, we explore the transformative convergence of high-performance computing and enterprise data systems into a unified data management platform. Traditionally, organizations have relied on distinct systems for varying computing demands. However, recent innovations enable seamless integration into a single, scalable solution. In this session, moderated by Tommy Minyard of Texas Advanced Computing Center (TACC) with panelists from Brown University, CINECA, the University of Utah, PacBio and VAST Data, audience members will hear insights on merging data infrastructures in academic, government, and corporate organizations. This session will discuss the benefits of this convergence, including simplification of data architecture, cost reductions, and enhanced data utility for diverse applications including AI.
Exhibitor Forum
Artificial Intelligence/Machine Learning
Facilities
TP
XO/EX
DescriptionArtificial intelligence (AI) and its growing demands have significantly impacted datacenter solutions. AI workloads necessitate immense computational power, resulting in increased energy consumption and heat generation. This shift necessitates a comprehensive re-evaluation of these critical areas: power management, thermal management, footprint optimization, controlling the datacenter (the next step for datacenters), and designing for liquid cooling failure scenarios. This presentation will guide the audience through our journey as we navigate these areas, where we are now, and where we are going next.
Tutorial
Accelerators
Applications and Application Frameworks
Broader Engagement
Emerging Technologies
Parallel Programming Methods, Models, Languages and Environments
TUT
DescriptionWe propose a 3-hour, hands-on tutorial on the use of the Julia language for high-performance computing (HPC) applications. Julia empowers experts and novice HPC users providing a high-productivity and high-performance ecosystem powered by just-in-time (JIT) compiled code via LLVM. We will showcase Julia's support for parallel programming models: CPU threads and vendor GPUs (e.g., NVIDIA, AMD), distributed memory parallelism using its message passing interface wrapper (MPI.jl), and CPU/GPU performance portable layers. Attendees will have access to the National Energy Research Scientific Computing Center (NERSC) resources for running CPU and (NVIDIA) GPU simulations and will learn how to use Julia to write end-to-end applications: computation, communication, data analysis. In addition, all codes will be publicly available, while we will follow up with support via a slack channel to address participants' questions after the tutorial. Applying for training at a DOE supercomputer will be available at: https://github.com/JuliaParallel/julia-hpc-tutorial-sc24
Tutorial
Parallel Programming Methods, Models, Languages and Environments
TUT
DescriptionLegion is a programming model designed for portable, scalable and high-performance applications on heterogeneous supercomputers. In this tutorial, participants will be introduced to Regent, a high-level programming language for the Legion programming model. The tutorial will be organized around teaching Regent “from the ground up”, beginning with the motivation for task-based programming, simple examples and hands-on exercises, and working up to advanced programming concepts. The examples and exercises using Regent will allow participants to progress quickly from being introduced to the basics of task-based programming to writing parts of and thoroughly understanding a non-trivial, self-contained Regent application at the end of the tutorial. The hands-on exercises will allow participants to run experiments on a cluster of machines with GPUs. A performance profiler will show participants the effect of different choices in mapping tasks and data onto complex machines. Overall, the tutorial will provide participants with an overview of task-based programming and performance tuning, as well as a starting point for developing their own task-based applications.
Birds of a Feather
TP
XO/EX
DescriptionThe Message Passing Interface (MPI) API is the most dominant programming approach for HPC environments. Its specification is driven by the MPI Forum, an open forum consisting of MPI developers, vendors and users. Last year, the MPI Forum published the latest version of the standard, MPI 4.1. We will take a look at the new features and discuss what they mean for the users of MPI. We will also discuss new features targeted at upcoming versions of the MPI standard, in particular an ABI as well as new features on fault tolerance, malleability and collective communication.
Workshop
Compilers
Parallel Programming Methods, Models, Languages and Environments
Performance Optimization
W
DescriptionAchieving performance generality for workloads across a wide variety of hardware is not a solved problem. With Mojo, we have attempted and proven at least one viable approach that combines compiler and programming language design to enable expert programmers to productively and consistently achieve performance goals for linear algebra workloads across CPU and GPU architectures. This talk will describe some of the compiler and language design principles behind this approach and programming techniques for building a library-based approach for achieving performance. This talk will discuss how Mojo's design allows all Mojo programmers to leverage the LLVM and MLIR ecosystems and how this can be composed with metaprogramming techniques to build high-performance libraries.
Birds of a Feather
TP
XO/EX
DescriptionThe NAIRR Pilot enables U.S. researchers to access resources needed for AI projects. The Pilot is a collaboration between federal agencies and public and private partners to establish a demonstration of a future NAIRR deployment, by soliciting applications for various research and educational resources. SC24 will get to know the Pilot's offerings in four brief sessions, featuring experts who will give short remarks followed by Q&A from the audience. We will cover: (i) What is NAIRR Pilot? (ii) What AI compute resources are available? (iii) What AI model APIs and datasets are provided? And (iv) community building and support.
Workshop
Distributed Computing
Education
Emerging Technologies
W
Exhibitor Forum
Emerging Technologies
Hardware Technologies
TP
XO/EX
DescriptionRISC-V is a fascinating new idea in CPU design. With a free instruction set specification, hardware designers can explore new ideas in CPU design and provide flexibility in accelerators of all kinds, without incurring costs to license a modifiable architecture. The standardized instruction set provides a solid base for compiler, operating system, and standard library development that can be leveraged by every hardware design. Initial CPUs based on the RISC-V instruction set were targeted at embedded computing. But recently, high-performance CPUs have become commercially available and truly high-performance clusters are now being planned.
This session, led by an HPC/AI expert, will invite researchers, innovators, and users to discuss experiences with RISC-V, where they see the market and design going in the future as implementation from Asian and European companies soars, and present a realistic assessment of the state of products and timeline for HPC and AI performant hardware and software.
This session, led by an HPC/AI expert, will invite researchers, innovators, and users to discuss experiences with RISC-V, where they see the market and design going in the future as implementation from Asian and European companies soars, and present a realistic assessment of the state of products and timeline for HPC and AI performant hardware and software.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionThis paper documents the development of a web-based tool designed to organise and present visual representations of performance, portability, and productivity (P3) data from previously published scientific studies. The P3 Explorer operates as both an open repository of scientific data and a data dashboard, providing visual heuristic analyses of performance portability and developer productivity, created using the Intel P3 Analysis library. The aim of the project is to create a community-led database of P3 studies to better inform application developers of alternative approaches to developing new applications targeting high performance on diverse hardware, with consideration of developer productivity.
Panel
Artificial Intelligence/Machine Learning
TP
Description"A tidal wave of papers looms on the horizon. Inquiries for the Panel: Are LLMs heralding a paradigm shift in the academic discourse, challenging conventional notions of idea expression, communication, and the evaluation of individuals primarily through paper and citation metrics? Do the current policies set forth by ACM/IEEE demonstrate prudence, or have they preemptively acted on this evolving landscape? With the proliferation of counterfeit or plagiarized papers, do they not only present a risk to corporate integrity but also potentially jeopardize national security? If the 'publish or perish' ethos is indeed waning, what new criteria will emerge to gauge scholarly impact and advancement? How must the review process evolve, both in the immediate and distant future, to accommodate these shifting tides? Finally, in light of these transformations, is the traditional format of academic papers still indispensable, or are alternative avenues for disseminating knowledge on the horizon?" — by ChatGPT-3.5
Workshop
Codesign
Data Movement and Memory
Facilities
W
DescriptionThe Parallel Research Kernels (PRK) were created to be the simple yet still interesting implementations of fundamental algorithms in high-performance computing, which could be used to evaluate and improve hardware and software systems. In this talk, I will describe the design methodology of the PRK and their use in multiple contexts. First, we consider the viability of alternative distributed programming models as compared to multiple flavors of MPI, especially the sensitivity to message granularity. Second, we demonstrate the use of the PRK to evaluate programming languages, from Python and C++17 to Rust and Julia. Finally, we use the PRK to measure the behavior of accelerators and heterogeneous memory systems.
The PRK were created by Tim Mattson and Rob Van der Wijngaart; this talk is based on the collective efforts of more than a dozen contributors.
The PRK were created by Tim Mattson and Rob Van der Wijngaart; this talk is based on the collective efforts of more than a dozen contributors.
Workshop
Compilers
Parallel Programming Methods, Models, Languages and Environments
Performance Optimization
W
Birds of a Feather
TP
XO/EX
DescriptionArtificial intelligence (AI), including machine learning, is increasingly contributing to most areas of scholarship. The AI lifecycle includes gathering and preparing data, selecting a method, training a model (creating and evaluating it), validating and using it, and storing and sharing it for reproducibility and reuse, all in the context of scholarly goals and ethics, privacy, and fairness. HPC centers have a key role to play in terms of the data, the models, and the tasks. This BoF will bring together HPC center staff and users to discuss how centers support AI today and how they might in the future.
Workshop
Applications and Application Frameworks
Artificial Intelligence/Machine Learning
Security
W
DescriptionDifferential Privacy has become the go-to approach for protecting sensitive information in data releases and learning tasks that are used for critical decision processes. For example, census data is used to allocate funds and distribute benefits, while several corporations use machine learning systems for financial predictions, hiring decisions, and more. While differential privacy provides strong guarantees, we will show that it may also induce biases and fairness issues in downstream decision processes. In this talk, we delve into the intersection of privacy, fairness, and decision processes, with a focus on understanding and addressing these fairness issues. We first provide an overview of Differential Privacy and its applications in data release and learning tasks. Next, we examine the societal impacts of privacy through a fairness lens and present a framework to illustrate what aspects of the private algorithms and/or data may be responsible for exacerbating unfairness. Finally, we propose a path to partially mitigate the observed fairness issues and discus challenges that require further exploration.
Workshop
Artificial Intelligence/Machine Learning
Codesign
W
DescriptionThe UpDown system’s goal is to reduce programming complexity AND improve scalability on graph computations. UpDown is codesigned for fine-grained parallelism and efficient global communication; early performance studies using graph kernels indicate a single UpDown node can outperform a multicore CPU by up to 100x. The 16,384-node UpDown system achieves strong scaling on small graphs with projected performance of 1,000x today’s supercomputers and cloud for Pagerank, Triangle Count, and more. Iso-power comparisons are more favorable.
The UpDown system architecture is a significant departure. First, UpDown’s 1-cycle thread creation and management, combined with hardware scheduling enables trillions of fine-grained computations (<25 instructions, MIMD) to achieve high hardware efficiency. Second, UpDown provides efficient short messages. Features include 1-cycle message sends, NIC-less design (scalability), and split-transaction memory access that enable software-controlled intelligent data movement. Third, UpDown has >4TB/s per-node of all-to-all network bandwidth and global memory access latency of 1.1 us. Radically higher network capability opens new spaces for graph algorithms and data structures, as the system can be programmed as a flat, global memory machine. Finally, UpDown has massive memory bandwidth (10 TB/s/node and 150 PB/s system). Together, these capabilities enable high-level programming of vertex and edge parallelism for scalable high performance. The UpDown project is part of the IARPA’s AGILE research program.
The UpDown system architecture is a significant departure. First, UpDown’s 1-cycle thread creation and management, combined with hardware scheduling enables trillions of fine-grained computations (<25 instructions, MIMD) to achieve high hardware efficiency. Second, UpDown provides efficient short messages. Features include 1-cycle message sends, NIC-less design (scalability), and split-transaction memory access that enable software-controlled intelligent data movement. Third, UpDown has >4TB/s per-node of all-to-all network bandwidth and global memory access latency of 1.1 us. Radically higher network capability opens new spaces for graph algorithms and data structures, as the system can be programmed as a flat, global memory machine. Finally, UpDown has massive memory bandwidth (10 TB/s/node and 150 PB/s system). Together, these capabilities enable high-level programming of vertex and edge parallelism for scalable high performance. The UpDown project is part of the IARPA’s AGILE research program.
Workshop
Distributed Computing
Education
Emerging Technologies
W
DescriptionHigh-performance computing (HPC) clusters are powerful tools that can be used to support a wide range of research projects across all disciplines. However, HPC clusters can be complex and difficult to use, limiting their accessibility to researchers without a strong technical background. This study used a mixed method to investigate ways to make HPC clusters more accessible to researchers from all disciplines on a university campus.
A usability study of 19 university researchers was conducted to understand the needs of HPC users and identify areas where user experience could be improved. Our findings reveal the need to build a customized graphical user interface HPC management portal to serve users’ needs, and to invest in workforce development by introducing an academic credit-based high-performance computing course for students and, partnering with other faculties, to introduce special programs (e.g., Student Cluster Competitions) which would draw more student interest.
A usability study of 19 university researchers was conducted to understand the needs of HPC users and identify areas where user experience could be improved. Our findings reveal the need to build a customized graphical user interface HPC management portal to serve users’ needs, and to invest in workforce development by introducing an academic credit-based high-performance computing course for students and, partnering with other faculties, to introduce special programs (e.g., Student Cluster Competitions) which would draw more student interest.
Birds of a Feather
TP
XO/EX
DescriptionThe substantial and continuous public funding for the European HPC ecosystem has produced technologies ready to lead the global HPC efforts in the areas of dynamic modular supercomputing architecture (dMSA) as well as workflow and data/storage management. These areas present ample opportunities for fruitful international collaboration.
This BoF, organized by the European HPC Technology Platform (ETP4HPC), will discuss which lessons were learnt developing these technologies, reason how to leverage synergies to benefit the global HPC ecosystem, and brainstorm on specific projects to start a mutually beneficial collaboration.
This BoF, organized by the European HPC Technology Platform (ETP4HPC), will discuss which lessons were learnt developing these technologies, reason how to leverage synergies to benefit the global HPC ecosystem, and brainstorm on specific projects to start a mutually beneficial collaboration.
Birds of a Feather
TP
XO/EX
DescriptionThis BoF brings together a panel of international HPC software development experts to discuss tools and techniques for enhancing HPC software developer productivity. Key themes include DevOps, programming models (OpenMP, RAJA, Kokkos, etc.), and development tools. Additionally, the integration of AI across these themes will be explored (e.g., is GitHub Copilot useful in HPC software development?). Attendees will gain insights into overcoming productivity challenges unique to HPC. This session is ideal for HPC software developers seeking to improve their productivity through innovative tools and shared expert experiences. Attendees will be able to vote on questions for the panel. LLNL-ABS-866156.
Tutorial
Debugging and Correctness Tools
Emerging Technologies
Fault-Tolerance, Reliability, Maintainability, and Adaptability
Numerical Methods
TUT
DescriptionFloating-point arithmetic is central to HPC and ML, with the variety of number formats, hardware platforms, and compilers exploding in this era of heterogeneity. This unfortunately increases the incidence of numerical issues including exceptions such as Infinity and NaN that can render the computed results unreliable or change control-flows, introduces excessive rounding that breaks the assumptions made in the numerical algorithm in use, and overall causes result non-reproducibility when code is optimized or ported across platforms. In this tutorial, we present three novel tools: (1) GPU-FPX, which exposes silent exceptions in NVIDIA GPU computations, (2) Ciel, which pinpoints where compilers silently over-optimize and cause non-reproducibility, and (3) Herbie, which improves the accuracy of a programmer-written expression, significantly reducing rounding error or eliminating exceptions. This half-day tutorial will consist of (1) presentations of floating-point basics, (2) demos of all our numerical debugging tools, presenting their principle of operation and ideal usage contexts, and (3) plenty of time for Q/A, especially on using these tools within the organization of the attendees. New and emerging technologies such as Tensor Cores will be introduced by showing how to test for non-portability of codes across them.
Workshop
Accelerators
Emerging Technologies
Hardware Technologies
W
DescriptionModern Out-of-Order RISC-V CPUs have complex mechanisms, making microarchitecture-level performance analysis challenging. Despite increasing Performance Monitoring Units (PMUs), interpreting this data requires deep architectural knowledge. This paper introduces a Top-down Microarchitecture Analysis (TMA) approximation to analyze application performance on RISC-V CPUs. TMA classifies performance issues into four categories by calculating metrics that reflect their proportions using predefined formulas and PMU events. We present the results of applying this method to analyze SPEC CPU2006 benchmarks on a SiFive RISC-V processor. This work is an initial step in analyzing RISC-V CPUs using TMA Level 1. The contributions of this research are threefold: (1) designing and implementing TMA for a RISC-V CPU with clear metric definitions; (2) proposing test cases and methods to verify TMA metrics and PMU implementation; (3) enabling software developers to profile workloads without requiring extensive microarchitecture knowledge.
Birds of a Feather
TP
W
TUT
XO/EX
DescriptionThe TOP500 list of supercomputers serves as a “Who’s Who” in the field of High Performance Computing (HPC). It started as a list of the most powerful supercomputers in the world and has evolved into a major source of information about trends in HPC. The 64th TOP500 list will be published in November 2024, just in time for SC24.
This BoF will present detailed analyses of the TOP500 and discuss the changes in the HPC marketplace during the past years. This BoF is meant as an open forum for discussion and feedback between the TOP500 authors and the user community.
This BoF will present detailed analyses of the TOP500 and discuss the changes in the HPC marketplace during the past years. This BoF is meant as an open forum for discussion and feedback between the TOP500 authors and the user community.
Paper
Accelerators
Algorithms
Artificial Intelligence/Machine Learning
Graph Algorithms
Performance Optimization
TP
DescriptionGraph Transformer is a new architecture that surpasses GNNs in graph learning. While there emerge inspiring algorithm advancements, their practical adoption is still limited, particularly on real-world graphs involving up to millions of nodes. We observe existing graph transformers fail on large-scale graphs mainly due to heavy computation, limited scalability and inferior model quality. Motivated by these observations, we propose TorchGT, the first efficient, scalable, and accurate graph transformer training system. TorchGT optimizes training at different levels. At algorithm level, by harnessing the graph sparsity, TorchGT introduces a Dual-interleaved Attention which is computation-efficient and accuracy-maintained. At runtime level, TorchGT scales training across workers with a communication-light Cluster-aware Graph Parallelism. At kernel level, an Elastic Computation Reformation further optimizes the computation by reducing memory access latency in a dynamic way. Extensive experiments demonstrate that TorchGT boosts training by up to 62.7x and supports graph sequence lengths of up to 1M.
Workshop
Debugging and Correctness Tools
Fault-Tolerance, Reliability, Maintainability, and Adaptability
Software Engineering
W
DescriptionPerformance-portable programming frameworks provide abstractions for parallel execution to allow easily porting an application to multiple backend programming models, such as CUDA, HIP, and OpenMP. However, programs may still have portability bugs that manifest only on specific backends. Traditional testing is ineffective in discovering these bugs, as it would require concrete execution on all supported hardware configurations for a potentially infinite set of inputs. To mitigate this issue, we focused on a specific programming framework, Kokkos, and identified several categories of common portability bugs. We then developed Klokkos, a static analysis approach based on symbolic execution that can run on commodity hardware, before execution on supercomputers. As a proof-of-concept, we ran Klokkos on examples encoding the identified bugs. Our results show that Klokkos is effective, efficient, and precise: it detected all the considered bugs, quickly, and without any false positives. Although preliminary, the results motivate further research in this direction.
Workshop
Debugging and Correctness Tools
Fault-Tolerance, Reliability, Maintainability, and Adaptability
Software Engineering
W
DescriptionFloating-point precision tuning (FPPT) searches target programs for computations amenable to reduced-precision, thereby trading accuracy for performance. FPPT does so by searching the mixed-precision design space for program variants maximizing performance constrained by some correctness criteria. Given their computational intensity and complexity, weather and climate models present prime FPPT targets. However, past attempts at FPPT in this domain are limited by manual efforts of domain experts (tedious) and low-precision emulation (obscures speedup). Automated and performance-guided techniques are naturally of interest but have not been explored at this scale. Facilitated by a bespoke Fortran transformation tool, this paper presents a first-of-its-kind case study: based on the varied results of applying FPPT to computational hotspots in three real-world weather and climate models (MPAS-A, ADCIRC, and MOM6), we identify and discuss important lessons learned and offer insights into best practices for feasible FPPT that targets large programs in complex domains such as this.
ACM Gordon Bell Finalist
TP
DescriptionWe exploit the widening margin in tensor-core performance between [FP64/FP32/FP16/INT8,FP64/FP32/FP16/FP8/INT8] on NVIDIA [Ampere,Hopper]
GPUs to boost the performance of output accuracy-preserving mixed-precision computation of Genome-Wide Association Studies (GWAS) of 305,000 patients from the UK Biobank, the largest-ever GWAS cohort studied for genetic epistasis using a multivariate approach. Tile-centric adaptive-precision linear algebraic techniques motivated by reducing data motion gain enhanced significance with low-precision GPU arithmetic. At the core of Kernel Ridge Regression (KRR) techniques for GWAS lie compute-bound cubic-complexity matrix operations that inhibit scaling to aspirational dimensions of the population, genotypes, and phenotypes. We accelerate KRR matrix generation by redesigning the computation for Euclidean distances to engage INT8 tensor cores while exploiting symmetry. We accelerate solution of the regularized KRR systems by deploying a new four-precision Cholesky-based solver, which, at 1.805 mixed-precision ExaOp/s on a nearly full Alps system, outperforms the state-of-the-art CPU-only REGENIE GWAS software by five orders of magnitude.
GPUs to boost the performance of output accuracy-preserving mixed-precision computation of Genome-Wide Association Studies (GWAS) of 305,000 patients from the UK Biobank, the largest-ever GWAS cohort studied for genetic epistasis using a multivariate approach. Tile-centric adaptive-precision linear algebraic techniques motivated by reducing data motion gain enhanced significance with low-precision GPU arithmetic. At the core of Kernel Ridge Regression (KRR) techniques for GWAS lie compute-bound cubic-complexity matrix operations that inhibit scaling to aspirational dimensions of the population, genotypes, and phenotypes. We accelerate KRR matrix generation by redesigning the computation for Euclidean distances to engage INT8 tensor cores while exploiting symmetry. We accelerate solution of the regularized KRR systems by deploying a new four-precision Cholesky-based solver, which, at 1.805 mixed-precision ExaOp/s on a nearly full Alps system, outperforms the state-of-the-art CPU-only REGENIE GWAS software by five orders of magnitude.
Paper
Distributed Computing
Middleware and System Software
Programming Frameworks and System Software
Resource Management
TP
DescriptionThe primary bottleneck of blockchain is shifting from consensus to execution due to recent advances in DAG-based consensus algorithms supporting over 100k TPS. Many blockchain systems segregate execution from ordering, missing the opportunity to harness potential parallelism in consensus-produced batches.
In this paper, we propose a new deterministically orderable concurrency control algorithm, OptME, which improves the performance of execution phase by exploiting inherent parallelism among transactions. This algorithm analyzes transaction dependencies to extract parallelism, and determines
the total order of transaction execution.
OptME consists of three steps: (1) building a transaction dependency graph, (2) generating a parallel execution schedule, and (3) executing transactions based on the schedule. We employ several optimizations, including parallel dependency graph construction, early abort detection, and efficient reordering with an optimistic assumption. Our evaluation demonstrates that OptME achieves up to 350k TPS and outperforms a state-of-the-art concurrency control algorithm, even under high contention scenarios.
In this paper, we propose a new deterministically orderable concurrency control algorithm, OptME, which improves the performance of execution phase by exploiting inherent parallelism among transactions. This algorithm analyzes transaction dependencies to extract parallelism, and determines
the total order of transaction execution.
OptME consists of three steps: (1) building a transaction dependency graph, (2) generating a parallel execution schedule, and (3) executing transactions based on the schedule. We employ several optimizations, including parallel dependency graph construction, early abort detection, and efficient reordering with an optimistic assumption. Our evaluation demonstrates that OptME achieves up to 350k TPS and outperforms a state-of-the-art concurrency control algorithm, even under high contention scenarios.
Birds of a Feather
TP
XO/EX
DescriptionLarge Language Model (LLM) based coding assistants have already proven to be useful tools for aiding software developers. Adapting these tools in HPC software development will greatly improve the quality and time-to-development of scientific codes. This will create an environment where researchers can devote more attention to scientific challenges and less to software development intricacies, driving scientific progress forward. However, LLM-based tools can be difficult to use and present many dangers in the form of over-reliance, inaccuracies, and intellectual property. This BoF provides a place for the community to discuss the use of LLMs for HPC software development.
Doctoral Showcase
Posters
TP
DescriptionAchieving *performance*, *portability*, and *productivity* for data-parallel computations (e.g., MatMul and convolutions) has emerged as a major research challenge. The complex hardware design of contemporary parallel architectures, including GPUs and CPUs, requires advanced program optimizations to fully exploit the performance potential of architectures. Furthermore, due to the diverse hardware landscape, it has proven challenging to achieve (performance) portability: different architectures require different kinds of optimizations, thereby often posing challenging, often even contradicting requirements on code optimization. Also, the complexity of achieving performance and portability must be hidden behind a user-productive programming interface to make programming modern architectures amenable.
This thesis introduces a novel approach to code *generation* and *optimization* for data-parallel computations targeting modern parallel architectures. The ultimate goal of our approach is to simultaneously achieve *performance*, *portability*, and *productivity*, in one combined approach, which is identified as a major research challenge.
The first part of this thesis introduces the algebraic formalism of Multi-Dimensional Homomorphisms (MDH) — a novel approach to generating code that can be fully automatically optimized (auto-tuned) for a particular target architecture and characteristics of the input and output data (such as size and memory layout); our code generation approach is hidden behind a productive user interface that expresses a wide range of data-parallel computations.
The second part of this thesis introduces the Auto-Tuning Framework (ATF) for automatically optimizing parameterized program code (as generated by our MDH approach). In contrast to existing auto-tuners, ATF supports so-called constrained tuning parameters which are ubiquitous in modern parallel programming.
This thesis introduces a novel approach to code *generation* and *optimization* for data-parallel computations targeting modern parallel architectures. The ultimate goal of our approach is to simultaneously achieve *performance*, *portability*, and *productivity*, in one combined approach, which is identified as a major research challenge.
The first part of this thesis introduces the algebraic formalism of Multi-Dimensional Homomorphisms (MDH) — a novel approach to generating code that can be fully automatically optimized (auto-tuned) for a particular target architecture and characteristics of the input and output data (such as size and memory layout); our code generation approach is hidden behind a productive user interface that expresses a wide range of data-parallel computations.
The second part of this thesis introduces the Auto-Tuning Framework (ATF) for automatically optimizing parameterized program code (as generated by our MDH approach). In contrast to existing auto-tuners, ATF supports so-called constrained tuning parameters which are ubiquitous in modern parallel programming.
Paper
Accelerators
Energy Efficiency
Facilities
Resource Management
State of the Practice
TP
Best Student Paper Finalist
DescriptionThis paper describes the deployment and operational experience of a novel incentive-based power-control strategy on the Fugaku supercomputer. Our incentive-based program, termed Fugaku Points, provides knobs to users to apply power control functions to improve the overall power efficiency of the supercomputer toward achieving HPC sustainability in terms of its environmental implications. We also discuss new operational opportunities, challenges, and future directions.
Exhibits
SCinet
TP
XO/EX
Workshop
Applications and Application Frameworks
Distributed Computing
Middleware and System Software
W
DescriptionScientific discoveries increasingly depend on leveraging computation and data at scale among and across ecosystems. Scientific workflow tools provide a construct to manage the computation and data over distributed and large-scale infrastructure. In the panel discussion, we will detail how the WORKS community should take a synergistic approach bringing together workflows, data, artificial intelligence, and humans, grounded in transparency and trust, to advance scientific discoveries. We propose two ideas for discussion: comprehensive adoption of UX methods within the workflow community and, in tandem, the application of higher level UI design patterns to streamline the creation of science-tailored workflow UIs while preserving known successful interaction patterns and design principles. We detail our current efforts in the next sections and aim to use the panel to foster discussion in the community.
Workshop
Accelerators
Heterogeneous Computing
Parallel Programming Methods, Models, Languages and Environments
W
DescriptionThis paper addresses the challenges of optimizing task scheduling for a distributed, task-based execution model in OpenMP for cluster computing environments. This work extends the implementation of the OpenMP Cluster (OMPC) task scheduling mechanisms. This work presents three key contributions: first, the refactoring of the OMPC runtime to unify task scheduling across devices and hosts; second, the optimization of the HEFT-based scheduling algorithm to ensure efficient task execution in distributed environments; and third, an extensive evaluation of Work Stealing and HEFT scheduling mechanisms in real-world clusters. This work provides a significant step toward improving distributed task scheduling in cluster computing, offering insights and incremental advancements that support the development of scalable and high-performance applications. Results show up to 24% improvements in scheduling time while opening up to more extensions in the scheduling methods.
Workshop
Architecture
Embedded and/or Reconfigurable Systems
Performance Optimization
Resource Management
W
DescriptionThe performance of large-scale graph analytics is limited by the capacity and performance of the memory subsystem on the platforms on which they execute. In this paper, we first discuss the limitations of existing approaches to scaling graph processing, and describe how they can be addressed via the use of disaggregated solutions with near-data processing (NDP) capabilities. Using observations from experimental analysis of the tradeoffs for different types of graphs and analytics kernels, we then identify the systems-level mechanisms that will be required by future graph analytics frameworks for disaggregated NDP architectures.
Paper
Accelerators
Applications and Application Frameworks
Modeling and Simulation
Numerical Methods
Task Parallelism
TP
DescriptionExperimental development of gate-all-around silicon nanowire field-effect transistors (NWFETs), a viable replacement for FinFETs, can be complemented by technology computer-aided design. This requires the availability of advanced device simulators relying on a quantum transport (QT) approach without any empirical parameters as inputs. Concretely, all material properties should be described from first-principles, and the whole physics at play should be accurately modeled, particularly the strong electron-electron interactions occurring in highly confined structures such as NWFETs. To shed light on these many-body effects, we implement them within the self-consistent GW approximation into an ab initio QT solver called QuaTrEx, based on density functional theory and the Non-equilibrium Green's Function formalism. We then simulate transistors made of up to 10,560 atoms on the LUMI supercomputer's GPU partition, reaching a parallel efficiency of 74% (60%) in weak (strong) scaling and an overall computational performance of 69.3 Pflop/s in double precision on 1,800 nodes.
Workshop
Applications and Application Frameworks
Distributed Computing
Middleware and System Software
W
DescriptionTo increase the dependability and portability of scientific data analysis workflows (DAWs), recent work has proposed contract-driven design of DAWs, providing verifiable expectations and obligations to ensure that tasks run in a proper environment and produce correct results.
However, the specification of suitable contracts is still left to the discretion of DAW developers, imposing labor-intensive manual work which likely hampers the widespread adoption of contracts in scientific practice. We report about work-in-progress of developing a pipeline empowered by Large Language Models for automatically generating code contracts from logical workflow descriptions. We instantiate this pipeline within the workflow system Nextflow, and evaluate its contract generation capabilities in an experiment using real-world Nextflow modules. Our findings indicate that we generate a substantial amount of contracts serving as starting point for DAW developers. Our approach demonstrates potential in assisting domain scientists with contract-driven design of DAWs, laying the groundwork for its future adoption.
However, the specification of suitable contracts is still left to the discretion of DAW developers, imposing labor-intensive manual work which likely hampers the widespread adoption of contracts in scientific practice. We report about work-in-progress of developing a pipeline empowered by Large Language Models for automatically generating code contracts from logical workflow descriptions. We instantiate this pipeline within the workflow system Nextflow, and evaluate its contract generation capabilities in an experiment using real-world Nextflow modules. Our findings indicate that we generate a substantial amount of contracts serving as starting point for DAW developers. Our approach demonstrates potential in assisting domain scientists with contract-driven design of DAWs, laying the groundwork for its future adoption.
Paper
Middleware and System Software
Parallel Programming Methods, Models, Languages and Environments
Programming Frameworks and System Software
Resource Management
TP
DescriptionScientific workflows on High-Performance Computing (HPC) consist of multiple data processing and computing tasks with dependencies. Efficiently scheduling computing resources and multi-tier storage across workflow tasks is crucial for optimizing performance. Existing solutions often fall short in achieving the co-scheduling of computing and I/O resources and lack compatibility with HPC system software. In this paper, we introduce a performance model for scheduling workflows on HPC systems to enhance the understanding of workflow scheduling and facilitate the testing of the scheduling algorithm. Additionally, we propose THman, an open-source scientific workflow scheduler featuring our heuristic scheduling algorithm, Highest Contribution First (HCF). THman achieves online co-scheduling of computing and I/O resources for workflows and is designed to work with traditional HPC batch schedulers for high compatibility. We evaluate THman using simulated workloads and real-world workflow applications. Experimental results show that THman reduces workflow makespan by up to 30.9% compared to alternative methods.
Exhibitor Forum
Cloud Computing
Portability
TP
XO/EX
DescriptionCloud-based HPC empowers practitioners with more flexibility than traditional on-premise systems; on-demand, customizable compute resources can be placed close to big data, thus minimizing I/O bottlenecks and allowing users to explore cutting-edge hardware. However, deploying HPC applications to different environments while also maintaining the same performance is a significant barrier to the community. Here, we discuss an approach for automating application and performance portability that is based on infrastructure as code provisioning of cloud clusters coupled with “compute environment as code” for configuring package managers and containerized applications to fully utilize cloud resources. We benchmark two common MPI applications across different cloud service providers. For each case, we present how some lessons learned and our best practices at several different levels within this HPC stack, including the configuration of high performance networking and storage options, has enabled both application and performance portability.
Workshop
Distributed Computing
Experimental Facility
HPC Infrastructure
W
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionQuantum computing poses many benefits to the computing world, given its efficiency over classical computers in problem spaces. However, the high noise and low logical qubit count of contemporary noisy-intermediate-scale-quantum (NISQ) devices make the development and execution of quantum algorithms difficult. As such, developers and researchers have used classical simulation to prototype and validate their algorithms, often utilizing specialized classical hardware such as GPUs and FPGAs for their parallelization capabilities. We propose an optimized and scalable method for quantum simulation using the complex general matrix-vector product (GEMV) operation on the Cerebras wafer-scale engine (WSE) architecture. We experimentally determine the scalability of our method for differing qubit counts. Finally, we demonstrate viability and scalability by simulating practical quantum image processing circuits using our method.
Workshop
Energy Efficiency
HPC Infrastructure
Sustainability
W
DescriptionAs computing systems approach the limits of traditional silicon technology, the diminishing returns in performance per watt present a significant barrier to sustaining growth in HPC. From a large-scale scientific supercomputing facility point of view, we propose a multifaceted strategy toward specialized hardware and architectures that are optimized for energy efficiency in specific applications. We also emphasize the need for integrating energy-aware practices across all levels of HPC, from system design and software development to operational policies. We discuss strategic opportunities such as the adoption of application-specific accelerators, the development of energy-efficient algorithms, and the implementation of data-driven operational analytics. Our goal is to develop a comprehensive roadmap ensuring that future leadership systems at OLCF can meet scientific demands while operating within stringent energy budgets, thereby supporting sustainable computing growth.
Workshop
Artificial Intelligence/Machine Learning
Broader Engagement
HPC in Society
W
DescriptionOpen science has been a central priority of U.S. federal research and policy goals in the 21st century. "Open science" is understood as an umbrella term covering various issues of "openness" in scientific practice and knowledge sharing, including democratic participation in scientific research (e.g., citizen science), equitable science communication, and fair intellectual property laws for digitized artifacts. This study will situate federal research agendas and policy frameworks for open science alongside the evolution and nationwide adoption of HPC resources. We also pay significant attention to AI/ML as a disruptive technology in both science/technology research and policy-making. The resulting contribution will be an analysis of the societal impacts of today's open science movement, grounded in an evaluation of risks and benefits posed by evolving scientific practices, paradigms, communities, and cultures.
Workshop
Debugging and Correctness Tools
Fault-Tolerance, Reliability, Maintainability, and Adaptability
Software Engineering
W
DescriptionDensity Functional Theory (DFT) is used extensively in the computation of electronic properties of matter, with various applications. Approximating the exchange-correlation (XC) functional is the key to the Kohn-Sham DFT approach, the basis of most DFT calculations. The choice of this density functional approximation (DFA) depends crucially on the particular system under study, which has resulted in the development of hundreds of DFAs. Though the exact density functional is not known, researchers have discovered analytical properties of this exact functional. Furthermore, these exact conditions are used when designing DFAs. We present XCVerifier, the first approach for verifying whether a DFA implementation satisfies the DFT exact conditions. XCVerifier was evaluated on five DFAs from the popular Libxc library and seven exact conditions from recent work. XCVerifier was able to verify or find violations for a majority of the DFA/condition pairs, demonstrating the feasibility of using formal methods to verify DFA implementations.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionEmerging ML/AI hardware accelerators, like the 850,000 processor Cerebras Wafer-Scale Engine (WSE), hold great promise to scale up the capabilities of evolutionary computation. However, challenges remain in maintaining visibility into underlying evolutionary processes while efficiently utilizing these platforms' large processor counts. Here, we focus on the problem of extracting phylogenetic history. We present a tracking-enabled asynchronous island-based genetic algorithm (GA) framework for WSE hardware. Emulated and on-hardware GA benchmarks with a simple tracking-enabled agent model clock upwards of 1 million generations per minute for population sizes reaching 16 million. We validate phylogenetic reconstructions from these trials and demonstrate their suitability for inference of underlying evolutionary conditions. In particular, we demonstrate extraction of clear phylometric signals that differentiate adaptive dynamics. Kernel code implementing the island-model GA supports drop-in customization to support any fixed-length genome content and fitness criteria, benefiting further explorations within the evolutionary biology and evolutionary computation communities.
Workshop
Artificial Intelligence/Machine Learning
W
DescriptionVision Transformer (ViT) is a powerful AI architecture for computer vision that is used by most imaging foundation models due to its effectiveness in discerning complex visual patterns across many tasks. However, training large-scale ViT foundation models requires considerable computing resources, leading to a significant energy footprint for training. For example, Open-AI’s SORA video generator model was trained on more than 10,000 NVIDIA H100 GPUs and the training took more than a month on a supercomputer. The energy consumption for training SORA was equivalent to the total annual energy consumption of 300 US households. This project aims to co-design the scaling algorithm and the ViT architecture to achieve hardware-, modality-, and energy-conscious computing for ViT foundation models. We anticipate that our proposed training approaches can not only significantly improve energy efficiency and reduce carbon footprint, but also significantly improve computing efficiency and scalability, fostering an accelerated AI development cycle.
Birds of a Feather
TP
XO/EX
DescriptionThis BoF will bring together cyberinfrastructure (CI) professionals and CI professional leaders, largely from academia and national labs, to discuss how to cultivate the cyberinfrastructure professional workforce.
Workshop
Applications and Application Frameworks
Distributed Computing
Middleware and System Software
W
Description-
ACM Student Research Competition: Graduate Poster
Posters
TP
DescriptionHigh performance computing (HPC) clusters have traditionally relied on proprietary provisioning and management infrastructure. This can be problematic, especially with regard to ongoing security and maintenance for vendored systems.
As an alternative to this, the Los Alamos National Laboratory (LANL) leads development of the Open Composable Heterogeneous Application Management Infrastructure (OpenCHAMI) stack, which provides a modular suite of size- and platform-independent cluster management tools. A major barrier to the full deployment of OpenCHAMI at LANL is its lack of authentication for access to sensitive data, such as private SSH keys or service tokens. To resolve this, we implement and integrate a node authentication system, under which secret configuration data may be requested only by system processes or authorized users.
We present a containerized microservice-based authentication system for post-boot compute node configuration, based on the Canonical cloud-init platform. This system is optimized to minimize its impact on cluster boot speed.
As an alternative to this, the Los Alamos National Laboratory (LANL) leads development of the Open Composable Heterogeneous Application Management Infrastructure (OpenCHAMI) stack, which provides a modular suite of size- and platform-independent cluster management tools. A major barrier to the full deployment of OpenCHAMI at LANL is its lack of authentication for access to sensitive data, such as private SSH keys or service tokens. To resolve this, we implement and integrate a node authentication system, under which secret configuration data may be requested only by system processes or authorized users.
We present a containerized microservice-based authentication system for post-boot compute node configuration, based on the Canonical cloud-init platform. This system is optimized to minimize its impact on cluster boot speed.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionDespite advancements in distributed computing libraries, performance challenges, such as data serialization and transfer, still persist. We focus on understanding data limitations within Dask, a versatile and popular Python library designed for distributed and parallel computing, and then investigate the potential of using the pass-by-proxy paradigm implemented by ProxyStore to address these inefficiencies. By integrating ProxyStore, we streamline data flow in Dask applications, reducing overheads associated with data serialization and scheduler overheads.
Our approach evaluates the impact of proxies on data transfer times and overall computational efficiency. We find that our integration reduces task overheads by 5-6x on a real machine learning application.
Our approach evaluates the impact of proxies on data transfer times and overall computational efficiency. We find that our integration reduces task overheads by 5-6x on a real machine learning application.
Workshop
Artificial Intelligence/Machine Learning
Codesign
W
DescriptionThe Mixture of Experts (MoE) model has emerged as a scalable solution for large-scale machine learning tasks, thanks to its dynamic expert selection. However, the gating mechanism that controls this selection together with all-2-all collectives can create significant computation and communication bottlenecks. In this talk, we present TurboMoE, a novel approach to accelerate MoE model training. TurboMoE employs innovative kernel-fusion and data-layout transformations to streamline the gating process, along with a new parallelization layout that minimizes communication overhead. We also present a re-engineered MoE architecture, which is employed at Snowflake’s Arctic to enable overlapping communication with parallel computation, leading to a more efficient training process.
Birds of a Feather
TP
XO/EX
DescriptionThis Birds of a Feather session, “Two Worlds Collide: Trustworthiness and Energy Efficiency for Coupled HPC+AI Simulation,” is the third installment of a series started in 2021 aimed at discussing and brainstorming solutions for a new paradigm in HPC: the coupling of simulation with artificial intelligence (AI). In this installment, we continue our discussions on needs, use cases, testing, and reproducibility, and add a new focus on energy efficiency: energy reduction from speedups must be assessed together with energy costs of training campaigns, which can be costly. How can we provide transformative scientific discoveries, while delivering efficiency and correctness assurance?
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionLLM inference includes two phases: prefill phase and decode phase. Prefill phase processes input tokens simultaneously to generate the first token. Decode phases generate the subsequent tokens one after another until they either meet a termination or reach the max length. To avoid recomputation, the Key-Value (KV) cache has become a standard approach used for storing previously computed keys and values. Throughout LLM inference, the KV cache memory space grows linearly with context-length and batch sizes, easily running out of the GPU memory of an instance. SOAT LLM inference usually uses recomputation/swap to handle KV cache overflow. Both recomputation and swap introduce overhead. However, the overhead of these strategies and the resource utilization over time during LLM inference have not been explored. This work aims to fill this gap by quantifying the overhead of recomputation/swapping, and analyzing the resource utilization during LLM inference to derive insights.
Workshop
Data Movement and Memory
I/O, Storage, Archive
W
DescriptionOn high performance computing systems, multiple concurrent workloads may read and write vast amounts of data stored through shared storage servers, hence competition for I/O resources between workloads is inevitable. Previous work has thoroughly recognized the impact of such competition-introduced resource contention, highlighting its potential to impact the performance of individual applications significantly. However, no prior work on such an issue has investigated the quantitative impact of inter-application I/O contention on individual applications. In this work, we first exemplify the dynamics of I/O interference towards I/O patterns and system status. We then propose a framework for collecting fine-grained I/O traces from applications and concurrent server-side metrics and train a neural network to accurately predict the existence of I/O interference and its potential impacts. Our results show that our model can accurately predict the impact of I/O interference with F1 scores exceeding 90% for both synthetic benchmarks and real-world applications.
Workshop
Debugging and Correctness Tools
Hardware Technologies
Resource Management
State of the Practice
W
DescriptionModern GPU systems are constantly evolving to meet the needs of computing-intensive applications in scientific and machine learning domains. However, there is typically a gap between hardware capacity and achievable application performance. This work aims to provide a better understanding of the Infinity Fabric interconnects on AMD GPUs and CPUs. We propose a test and evaluation methodology for characterizing performance of data movements on multi-GPU systems, stressing different communication options on AMD MI250X GPUs, including point-to-point and collective communication, and memory allocation strategies between GPUs, as well as the host CPU. In a single-node setup with four GPUs, we show that direct peer-to-peer memory accesses between GPUs and utilization of the RCCL library outperform MPI-based solutions in terms of memory/communication latency and bandwidth. Our test and evaluation method serves as a base for validating memory and communication strategies on a system and improving applications on AMD multi-GPU computing systems.
Paper
Data Movement and Memory
Performance Evaluation and/or Optimization Tools
Resource Management
State of the Practice
TP
DescriptionScientific experiments are producing unprecedented volumes of data with real-time High-Performance Computing (HPC) needs. Understanding and ensuring efficient data movement in these emerging data-intensive workloads is becoming critical for successful workflow execution. The need for end-to-end that integrates compute, network, and storage resources across facilities is resulting in a new integrated infrastructure paradigm. In this paper, we present an extensive analysis of three years of network traffic data from NERSC and identify critical data movement trends while detecting bottlenecks that significantly curtail transfer performance. Our results show that data movement patterns have shifted in the three years, and current infrastructure cannot sufficiently handle competing transfers, leading up to 30% throughput degradation for individual flows. In addition, we provide design recommendations for data movement management in future integrated research infrastructures that aim to reduce data transfer latency, reducing overall time to scientific results.
Workshop
Accelerators
Modeling and Simulation
Performance Evaluation and/or Optimization Tools
W
DescriptionPower is a critical limiting factor in supercomputing as systems scale to exascale levels. To advance scientific computing, supercomputers must operate efficiently under limited power budgets. Power-aware scheduling can help by enforcing power management strategies, but this requires a deep understanding of application power behavior, especially on modern GPU-centric supercomputers. This study examines the power behavior of VASP, a leading HPC application, on the Perlmutter A100 GPU system at NERSC. We explore how VASP’s power usage changes with various inputs and parallelism and assess its response to power capping. We find that VASP’s power usage varies significantly with different workloads, more so than with parallel concurrency. Additionally, power capping GPUs to 50% of their Thermal Design Power can be applied to VASP workloads with less than a 10% performance loss.
These findings shed light on the feasibility and effectiveness of power-aware scheduling based on application power profiles on HPC systems.
These findings shed light on the feasibility and effectiveness of power-aware scheduling based on application power profiles on HPC systems.
Art of HPC
Posters
TP
W
TUT
XO/EX
DescriptionThe image is generated wholly from code written in Python (version 3.11.5) using the visualization library "matplotlib" (version 3.8.3) to programmatically and procedurally define the design. This underlying code has, along with the generated output design, been open-sourced under the CC BY 4.0 license and is available to view from the public GitHub repository "high-res-art" under the artist's personal space (https://github.com/sadielbartholomew/high-res-art/blob/main/inspired_by_le_parc_hi_res.py). High-performance computing was indispensable towards refining the parameters encoding the precise design, in particular the radii of the two aligned wedges forming the element under repeated rotation; the number of patches per side; and the rotational array. The Slurm workload manager was used on the UK's JASMIN supercomputer to batch process such configurations of the parameters starting from exploratory values, followed by inspection of the generated outcomes and honing in on parameter sets producing promising outcomes in several iterations until finally this design emerged as a visual favorite.
Birds of a Feather
TP
XO/EX
DescriptionIn order to exploit the capabilities of new HPC systems and to meet their demands in scalability, communication software needs to scale on millions of cores and support applications with adequate functionality. Unified Communication X (UCX) is a collaboration between industry, national labs, and academia that provides a unified open-source framework.
The UCX project is managed by the UCF Consortium (http://www.ucfconsortium.org/) and includes members from LANL, ANL, The Ohio State University, AMD, ARM, NVIDIA and more. This session will serve as the UCX community meeting and will introduce the latest developments to HPC developers and the broader user community.
The UCX project is managed by the UCF Consortium (http://www.ucfconsortium.org/) and includes members from LANL, ANL, The Ohio State University, AMD, ARM, NVIDIA and more. This session will serve as the UCX community meeting and will introduce the latest developments to HPC developers and the broader user community.
Paper
Unlocking High-Performance with Low-Bit NPUs and CPUs for Highly Optimized HPL-MxP on Cloud Brain II
Accelerators
Artificial Intelligence/Machine Learning
Codesign
State of the Practice
System Administration
TP
DescriptionMix-precision computation is crucial for artificial intelligence and scientific computing applications. However, as novel chips with innovative architectures emerge, harnessing their computational capabilities presents significant challenges. While existing algorithms for the HPL-MxP LU factorization excel on homogeneous systems, they often encounter difficulties on specialized heterogeneous architectures. This deficiency arises from inadequate optimization for computation, memory access, and communication, hindering effective mixed-precision acceleration. This work introduces an algorithm-hardware co-optimization approach for LU factorization on specialized NPUs and CPUs, leveraging their unique architectures. A novel multi-iteration fusion method for general matrix multiplication is introduced, strategically designed to maximize on-chip L1 buffer utilization, effectively overcoming the notorious "memory wall". Additionally, a multi-stage, multi-level heterogeneous pipeline for LU factorization in an accelerator-CPU cloud environment is presented, where compute-intensive matrix multiplications are offloaded to NPUs while CPUs handle the remaining tasks. The co-optimization approach fosters deep collaboration between CPUs and accelerators, thereby unlocking enhanced performance.
Exhibits
Flash Session
TP
XO/EX
DescriptionIn 2023, global electric vehicle (EV) sales reached a record 14 million and are projected to rise to 17 million by the end of 2024, with one in five new cars sold worldwide expected to be an EV. This shift is essential, as transportation accounts for about 20% of global CO2 emissions, and innovations like Urban Air Mobility (UAM) will enhance electric transport's role in combating climate change. However, advancing electric transportation hinges on overcoming two major battery-related challenges.
First, the exploration of battery materials has been limited; the industry has studied only around 1,000 unique small molecules over the past 30 years, despite the existence of approximately 10^11 potential candidates. To identify ideal electrolyte materials, we must leverage advanced in-silico chemistry techniques like DFT calculations and molecular dynamics simulations.
Second, ensuring manufacturing quality and safety at scale is a daunting task, as gigafactories produce up to 1 million cells daily but often conduct thorough quality checks on only 1% of them. This may allow defective cells to go unnoticed, compromising safety. Managing trillions of cycles of lifecycle data requires advanced GPUs like the H100 for computation-intensive tasks, such as real-time CT rendering, ensuring high standards of quality assurance in EV battery production. Addressing these challenges is crucial for the successful scaling of electric transportation and achieving a sustainable future.
First, the exploration of battery materials has been limited; the industry has studied only around 1,000 unique small molecules over the past 30 years, despite the existence of approximately 10^11 potential candidates. To identify ideal electrolyte materials, we must leverage advanced in-silico chemistry techniques like DFT calculations and molecular dynamics simulations.
Second, ensuring manufacturing quality and safety at scale is a daunting task, as gigafactories produce up to 1 million cells daily but often conduct thorough quality checks on only 1% of them. This may allow defective cells to go unnoticed, compromising safety. Managing trillions of cycles of lifecycle data requires advanced GPUs like the H100 for computation-intensive tasks, such as real-time CT rendering, ensuring high standards of quality assurance in EV battery production. Addressing these challenges is crucial for the successful scaling of electric transportation and achieving a sustainable future.
Invited Talk
TP
DescriptionAs artificial intelligence (AI) advances at a rapid pace, it continues to capture news headlines and the public imagination, all while sparking debate about its transformative potential and implications for the future. This talk will cover the strategy that the U.S. National Science Foundation (NSF) is pursuing to cultivate a trustworthy, inclusive future for AI in America. Efforts such as the National AI Research Institutes, ExpandAI, EducateAI, the National AI Research Resource pilot, and Regional Innovation Engines, among others, will be discussed as key initiatives that are striving to cultivate groundbreaking research, expand participation, and develop talent in AI and provide the research community access to critical computing and data resources. NSF-led efforts will be contextualized in broader federal initiatives underway, spurred by the Executive Order on the Safe, Secure and Trustworthy Development and Use of Artificial Intelligence and related Administration efforts.
Paper
Data Compression
Data Movement and Memory
Distributed Computing
Message Passing
Network
TP
DescriptionRemote Memory Access (RMA) enables direct access to remote memory to achieve high performance for HPC applications. However, most modern parallel programming models lack schemes for the remote process to detect the completion of RMA operations. Many previous works have proposed programming models and extensions to notify the communication peer, but they did not solve the multi-NIC aggregation, portability, hardware-software co-design, and usability problems. In this work, we proposed a Unified Notifiable RMA (UNR) library for HPC to address these challenges. In addition, we demonstrate the best practice of utilizing UNR within a real-world scientific application, PowerLLEL. We deployed UNR across four HPC systems, each with a different interconnect. The results show that PowerLLEL powered by UNR achieves up to a 36% acceleration on 1728 nodes of the Tianhe-Xingyi supercomputing system.
Workshop
Data Movement and Memory
Emerging Technologies
W
DescriptionGlobal shared memories of petabytes are increasingly useful for applications, but traditional page-based techniques do not scale (limit reach), and scalable techniques such as segments fail to provide needed locality control. We propose a novel two-level translation scheme, UpDown, that provides compact, efficient access control to distributed segments of arbitrary size and data layout control, solving problems of reach and data locality. UpDown's novel two-level structure separates access control from data layout, allowing applications to manage locality cheaply, without privileged operations.
We evaluate UpDown against page-based systems, using big data computations. Our results show that UpDown is scalable and provides effective global data locality management. UpDown's translation states are an average of ∼620 billion times smaller total. UpDown's local translation state is ∼130 billion times smaller. Simulations with synthetic traces show that the two-level translation scheme enables fast, user-level management of data locality in a scalable parallel machine.
We evaluate UpDown against page-based systems, using big data computations. Our results show that UpDown is scalable and provides effective global data locality management. UpDown's translation states are an average of ∼620 billion times smaller total. UpDown's local translation state is ∼130 billion times smaller. Simulations with synthetic traces show that the two-level translation scheme enables fast, user-level management of data locality in a scalable parallel machine.
Tutorial
Artificial Intelligence/Machine Learning
Portability
TUT
DescriptionThe use of containers has revolutionized the way in which industries and enterprises have developed and deployed computational software and distributed systems. This containerization model has gained traction within the HPC community as well, with the promise of improved reliability, reproducibility, portability, and levels of customization that were not previously possible on supercomputers. This adoption has been enabled by a number of HPC Container runtimes that have emerged including Singularity, Shifter, Sarus, Podman and others.
This hands-on tutorial looks to train users on the use of containers for HPC use-cases. We will provide a detailed background on Linux containers, along with an introductory hands-on experience building a container image, sharing the container and running it on an HPC cluster. Furthermore, the tutorial will provide more advanced information on how to run MPI-based and GPU-enabled HPC applications, how to optimize I/O intensive workflows, and how to setup GUI- enabled interactive sessions. Cutting-edge examples will include machine learning and bioinformatics. Users will leave the tutorial with a solid foundational understanding of how to utilize containers on HPC resources using Podman, Shifter and Singularity, and in-depth knowledge to deploy custom containers on their own resources.
This hands-on tutorial looks to train users on the use of containers for HPC use-cases. We will provide a detailed background on Linux containers, along with an introductory hands-on experience building a container image, sharing the container and running it on an HPC cluster. Furthermore, the tutorial will provide more advanced information on how to run MPI-based and GPU-enabled HPC applications, how to optimize I/O intensive workflows, and how to setup GUI- enabled interactive sessions. Cutting-edge examples will include machine learning and bioinformatics. Users will leave the tutorial with a solid foundational understanding of how to utilize containers on HPC resources using Podman, Shifter and Singularity, and in-depth knowledge to deploy custom containers on their own resources.
Birds of a Feather
TP
XO/EX
DescriptionIncreased demand for AI applications highlights the “memory wall” obstacle — a capacity and bandwidth memory transfer bottleneck. CXL facilitates memory sharing between accelerators and GPUs while enabling direct-attached memory (i.e. DRAM) to any node, improving memory bandwidth, performance, and capacity for AI language models.
This session will explore the advantages of memory sharing and DRAM improvements for CPU, GPU, and CPU plus GPU-based memory applications utilizing AI language models, such as RAG and LlaMA. Attendees will learn about performance, cost, and power consumption benefits of DRAM and CXL memory modules.
This session will explore the advantages of memory sharing and DRAM improvements for CPU, GPU, and CPU plus GPU-based memory applications utilizing AI language models, such as RAG and LlaMA. Attendees will learn about performance, cost, and power consumption benefits of DRAM and CXL memory modules.
Workshop
Applications and Application Frameworks
Artificial Intelligence/Machine Learning
Security
W
DescriptionHPC systems face security and compliance challenges, particularly in preventing waste and misuse of computational resources by unauthorized or malicious software that deviates from allocation purpose. Existing methods to classify applications based on job names or resource usage are often unreliable or fail to capture applications that have different behavior due to different inputs or system noise. This research proposes an approach that uses similarity-preserving fuzzy hashes to classify HPC application executables. By comparing the similarity of SSDeep fuzzy hashes, a Random Forest Classifier can accurately label applications executing on HPC systems including unknown samples. We evaluate the Fuzzy Hash Classifier on a dataset of 92 application classes and 5333 distinct application samples. The proposed method achieved a macro f1-score of 90% (micro f1-score: 89%, weighted f1-score: 90%). Our approach addresses the critical need for more effective application classification in HPC environments, minimizing resource waste, and enhancing security and compliance.
Workshop
Codesign
Data Movement and Memory
Facilities
W
DescriptionIncreasing system complexity and component costs mean that designing supercomputers and other HPC systems requires significant architectural compromises to be made. As costs have increased dramatically, system architects are being forced to make ever more significant tradeoffs, where increasing one set of resources requires a reduction in another. Achieving the right resource balance is crucial for maximizing performance of the target workloads the system is designed for. To guide these decisions, it is first necessary to understand what the resource requirements of the workloads are. At ORNL we have been investigating the feasibility of using telemetry collected from existing systems to better understand how those systems are being used by users and their applications. We hope to be able to use this data to develop an understanding of resource usage to prioritize the various components in planning for future system procurements. In this talk I will give an overview of this effort, and the challenges we have faced along the way.
Awards and Award Talks
TP
DescriptionA US imperative is to deliver a Fusion Pilot Plant to accelerate the fusion energy development timeline. This will rely heavily on validated scientific and engineering advances driven by HPC together with advanced statistical methods featuring artificial intelligence/deep learning/machine learning (AI/DL/ML) that must properly embrace verification, validation, and uncertainty quantification (VVUQ). Especially time-urgent is the need to predict and avoid large-scale “major disruptions” in tokamak systems.
This talk highlights the deployment of recurrent and convolutional neural networks in Princeton's Deep Learning Code—"FRNN"—that enabled the first adaptable predictive DL model for carrying out efficient "transfer learning" while delivering validated predictions of disruptive events across prominent tokamak devices. Moreover, the AI/DL capability can provide not only the “disruption score,” as an indicator of the probability of an imminent disruption but also a “sensitivity score” in real-time to indicate the underlying reasons for the predicted disruption. A real-time prediction and control capability has recently been significantly advanced with a novel surrogate model/HPC simulator ("SGTC")—a first-principles-based prediction and control surrogate necessary for projections to future experimental devices (e.g., ITER, FPP's) for which no "ground truth" observational data exist.
This talk highlights the deployment of recurrent and convolutional neural networks in Princeton's Deep Learning Code—"FRNN"—that enabled the first adaptable predictive DL model for carrying out efficient "transfer learning" while delivering validated predictions of disruptive events across prominent tokamak devices. Moreover, the AI/DL capability can provide not only the “disruption score,” as an indicator of the probability of an imminent disruption but also a “sensitivity score” in real-time to indicate the underlying reasons for the predicted disruption. A real-time prediction and control capability has recently been significantly advanced with a novel surrogate model/HPC simulator ("SGTC")—a first-principles-based prediction and control surrogate necessary for projections to future experimental devices (e.g., ITER, FPP's) for which no "ground truth" observational data exist.
Workshop
Accelerators
Emerging Technologies
Hardware Technologies
W
DescriptionE4 Computer Engineering designs and manufactures cutting-edge solutions for HPC clusters, the Cloud, Data Analytics, Artificial Intelligence and Hyper-converged infrastructures, made for industry and academics alike. E4 Computer Engineering, building on more than 20 years of developing and integrating innovative technologies and announced its first RISC-V based server, RSV10, which is a cluster aimed at enabling the co-design of high performance scientific and engineering applications and the supporting software stack on RISC-V ISA.
Workshop
Accelerators
Emerging Technologies
Hardware Technologies
W
DescriptionInspireSemi provides revolutionary high-performance, energy-efficient accelerated computing solutions for High-Performance Computing (HPC), AI, graph analytics, and other compute-intensive workloads. The Thunderbird “supercomputer-cluster-on-a-chip” is a disruptive, next-generation datacenter accelerator designed to address multiple underserved and diversified industries, including financial services, computer-aided engineering, energy, climate modeling, and life sciences & drug discovery. Based on the open standard RISC-V instruction set architecture, InspireSemi’s solutions leverage an established and thriving software ecosystem.
Workshop
Accelerators
Emerging Technologies
Hardware Technologies
W
DescriptionTenstorrent bring together experts in the field of computer architecture, ASIC design, advanced systems, and neural network compilers to build the next generation of computing. They sell a family of PCIe accelerator products build around their Tensix core technology, with give RISC-V CPUs per Tensix core. These accelerators are not only designed for AI and ML workloads, but with the release of the open source Metalium framework can be programmed directly for other high performance codes. With the hardware widely available at a reasonable price, Tenstorrent brings RISC-V powered accelerated computing to the masses!
Workshop
Energy Efficiency
HPC Infrastructure
Sustainability
W
DescriptionPower management and energy efficiency are critical
research areas for exascale computing and beyond, necessitating
reliable telemetry and control for distributed systems. Despite this
need, existing approaches present several limitations precluding
their adoption in production. These limitations include, but are
not limited to, lack of portability due to vendor-specific and
closed-source solutions, lack of support for non-MPI applications,
and lack of user-level customization.
We present a job-level power management framework based
on Flux. We introduce flux-power-monitor and demonstrate
its effectiveness on the Lassen (IBM Power AC922) and Tioga
(HPE Cray EX235A) systems with a low average overhead
of 0.4%. We also present flux-power-manager, where we
discuss a proportional sharing policy and introduce a hierarchical
FFT-based dynamic power management algorithm (FPP). We
demonstrate that FPP reduces energy by 1% compared to
proportional sharing, and by 20% compared to the default IBM
static power capping policy.
research areas for exascale computing and beyond, necessitating
reliable telemetry and control for distributed systems. Despite this
need, existing approaches present several limitations precluding
their adoption in production. These limitations include, but are
not limited to, lack of portability due to vendor-specific and
closed-source solutions, lack of support for non-MPI applications,
and lack of user-level customization.
We present a job-level power management framework based
on Flux. We introduce flux-power-monitor and demonstrate
its effectiveness on the Lassen (IBM Power AC922) and Tioga
(HPE Cray EX235A) systems with a low average overhead
of 0.4%. We also present flux-power-manager, where we
discuss a proportional sharing policy and introduce a hierarchical
FFT-based dynamic power management algorithm (FPP). We
demonstrate that FPP reduces energy by 1% compared to
proportional sharing, and by 20% compared to the default IBM
static power capping policy.
Paper
Fault-Tolerance, Reliability, Maintainability, and Adaptability
Middleware and System Software
Performance Evaluation and/or Optimization Tools
Runtime Systems
TP
DescriptionWith the ongoing reduction in technology sizes and voltage levels, modern microprocessors are increasingly susceptible to soft errors, corrupting datapath units during program execution. While these error types have received considerable attention recently, existing solutions either confine themselves to limited scopes or incur massive overheads in performance and power consumption, hindering practical usage. In this work, we propose CONDA, a novel error detection technique based on code transformation and static program analysis, achieving versatile datapath protection at low cost. At compile time, CONDA analyzes program characteristics and transforms the original program code without complicating its control-flow and memory access patterns. At runtime, CONDA detects datapath errors with low overhead and latency. The evaluation of 38 benchmarks and a parallel HPC simulation reveals that CONDA only incurs 57.79% runtime overhead, which is 41.84% faster than existing state-of-the-art, with the same level of error detection effectiveness and low detection latency.
Art of HPC
Posters
TP
W
TUT
XO/EX
DescriptionThis analysis requires a large amount of raw data: detailed models of the 3D topography of the sea floor, satellite imagery to monitor glacier change over time, and physical samples of the ice and sediment. These datasets need to be mapped to a single consistent geographic framework for comparison. For example, lower resolution (100m) satellite-derived data of the entire glacier research site needed to be superimposed onto the higher resolution (5m) bathymetry ocean bottom multibeam (sonar) ship tracks close to the glacier terminus. Physical samples of ice and sediment will also be collected during the expedition and will need to be represented in this common georeferenced space.
The conglomerate topography/bathymetry of Greenland and glacier terminus endpoints over time were imported into a 3D visualization application (ParaView). The coverage for the high-resolution data is only a fraction of the total area, so care must be taken to represent the data appropriately and color-maps constructed to represent the various categories (ocean, land, ice) represented. The portions of interest are exported for 3D exploration in Unity via the Artifact-Based Rendering (ABR) plugins running in Unity and ParaView. The ABR engine allows for the complex scientific visualizations from ParaView to be piped to the Unity engine in real time and gives the user control over key color and textural representations. The hand-tracking/gestural navigation developed in Unity enables users to easily explore the data and approach areas of scientific interest for closer inspection. Users can export the exploration path as an animation for collaboration.
The conglomerate topography/bathymetry of Greenland and glacier terminus endpoints over time were imported into a 3D visualization application (ParaView). The coverage for the high-resolution data is only a fraction of the total area, so care must be taken to represent the data appropriately and color-maps constructed to represent the various categories (ocean, land, ice) represented. The portions of interest are exported for 3D exploration in Unity via the Artifact-Based Rendering (ABR) plugins running in Unity and ParaView. The ABR engine allows for the complex scientific visualizations from ParaView to be piped to the Unity engine in real time and gives the user control over key color and textural representations. The hand-tracking/gestural navigation developed in Unity enables users to easily explore the data and approach areas of scientific interest for closer inspection. Users can export the exploration path as an animation for collaboration.
Art of HPC
Posters
TP
W
TUT
XO/EX
DescriptionThe simulation data was generated from a three-hour simulation at 7.5 m grid-spacing (domain size of 10 km x 10 km x 8 km) of a precipitating cumulus congestus cloud using Cloud Model 1 (CM1) with Lagrangian microphysics on NSF NCAR's Derecho supercomputer. Model output, in the form of NetCDF files, was converted to OpenVDB volume files, which were then read directly into Blender 3D animation software.
Materials, lighting, and camera motion for the pan/zoom sequence were all applied in Blender. The animation was rendered on an NVIDIA GPU using NVIDIA OptiX via the Blender Cycles rendering engine. The Derecho supercomputer model was also rendered in Blender from a textured mesh. Post-production, including adding text and logos, was performed in Adobe Premiere.
Materials, lighting, and camera motion for the pan/zoom sequence were all applied in Blender. The animation was rendered on an NVIDIA GPU using NVIDIA OptiX via the Blender Cycles rendering engine. The Derecho supercomputer model was also rendered in Blender from a textured mesh. Post-production, including adding text and logos, was performed in Adobe Premiere.
Workshop
Applications and Application Frameworks
Artificial Intelligence/Machine Learning
Parallel Programming Methods, Models, Languages and Environments
W
DescriptionThe Dragon telemetry service is an easy-to-use, scalable means for users to visualize both hardware and custom metrics for complex workflows. We discuss in-depth the Dragon runtime, the architecture and capabilities of the telemetry service, and how the telemetry service compares to existing tools. Use of the telemetry service is demonstrated for a multi-language AI-in-the-loop workflow where both built-in hardware metrics and custom user metrics are visualized in a Grafana dashboard.
Workshop
Emerging Technologies
Modeling and Simulation
Scientific and Information Visualization
W
DescriptionMonitoring large data centers is critical for optimizing the performance of the hosted computing systems. By collecting and analyzing data, one can enhance the system’s efficiency. Spatial visualization of the current system status and historical data provides additional insights into its behavior and possible anomalies.
This approach improves maintenance efficiency by reducing the time needed to identify and fix problems. It also aids in localizing defective components, temperature hotspots, and network issues. While the virtual environment allows for remote inspection employing the augmented reality will fully exploit the capabilities also for on-site service.
We present a comprehensive 3D model of the HPC system, detailing everything from racks to individual servers and their components. We use photorealistic rendering and parametric materials whose properties are modified in real-time based on monitoring data. This provides both service-oriented and well as public relations-oriented materials. Our solution is built on open-source tools such as Blender, Grafana, and Influx.
This approach improves maintenance efficiency by reducing the time needed to identify and fix problems. It also aids in localizing defective components, temperature hotspots, and network issues. While the virtual environment allows for remote inspection employing the augmented reality will fully exploit the capabilities also for on-site service.
We present a comprehensive 3D model of the HPC system, detailing everything from racks to individual servers and their components. We use photorealistic rendering and parametric materials whose properties are modified in real-time based on monitoring data. This provides both service-oriented and well as public relations-oriented materials. Our solution is built on open-source tools such as Blender, Grafana, and Influx.
Workshop
Accelerators
Heterogeneous Computing
Parallel Programming Methods, Models, Languages and Environments
W
Workshop
Accelerators
Emerging Technologies
Hardware Technologies
W
DescriptionMastering computational architectures is essential for developing fast and power-efficient programs. Our advanced simulator empowers both IT students and professionals to grasp the fundamentals of superscalar processors and HW/SW co-design. With customizable processor architecture, full C compiler support, and detailed performance statistics, this tool offers a comprehensive learning experience.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
DescriptionUnlock the power of superscalar processor design with our cutting-edge RISC-V simulator! Tailored for IT students, researchers, and HPC professionals, this web-based tool brings complex architectures to life with an intuitive, customizable interface. Explore processor components, tweak configurations, and benchmark code snippets—all from your browser.
The simulator offers seamless support for C and assembly programs, built-in performance metrics, and full GCC compiler integration for various optimization levels. Whether you're learning or innovating, this tool enables you to experiment with different architectural setups, analyze results, and export configurations for sharing.
Designed to deepen your understanding of processor design and HW-SW co-design, the simulator supports both interactive exploration and batch processing via command-line. Perfect for those aiming to optimize RISC-V processors and HPC codes, it’s more than just a learning tool—it’s a powerful platform for research and development. Get ready to elevate your skills and performance optimization with this advanced simulator!
The simulator offers seamless support for C and assembly programs, built-in performance metrics, and full GCC compiler integration for various optimization levels. Whether you're learning or innovating, this tool enables you to experiment with different architectural setups, analyze results, and export configurations for sharing.
Designed to deepen your understanding of processor design and HW-SW co-design, the simulator supports both interactive exploration and batch processing via command-line. Perfect for those aiming to optimize RISC-V processors and HPC codes, it’s more than just a learning tool—it’s a powerful platform for research and development. Get ready to elevate your skills and performance optimization with this advanced simulator!
Workshop
Applications and Application Frameworks
Distributed Computing
Middleware and System Software
W
Workshop
Artificial Intelligence/Machine Learning
Biology
Education
Emerging Technologies
Medicine
Modeling and Simulation
W
Workshop
Compilers
Parallel Programming Methods, Models, Languages and Environments
Performance Optimization
W
Workshop
HPC Infrastructure
State of the Practice
System Administration
W
DescriptionWelcome & Opening Remarks
Workshop
Applications and Application Frameworks
Algorithms
Performance Evaluation and/or Optimization Tools
W
Workshop
Graph Algorithms
Heterogeneous Computing
Programming Frameworks and System Software
W
Workshop
Artificial Intelligence/Machine Learning
Broader Engagement
HPC in Society
W
Workshop
Architecture
Network
Performance Optimization
System Administration
W
Exhibitor Forum
Broader Engagement
TP
XO/EX
DescriptionHPC is built on open source, and often new technologies are shared under open source licenses, but building and managing a community around an open source project isn't easy. Often, the skills that are needed for engaging with users of a project are completely different from the skills that were needed to create the project in the first place.
While study after study proves the importance of having a diversity of participation and thought for success, opening your project up to disagreement and discussion can be challenging and intimidating. Intentionally opening the door to disagreement can feel like a bad idea.
This talk will help you understand if diversity is right for your project, and then help you set your project up to let diversity in with a minimum of trolls and bikeshedding.
While study after study proves the importance of having a diversity of participation and thought for success, opening your project up to disagreement and discussion can be challenging and intimidating. Intentionally opening the door to disagreement can feel like a bad idea.
This talk will help you understand if diversity is right for your project, and then help you set your project up to let diversity in with a minimum of trolls and bikeshedding.
Workshop
Embedded and/or Reconfigurable Systems
Heterogeneous Computing
W
Workshop
Data Movement and Memory
I/O, Storage, Archive
W
DescriptionIn the era of big data, where vast amounts of real-time data are generated, stream processing engines (SPEs) are widely utilized across various industries. In particular, SPEs that process stateful queries utilize historical data to execute current queries. We address the problem that occurs when a Log Structured Merge Tree-based Kev-Value Store (LSM-KVS) meets an SPE as a state store. We found that LSM operations that occur synchronously within an SPE affect the SPE, and additionally introduce the I/O operations that occur as a result of LSM operations.
Workshop
Broader Engagement
HPC in Society
Inclusivity
W
Workshop
Broader Engagement
HPC in Society
Inclusivity
W
DescriptionIn this session, we will invite our attendees to engage in a dynamic activity where they will share visions and strategies for building community in groups of three people. Using the Troika Consulting technique, the attendees will be guided to provide feedback to other's practical or imaginative questions within building community and networking strategies. At the end of this session, we hope our attendees will get different perspectives on possible problems they face daily and create connections.
Workshop
Artificial Intelligence/Machine Learning
Biology
Broader Engagement
Education
Emerging Technologies
Medicine
Modeling and Simulation
W
DescriptionThe diagnosis and grading of cancer rely on the examination of abnormal tissue and the morphology of the cells within. For example, the clinical evaluation of prostate cancer requires the assessment of glandular and cellular morphology from histopathology images. However, prostate cancer patients suffer from high rates of inter-observer variability among pathologists in the clinic. Additionally, recent studies have shown that the angle and depth of slide sectioning also contribute to significant variation in tumor grading, further illustrating the need for a quantitative, 3-dimensional, volumetric approach to prostate cancer whole biopsy imaging. We present the development of a propagation-based phase-contrast micro-CT approach that produces volumetric images of whole prostate needle core biopsies without the addition of contrast-enhancing stain. We cross-validate these images by comparing diagnostic features visible via x-ray imaging with those observed by clinically trained pathologists using conventional histopathology slides collected from matching samples. We then adapt techniques from topological data analysis (TDA) to quantify the variation in glandular architecture associated with depth within the sample, as well as age and comorbidity of the patient. Formalin-fixed, paraffin-embedded (FFPE), unstained prostate cancer samples containing phenotypes from each Gleason pattern were imaged in 3D at Lawrence Berkeley National Lab (LBNL). Hematoxylin and eosin-stained histopathology slides were obtained and scanned from each of these samples as a control reference for the X-ray images. All images scanned and analyzed in these experiments will be de-identified and made publicly available via a customized version of the open-source web viewer Neuroglancer. We aim to democratize the results from this work and subsequent similar experiments such that other scientists and clinicians might use our data to develop and train new models for the measurement of tumor phenotype and heterogeneity. In summary, this study reports the identification of reproducible imaging parameters for the non-destructive 3D reconstruction of soft-tissue tumor biopsies at cellular resolution without the addition of contrast-enhancing stain – a significant step towards advancing the clinical diagnosis of prostate cancer. Further, we also report an interpretable computational model for the quantification of glandular shape and its variation – the key diagnostic feature in prostate cancer and a crucial marker for disease severity. This advancement of 3D histopathology and computational topology will serve public health needs by improving the diagnosis of prostate cancer and other soft-tissue malignancies.
Workshop
Broader Engagement
HPC in Society
Inclusivity
W
DescriptionIn this lightning talk, we motivate the creation of a student-led chapter of the professional organization, Women in High-Performance Computing (WHPC), at one of the largest academic institutions by enrollment, Arizona State University (ASU). We believe that the strategic objectives of WHPC are best served in our community by empowering student HPC researchers with extracurricular events and leadership opportunities that increase the visibility of women in HPC, raise awareness of their underrepresentation, and provide women with opportunities to develop their social and professional networks.
Workshop
Broader Engagement
HPC in Society
Inclusivity
W
DescriptionArtificial Intelligence (AI) has become one of the most transformative technologies in this century and permeates nearly every facet of our existence. Despite its rapid growth and deep societal impact, the field of AI faces a significant gender disparity. Women are underrepresented in AI, both in industry and academia. This gender gap is not just a social issue -- it has a profound impact on the development of AI systems, potentially leading to biased applications that can exacerbate existing inequalities. The current status of women in AI reflects broader systemic issues in STEM fields, including gender biases in education systems, the lack of role models, and a work culture that often marginalizes women’s contributions. This presentation explores the challenges faced by women in AI and presents approaches that the University of Florida’s Women in High Performance (WHPC) Chapter took to overcome the barriers.
Workshop
Applications and Application Frameworks
Artificial Intelligence/Machine Learning
Parallel Programming Methods, Models, Languages and Environments
W
DescriptionToday Python developers typically access GPUs from either deep learning frameworks or tools that haven't kept up with modern CUDA practices. In this work in progress update, the CUDA Python team will demonstrate new interfaces using the CUDA Core Compute Libraries and an updated Pythonic object model. Additionally, these tools are able to link to device side code with Numba and link time optimization (LTO). These tools are additionally working with the new nvmath-python library that takes away much of the difficulties in picking the correctly optimized CUDA library from the Python runtime. The team is eager to get early feedback and help incorporate ideas into these tools as they launch in the next year.
Paper
Heterogeneous Computing
Linear Algebra
Network
Parallel Programming Methods, Models, Languages and Environments
Performance Evaluation and/or Optimization Tools
TP
DescriptionAs next-generation experimental and observational instruments for scientific research are being deployed with higher resolutions and faster data capture rates, the fundamental demands of producing high-quality scientific throughput require portability and performance to meet the high productivity goals. Understanding such a workflow's end-to-end performance on HPC systems is formidable work. In this paper, we address this challenge by introducing a Workflow Roofline model, which ties a workflow's end-to-end performance with peak node- and system- performance constraints. We analyze four workflows: LCLS, a time-sensitive workflow that is bound by system external bandwidth; BerkeleyGW, a traditional HPC workflow that is bound by node-local performance; CosmoFlow, an AI workflow that is bound by the CPU preprocessing; and GPTune, an auto tuner that is bound by the data control flow. We demonstrate the ability of our methodology to understand various aspects of performance and performance bottlenecks on workflows and systems and motivate workflow optimizations.
Birds of a Feather
TP
XO/EX
DescriptionThis session will focus on the integration and scalability of AI-driven scientific workflows across facilities. Building on vibrant discussions from our previous SC BoF sessions, this session will address the challenges and opportunities inherent in multi-facility workflows. Key themes will include the coordination among various computing and experimental facilities, near real-time data processing, and enhancing infrastructure resilience. Participants will engage in collaborative brainstorming sessions to identify innovative solutions for data representation and storage challenges. This session aims to foster an environment of collaboration, driving the development of efficient and scalable workflows that support modern scientific research’s growing complexity and scale.
Workshop
Applications and Application Frameworks
Distributed Computing
Middleware and System Software
W
Birds of a Feather
TP
XO/EX
DescriptionWhile sparse matrix computations are at the heart of many scientific and engineering applications, there exists no widely adopted interface standard. A reason for this may be the plethora of optimization options relevant to today’s accelerator architectures. At the same time, many vendors already provide support for sparse matrix computations in proprietary libraries, but due to diverging architectural constraints, these libraries have different execution models, APIs, and formats supported. We started a cross-institutional effort involving academia and industry to define an API for sparse linear algebra operations. In this BoF, we present a blueprint and discuss considerations motivating design choices.
Workshop
Emerging Technologies
Modeling and Simulation
Scientific and Information Visualization
W
DescriptionThis talk will describe our efforts in analysing our system workloads, with the eventual aim of discovering fingerprints for common applications or application classes. Our system workload analysis includes categorising applications, investigating the causes for changes in usage/performance over time, as well as the impact of system changes on applications. The talk will also discuss how application fingerprints might be used to improve system efficiency.
Workshop
Accelerators
Modeling and Simulation
Performance Evaluation and/or Optimization Tools
W
DescriptionWhereas contentions within storage systems noticeably impact runtimes, shared bandwidth-type resources, such as Lustre, pose challenges for high-performance computing cluster schedulers. Additionally, accurately estimating job resource requirements, particularly related to I/O operations, remains a significant challenge for users. In response to these challenges, we have developed a prototype that facilitates I/O-aware scheduling in Slurm without imposing additional burdens on users. Accounting for the specific properties of this bandwidth-type resource, our system monitors real-time Lustre bandwidth utilization, estimates job I/O requirements, and dynamically adjusts to the demands placed on the file system. Our workload-adaptive scheduler aims to maintain the bandwidth utilization at a level that reflects the resource requirement of the job queue. We further enhance the efficacy of our approach by introducing a "two-group'' approximation technique that ensures efficient performance regardless of the availability of zero-throughput jobs. We demonstrate the effectiveness of our approach through evaluation on a real cluster.
Workshop
Applications and Application Frameworks
Distributed Computing
Middleware and System Software
W
Workshop
Applications and Application Frameworks
Distributed Computing
Middleware and System Software
W
Workshop
Applications and Application Frameworks
Distributed Computing
Middleware and System Software
W
Workshop
Applications and Application Frameworks
Artificial Intelligence/Machine Learning
Parallel Programming Methods, Models, Languages and Environments
W
Workshop
Artificial Intelligence/Machine Learning
W
DescriptionThe predictive power of Large Language Models (LLMs) has increasingly made them go-to methods across many scientific domains. This presents scientists with new challenges but even greater opportunities as they explore various approaches and drive changes in data protection, data usage, and scientific understanding. The international Trillion Parameter Consortium (TPC) aims to bring together groups interested in collaborating around important areas including building, training, and using large-scale AI models as well as building and operating large-scale computing systems. TPC convenes individuals from three broad and overlapping communities: (1) those working on AI methods development, natural language processing/multimodal approaches and architectures, full stack implementations, scalable libraries and frameworks, AI workflows, data aggregation, cleaning and organization, training runtimes, model evaluation, downstream adaptation, alignment, etc.; (2) those who design and build hardware and software systems; and (3) those who will ultimately use the resulting AI systems to explore a range of challenges in science, engineering, medicine, and other domains. This workshop aims to provide an update on projects around the world, with empahsis on international collaborative teams working in areas ranging from evaluation and safety to training and performance to discipline-specific applications.
Workshop
Performance Optimization
Programming Frameworks and System Software
W
DescriptionWrap up at the end of the workshop
Workshop
Cloud Computing
Middleware and System Software
State of the Practice
W
DescriptionHPC and cloud computing have historically evolved independently, each specializing in performance or productivity. But what if we could combine the best of both worlds? The recently proposed Acceleration as a Service (XaaS) presented a new vision for the future — a unified architecture providing transparent access to high-performance computing resources across any underlying cloud or HPC provider. By bridging advancements from both domains with performance-portable containers, XaaS opens up new research directions: How can XaaS enable the seamless execution of scientific software across many systems and architectures? How much will XaaS have to change containers to achieve performance portability between different HPC systems? What new serverless and flexible resource utilization models could XaaS enable?
Join our panel, where the authors of the XaaS proposal will explore these questions and discuss the potential of cloud technologies to change the HPC landscape.
Join our panel, where the authors of the XaaS proposal will explore these questions and discuss the potential of cloud technologies to change the HPC landscape.
Workshop
Graph Algorithms
Heterogeneous Computing
Programming Frameworks and System Software
W
DescriptionGraph Neural Networks (GNNs) have been used in a variety of challenging applications. However, training GNN models is time-consuming as it incur high volume of irregular data accessing due to its graph-structured input data; such a challenge is further exacerbated in real-world applications as they often involve large-scale graphs with over billions of edges. Most existing GNN accelerators cannot scale to billion-scale graphs due to memory limitation. We propose xBS-GNN, an accelerator optimized for billion-scale GNN training. To achieve high training throughput, xBS-GNN jointly exploits several optimizations, including (1) a novel data placement policy, along with (2) a vertex-renaming technique and memory-efficient lookup table design for fast data retrieval, and (3) a feature quantization mechanism to reduce memory traffic. We evaluate xBS-GNN on three large datasets. xBS-GNN achieves up to 8.39x speedup over a widely-used GPU baseline and up to 5.13x speedup over a state-of-the-art GNN training accelerator.
Workshop
Cloud Computing
Middleware and System Software
State of the Practice
W
DescriptionDo Linux distribution package managers need the privileged operations they request to actually happen? Apparently not, at least when building container images for HPC applications. We use this observation to implement a root emulation mode using a Linux seccomp filter that intercepts some privileged system calls, does nothing, and returns success to the calling program. This approach provides no consistency whatsoever but appears sufficient to build a wide selection of Dockerfiles, including one that Docker itself cannot build, simplifying fully-unprivileged workflows needed for HPC application containers.
Sessions
Workshop
Distributed Computing
Fault-Tolerance, Reliability, Maintainability, and Adaptability
W
Workshop
Algorithms
Heterogeneous Computing
W
Workshop
Performance Optimization
Programming Frameworks and System Software
W
Workshop
Debugging and Correctness Tools
Fault-Tolerance, Reliability, Maintainability, and Adaptability
Software Engineering
W
Workshop
Artificial Intelligence/Machine Learning
W
Paper
Accelerators
Applications and Application Frameworks
Graph Algorithms
Modeling and Simulation
Numerical Methods
TP
Workshop
Artificial Intelligence/Machine Learning
W
Paper
Data Movement and Memory
Performance Evaluation and/or Optimization Tools
Resource Management
State of the Practice
TP
ACM Student Research Competition: Undergraduate Poster
Posters
TP
Workshop
Cloud Computing
Middleware and System Software
State of the Practice
W
Inclusivity
Childcare
Inclusivity
TP
W
TUT
XO/EX
Inclusivity
Childcare
Inclusivity
TP
W
TUT
XO/EX
Inclusivity
Childcare
Inclusivity
TP
W
TUT
XO/EX
Inclusivity
Childcare
Inclusivity
TP
W
TUT
XO/EX
Inclusivity
Childcare
Inclusivity
TP
W
TUT
XO/EX
Workshop
Artificial Intelligence/Machine Learning
Codesign
W
Paper
Artificial Intelligence/Machine Learning
Distributed Computing
Heterogeneous Computing
Performance Optimization
TP
Workshop
I/O, Storage, Archive
W
Paper
Accelerators
Compilers
Embedded and/or Reconfigurable Systems
Linear Algebra
Performance Evaluation and/or Optimization Tools
TP
Paper
Accelerators
Applications and Application Frameworks
Artificial Intelligence/Machine Learning
Modeling and Simulation
Numerical Methods
TP
Workshop
Broader Engagement
Education
Inclusivity
W
Paper
Accelerators
Algorithms
Artificial Intelligence/Machine Learning
Graph Algorithms
Performance Optimization
TP
Workshop
Distributed Computing
Education
Emerging Technologies
W
Workshop
Accelerators
Heterogeneous Computing
Parallel Programming Methods, Models, Languages and Environments
W
Paper
Architecture
Codesign
Data Movement and Memory
Energy Efficiency
Green Computing
Linear Algebra
TP
Workshop
Artificial Intelligence/Machine Learning
Broader Engagement
HPC in Society
W
Exhibits
Exhibit Floor Ribbon Cutting
TP
XO/EX
Exhibits
Exhibitor Pre-Gala Dinner
XO/EX
Reception
Exhibitor Reception
XO/EX
Exhibitor Forum
Hardware Technologies
TP
XO/EX
Workshop
Applications and Application Frameworks
W
Workshop
Codesign
Data Movement and Memory
Facilities
W
Exhibits
Flash Session
TP
XO/EX
Paper
Accelerators
Artificial Intelligence/Machine Learning
Cloud Computing
Distributed Computing
Heterogeneous Computing
Performance Optimization
TP
Paper
Accelerators
Algorithms
Data Movement and Memory
Graph Algorithms
TP
Workshop
Applications and Application Frameworks
Artificial Intelligence/Machine Learning
Parallel Programming Methods, Models, Languages and Environments
W
Paper
Accelerators
Algorithms
Data Compression
I/O, Storage, Archive
Performance Optimization
TP
Paper
Accelerators
Algorithms
Linear Algebra
Modeling and Simulation
Numerical Methods
TP
Inclusivity
HPC Around the World Showcase
Inclusivity
TP
W
TUT
XO/EX
Inclusivity
HPC Around the World Showcase
Inclusivity
TP
W
TUT
XO/EX
Inclusivity
HPC Around the World Showcase
Inclusivity
TP
W
TUT
XO/EX
Paper
Accelerators
Applications and Application Frameworks
Modeling and Simulation
Numerical Methods
Task Parallelism
TP
Workshop
HPC Infrastructure
State of the Practice
System Administration
W
Workshop
State of the Practice
System Administration
W
Workshop
Graph Algorithms
Heterogeneous Computing
Programming Frameworks and System Software
W
IndySCC
IndySCC
TP
XO/EX
IndySCC
IndySCC
TP
XO/EX
IndySCC
IndySCC
TP
XO/EX
IndySCC
IndySCC Kickoff
TP
XO/EX
IndySCC
Posters
IndySCC Poster Display
TP
XO/EX
IndySCC
Posters
IndySCC Poster Display
TP
XO/EX
IndySCC
Posters
IndySCC Poster Display
TP
XO/EX
IndySCC
Posters
IndySCC Poster Display
TP
XO/EX
Workshop
Architecture
Embedded and/or Reconfigurable Systems
Performance Optimization
Resource Management
W
Workshop
Accelerators
Emerging Technologies
Hardware Technologies
W
Workshop
Distributed Computing
Performance Evaluation and/or Optimization Tools
Scientific and Information Visualization
W
Paper
Accelerators
HPC Infrastructure
Performance Evaluation and/or Optimization Tools
State of the Practice
TP
Workshop
Compilers
Parallel Programming Methods, Models, Languages and Environments
Performance Optimization
W
Paper
Accelerators
Applications and Application Frameworks
Distributed Computing
Graph Algorithms
Heterogeneous Computing
Tensors
TP
Workshop
Artificial Intelligence/Machine Learning
Graph Algorithms
Scalable Data Mining
W
Paper
Accelerators
Algorithms
Data Compression
Linear Algebra
Tensors
TP
Workshop
Data Movement and Memory
Emerging Technologies
W
Paper
Accelerators
Compilers
Heterogeneous Computing
Performance Evaluation and/or Optimization Tools
TP
Paper
Distributed Computing
Heterogeneous Computing
Programming Frameworks and System Software
Runtime Systems
Task Parallelism
TP
Inclusivity
Parents Room
Inclusivity
TP
W
TUT
XO/EX
Inclusivity
Parents Room
Inclusivity
TP
W
TUT
XO/EX
Inclusivity
Parents Room
Inclusivity
TP
W
TUT
XO/EX
Inclusivity
Parents Room
Inclusivity
TP
W
TUT
XO/EX
Inclusivity
Parents Room
Inclusivity
TP
W
TUT
XO/EX
Inclusivity
Parents Room
Inclusivity
TP
W
TUT
XO/EX
Exhibits
Flash Session
Paving the Way Fault Tolerance: The Future of Reliable Quantum Computing
TP
XO/EX
Workshop
Data Movement and Memory
I/O, Storage, Archive
W
Paper
Fault-Tolerance, Reliability, Maintainability, and Adaptability
Middleware and System Software
Performance Evaluation and/or Optimization Tools
Runtime Systems
TP
Paper
Heterogeneous Computing
Linear Algebra
Network
Parallel Programming Methods, Models, Languages and Environments
Performance Evaluation and/or Optimization Tools
TP
Workshop
Accelerators
Modeling and Simulation
Performance Evaluation and/or Optimization Tools
W
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
Reception
TP
Paper
Accelerators
Energy Efficiency
Facilities
Resource Management
State of the Practice
TP
Inclusivity
Prayer Room
Inclusivity
TP
W
TUT
XO/EX
Inclusivity
Prayer Room
Inclusivity
TP
W
TUT
XO/EX
Inclusivity
Prayer Room
Inclusivity
TP
W
TUT
XO/EX
Inclusivity
Prayer Room
Inclusivity
TP
W
TUT
XO/EX
Inclusivity
Prayer Room
Inclusivity
TP
W
TUT
XO/EX
Inclusivity
Prayer Room
Inclusivity
TP
W
TUT
XO/EX
Exhibits
Flash Session
TP
XO/EX
Workshop
Debugging and Correctness Tools
Performance Evaluation and/or Optimization Tools
W
Inclusivity
Quiet Room
Inclusivity
TP
W
TUT
XO/EX
Inclusivity
Quiet Room
Inclusivity
TP
W
TUT
XO/EX
Inclusivity
Quiet Room
Inclusivity
TP
W
TUT
XO/EX
Inclusivity
Quiet Room
Inclusivity
TP
W
TUT
XO/EX
Inclusivity
Quiet Room
Inclusivity
TP
W
TUT
XO/EX
Inclusivity
Quiet Room
Inclusivity
TP
W
TUT
XO/EX
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
TP
XO/EX
Paper
Accelerators
Artificial Intelligence/Machine Learning
Codesign
State of the Practice
System Administration
TP
Inclusivity
Satellite Parents Room
Inclusivity
TP
W
TUT
XO/EX
Inclusivity
Satellite Parents Room
Inclusivity
TP
W
TUT
XO/EX
Inclusivity
Satellite Parents Room
Inclusivity
TP
W
TUT
XO/EX
Inclusivity
Satellite Parents Room
Inclusivity
TP
W
TUT
XO/EX
Inclusivity
Satellite Parents Room
Inclusivity
TP
W
TUT
XO/EX
Inclusivity
Satellite Parents Room
Inclusivity
TP
W
TUT
XO/EX
Awards and Award Talks
SC24 Awards Ceremony
TP
XO/EX
Keynote
SC24 Keynote Address Overflow 1
TP
W
TUT
XO/EX
Keynote
SC24 Keynote Address Overflow 2
TP
W
TUT
XO/EX
Paper
Data Compression
Data Movement and Memory
Distributed Computing
Message Passing
Network
TP
Paper
Accelerators
Data Movement and Memory
Emerging Technologies
Hardware Technologies
Heterogeneous Computing
Linear Algebra
Network
TP
Paper
Cloud Computing
Fault-Tolerance, Reliability, Maintainability, and Adaptability
State of the Practice
TP
Paper
Middleware and System Software
Programming Frameworks and System Software
Resource Management
TP
Paper
Algorithms
Data Movement and Memory
I/O, Storage, Archive
Performance Optimization
Scientific and Information Visualization
Visualization
TP
Workshop
Debugging and Correctness Tools
Hardware Technologies
Resource Management
State of the Practice
W
Workshop
Applications and Application Frameworks
Algorithms
Performance Evaluation and/or Optimization Tools
W
Paper
Distributed Computing
Middleware and System Software
TP
Paper
Algorithms
Artificial Intelligence/Machine Learning
Graph Algorithms
Linear Algebra
TP
Paper
Algorithms
Artificial Intelligence/Machine Learning
Heterogeneous Computing
Performance Optimization
TP
Student Cluster Competition
Student Cluster Competition
TP
XO/EX
Student Cluster Competition
Student Cluster Competition
TP
XO/EX
Student Cluster Competition
Student Cluster Competition
TP
XO/EX
Student Cluster Competition
Student Cluster Competition Kickoff
TP
XO/EX
Posters
Student Cluster Competition
Student Cluster Competition Posters Display
TP
XO/EX
Posters
Student Cluster Competition
Student Cluster Competition Posters Display
TP
XO/EX
Posters
Student Cluster Competition
Student Cluster Competition Posters Display
TP
XO/EX
Posters
Student Cluster Competition
Student Cluster Competition Posters Display
TP
XO/EX
Students@SC
TP
W
TUT
XO/EX
Students@SC
TP
W
TUT
XO/EX
Students@SC
Students@SC Student Headquarters
TP
W
TUT
XO/EX
Students@SC
Students@SC Student Headquarters
TP
W
TUT
XO/EX
Students@SC
Students@SC Student Headquarters
TP
W
TUT
XO/EX
Students@SC
Students@SC Student Headquarters
TP
W
TUT
XO/EX
Students@SC
Students@SC Student Headquarters
TP
W
TUT
XO/EX
Students@SC
Students@SC Student Headquarters
TP
W
TUT
XO/EX
Students@SC
Students@SC Wrap-Up
TP
W
TUT
XO/EX
Workshop
Cloud Computing
Distributed Computing
W
Reception
Technical Program Reception
TP
Workshop
Artificial Intelligence/Machine Learning
Biology
Education
Emerging Technologies
Medicine
Modeling and Simulation
W
Workshop
Embedded and/or Reconfigurable Systems
Heterogeneous Computing
W
Workshop
Data Compression
Data Movement and Memory
Middleware and System Software
W
Workshop
Architecture
Network
Performance Optimization
System Administration
W
Workshop
Applications and Application Frameworks
Distributed Computing
Middleware and System Software
W
Workshop
Emerging Technologies
Modeling and Simulation
Scientific and Information Visualization
W
Workshop
Heterogeneous Computing
Parallel Programming Methods, Models, Languages and Environments
Task Parallelism
W
Workshop
Applications and Application Frameworks
Artificial Intelligence/Machine Learning
Security
W
Tutorial
Tutorial Lunch
TUT
Tutorial
Tutorial Lunch
TUT
Exhibits
Flash Session
Unlocking New Possibilities: How IQM Quantum Cloud Transforms Innovation
TP
XO/EX
Paper
Algorithms
Artificial Intelligence/Machine Learning
Data Movement and Memory
Graph Algorithms
TP
Workshop
Distributed Computing
Experimental Facility
HPC Infrastructure
W
Try a different query.