Search
Organizations
Contributors
Presentations
Students@SC
DescriptionThis workshop will explore the definitions of microaggressions, macroagressions, microaffirmations, and effective methods to recognize the impacts of these in the workplace. The workshop will consist of understanding and defining biases and reviewing subtle remarks that may seem commonplace but could be harmful. The objective is for participants to gain an understanding of what microaggressions are, how harmful they can be, and how to combat them to promote a positive culture (which can be with microaffirmations).
Workshop
Middleware and System Software
Programming Frameworks and System Software
Runtime Systems
W
DescriptionThe complexity of node architectures in supercomputers increases as we cross milestones on the way toward exascale and beyond. Increasing levels of parallelism in multi- and many-core chips and emerging heterogeneity of computational resources coupled with energy and memory constraints force a reevaluation of our approaches towards operating systems and runtime environments.
The International Workshop on Runtime and Operating Systems for Supercomputers (ROSS) provides a forum for researchers to exchange ideas and discuss research questions that are relevant to upcoming supercomputers and cloud environments for high-performance computing. In addition to typical workshop publications, we encourage novel and possibly immature ideas, provided that they are interesting and on-topic. Well-argued position papers are also welcome.
The International Workshop on Runtime and Operating Systems for Supercomputers (ROSS) provides a forum for researchers to exchange ideas and discuss research questions that are relevant to upcoming supercomputers and cloud environments for high-performance computing. In addition to typical workshop publications, we encourage novel and possibly immature ideas, provided that they are interesting and on-topic. Well-argued position papers are also welcome.
Workshop
Education
State of the Practice
W
DescriptionThis paper describes an assignment in the Chapel programming language for creating a 1D heat equation solver. Two methods are used to solve the problem, exposing a variety of parallel programming concepts. The first portion of the assignment uses high-level parallel constructs, namely Chapel's forall loop and Block distribution, to create a simple distributed-memory solver. Here, students are asked to think about what it means for an array to be split across the memory in multiple compute nodes while relying on the language to handle the details of communication and synchronization. The second portion of the assignment uses low-level parallelism, like barriers and explicit communication. Here, the goal is to reduce overhead, while introducing students to the ideas of explicit communication and synchronization. In both parts, students are provided with a non-distributed version of the solver and are asked to create a modified version that runs across multiple compute nodes.
Birds of a Feather
Performance Measurement, Modeling, and Tools
TP
XO/EX
DescriptionData intensive supercomputer applications are increasingly important workloads, especially for “Big Data” problems, but are ill suited for most of today’s computing platforms (at any scale!). The Graph500 list has grown to over 357 entries and has demonstrated the challenges of even simple analytics. The new SSSP kernel introduced at SC17 has increased the benchmark’s overall difficulty. This BoF will unveil the latest Graph500 lists, provide in-depth analysis of the kernels and machines, and enhance the new energy metrics the Green Graph500. It will offer a forum for community and provide a rallying point for data intensive supercomputing problems.
Paper
Exascale
Large Scale Systems
State of the Practice
TP
DescriptionHPL-MxP is an emerging high performance benchmark used to measure the mixed-precision computing capability of leading supercomputers. This work present our efforts on the new Sunway that linearly scales the benchmark to over 40 million cores, sustains an overall mixed-precision performance exceeding 5 ExaFlop/s, and achieves over 85% of peak performance, which is the highest efficiency reached among all heterogeneous systems on the HPL-MxP list. The optimizations of our HPL-MxP implementation include: (1)a Two-Direction Look-Ahead and Overlap algorithm that enables overlaps of all communications with computation; (2)a multi-level process-mapping and communication-scheduling method that uses the network as best as possible while maintaining conflict-free algorithm-flow; and (3)a CG-Fusion computing framework that eliminates up to 60% of inter-chip communications and removes the memory access bottleneck while serving both computation and communication simultaneously. This work could also provide useful insights for tuning cutting-edge applications on Sunway supercomputers as well as other heterogeneous supercomputers.
Paper
Accelerators
Applications
Modeling and Simulation
TP
DescriptionA high-scalable and fully optimized earthquake model is presented based on the latest Sunway supercomputer. Contributions include:
1) the curvilinear grid finite-difference method (CGFDM) and flexible model applying perfectly matched layer (PML) and enabling more accurate and realistic terrain descriptions;
2) a hybrid and non-uniform domain decomposition scheme that efficiently maps the model across different levels of the computing system; and
3) sophisticated optimizations that largely alleviate or even eliminate bottlenecks in memory, communication, etc., obtaining a speedup of over 140x.
Combining all innovations, the design fully exploits the hardware potential of all aspects and enables us to perform the largest CGFDM-based earthquake simulation ever reported (69.7 PFlops using over 39 million cores).
Based on our design, the Turkey earthquakes (February 6, 2023), and the Ridgecrest earthquake (July 4, 2019), are successfully simulated with a maximum resolution of 12-m. Precise hazard evaluations for the hazardous reduction of earthquake-stricken areas are also conducted.
1) the curvilinear grid finite-difference method (CGFDM) and flexible model applying perfectly matched layer (PML) and enabling more accurate and realistic terrain descriptions;
2) a hybrid and non-uniform domain decomposition scheme that efficiently maps the model across different levels of the computing system; and
3) sophisticated optimizations that largely alleviate or even eliminate bottlenecks in memory, communication, etc., obtaining a speedup of over 140x.
Combining all innovations, the design fully exploits the hardware potential of all aspects and enables us to perform the largest CGFDM-based earthquake simulation ever reported (69.7 PFlops using over 39 million cores).
Based on our design, the Turkey earthquakes (February 6, 2023), and the Ridgecrest earthquake (July 4, 2019), are successfully simulated with a maximum resolution of 12-m. Precise hazard evaluations for the hazardous reduction of earthquake-stricken areas are also conducted.
Exhibits
Flash Session
TP
XO/EX
DescriptionThis session will discuss the latest generation of Nokia’s PSE (Photonic Switch Engine which provides up to 1.2Tb/s per wavelength and helps close the gap to Shannon’s limit.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
TP
XO/EX
DescriptionThe complexity and parameters of mainstream large models are increasing rapidly. For example, the increasingly popular large language models (e.g., ChatGPT) have billions of parameters. While this has led to performance improvements, the performance gains for simple tasks may be unacceptable for the additional cost. We apply residual networks of three different depths and evaluate them extensively on the MedMNIST pneumonia dataset. Experimental results show that smaller models can achieve satisfactory performance at significantly lower costs than larger models.
Birds of a Feather
Quantum Computing
TP
XO/EX
DescriptionIntegrating quantum computing (QC) test beds into scientific computing environments presents challenges in software interfaces and system familiarity. High-performance computing (HPC) centers are adopting this task but selecting suitable test bed technologies is complex due to numerous providers with varying maturity levels and the associated risk of single vendor systems.
A component-based approach is promising but faces challenges with the lack of standardized benchmarks, and the need for device-specific calibrations. This discussion addresses the challenge of component-based approaches and explores unifying access to diverse QC technologies, leveraging HPC for optimization, and fulfilling researcher needs.
A component-based approach is promising but faces challenges with the lack of standardized benchmarks, and the need for device-specific calibrations. This discussion addresses the challenge of component-based approaches and explores unifying access to diverse QC technologies, leveraging HPC for optimization, and fulfilling researcher needs.
Workshop
Data Analysis, Visualization, and Storage
Large Scale Systems
Programming Frameworks and System Software
Reproducibility
Resource Management
Runtime Systems
W
DescriptionThis paper presents an adaptive continuum synchronisation method for data science pipelines deployed on edge-fog-cloud infrastructures. In a diagnostic phase, a model, based on the Bernoulli principle, is used as an analogy to create a global representation of bottlenecks in a pipeline. In a supervision phase, a watchman/sentinel cooperative system monitors and captures the throughput of the pipeline stages to create a bottleneck-stage scheme. In a rectification phase, this system produces replicas of stages identified as bottlenecks to mitigate the workload congestion using implicit parallelism and load balancing algorithms. This method is automatically and transparently invoked to produce in runtime a steady continuum dataflow. To test our proposal, we conducted a case study about the processing of medical and satellite data on fog-cloud infrastructures. The evaluation revealed that this method creates, without characterising workloads nor knowing infrastructure details, continuum dataflows, which yield a competitive performance with solutions in the state-of-the-art.
Exhibitor Forum
Exascale
Programming Frameworks and System Software
Quantum Computing
TP
XO/EX
DescriptionTake a deep dive into the latest developments in NVIDIA software for high performance computing applications, including a comprehensive look at what’s new in programming models, compilers, libraries, and tools. We'll cover topics of interest to HPC developers, targeting traditional HPC modeling and simulation, quantum computing, HPC+AI, scientific visualization, and high-performance data analytics.
Workshop
Programming Frameworks and System Software
W
DescriptionInsights about applications and user environments can help HPC center staff make data-driven decisions about cluster operations. In this paper, we present a fast and responsive web-based visualization framework for analyzing HPC application usage. By leveraging XALT, a powerful tool for tracking application and library usage, we collected tens of millions of data points on a national supercomputer. The portable visualization framework created with Plotly Dash can be easily launched as a container and accessed from a web browser. The presented visualizations take a deep dive into the XALT data, analyzing application use, compiler usage, library usage, and even user-specific usage. Our analysis codes can distinguish between centrally installed applications and user-installed applications and can generate plots based on different metrics (no of jobs or cpu-hours). Initial insights gained from this visualization framework have helped our support staff identify several goals for improving the software stack and helping users proactively.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
TP
XO/EX
DescriptionIn this work, we explore how to replicate the behavior of undocumented hardware units -- in this case, NVIDIA's Tensor Cores -- and reason about them.
While prior work has employed manual testing to identify hardware behavior, we show that SMT can be used to generate inputs that can discriminate between different hardware implementation choices. We argue that SMTLIB, the language specification for SMT solvers, is well suited for exposing hardware implementations.
Using our method, we create a formal specification of the tensor cores on NVIDIA's Volta architecture. We confirm many of the findings of previous studies on tensor cores, but also identify two discrepancies: we find that the hardware does not use IEEE-754 round-to-zero for accumulation and that the 5-term accumulator requires 3 extra bits for carry out since it does not normalize intermediate sums.
The work will be presented in person using the poster as a visual aid.
While prior work has employed manual testing to identify hardware behavior, we show that SMT can be used to generate inputs that can discriminate between different hardware implementation choices. We argue that SMTLIB, the language specification for SMT solvers, is well suited for exposing hardware implementations.
Using our method, we create a formal specification of the tensor cores on NVIDIA's Volta architecture. We confirm many of the findings of previous studies on tensor cores, but also identify two discrepancies: we find that the hardware does not use IEEE-754 round-to-zero for accumulation and that the 5-term accumulator requires 3 extra bits for carry out since it does not normalize intermediate sums.
The work will be presented in person using the poster as a visual aid.
Workshop
Distributed Computing
Middleware and System Software
Runtime Systems
W
DescriptionModern tasking models define applications in a fine-grained manner that necessitates lower overhead per segment of computation. While previous work has seen implementations of hardware support for tasking models, many lack the support required by heterogeneity and fall short of expanding memory interfaces for data-centric needs and memory utilization. In this paper, we propose and implement a hardware support scheme of the sequential codelet model (SCM). The hardware support makes it possible to demonstrate SCM’s potential advantage on heterogeneous workloads and capability of supporting the expanding software memory interface. The gem5 implementation of the Sequential Codelet Model functions as a foundation to demonstrate the benefits offered by the SCM program execution model by moving hardware support closer to program semantics. We compare the overhead with DARTS, a software implementation of the Codelet Model that has been shown to be useful for fine-grained execution, and show a 20x reduction in overhead.
Workshop
Data Analysis, Visualization, and Storage
Large Scale Systems
Performance Measurement, Modeling, and Tools
W
DescriptionAutomated computational steering is used to automatically guide simulations toward productive states by combining data analysis with predefined control flow paths. Interactive computational steering achieves a similar goal, but by relying on manual human intervention instead. Existing in situ libraries are capable of fulfilling some computational steering use cases, but not all of them. This paper presents a general purpose interface for instrumenting existing simulation codes with interactive computational steering capabilities. Common use cases are presented, summarized from informal interviews held with 7 research scientists that use large-scale simulations in their work. Preliminary support for bidirectional communication via simulation callbacks and shell commands has been implemented in Ascent, a software library which provides simulations with in situ analysis and visualization infrastructure. Finally, a proof of concept instrumentation is provided, demonstrating that the proposed interface is sufficiently flexible to enable any interactive computational steering use case within Ascent-instrumented simulations.
Paper
Accelerators
Algorithms
Graph Algorithms and Frameworks
TP
DescriptionDetecting strongly connected components (SCCs) is an important step in various graph computations. The fastest GPU and CPU implementations from the literature work well on graphs where most of the vertices belong to a single SCC and the vertex degrees follow a power-law distribution. However, these algorithms can be slow on the mesh graphs used in certain radiative transfer simulations, which have a nearly constant vertex degree and can have significant variability in the number and size of SCCs. We introduce ECL-SCC, an SCC detection algorithm that addresses these shortcomings. Our approach is GPU-friendly and employs innovative techniques such as maximum ID propagation and edge removal. On an A100 GPU, ECL-SCC performs on par with the fastest prior GPU code on power-law graphs and outperforms it by 7.8x on mesh graphs. Moreover, ECL-SCC running on the GPU outperforms fast parallel CPU code by three orders of magnitude on meshes.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
TP
XO/EX
DescriptionThe field of in silico cellular modeling has made notable strides in number of cells that can be simultaneously modeled. While computational capabilities have grown exponentially, I/O performance has lagged behind. To address this issue, we present an in-transit approach to enable in situ visualization and analysis of large-scale fluid-structure-interaction models on leadership-class systems. We delineate the proposed framework and demonstrate the feasibility of this approach by measuring overhead introduced. The proposed framework provides a valuable tool for both at-scale debugging and enabling scientific discovery, which would be difficult to achieve otherwise.
Posters
Research Posters
TP
XO/EX
DescriptionIn traditional deep learning workflows, AI applications (producers) train DNN models offline using fixed datasets, while inference serving systems (consumers) load the trained models for offering real-time inference queries. In practice, AI applications often operate in a dynamic environment where data is constantly changing. Compared to offline learning, Continuous learning frequently (re)-trains models to adapt to the ever-changing data. This demands regular deployment of the DNN models, increasing the model update frequency between producers and consumers. Typically, producers and consumers are connected via model repositories like PFS, which may result in high model update latency due to I/O bottleneck of PFS. To address this, our work introduces a high-performance I/O framework that speeds up model updates between producers and consumers. It employs a cache-aware model handler to minimize the latency and an intelligent performance predictor to maintain a balance between training and inference performance.
Paper
Accelerators
Algorithms
Graph Algorithms and Frameworks
TP
DescriptionFinding a minimum spanning tree (MST) is a fundamental graph algorithm with applications in many fields. This paper presents ECL-MST, a fast MST implementation designed specifically for GPUs. ECL-MST is based on a parallelization approach that unifies Kruskal's and Borůvka's algorithm and incorporates new and existing optimizations from the literature, including implicit path compression and edge-centric operation. On two test systems, it outperforms leading GPU and CPU codes from the literature on all of our 17 input graphs from various domains. On a Titan V GPU, ECL-MST is, on average, 4.6 times faster than the next fastest code, and on an RTX 3080 Ti GPU, it is 4.5 times faster. On both systems, ECL-MST running on the GPU is roughly 30 times faster than the fastest parallel CPU code.
Posters
Research Posters
TP
XO/EX
DescriptionFor numerical simulations, linear system with large sparse matrix with high condition number needs to be solved. LDU-factorization with pivoting strategy provides robust solver for such system. Computational complexity of the factorization solver is high and cannot be reduced in framework of the direct solver, but by using lower precision arithmetic, computational cost and memory usage could be reduced. LDU-factorization uses recursive generation of the Schur complement matrix, but generation of the last one can be replaced by an iterative method. Here decomposition of the whole matrix into a union of the moderate and hard parts during factorization with threshold pivoting plays a key role. A new algorithm uses factorization in lower precision as a preconditioner for iterative solver in higher precision to generate the last Schur complement. True mixed precision arithmetic is used in forward/backward substitution for preconditioner with factorized matrix in lower precision and RHS vectors in higher.
Posters
Scientific Visualization & Data Analytics Showcase
Data Analysis, Visualization, and Storage
Modeling and Simulation
Visualization
TP
XO/EX
DescriptionThe Advanced Visualization Lab at the NCSA created a cinematic scientific visualization showing a flight through the Milky Way galaxy, to the galactic center where stars are orbiting around a supermassive black hole. The tour summarizes results from Andrea Ghez's Galactic Center Group: their study of the motions of stars around the Milky Way's central black hole reveals a rich and surprising environment, with hot young stars (coded as purple) where few were expected to be, many orbiting in a common plane; a paucity of cooler old stars (yellow); a population of unexpected "G-object" dusty stars (red); and an eclipsing binary star (teal). The black hole itself, shrouded in mystery, is seen only as a tiny faint twinkling radio source. But the movement of these nearby stars, especially the S0-2 "hero" (pale blue ellipse), probe the black hole's gravity, exposing its massive presence.
Posters
Research Posters
TP
XO/EX
DescriptionPointing out genetic mutations is pivotal to enable clinicians to prescribe personalized therapies to their patients. Genome Analysis ToolKit's HaplotypeCaller, relying on the Pair Hidden Markov Model (PairHMM) algorithm, is one of the most used applications to identify such variants. However, the PairHMM represents the bottleneck for this tool. Deploying such an algorithm on hardware accelerators represents a valuable solution. Nevertheless, State-of-the-Art designs do not have the flexibility to support the length variability of the input sequences and are not usable in real-life applicative scenarios. For these reasons, this work presents a GPU accelerator for the PairHMM capable of supporting sequences of any length, thanks to a dynamic memory swap methodology, overcoming the limitation of literature solutions. Our accelerator achieves an 8154× speedup over the software baseline, surpassing the most-performant State-of-the-Art design up to 1.6×.
Birds of a Feather
Cloud Computing
Distributed Computing
TP
XO/EX
DescriptionWe are building a National Science Data Fabric (NSDF) that introduces a novel trans-disciplinary approach for integrated data delivery and access to shared storage, networking, computing, and educational resources. Such a data fabric can democratize data-driven scientific discovery across the growing data science community. In this BoF, we want to engage the data science community to discuss the challenges and opportunities of the NSDF project and other similar efforts to connect an open network of institutions, including resource-disadvantaged institutions, and develop a federated testbed configurable for individual and shared scientific use.
Workshop
Modeling and Simulation
Performance Measurement, Modeling, and Tools
W
DescriptionWe present a simple performance model to estimate the qubit-count and runtime associated with large-scale error-corrected quantum computations. Our estimates extrapolate current usage costs of quantum computers and show that computing the ground state of the 2D Hubbard model, which is widely believed to be an early candidate for practical quantum advantage, could start at a million dollars. Our model shows a clear cost advantage of up to four orders of magnitude for quantum processors based on superconducting technology compared to ion trap devices. Our analysis shows that usage costs, while substantial, will not necessarily block the road to practical quantum advantage. Furthermore, the combined effects of algorithmic improvements, more efficient error correction codes, and R&D cost amortization are likely to lead to orders of magnitude reductions in cost.
Workshop
Performance Measurement, Modeling, and Tools
Performance Optimization
W
DescriptionThe first generation of exascale systems will include a variety of machine architectures, featuring GPUs from multiple vendors. As a result, many developers are interested in adopting portable programming models to avoid maintaining multiple versions of their code. It is necessary to document experiences with such programming models to assist developers in understanding the advantages and disadvantages of different approaches.
To this end, this work evaluates the performance portability of a SYCL implementation of a large-scale cosmology application (CRK-HACC) running on GPUs from three different vendors: AMD, Intel, and NVIDIA. We detail the process of migrating the original code from CUDA to SYCL and show that specializing kernels for specific targets can greatly improve performance portability without significantly impacting programmer productivity. The SYCL version of CRK-HACC achieves a performance portability of 0.96 with a code divergence of almost 0, demonstrating that SYCL is a viable programming model for performance-portable applications.
To this end, this work evaluates the performance portability of a SYCL implementation of a large-scale cosmology application (CRK-HACC) running on GPUs from three different vendors: AMD, Intel, and NVIDIA. We detail the process of migrating the original code from CUDA to SYCL and show that specializing kernels for specific targets can greatly improve performance portability without significantly impacting programmer productivity. The SYCL version of CRK-HACC achieves a performance portability of 0.96 with a code divergence of almost 0, demonstrating that SYCL is a viable programming model for performance-portable applications.
Posters
Research Posters
TP
XO/EX
DescriptionA software tool, called SPEL, has been developed to port and optimize and the ultrahigh-resolution ELM (uELM) code onto GPUs within a functional unit test framework. To promote the widespread adoption of this approach for community-based uELM development, this poster presents a portable software environment that enables efficient development of the uELM code on GPUs. The standalone software environment, which utilizes Docker, contains all the necessary code, libraries, and system software required for uELM development using SPEL. The process involved in this study includes identifying a Docker image that supports GPU, configuring and simulating ELM at the site level, capturing reference solutions, testing uELM functional units, and generating and optimizing code that is compatible with GPUs. The effectiveness of this methodology is demonstrated through a case study.
Paper
Cloud Computing
Distributed Computing
Data Movement and Memory
Performance Measurement, Modeling, and Tools
TP
DescriptionMemory disaggregation has recently been adopted in major data centers to improve resource utilization, driven by cost and sustainability. Meanwhile, studies on large-scale HPC facilities have also highlighted memory under-utilization. A promising and non-disruptive option for memory disaggregation is rack-scale memory pooling, where node-local memory is supplemented by shared memory pools. This work outlines the prospects and requirements for adoption and clarifies several misconceptions. We propose a quantitative method for dissecting application requirements on the memory system in three levels, moving from general, to multi-tier memory, and then to memory pooling. We also provide tools to facilitate the quantitative approach. We evaluated a set of representative HPC workloads on an emulated platform. Our results show that interference in memory pooling has varied application impact, depending on access ratio and arithmetic intensity. Finally, our method is applied in two case studies to show benefits at both the application and system level.
Keynote
TP
W
TUT
XO/EX
DescriptionDr. Hakeem Oluseyi grew up in some of the roughest neighborhoods in the country. As a result, he spent a lot of time inside, reading encyclopedias and watching PBS nature shows. At a young age, he discovered a love of science and space that was inspired by his role model, Albert Einstein. Throughout his childhood and into young adulthood, he was repeatedly faced with circumstances that would make most people give up—a lack of supervision at home, attending his state’s lowest rated school, falling in with the wrong crowd, and failing physics exams when he ultimately made his way to Stanford. But Hakeem never gave up.
Today, as a world-renowned astrophysicist and the former Space Science Education Lead at NASA, Hakeem inspires audiences around the world to chase impossible dreams, fight for what they want, refuse to listen to naysayers, and reach out and lend a hand up to those around them. Hilarious, honest, and inspiring, Hakeem wows audiences with a look at his mind-bending scientific research while motivating them with his personal life story.
Today, as a world-renowned astrophysicist and the former Space Science Education Lead at NASA, Hakeem inspires audiences around the world to chase impossible dreams, fight for what they want, refuse to listen to naysayers, and reach out and lend a hand up to those around them. Hilarious, honest, and inspiring, Hakeem wows audiences with a look at his mind-bending scientific research while motivating them with his personal life story.
Workshop
Quantum Computing
Software Engineering
W
DescriptionPractical applications of quantum computing are currently limited by the number of qubits that can be set with reasonable fidelities for each system. Therefore, a distributed quantum computing system with multiple quantum computers coherently connected is highly demanding. To realize the internode communication of quantum information, the software interface, Quantum Message Passing Interface (QMPI), leveraging the framework built for classical MPI but taking advantage of quantum teleportation to communicate between different quantum nodes was proposed. In this project, we develop the QMPI with point-to-point and collective operations in Qiskit and characterize its performance by demonstrating the application implementations. Moreover, we developed a new technique for optimizing collective communication of the distributed quantum programs with Multi-Controlled Toffoli gates. This technique beats the state-of-the-art in terms of fidelity and the number of remote EPR pairs consumed in both simulations and experiments.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
TP
XO/EX
DescriptionHigh Performance Computing (HPC) systems are essential for various scientific fields, and effective job scheduling is crucial for their performance. Traditional backfilling techniques, such as EASY-backfilling, rely on user-submitted runtime estimates, which can be inaccurate and lead to suboptimal scheduling. This poster presents RL-Backfiller, a novel reinforcement learning (RL) based approach to improve HPC job scheduling. Our method incorporates RL to make better backfilling decisions, independent of user-submitted runtime estimates. We trained RL-Backfiller on the synthetic Lublin-256 workload and tested it on the real SDSC-SP2 1998 workload. We show how RLBackfilling can learn effective backfilling strategies and outperform traditional EASY-backfilling and other heuristic combinations via trial-and-error on existing job traces. Our evaluation results show up to 17x better scheduling performance (based on average bounded job slowdown) compared to EASY-backfilling
Workshop
Modeling and Simulation
Performance Measurement, Modeling, and Tools
W
DescriptionHPC systems employ a scheduling technique called “backfilling”, wherein low-priority jobs are scheduled earlier to use the available resources that are waiting for the pending high-priority jobs. Backfilling relies on job runtime to calculate the start time of the ready-to-schedule jobs and avoid delaying them. It is a common belief that better estimations of job runtime will lead to better backfilling and more effective scheduling. However, our experiments show a different conclusion: there is a missing trade-off between prediction accuracy and backfilling opportunities. To learn how to achieve the best trade-off, we believe reinforcement learning (RL) can be effectively leveraged. Based on this idea, we designed RLBackfilling, a reinforcement learning based backfilling algorithm. Our evaluation results show up to 17x better scheduling performance compared to EASY backfilling using user-provided job runtime and 4.7x better performance comparing with EASY using the ideal predicted job runtime (the actual job runtime).
Workshop
Exascale
Message Passing
Programming Frameworks and System Software
W
DescriptionDistributed scientific applications run on a complex stack of soft- ware and network technologies. Each layer has configuration options for tuning performance. These can range from protocol thresh- olds to algorithmic changes for collectives. Micro-benchmarks are a common methodology to evaluate the communication stack and are relatively easy to tune. However they aren’t representative of application behavior. Proxy applications, however, offer a simplified, but realistic, representation of the main computational and communicative methods in scientific programs. Since these proxy applications contain realistic message passing patterns, the correlation between micro-benchmarks and proxy application performance is not obvious. We present a study of statistically analyzing the impacts of tuning. Our results show how tuned micro-benchmark performance correlates with tuned proxy application performance.
Workshop
Applications
Cloud Computing
Distributed Computing
Edge Computing
Large Scale Systems
W
DescriptionAn entire ecosystem of methodologies and tools revolves around scientific workflow management. They cover crucial non-functional requirements that standard workflow models fail to target, such as interactive execution, energy efficiency, performance portability, Big Data management, and intelligent orchestration in the Computing Continuum. Characterizing and monitoring this ecosystem is crucial to develop an informed view of current and future research directions. This work conducts a systematic mapping study of the Italian workflow research community, collecting and analyzing 25 tools and 10 applications from several scientific domains in the context of the "National Research Centre for HPC, Big Data, and Quantum Computing'' (ICSC). The study aims to outline the main current research directions and determine how they address the critical needs of modern scientific applications. The findings highlight a variegated research ecosystem of tools, with a prominent interest in advanced workflow orchestration and still immature but promising efforts toward energy efficiency.
Posters
Research Posters
TP
XO/EX
DescriptionTriangle counting is a cornerstone operation in large graph analytics. It has been a challenging problem historically, owing to the irregular and dynamic nature of the algorithm, which not only inhibits compile-time optimizations, but also requires runtime optimizations such as message aggregation and load-imbalance mitigation. Popular triangle counting algorithms are either inherently slow, fail to take advantage of available vectorization in modern processors, or involve sparse matrix operations. With its support for fine-grained asynchronous messages, the Partitioned Global Address Space (PGAS) with the Actor model has been identified to be efficient for irregular applications. However, few triangle counting implementations have been optimally implemented on top of PGAS Actor runtimes. To address the above mentioned challenges, we propose a set-intersection-based implementation of a distributed triangle counting algorithm atop the PGAS Actor runtime. Evaluation of our approach on the PACE Phoenix cluster and the Perlmutter supercomputer shows encouraging results.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
TP
XO/EX
DescriptionGPU-aware collective communication has become a major bottleneck for modern computing platforms as GPU computing power rapidly rises. To address this issue, traditional approaches integrate lossy compression directly into GPU-aware collectives, which still suffer from serious issues such as underutilized GPU devices and uncontrolled data distortion.
In this poster, we propose GPU-LCC, a general framework that designs and optimizes GPU-aware, compression-enabled collectives with well-controlled error propagation. To validate our framework, we evaluate the performance on up to 512 NVIDIA A100 GPUs with real-world applications and datasets. Experimental results demonstrate that our GPU-LCC-accelerated collective computation (Allreduce), can outperform NCCL as well as Cray MPI by up to 4.5X and 20.2X, respectively. Furthermore, our accuracy evaluation with an image-stacking application confirms the high reconstructed data quality of our accuracy-aware framework.
In this poster, we propose GPU-LCC, a general framework that designs and optimizes GPU-aware, compression-enabled collectives with well-controlled error propagation. To validate our framework, we evaluate the performance on up to 512 NVIDIA A100 GPUs with real-world applications and datasets. Experimental results demonstrate that our GPU-LCC-accelerated collective computation (Allreduce), can outperform NCCL as well as Cray MPI by up to 4.5X and 20.2X, respectively. Furthermore, our accuracy evaluation with an image-stacking application confirms the high reconstructed data quality of our accuracy-aware framework.
Paper
Cloud Computing
Distributed Computing
Data Movement and Memory
Performance Measurement, Modeling, and Tools
TP
DescriptionAdvances in networks, accelerators, and cloud services encourage programmers to reconsider where to compute---such as when fast networks make it cost-effective to compute on remote accelerators despite added latency. Workflow and cloud-hosted serverless computing frameworks can manage multi-step computations spanning federated collections of cloud, high-performance computing (HPC), and edge systems, but passing data among computational steps via cloud storage can incur high costs. Here, we overcome this obstacle with a new programming paradigm that decouples control flow from data flow by extending the pass-by-reference model to distributed applications. We describe ProxyStore, a system that implements this paradigm by providing object proxies that act as wide-area object references with just-in-time resolution. This proxy model enables data producers to communicate data unilaterally, transparently, and efficiently to both local and remote consumers. We demonstrate the benefits of this model with synthetic benchmarks and real-world scientific applications, running across various computing platforms.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
TP
XO/EX
DescriptionIn recent years, we have seen an un-precedented growth of data in our daily lives ranging from health data from an Apple Watch, financial stock price data, volatile crypto-currency data, to diagnostic data of nuclear/rocket simulations. The increase in high-precision, high-sample-rate time-series data is a challenge to existing database technologies. We have developed a novel technique that utilizes sparse-file support to achieve O(1) time complexity in create, read, update, and delete (CRUD) operations while supporting time granularity down to 1-second. We designed and implemented XStore to be lightweight and offer high performance without the need to maintain an index of the time-series data. We have conducted a detailed evaluation between XStore and existing best-of-breed systems such as MongoDB using synthetic data spanning 20 years, with second granularity, totaling over 5 billion datapoints. Through empirical experiments against MongoDB, XStore achieves 2.5X better latency and delivers up to 3X improvement in throughput.
Exhibitor Forum
Artificial Intelligence/Machine Learning
Architecture and Networks
Hardware Technologies
TP
XO/EX
DescriptionHPC not only performs complex calculations at high speed but also processes large amount of data. HPC Systems separates compute node and storage node to effectively process them. All computation is performed on compute node, and all data is stored in storage node. In order to perform data analytics, compute node has to read large amount of data from storage node because simulation output data is large. Compute nodes must have enough memory to hold extremely large data sets but also bandwidth from storage can become a bottleneck as well. However, the actual data required for analytics is only a small part of the total data. One solution to solve this problem is computational storage. Since computational storage transfer only results to compute node by processing where data resides, it can reduce data movement and increase performance. SK hynix is researching computational storage technologies with Los Alamos National Laboratory. We propose Object based Computational Storage (OCS) as a new computational storage platform for data analytics in HPC. OCS has not only high scalability but also data-aware characteristics. Data-aware characteristics enable OCS to perform analytics independently without help from compute nodes. We intend to leverage the Apache analytics ecosystem, including Arrow and Substrait to enhance that ecosystem with the advantages computing near storage enables. Systems that use Arrow can transfer query results using a common transfer format, and Substrait provides a standard and open representation of query plans enabling pushdown of query portions to computational storage. SK hynix’s key technology for OCS is Object based Computational Storage Array(OCSA) used as a backend storage. With OCSA, OCS will provide flexible query pushdown and analytics acceleration as well as less software overhead. This talk will introduce the OCS architecture and discuss why we propose OCS as future direction for computational storage in HPC.
Exhibits
Flash Session
TP
XO/EX
DescriptionJoin us as we delve into real-world customer experiences with data ingestion and discover how a high-performance, highly scalable SMB server accelerates and eases the process. Learn how SMB can help you address complex data ingestion challenges, and how Fusion File Share by Tuxera enhances the efficiency of the process.
Workshop
Applications
Cloud Computing
Distributed Computing
Edge Computing
Large Scale Systems
W
DescriptionEarthquake early warning systems use synthetic data from simulation frameworks like MudPy to train models for predicting the magnitudes of large earthquakes. MudPy, although powerful, has limitations: a lengthy simulation time to generate the required data, lack of user-friendliness, and no platform for discovering and sharing its data. We introduce FakeQuakes DAGMan Workflow (FDW), which utilizes Open Science Grid (OSG) for parallel computations to accelerate and streamline MudPy simulations. FDW significantly reduces runtime and increases throughput compared to a single-machine setup. Using FDW, we also explore partitioned parallel HTCondor DAGMan workflows to enhance OSG efficiency. Additionally, we investigate leveraging cyberinfrastructure, such as Virtual Data Collaboratory (VDC), for enhancing MudPy and OSG. Specifically, we simulate using Cloud bursting policies to enforce FDW job-offloading to VDC during OSG peak demand, addressing shared resource issues and user goals; we also discuss VDC’s value in facilitating a platform for broad access to MudPy products.
Exhibits
Flash Session
TP
XO/EX
DescriptionIn the modern business landscape, AI-driven initiatives are inhibited by an overly complicated data management ecosystem. Organizations are struggling to integrate various databases, distributed object stores, filesystems, and divergent data migration techniques. Learn about DDN's next generation approach to resolve the complexities of diverse infrastructures and unlock AI-driven digital transformation.
Exhibits
Flash Session
TP
XO/EX
DescriptionIn the modern business landscape, AI-driven initiatives are inhibited by an overly complicated data management ecosystem. Organizations are struggling to integrate various databases, distributed object stores, filesystems, and divergent data migration techniques. Learn about DDN's next generation approach to resolve the complexities of diverse infrastructures and unlock AI-driven digital transformation.
Exhibits
Flash Session
TP
XO/EX
DescriptionIn the modern business landscape, AI-driven initiatives are inhibited by an overly complicated data management ecosystem. Organizations are struggling to integrate various databases, distributed object stores, filesystems, and divergent data migration techniques. Learn about DDN's next generation approach to resolve the complexities of diverse infrastructures and unlock AI-driven digital transformation.
Workshop
Accelerators
Codesign
Heterogeneous Computing
Task Parallelism
W
DescriptionHyperparameter Optimization (HPO) of Neural Networks is a computationally expensive procedure, that has the potential to benefit from the use of novel accelerator capabilities. This paper investigates the performance of three popular HPO algorithms in terms of the achieved speed-up and model accuracy, utilizing early stopping, Bayesian, and genetic optimization approaches, in combination with mixed precision functionalities on NVIDIA A100 GPUs with Tensor Cores. The benchmarks are performed on 64 GPUs in parallel on three datasets: two from the vision and one from the CFD domain. The results show that, depending on the algorithm, larger speed-ups can be achieved for mixed precision compared to full precision HPO if the checkpoint frequency is kept low. In addition to the reduced runtime, also small gains in generalization performance on the test set are observed.
Workshop
Applications
Data Movement and Memory
Heterogeneous Computing
I/O and File Systems
Large Scale Systems
Middleware and System Software
Performance Measurement, Modeling, and Tools
Performance Optimization
W
DescriptionThe relatively slower data transfer speeds resulting in I/O bottlenecks in scientific simulations is one of the critical challenges in exascale computing. Simulations generate large data and analysis applications consume this data to provide time-critical insights. The limited size and high power consumption of Dynamic Random Access Memory (DRAM) capacity leaves slow storage devices as the primary option for large-scale data transfers. Non-volatile memory (NVM) devices such as Intel Optane bridges the gap between storage and volatile memory by providing DRAM-comparable performance and persistence. We present PQueue, a data transfer library for in situ analysis of simulation output using persistent memory. PQueue leverages NVM and provides an API that resembles high-level parallel I/O libraries such as PnetCDF to enable seamless transition for application developers. We achieved a maximum of 7X improvement in write times and a maximum of 10X improvement in read times as compared to PnetCDF.
Exhibitor Forum
Artificial Intelligence/Machine Learning
Architecture and Networks
Hardware Technologies
TP
XO/EX
DescriptionNVIDIA Grace Hopper Superchips are a scale-up architecture ideal for scientific computing workflows involving CPUs and GPUs. Building on a decade of GPU acceleration, Grace-Hopper realizes NVIDIA NVLink C2C, a 900 GB/s interconnect between the Grace CPU and the Hopper H100 GPU. C2C enables coherent memory at 7x the bandwidth of PCIe across Hopper’s 96GB HBM3 and Grace’s up to 480GB LPDDR5X. This removes the conceptual CPU/GPU memory divide and lowers barriers for scientists accelerating their applications with ever faster GPUs, e.g., H100 delivering up to 67 FP64 teraflops and 4 TB/s memory bandwidth. With more application code executing on GPUs, workload performance becomes increasingly susceptible to non-GPU limiters like data movement and CPU performance (Amdahl’s Law). C2C and the Grace CPU, ideal for single-thread or multi-core CPU workloads, restore the required balance . Grace combines 72 Arm Neoverse-V2 cores with NVIDIA Scalable Coherency Fabric, a distributed cache and mesh fabric with 3.2 TB/s bi-section bandwidth. This high bandwidth mesh enables one NUMA node for all 72 CPU cores, simplifying multi-core programming. Each core implements a 512-bit SVE2 SIMD pipeline for a total CPU FP64 theoretical peak of 7.1 teraflops. When combined with the up to 500 GB/s memory bandwidth of the LPDDR5X DRAM, Grace delivers twice the performance-per-Watt of conventional x86-64 CPUs. This session presents HPC and AI workload performance results with a technical deep-dive into the specific features of Grace-Hopper that accelerate each workload. We discuss how Grace-Hopper's distinctive coupling of the CPU/GPU hardware and the accompanying software stack create a platform which increases developer productivity, accelerates existing applications, and facilitates new standard programming models in C++, Fortran, and Python. Attendees will gain a deeper understanding of how to extract the performance offered by Grace-Hopper and realize the potential of this innovative, energy-efficient platform for science and industry.
Birds of a Feather
Data Analysis, Visualization, and Storage
TP
XO/EX
DescriptionStorage IO is becoming more of a bottleneck, especially for a new generation of AI-based workloads that are accelerated by GPUs. This session will provide a brief overview of key trends, available solutions presented as lightning talks, and illustrative application performance gains in this space. The majority of the session will engage in an open, forward-looking discussion with the gathered community on promising areas for investigation. Presenters will include those from academia and industry with new and challenging applications, storage partners with characterization, and innovators with new solutions in GPU-initiated storage and greater security. Join us for an exciting exchange!
Workshop
State of the Practice
W
DescriptionThe existing HPC I/O stack struggles with the growing demands of HPC scientific workloads. To start with the latency bottleneck, there is a deeply layered kernel hierarchy to translate HPC I/O requests to the actual storage operations. This layered architecture adds a significant overhead along the entire I/O request path. Measurements have shown that it takes between 18,000 and 20,000 instructions to send and receive a single fundamental 4KB I/O request. Our novel hardware/software framework, named DeLiBA, aims to bridge this gap by facilitating the development of software components within the HPC I/O stack in user space, rather than the kernel space, and leverages a proven 16 nanometer (nm) FPGA framework to quickly deploy the FPGA-based HPC I/O accelerators. Our initial results achieve a 10% increase in throughput and demonstrates up to 2.3 times the I/O operations per second compared to conventional methods.
Exhibits
Flash Session
TP
XO/EX
DescriptionIn the world of high-performance computing, networks serve as vital conduits for data transmission, security, and application delivery. However, the ever-evolving nature of network traffic demands adaptive solutions. This presentation describes the challenges in optimizing and evolving network application performance and the unique role FPGAs can play in accelerating next-generation HPC networks.
Workshop
Accelerators
Edge Computing
Heterogeneous Computing
W
DescriptionHeterogeneous Intellectual Property (IP) hardware acceleration engines have emerged as a viable path forward to improving performance in the waning of Moore’s Law and Dennard scaling. In this study, we design, prototype, and evaluate the HPC-specialized ZHW floating point compression accelerator as a resource on a System on Chip (SoC). Our full hardware/software implementation and evaluation reveal inefficiencies at the system level that significantly throttle the potential speedup of the ZHW accelerator. By optimizing data movement between CPU, memory, and accelerator, 6.9X is possible compared to a RISC-V64 core, and 2.9X over a Mac M1 ARM core.
Birds of a Feather
Distributed Computing
State of the Practice
TP
XO/EX
DescriptionThe ACCESS Resource providers (RP) will give an overview of the available resources and their unique characteristics. Those resources are open to a broad audience of computational researchers. Individuals can apply for allocations by submitting a request to ACCESS. Once this request is approved they can exchange their awarded service units for resources at one or several of the providers (e.g. node hours, GPU hours, storage).
The presentations will highlight the variety of resources and will be followed by a discussion with the community, allowing the audience to directly interact with the RPs.
Visit https://app.meet.ps/attendee/fcqctplo to submit questions beforehand
The presentations will highlight the variety of resources and will be followed by a discussion with the community, allowing the audience to directly interact with the RPs.
Visit https://app.meet.ps/attendee/fcqctplo to submit questions beforehand
Workshop
Accelerators
Codesign
Heterogeneous Computing
Task Parallelism
W
DescriptionThe accurate and efficient determination of hydrologic connectivity has garnered significant attention from both academic and industrial sectors due to its critical implications for environmental management. While recent studies have leveraged the spatial characteristics of hydrologic features, the use of elevation models for identifying drainage paths can be influenced by flow barriers. To address these challenges, our focus in this study is on detecting drainage crossings through the application of advanced convolutional neural networks (CNNs). In pursuit of this goal, we use neural architecture search to automatically explore CNN models for identifying drainage crossings. Our approach not only attains high accuracy (over 97% for average precision) in object detection but also excels in efficiently inferring correct drainage crossings within a remarkably short time frame (0.268 ms). Furthermore, we perform a detailed profiling of our approach on GPU systems to analyze performance bottlenecks.
Workshop
Energy Efficiency
Green Computing
Sustainability
W
DescriptionSustainability in HPC is a major challenge not only for HPC centers and their users, but also for society. A lot of effort went into reducing the energy consumption of systems, but most efforts propose solutions targeting CPUs. As HPC systems shift more to GPU-centric architectures, simulation codes increasingly adopt GPU-programming models, leading to an urgent need to increase the energy-efficiency of GPU-enabled codes. However, studies for reducing the energy consumption of large-scale simulations executing on CPUs and GPUs have received insufficient attention.
In this work, we enable accurate energy measurements using an open-source toolkit across CPU+GPU architectures. We use this approach in SPH-EXA, an open-source GPU-centric astrophysical and cosmological simulation framework showing that with code instrumentation, users can accurately measure energy consumption of their application, beyond the data provided by HPC systems. The accurate energy data provide significant insights to users for conducting energy-aware computational experiments and code development.
In this work, we enable accurate energy measurements using an open-source toolkit across CPU+GPU architectures. We use this approach in SPH-EXA, an open-source GPU-centric astrophysical and cosmological simulation framework showing that with code instrumentation, users can accurately measure energy consumption of their application, beyond the data provided by HPC systems. The accurate energy data provide significant insights to users for conducting energy-aware computational experiments and code development.
Workshop
Modeling and Simulation
Performance Measurement, Modeling, and Tools
W
DescriptionPerformance variability in complex computer systems is a major challenge for accurate benchmarking and performance characterization, especially for tightly-coupled large-scale high-performance computing systems. Point summaries of performance may be both uninformative, if they do not capture the full richness of its behavior, and inaccurate, if they are derived from an inadequate sample set of measurements. Determining the correct sample size requires balancing tradeoffs of computation, methodology, and statistical power.
We treat the performance distribution as the primary target of the performance evaluation, from which all other metrics can be derived. We propose and evaluate a meta-heuristic that dynamically characterizes the performance distribution, determining when enough samples have been collected to approximate the true distribution. Compared to fixed stopping criteria, this adaptive method can be more efficient in resource use and more accurate. Importantly, it requires no advance assumptions about the system under test or its performance characteristics.
We treat the performance distribution as the primary target of the performance evaluation, from which all other metrics can be derived. We propose and evaluate a meta-heuristic that dynamically characterizes the performance distribution, determining when enough samples have been collected to approximate the true distribution. Compared to fixed stopping criteria, this adaptive method can be more efficient in resource use and more accurate. Importantly, it requires no advance assumptions about the system under test or its performance characteristics.
Paper
Accelerators
Algorithms
Graph Algorithms and Frameworks
TP
DescriptionGlobal ocean data assimilation is a crucial technique to estimate the actual oceanic state by combining numerical model outcomes and observation data, which is widely used in climate research. Due to the imbalanced distribution of observation data in global ocean, the parallel efficiency of recent methods suffers from workload imbalance. When massive GPUs are applied for global ocean data assimilation, the workload imbalance becomes more severe, resulting in poor scalability. In this work, we propose a novel adaptive workload-balance scheduling strategy, assimilation, which successfully estimates the total workload prior to execution and ensures a balanced workload assignment. Further, we design a parallel dynamic programming approach to accelerate the schedule decision, and develop a factored dataflow to exploit the parallel potential of GPUs. Evaluation demonstrates that our algorithm outperforms the state-of-the-art method by up to 9.1x speedup. This work is the first to scale global ocean data assimilation to 4,000 GPUs.
Workshop
Education
State of the Practice
Sustainability
W
DescriptionThis lightning talk will highlight how several aspects of sustainability can frame the programming themes for a senior-level parallel computing class. We used the shallow water equation as a theme in our assignments from serial C & MPI, through OpenMP/PThreads to CUDA. By framing the problem sets in the setting of sustainability, both in terms of power usage /performance as well as framing and motivating the problem we are solving (shallow water equation) in terms of sustainability /environmental impact, our goal is to help stimulate the students to really get excited about our field and “spread the word”.
The inspiration for the core idea of this work came from attending the CDER 2022 PDC training workshop. It led to an ongoing related miniproject sponsored by Norway´s national Excited Centre of Excellent IT Education.
The inspiration for the core idea of this work came from attending the CDER 2022 PDC training workshop. It led to an ongoing related miniproject sponsored by Norway´s national Excited Centre of Excellent IT Education.
Tutorial
Data Analysis, Visualization, and Storage
I/O and File Systems
Large Scale Systems
Performance Measurement, Modeling, and Tools
TUT
DescriptionAs concurrency and complexity continue to increase on high-end machines, storage I/O performance is rapidly becoming a fundamental challenge to scientific discovery. At the exascale, online analysis will become a dominant form of data analytics, and thus scalable in situ workflows will become critical, along with high performance I/O to storage. The many components of a workflow running simultaneously pose another challenge of evaluating and improving the performance of these workflows. Therefore, performance data collection needs to be an integral part of the entire workflow.
In this tutorial, we present ADIOS-2 which allows for building in situ and file-based data processing workflows for extreme scale systems, including interactive, on-demand, in situ visualization of the data, and including performance profiling of the entire workflow. Half of this tutorial will be hands-on sessions, where we provide access to the software, and build together a complete MiniApp with in situ analytics and performance analysis that users can run on their laptop and supercomputers at large scale. We will show how ADIOS-2 is fully integrated into three popular visualization and performance tools: Jupyter Notebook, ParaView and TAU, creating a software ecosystem for in situ processing of both performance and scientific data.
In this tutorial, we present ADIOS-2 which allows for building in situ and file-based data processing workflows for extreme scale systems, including interactive, on-demand, in situ visualization of the data, and including performance profiling of the entire workflow. Half of this tutorial will be hands-on sessions, where we provide access to the software, and build together a complete MiniApp with in situ analytics and performance analysis that users can run on their laptop and supercomputers at large scale. We will show how ADIOS-2 is fully integrated into three popular visualization and performance tools: Jupyter Notebook, ParaView and TAU, creating a software ecosystem for in situ processing of both performance and scientific data.
Paper
Accelerators
Data Analysis, Visualization, and Storage
Data Compression
TP
DescriptionSZ is a lossy floating-point data compressor that excels in compression ratio and throughput for high-performance computing (HPC), time series databases, and deep learning applications. However, SZ performs poorly for small chunks and has slow decompression. We pinpoint the Huffman tree in the quantization factor encoder as the bottleneck of SZ. In this paper, we propose ADT-FSE, a new quantization factor encoder for SZ. Based on the Gaussian distribution of quantization factors, we design an adaptive data transcoding (ADT) scheme to map quantization factors to codes for better compressibility, and then use finite state entropy (FSE) to compress the codes. Experiments show that ADT-FSE improves the quantization factor compression ratio, compression and decompression throughput by up to 5x, 2x and 8x, respectively, over the original SZ Huffman encoder. On average, SZ_ADT is over 2x faster than ZFP in decompression.
Birds of a Feather
Architecture and Networks
TP
XO/EX
DescriptionTestbeds play a vital role in assessing the readiness of novel architectures for upcoming supercomputers for the exascale and post-exascale era. These testbeds also act as co-design hubs, enabling the collection of application operational requirements, while identifying critical gaps that need to be addressed for an architecture to become viable for HPC. Various research centers are actively deploying testbeds, and our aim is to build a community that facilitates the sharing of information, encouraging collaboration and understanding of the available evaluation resources. This BoF will facilitate the exchange of best practices, including testbed design, benchmarking, system evaluation, and availability.
Tutorial
Algorithms
Message Passing
Performance Optimization
TUT
DescriptionThe vast majority of production parallel scientific applications today use MPI and run successfully on the largest systems in the world. Parallel system architectures are evolving to include complex, heterogeneous nodes comprising general-purpose CPUs as well as accelerators such as GPUs. At the same time, the MPI standard itself is evolving to address the needs and challenges of future extreme-scale platforms as well as applications. This tutorial will cover several advanced features of MPI that can help users program modern systems effectively. Using code examples based on scenarios found in real applications, we will cover several topics including efficient ways of doing 2D and 3D stencil computation, derived datatypes, one-sided communication, hybrid programming (MPI + threads, shared memory, GPUs), topologies and topology mapping, neighborhood and nonblocking collectives, and some of the new performance-oriented features in MPI-4. Attendees will leave the tutorial with an understanding of how to use these advanced features of MPI and guidelines on how they might perform on different platforms and architectures.
Tutorial
Accelerators
Heterogeneous Computing
Performance Optimization
TUT
DescriptionWith the increasing prevalence of multicore processors, shared-memory programming models are essential. OpenMP is a popular, portable, widely supported, and easy-to-use shared-memory model. Developers usually find OpenMP easy to learn. However, they are often disappointed with the performance and scalability of the resulting code. This disappointment stems not from shortcomings of OpenMP, but rather from the lack of depth with which it is employed. Our “Advanced OpenMP Programming” tutorial addresses this critical need by exploring the implications of possible OpenMP parallelization strategies, both in terms of correctness and performance.
We assume attendees understand basic parallelization concepts and know the fundamentals of OpenMP. We focus on performance aspects, such as data and thread locality on NUMA architectures, false sharing, and exploitation of vector units. All topics are accompanied by extensive case studies, and we discuss the corresponding language features in-depth. Continuing the emphasis of this successful tutorial series, we focus solely on performance programming for multi-core architectures. Throughout all topics, we present the recent additions of OpenMP 5.2 and comment on developments targeting OpenMP 6.0.
We assume attendees understand basic parallelization concepts and know the fundamentals of OpenMP. We focus on performance aspects, such as data and thread locality on NUMA architectures, false sharing, and exploitation of vector units. All topics are accompanied by extensive case studies, and we discuss the corresponding language features in-depth. Continuing the emphasis of this successful tutorial series, we focus solely on performance programming for multi-core architectures. Throughout all topics, we present the recent additions of OpenMP 5.2 and comment on developments targeting OpenMP 6.0.
Birds of a Feather
Architecture and Networks
TP
XO/EX
DescriptionFPGAs have gone from niche components to being a central part of many data centers worldwide. The last year has seen tremendous advances in FPGA programmability and technology, especially in the shift to reconfigurable architectures that are heterogeneous and/or based on CGRAs or other AI engines. This BoF has two parts. The first is a series of lightning talks presenting advances in tools, technologies, and use-cases for these emerging architectures. The second part of the BoF will be a general discussion driven by the interests of the attendees, potentially including additional topics.
Workshop
Algorithms
Heterogeneous Computing
Large Scale Systems
W
DescriptionAs supercomputers become larger with powerful Graphics Processing Unit (GPU), traditional direct eigensolvers struggle to keep up with the hardware evolution and scale efficiently due to communication and synchronization demands. Subspace eigensolvers, like the Chebyshev Accelerated Subspace Eigensolver (ChASE), have a simpler structure and can overcome communication and synchronization bottlenecks. ChASE is a modern subspace eigensolver that uses Chebyshev polynomials to accelerate the computation of extremal eigenpairs of dense Hermitian eigenproblems. In this work we show how we have modified ChASE by rethinking its memory layout, introducing a novel parallelization scheme, switching to a more performing communication-avoiding algorithm for one of its inner module, and substituting MPI library by vendor-optimized NCCL library. The resulting library can tackle dense problems with size up to N=O(10^6), and scales effortlessly up to the full 900 nodes---each one powered by 4xA100 NVIDIA GPUs---of the JUWELS Booster hosted at the Jülich Supercomputing Centre.
Birds of a Feather
Applications
TP
XO/EX
DescriptionAgriculture worldwide is facing massive challenges in production, distribution, pollution reduction, and food security and waste: less than 40% of any crop is actually marketed. The farm, the oldest human-engineered system, produces the vast majority of human sustenance and consumes the majority of global freshwater. Its efficient operation is of vital importance -particularly when supply chains are disrupted by wars and pandemics. This BoF will discuss how novel supercomputing technologies and related distributed heterogeneous systems at scale could empower the primary sector and, as a result, stop operating in a needlessly fragile and inefficient way.
Exhibitor Forum
Accelerators
Artificial Intelligence/Machine Learning
TP
XO/EX
DescriptionGenerative AI is quickly becoming mainstream and everyone wants a slice of the AI pie. Led by an AI and HPC expert from Penguin Solutions, this exhibitor forum will explore what it takes to deliver AI to the masses – from cost to management of running AI architectures. The speaker will discuss options available for companies to scale their AI infrastructure, including renting AI factories in the cloud with a pay-as-you-go model versus building an AI factory of your own.
Two of the most important questions without a doubt have to be around cost and management of running AI architectures. Audience members will come away from this forum with real-world insights that they can apply directly to whatever their current AI setup is. This technical deep dive will also go in-depth on the tools you can implement, like Penguin Computing TrueHPC that can be used with AI solutions to easily build complex, high-performance environments across the many facets of your IT infrastructure.
Want to learn about the pros and cons of building an AI factory in the cloud and using a pay-as-you-go model? Or are you more interested in buying or building your very own AI factory? How does cost and performance factor into all of this? This forum will answer all of those questions and more and leave audience members with actionable takeaways that will have the power to positively impact current AI operations. We’re throwing away the notion that you have to be an established enterprise with deep pockets to run AI models and empower the supercomputing community.
Two of the most important questions without a doubt have to be around cost and management of running AI architectures. Audience members will come away from this forum with real-world insights that they can apply directly to whatever their current AI setup is. This technical deep dive will also go in-depth on the tools you can implement, like Penguin Computing TrueHPC that can be used with AI solutions to easily build complex, high-performance environments across the many facets of your IT infrastructure.
Want to learn about the pros and cons of building an AI factory in the cloud and using a pay-as-you-go model? Or are you more interested in buying or building your very own AI factory? How does cost and performance factor into all of this? This forum will answer all of those questions and more and leave audience members with actionable takeaways that will have the power to positively impact current AI operations. We’re throwing away the notion that you have to be an established enterprise with deep pockets to run AI models and empower the supercomputing community.
Workshop
Artificial Intelligence/Machine Learning
Software Engineering
W
DescriptionRecent advances in artificial intelligence methods show the enormous potential of AI methods. The underlying concepts are embedding spaces to represent real-world information. These embedding spaces have been used to represent, transform, and work with complex information in large-language models but also many other domains such as climate sciences or automated driving systems. In this talk, we focus on embedding spaces for programs and use those primarily to assess, analyze, and improve program performance. We start by deriving a first embedding from textual LLWM internal representation (IR) and show that it successfully predicts GPU execution times of programs. We then show that textual representations bear the danger is missing context and being overly sensitive to specific strings. Using a graph-based representation, we improve the embedding to capture relationships such as data dependencies and flows in LLVM IR. Finally, we discuss DaCe's performance metaprogramming capabilities and it's programmable graph-based IR. We then demonstrate how a graph-neural network (GNN)-based embedding can capture general performance properties. Those properties form the concept of Performance Embeddings for Transfer Tuning and can be used to select optimization metaprograms to apply to transform the IR graph.
Birds of a Feather
HPC in Society
TP
XO/EX
DescriptionThe SC23 edition of the Birds of a Feather Americas High-Performance Computing Collaboration: Global Actions seeks to showcase collaborations that have resulted from the partnerships formed since the first edition at SC19, presenting opportunities and experiences between different HPC Networks and Laboratories from countries in North, Central, and South America with other continents, mainly with Europe. In the BoF, different aspects will be discussed around the expectations and experiences of collaboration in HPC, to feed the continental roadmap. This BoF is a crucial step to support the signature of an MoU to start the formalization of the Americas HPC Collaboration.
Paper
Accelerators
Data Analysis, Visualization, and Storage
Data Compression
TP
DescriptionAs supercomputers advance toward exascale capabilities, computational intensity increases significantly, and the volume of data requiring storage and transmission experiences exponential growth. Adaptive Mesh Refinement (AMR) has emerged as an effective solution to address these two challenges. Concurrently, error-bounded lossy compression is recognized as one of the most efficient approaches to tackle the latter issue. Despite their respective advantages, few attempts have been made to investigate how AMR and error-bounded lossy compression can function together. To this end, this study presents a novel in-situ lossy compression framework that employs the HDF5 filter to improve both I/O costs and boost compression quality for AMR applications. We implement our solution into the AMReX framework and evaluate on two real-world AMR applications, Nyx and WarpX, on the Summit supercomputer. Experiments with 512 cores demonstrate that AMRIC improves the compression ratio by 81x and the I/O performance by 39x over AMReX's original compression solution.
Workshop
State of the Practice
W
DescriptionAs high-performance computing approaches the exascale era, the analysis of the vast amount of monitoring data generated by supercomputers has become increasingly challenging for data analysts. The detection of change points, which plays a critical role in anomaly detection, performance optimization, and root cause analysis of problems and failures, has grown beyond human capacity for manual review. To address this issue, our focus lies in developing an effective model capable of identifying anomalous behavior, and to achieve this, we introduce the concept of an online adaptive sampling algorithm. By evaluating the model's performance across various use cases, we conduct tests on a complex datasets to detect change points. Overall, we observe that the model successfully captures key features of normal behavior, and we believe it opens promising avenues for further research, particularly in assisting with various tasks related to anomaly detection and performance optimization in high-performance computing environments.
Workshop
Large Scale Systems
Middleware and System Software
Programming Frameworks and System Software
W
DescriptionIn conventional multi-GPU configurations, the host manages execution, kernel launches, communication, and synchronization, incurring unnecessary overhead. To mitigate this, we present a CPU-free model that delegates control to the devices themselves, especially benefiting communication-intensive applications. Utilizing techniques such as persistent kernels, specialized thread blocks, and device-initiated communication, we create autonomous multi-GPU code that drastically reduces communication overhead. Our approach is demonstrated with popular solvers, including 2D/3D Jacobian stencil and Conjugate Gradient (CG). We are currently developing its compiler technology, applying the model to a broader set of applications and its debugging/profiling tools.
Posters
Research Posters
TP
XO/EX
DescriptionResource disaggregation is prevalent in datacenters since it provides high resource utilization when compared to servers dedicated to either compute, memory, or storage. NVMe-over-Fabrics (NVMe-oF) is the standardized protocol used for accessing disaggregated storage over the network. Currently, the NVMe-oF specification lacks any semantics to prioritize I/O requests based on different application needs. Since applications have varying goals — latency-sensitive or throughput-critical I/O — we need to design efficient schemes in order to allow applications to specify the type of performance they wish to achieve. Furthermore, with additional tenants, we need to provide the respective specified performance optimizations that each application requests, regardless of congestion. This is a challenging problem, as the current NVMe specification lacks semantics to support multi-tenancy. Our research poster brings awareness to the ways in which we can bring multi-tenancy support to the NVMe-oF specification.
Workshop
Architecture and Networks
Hardware Technologies
W
DescriptionIn this work we perform one of the first in-depth, empirical comparisons of the Arm and RISC-V instruction sets. We compare a series of benchmarks compiled with GCC 9.2 and 12.2, targeting the scalar subsets of Arm's Armv-8a and RISC-V's rv64g. We analyze instruction counts, critical paths and windowed critical paths to get an estimate of performance differences between the two instruction sets, determining where each has advantages and disadvantages. The results show the instruction sets are relatively closely matched on the metrics we evaluated for the benchmarks we considered, indicating that neither ISA has a large, inherent advantage over the other, architecturally.
Workshop
Artificial Intelligence/Machine Learning
Energy Efficiency
Green Computing
Performance Measurement, Modeling, and Tools
Sustainability
W
DescriptionHigh-Performance Computing (HPC) centers demand a lot of power, and continue to grow through the exascale era. This work establishes the need for a multi-tiered, feedback-driven power management framework to follow dynamic power objectives while maximizing job performance, highlighting the need to respond to external factors (e.g., power constraints), and internal factors (e.g., performance variation). We present a practical implementation of this framework on a real-world cluster in addition to conducting simulations for larger data centers. We accurately track a moving power target for demand response while reacting to incomplete or inaccurate prior knowledge about job power and performance properties. We demonstrate that online performance feedback from a job runtime enables a cluster power management policy to recover most of the performance degradation introduced by job-type misclassification.
Workshop
Performance Measurement, Modeling, and Tools
Programming Frameworks and System Software
W
DescriptionThe MPI 4.0 standard introduced the concept of partitioned point-to-point communication. One facet that may help in encouraging application developers to use this new concept in their programs is the availability of proper tool support in a timely manner. We therefore propose nine new events extending the OTF2 event model to accurately represent the runtime behavior of partitioned point-to-point communication. We then demonstrate the suitability of these extensions with three different use cases in the context of performance analysis. In particular, we showcase a prototype implementation of an extended waitstate analysis in the Scalasca trace analyzer, and discuss further potential use cases in the realm of trace visualization and simulation.
Workshop
Quantum Computing
Software Engineering
W
DescriptionA crucial step in compiling a quantum algorithm involves addressing a layout problem to meet the device's layout constraints. The Qubit Mapping and Routing (QMR) problem aims to minimize the number of SWAP gates added to the circuit to fulfill NISQ hardware's connectivity constraints. Although this problem is NP-hard, finding solutions quickly is vital as it is part of the compilation process.
In this research, we present the QMR problem as a Quadratic Unconstrained Binary Optimization problem (QUBO) and utilize specialized hardware, the Fujitsu Digital Annealer, for faster solving. Experiments on various benchmarks are conducted, comparing our approach to popular methods like Qiskit and tket. Remarkably, our method achieves the optimal solutions for almost all instances in the QUEKO benchmark, outperforming other solvers significantly. Furthermore, we demonstrate our approach's superior performance in various instances when compared to other application-specific quantum circuits.
In this research, we present the QMR problem as a Quadratic Unconstrained Binary Optimization problem (QUBO) and utilize specialized hardware, the Fujitsu Digital Annealer, for faster solving. Experiments on various benchmarks are conducted, comparing our approach to popular methods like Qiskit and tket. Remarkably, our method achieves the optimal solutions for almost all instances in the QUEKO benchmark, outperforming other solvers significantly. Furthermore, we demonstrate our approach's superior performance in various instances when compared to other application-specific quantum circuits.
Workshop
Education
State of the Practice
W
DescriptionThis work presents an overview of an NSF Research Experience for Undergraduate Site on Trust and Reproducibility of Intelligent Computation, delivered by faculty and graduate students in the Kahlert School of Computing at University of Utah. The chosen themes bring together several concerns for the future in producing computational results that can be trusted: secure, reproducible, based on sound algorithmic foundations, and developed in the context of ethical considerations. The research areas represented by student projects include machine learning, high-performance computing, algorithms and applications, computer security, data science, and human-centered computing. In the first four weeks of the program, the entire student cohort spent their mornings in lessons from experts in these crosscutting topics, and used one-of-a-kind research platforms operated by the University of Utah, namely NSF-funded CloudLab and POWDER facilities. This program can serve as a model for preparing a future workforce to integrate ML into trustworthy reproducible applications.
Workshop
Middleware and System Software
Programming Frameworks and System Software
Runtime Systems
W
DescriptionIn the high performance computing (HPC) domain, performance variability is a major scalability issue for parallel computing applications with heavy synchronization and communication. We present an experimental performance analysis of OpenMP benchmarks regarding the variation of execution time, and determine the potential factors causing performance variability.
Our work offers some understanding of performance distributions and directions for future work on how to mitigate variability for OpenMP-based applications. Two representative OpenMP benchmarks from the EPCC OpenMP micro-benchmark suite and BabelStream are run across two x86 multicore platforms featuring up to 256 threads. From the obtained results, we characterize and explain the execution time variability as a function of thread-pinning, simultaneous multithreading (SMT) and core frequency variation.
Our work offers some understanding of performance distributions and directions for future work on how to mitigate variability for OpenMP-based applications. Two representative OpenMP benchmarks from the EPCC OpenMP micro-benchmark suite and BabelStream are run across two x86 multicore platforms featuring up to 256 threads. From the obtained results, we characterize and explain the execution time variability as a function of thread-pinning, simultaneous multithreading (SMT) and core frequency variation.
Workshop
Accelerators
Applications
Compilers
Heterogeneous Computing
Programming Frameworks and System Software
Runtime Systems
W
DescriptionWith the advent of GPUs in parallel computing several languages, tools and compilers are being developed. Many impactful applications can benefit from the performance capabilities these GPUs provide, but moving large, complex code bases to GPU execution often poses many hurdles and growing pains as developers adapt unfamiliar programming models and interface with increasingly complex, but powerful hardwares. Our work discusses experiences using OpenACC to bring GPU acceleration to MURaM, a state-of-the-art solar physics application, including various problems we have explored and overcome to bring better performance portability to the code within the limitations of the programming model. We then provide scaling results and findings transitioning to current generation GPU architectures with strong and weak scaling on up to 512 NVIDIA A100 GPUs, observing one A100 GPU as comparable to 90-100 CPU cores, and GPUs scaling much further than the CPU runs are capable.
Birds of a Feather
Data Analysis, Visualization, and Storage
TP
XO/EX
DescriptionParallel I/O performance can be a critical bottleneck for applications, yet users often need to be equipped for identifying and diagnosing I/O performance issues. Increasingly complex hierarchies of storage hardware and software deployed on many systems only compound this problem. Tools that can effectively capture, analyze, and tune I/O behavior for these systems empower users to realize performance gains for many applications.
In this BoF, we form a community around best practices in analyzing parallel I/O and cover recent advances to help address the above-mentioned problem, drawing on the expertise of users, I/O researchers, and administrators in attendance.
In this BoF, we form a community around best practices in analyzing parallel I/O and cover recent advances to help address the above-mentioned problem, drawing on the expertise of users, I/O researchers, and administrators in attendance.
Workshop
Distributed Computing
Security
W
DescriptionBoth 3rd generation Xeon scalable processors and Gramine 1.0, which potentially improves the performance of Intel SGX, were released in 2021. In this paper, we provide the first performance analysis of HPC workloads with Gramine and SGX on 3rd generation Xeon scalable processors. Our analysis starts with some microbenchmarks and is then extended to various HPC workloads. Our experimental results show that Gramine+SGX shows a small performance overhead (4-17%) for both compute-intensive and memory-bandwidth-sensitive workloads but a bit large performance overhead (up to 170%) for a memory-latency-sensitive workload. In addition, we show that the combination of Gramine and a 3rd generation Xeon scalable processor shows a slowdown of 1.5x on average (up to 4.4x) for many HPC workloads. This number is an order of magnitude smaller than that reported in the previous work using the combination of the former generation SGX toolchain and processor.
Paper
ANT-MOC: Scalable Neutral Particle Transport Using 3D Method of Characteristics on Multi-GPU Systems
Accelerators
Applications
Modeling and Simulation
TP
Best Paper Finalist
Best Student Paper Finalist
DescriptionThe Method Of Characteristic (MOC) to solve the Neutron Transport Equation (NTE) is the core of full-core simulation for reactors. High resolution is enabled by discretizing the NTE through massive tracks to traverse the 3D reactor geometry. However, the 3D full-core simulation is prohibitively expensive because of the high memory consumption and the severe load imbalance. To deal with these challenges, we develop ANT-MOC. Specifically, we build a performance model for memory footprint, computation and communication, based on which a track management strategy is proposed to overcome the resolution bottlenecks caused by limited GPU memory. Furthermore, we implement a novel multi-level load mapping strategy to ensure load balancing among nodes, GPUs, and CUs. ANT-MOC enables a 3D full-core reactor simulation with 100 billion tracks on 16,000 GPUs, with 70.69% and 89.38% parallel efficiency for strong scalability and weak scalability, respectively.
Paper
Artificial Intelligence/Machine Learning
Compilers
Performance Measurement, Modeling, and Tools
Performance Optimization
Programming Frameworks and System Software
Tensors
TP
DescriptionPerformance tuning, software/hardware co-design, and job scheduling are among the many tasks that rely on models to predict application performance. We propose and evaluate low-rank tensor decomposition for modeling application performance. We discretize the input and configuration domains of an application using regular grids. Application execution times mapped within grid-cells are averaged and represented by tensor elements. We show that low-rank canonical-polyadic (CP) tensor decomposition is effective in approximating these tensors. We further show that this decomposition enables accurate extrapolation of unobserved regions of an application's parameter space. We then employ tensor completion to optimize a CP decomposition given a sparse set of observed execution times. We consider alternative piecewise/grid-based models and supervised learning models for six applications and demonstrate that CP decomposition optimized using tensor completion offers higher prediction accuracy and memory-efficiency for high-dimensional performance modeling.
Birds of a Feather
State of the Practice
TP
XO/EX
DescriptionThis BoF provides a forum for Fortran developers to engage with its modern programming features. Fortran continues to play a crucial role in numerous legacy applications, but with features introduced in recent standards, the language also supports modern programming practices and high-performance computing. As Fortran 2023 approaches, this BoF brings together developers from various domains to share experiences and explore the language's evolving capabilities. After some brief panelist presentations, the session will focus on an interactive discussion where audience members will be encouraged to share their own experiences and ask questions of our panelists.
Posters
Research Posters
TP
XO/EX
DescriptionType Ia Supernovae are highly luminous thermonuclear explosions of white dwarfs which serve as standardizable distance markers for investigating the accelerating expansion of our Universe. Most existing supernovae simulation codes are only designed to run on homogeneous CPU-only systems and do not take advantage of the increasing shift towards heterogeneous architectures in HPC. To address this, we present Ares, the first performance portable massively-parallel code for simulating thermonuclear burn fronts. By creating multi-physics modules using the Kokkos and Parthenon frameworks, we are able to scale supernovae simulations to distributed HPC clusters operating on any of CUDA, HIP, SYCL, HPX, OpenMP and serial backends. We evaluate our application by conducting weak and strong scaling studies on both CPU and GPU clusters, showing the efficiency of our method for a diverse set of targets.
Birds of a Feather
State of the Practice
TP
XO/EX
DescriptionThis BoF brings together the Arm HPC community to discuss experiences and lessons learnt in delivering and operating Arm-based HPC systems. The topic of Arm HPC ecosystem maturity has been extensively discussed, focusing especially on the upper part of the stack (compiler, libraries, applications). This BoF focuses instead on the other side of the coin with a focus on administration and management of systems. Primed by a short opening session from well-recognized experts in the community, the host and panel will engage attendees to share and ask probing questions. Audience participation is strongly encouraged.
Workshop
Artificial Intelligence/Machine Learning
Energy Efficiency
Green Computing
Performance Measurement, Modeling, and Tools
Sustainability
W
DescriptionAs modern High-Performance Computing (HPC) reach exascale performance, their power consumption becomes a serious threat to environmental and energy sustainability. Efficient power management in HPC systems is crucial for optimizing workload management, reducing operational costs, and promoting environmental sustainability. Accurate prediction of job power consumption plays an important role in achieving such goals. We apply a technique combining Machine Learning (ML) algorithms with Natural Language Processing (NLP) tools to predict job power consumption. The solution is able to predict job maximum and average power consumption per node, leveraging only information which is available at the time of job submission. The prediction is performed in an online fashion, and we validate the approach using batch system logs extracted from Supercomputer Fugaku. The experimental evaluation shows promising results of outperforming classical technique while obtaining an R2 score of more than 0.53 for our two prediction tasks.
Workshop
Education
Heterogeneous Computing
Reproducibility
State of the Practice
W
DescriptionTechnological advancements have led to an increase in teaching the fundamentals of robotics and autonomous systems and their importance, relying on strong hands-on practical experimentation. National Science Foundation (NSF)-supported testbeds have opened the doors for experimentation and support in the next era of computing platforms and large-scale cloud research.
We present an open-source educational module that conveys accessibility to education, aiming to prepare learners for technological career paths. Our educational module is developed with the motivation to bring hands-on sessions and allow students to attain knowledge in a comprehensive manner. Specifically, we present AutoLearn: Learning in the Edge to Cloud Continuum, an educational module that integrates a collection of educational artifacts, based on an open-source self-driving platform for small scale that leverages the Chameleon Cloud testbed to teach cloud computing concepts, edge devices technology, and artificial intelligence driven applications.
We present an open-source educational module that conveys accessibility to education, aiming to prepare learners for technological career paths. Our educational module is developed with the motivation to bring hands-on sessions and allow students to attain knowledge in a comprehensive manner. Specifically, we present AutoLearn: Learning in the Edge to Cloud Continuum, an educational module that integrates a collection of educational artifacts, based on an open-source self-driving platform for small scale that leverages the Chameleon Cloud testbed to teach cloud computing concepts, edge devices technology, and artificial intelligence driven applications.
Paper
Heterogeneous Computing
Programming Frameworks and System Software
Task Parallelism
TP
DescriptionIn a parallel and distributed application, a mapping is a selection of a processor for each computation or task and memories for the data collections that each task accesses. Finding high-performance mappings is challenging, particularly on heterogeneous hardware with multiple choices for processors and memories. We show that fast mappings are sensitive to the machine, application, and input. Porting to a new machine, modifying the application, or using a different input size may necessitate re-tuning the mapping to maintain the best possible performance.
We present AutoMap, a system that automatically tunes the mapping to the hardware used and finds fast mappings without user intervention or code modification. In contrast, hand-written mappings often require days of experimentation. AutoMap utilizes a novel constrained coordinate-wise descent search algorithm that balances the trade-off between running computations quickly and minimizing data movement. AutoMap discovers mappings up to 2.41x faster than custom, hand-written mappers.
We present AutoMap, a system that automatically tunes the mapping to the hardware used and finds fast mappings without user intervention or code modification. In contrast, hand-written mappings often require days of experimentation. AutoMap utilizes a novel constrained coordinate-wise descent search algorithm that balances the trade-off between running computations quickly and minimizing data movement. AutoMap discovers mappings up to 2.41x faster than custom, hand-written mappers.
Workshop
Artificial Intelligence/Machine Learning
Energy Efficiency
Green Computing
Performance Measurement, Modeling, and Tools
Sustainability
W
DescriptionWe introduce a novel energy-efficient job scheduling approach for High-Performance Computing (HPC) environments. Its primary objective is to bridge the gap between research and production in energy-efficient scheduling models for HPC. The proposed architecture and program decouple scheduling heuristics to a Python application in the HPC scheduler SLURM, enabling adaptability for production setups. The implementation demonstrates an 11% potential energy saving in the High-Performance Conjugate Gradients (HPCG) benchmark, highlighting the practicality of the approach in a single-node HPC cluster. This work serves as a foundation for integrating research in the area into production, offering a realistic example of energy-efficient HPC in practice. It also opens possibilities for more advanced applications, like automatically scheduling jobs during low-cost and renewable energy periods, as already used by companies employing HPC. This contribution showcases a practical, energy-efficient solution for HPC job scheduling and identifies potential for future enhancements in this area.
Paper
Artificial Intelligence/Machine Learning
Compilers
Performance Measurement, Modeling, and Tools
Performance Optimization
Programming Frameworks and System Software
Tensors
TP
DescriptionWhile considerable research has been directed at automatic parallelization for shared-memory platforms, little progress has been made in automatic parallelization schemes for distributed-memory systems. We introduce an innovative approach to automatically produce distributed-memory parallel code for an important sub-class of affine tensor computations common to Coupled Cluster (CC) electronic structure methods, neuro-imaging applications, and deep learning models.
We propose a novel systematic approach to modeling the relations and trade-offs of mapping computations and data onto multi-dimensional grids of homogeneous nodes. Our formulation explores the space of computation and data distributions across processor grids. Tensor programs are modeled as a non-linear symbolic formulation accounting for the volume of data communication and per-node capacity constraints induced under specific mappings. Solutions are found, iteratively, using the Z3 SMT solver, and used to automatically generate efficient MPI code. Our evaluation demonstrates the effectiveness of our approach over Distributed-Memory Pluto and the Cyclops Tensor Framework.
We propose a novel systematic approach to modeling the relations and trade-offs of mapping computations and data onto multi-dimensional grids of homogeneous nodes. Our formulation explores the space of computation and data distributions across processor grids. Tensor programs are modeled as a non-linear symbolic formulation accounting for the volume of data communication and per-node capacity constraints induced under specific mappings. Solutions are found, iteratively, using the Z3 SMT solver, and used to automatically generate efficient MPI code. Our evaluation demonstrates the effectiveness of our approach over Distributed-Memory Pluto and the Cyclops Tensor Framework.
Workshop
Architecture and Networks
Hardware Technologies
W
DescriptionIn this paper, we propose and evaluate several optimized implementations of the general matrix multiplication (Gemm) on two different RISC-V architecture cores implementing the RISC-V vector extension (RVV): C906 and C910 from T-HEAD. Specifically, we address the performance portability problem across these processor cores by means of an automatic assembly code generator, written in Python, capable of emitting RVV code for high performance computing (HPC), with a variety of combinations of specific and general optimizations.
Our experimental results using a number of automatically-generated micro-kernels for Gemm, on both RISC-V architectures, reveal different impact of each optimization, depending on the target architecture, and highlight the importance of automatically generating HPC RVV code to achieve performance portability while reducing the developers' effort. In addition, these optimizations show important performance gains with respect to to a state-of-the-art tuned BLAS library (OpenBLAS), reaching 3x and 1.3x speed-ups for the C910 and C906, respectively.
Our experimental results using a number of automatically-generated micro-kernels for Gemm, on both RISC-V architectures, reveal different impact of each optimization, depending on the target architecture, and highlight the importance of automatically generating HPC RVV code to achieve performance portability while reducing the developers' effort. In addition, these optimizations show important performance gains with respect to to a state-of-the-art tuned BLAS library (OpenBLAS), reaching 3x and 1.3x speed-ups for the C910 and C906, respectively.
Workshop
Performance Optimization
W
DescriptionThe rapid development in machine learning (ML) has prompted demand for low-precision arithmetic hardware that can deliver faster computing speed. Weather simulation applications typically exhibit higher sensitivity towards small perturbation on the input data, but the inherent uncertainty paves the way for opportunities in mixed-precision computing (MPC) by trading accuracy for performance. Additional challenges of balancing between the lower computational cost and accuracy requirements need to be addressed before successful MPC can be applied to weather modeling applications. Determining an acceptable precision allocation for variables involves interacting with an exponential search space of mixed-precision configurations. We propose a mixed-precision code tuning framework for automatic search of suitable precision configurations for weather modeling applications with black-box optimization algorithms. The search results achieve up to 30% performance gain that stays within the tolerance level, offering a workflow to facilitate the identification of variables sensitive to precision change.
Posters
Research Posters
TP
XO/EX
DescriptionThe increasing demand for processing power on resource-constrained edge devices necessitate efficient techniques for optimizing High Performance Computing (HPC) applications. We propose HPEE (HPC Parameter Exploration on Edge), a novel approach that formulates the parameter search space problem as a pure exploration multi-armed bandit (MAB) technique. By efficiently exploring the search space using the MAB framework, we achieve significant performance improvements, while respecting the limited computational resources of edge devices. Experimental results, based on HPC application, demonstrate the effectiveness of our approach in optimizing parameter search on edge devices, offering a promising solution for enhancing HPC performance in resource-constrained environments.
Posters
Research Posters
TP
XO/EX
DescriptionDistributed large model inference is still in a dilemma where balancing latency and throughput, or rather cost and effect. Tensor parallelism, while capable of optimizing latency, entails a substantial expenditure. Conversely, pipeline parallelism excels in throughput but falls short in minimizing execution time.
To address this challenge, we introduce a novel solution - interleaved parallelism. This approach interleaves computation and communication across requests. Our proposed runtime system harnesses GPU scheduling techniques to facilitate the overlapping of communication and computation kernels, thereby enabling this pioneering parallelism for distributed large model inference. Extensive evaluations show that our proposal outperforms existing parallelism approaches across models and devices, presenting the best latency and throughput in most cases.
To address this challenge, we introduce a novel solution - interleaved parallelism. This approach interleaves computation and communication across requests. Our proposed runtime system harnesses GPU scheduling techniques to facilitate the overlapping of communication and computation kernels, thereby enabling this pioneering parallelism for distributed large model inference. Extensive evaluations show that our proposal outperforms existing parallelism approaches across models and devices, presenting the best latency and throughput in most cases.
Workshop
Programming Frameworks and System Software
W
DescriptionDuring software development, many aspects of the system and user state can change. Significant time can be spent tracking down the causes of these differences, rather than focusing on the main task of software development. This paper describes a tool to record the state at build-time and at runtime of an application to more easily investigate the cause(s) of differences in behavior. The added logging enables better software quality assurance by tracking code changes and their effects on runtime behavior. At a minimum, this tool only requires prepending one command at build-time and another at runtime. Project-level configurations can be set to enable the collection of additional information.
Workshop
Performance Measurement, Modeling, and Tools
Performance Optimization
W
DescriptionSimulations of Lattice Quantum Chromodynamics (LQCD) are an important application (two digit percentage of cycles) on major High Performance Computing (HPC) installations, including systems high up on and leading the top500 list. In the rapidly changing hardware landscape of HPC, binding workforce to optimize simulation software for every architecture becomes a sustainability issue.
In this work, we explore the feasibility of using performance portable parallel code for an important LQCD kernel. Fusing the Kokkos C++ Performance Portability EcoSystem with MPI allows to scale on massive parallel machines while still being able to target a plentitude of different architectures with the same simple code. We report on benchmarking results for a range of currently deployed and recently introduced systems, including AMD EPYC 7742, AMD MI250, Fujitsu A64FX, Nvidia A100 and Nvidia H100 components, with mostly encouraging results.
In this work, we explore the feasibility of using performance portable parallel code for an important LQCD kernel. Fusing the Kokkos C++ Performance Portability EcoSystem with MPI allows to scale on massive parallel machines while still being able to target a plentitude of different architectures with the same simple code. We report on benchmarking results for a range of currently deployed and recently introduced systems, including AMD EPYC 7742, AMD MI250, Fujitsu A64FX, Nvidia A100 and Nvidia H100 components, with mostly encouraging results.
Workshop
Accelerators
Codesign
Heterogeneous Computing
Task Parallelism
W
DescriptionTransformer models suffer from high computational complexity. Habana GAUDI architecture offers a promising solution to tackle these issues. GAUDI features a Matrix Multiplication Engine (MME) and a cluster of fully programmable Tensor Processing Cores (TPC). This paper explores the untapped potential of using GAUDI processors to accelerate Transformer-based models, addressing key challenges in the process. First, we provide a performance comparison between the MME and TPC components, illuminating their relative strengths and weaknesses. Second, we explore strategies to optimize MME and TPC utilization, offering practical insights to enhance computational efficiency. Third, we evaluate the performance of Transformers on GAUDI, particularly in handling long sequences and uncovering performance bottlenecks. Last, we evaluate the end-to-end performance of two Transformer-based large language models (LLM) on GAUDI. The contributions of this work encompass practical insights for practitioners and researchers alike. We delve into GAUDI's capabilities for Transformers through systematic profiling, analysis, and optimization exploration.
Workshop
Education
State of the Practice
W
Tutorial
Cloud Computing
Software Engineering
TUT
DescriptionHigh Performance Computing in the cloud has grown significantly over the last five years. Weather, computational fluid dynamics (CFD), genomic analysis and more are workloads that leverage the elasticity and the broad compute choices of the cloud to innovate faster and deliver faster results. The large choice of compute, storage and network options and the dynamic nature of cloud can make the first experience a daunting proposition. Cloud technologies also provide new capabilities to scientist, engineer and HPC specialists, however, how to use them may not be immediately clear.
This tutorial provides an intermediate and advanced content to run and manage HPC in the cloud. It is organized in four series of progressive lectures and labs that provides a hands-on learning experience. It starts with a primer on cloud foundations and how they map to common HPC concepts, dives deeper into cloud core components and presents the best practices to run HPC in the cloud.
This tutorial uses a combination of lectures and hands-on labs on provided temporary Amazon Web Services (AWS) accounts to provide both conceptual and hands-on learning.
This tutorial provides an intermediate and advanced content to run and manage HPC in the cloud. It is organized in four series of progressive lectures and labs that provides a hands-on learning experience. It starts with a primer on cloud foundations and how they map to common HPC concepts, dives deeper into cloud core components and presents the best practices to run HPC in the cloud.
This tutorial uses a combination of lectures and hands-on labs on provided temporary Amazon Web Services (AWS) accounts to provide both conceptual and hands-on learning.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
TP
XO/EX
DescriptionMachine Learning (ML) has become an increasingly popular tool to accelerate traditional workflows. Critical to the use of ML is the process of splitting datasets into training, validation, and testing subsets to develop and evaluate models. Common practice is to assign these subsets randomly. Although this approach is fast, it only measures a model's capacity to interpolate. These testing errors may be overly optimistic on out-of-scope data; thus, there is a growing need to easily measure performance for extrapolation tasks. To address this issue, we report astartes, an open-source Python package that implements many similarity- and distance-based algorithms to partition data into more challenging splits. This poster focuses on use-cases within cheminformatics. However, astartes operates on arbitrary vectors, so its principals and workflow are generalizable to other ML domains as well. astartes is available via the Python package managers pip and conda and is publicly hosted on GitHub (github.com/JacksonBurns/astartes).
Tutorial
Applications
Software Engineering
TUT
DescriptionProducing scientific software is a challenge. The high-performance modeling and simulation community, in particular, faces the confluence of disruptive changes in computing architectures and new opportunities (and demands) for greatly improved simulation capabilities, especially through coupling physics and scales. Simultaneously, computational science and engineering (CSE), as well as other areas of science, are experiencing an increasing focus on scientific reproducibility and software quality. Code coupling requires aggregate team interactions including integration of software processes and practices. These challenges demand large investments in scientific software development and improved practices. Focusing on improved developer productivity and software sustainability is both urgent and essential.
Attendees will learn about practices, processes, and tools to improve the productivity of those who develop CSE software, increase the sustainability of software artifacts, and enhance trustworthiness in their use. We will focus on aspects of scientific software development that are not adequately addressed by resources developed for industrial software engineering. Topics include the design, refactoring, and testing of complex scientific software systems; collaborative software development; and software packaging. The second half of this full-day tutorial will focus on reproducibility, and why and how to keep a lab notebook for computationally-based research.
Attendees will learn about practices, processes, and tools to improve the productivity of those who develop CSE software, increase the sustainability of software artifacts, and enhance trustworthiness in their use. We will focus on aspects of scientific software development that are not adequately addressed by resources developed for industrial software engineering. Topics include the design, refactoring, and testing of complex scientific software systems; collaborative software development; and software packaging. The second half of this full-day tutorial will focus on reproducibility, and why and how to keep a lab notebook for computationally-based research.
Workshop
Quantum Computing
Software Engineering
W
DescriptionThe classical simulation of quantum computers is in general a computationally hard problem. To emulate the behavior of realistic devices, it is sufficient to sample bitstrings from circuits. Recently, Ref. [ 5] introduced the so-called gate-by-gate sampling algorithm to sample bitstrings and showed it to be computationally favorable in many cases. Here we present bgls, a Python package which implements this sampling algorithm. bgls has native support for several states and is highly flexible for use with additional states. We show how to install and use bgls, discuss optimizations in the algorithm, and demonstrate its utility on several problems.
ACM Gordon Bell Finalist
Awards
TP
DescriptionReal-time 30-second-refresh numerical weather prediction (NWP) was performed with exclusive use of 11,580 nodes (~7%) of supercomputer Fugaku during Tokyo Olympics and Paralympics in 2021. Total 75,248 forecasts were disseminated in the 1-month period mostly stably with time-to-solution less than 3 minutes for 30-minute forecast. Japan’s Big Data Assimilation (BDA) project developed the novel NWP system for precise prediction of hazardous rains toward solving the global climate crisis. Compared with typical 1-hour-refresh systems, the BDA system offered two orders of magnitude increase in problem size and revealed the effectiveness of 30-second refresh for highly nonlinear, rapidly evolving convective rains. To achieve the required time-to-solution for real-time 30-second refresh with high accuracy, the core BDA software incorporated single precision and enhanced parallel I/O with properly selected configurations of 1000 ensemble members and 500-m-mesh weather model. The massively parallel, I/O intensive real-time BDA computation demonstrated a promising future direction.
Paper
Artificial Intelligence/Machine Learning
TP
DescriptionDynamic graph networks are widely used for learning time-evolving graphs, but prior work on training these networks is inefficient due to communication overhead, long synchronization, and poor resource usage. Our investigation shows that communication and synchronization can be reduced by carefully scheduling the workload and the execution order of operators in GNNs can be adjusted without hurting training convergence.
We propose a system called BLAD to consider the above factors, comprising a two-level load scheduler and an overlap-aware topology manager. The scheduler allocates each snapshot group to a GPU, alleviating cross-GPU communication.
The snapshots in a group are then carefully allocated to processes on a GPU, enabling overlap of compute-intensive NN operators and memory-intensive graph operators. The topology manager adjusts the operators' execution order to maximize the overlap. Experiments show that it achieves 27.2% speed up on training time on average without affecting final accuracy, compared to state-of-the-art solutions.
We propose a system called BLAD to consider the above factors, comprising a two-level load scheduler and an overlap-aware topology manager. The scheduler allocates each snapshot group to a GPU, alleviating cross-GPU communication.
The snapshots in a group are then carefully allocated to processes on a GPU, enabling overlap of compute-intensive NN operators and memory-intensive graph operators. The topology manager adjusts the operators' execution order to maximize the overlap. Experiments show that it achieves 27.2% speed up on training time on average without affecting final accuracy, compared to state-of-the-art solutions.
Invited Talk
Applications
Biology
Medicine
TP
DescriptionNeuroscience has become a highly interdisciplinary research field, including among others purely experimental studies, applied technology development, mathematical theory, computational models and simulations, AI, visualization and data analysis. However, neuroscience is relatively new to the usage of High Performance Computing. Within the European Flagship project, the Human Brain Project, scientists from all around Europe have made substantial progress in consolidating the computational requirements and usage patterns of this heterogeneous field. In parallel with the evolution of the European HPC landscape, neuroscience has also helped co-design the federated access to HPC, cloud, and data resources through the ICEI project, in collaboration with the FENIX-RI – a European effort to provide federated access to some of the largest HPC centers in Europe.
In this talk, I will provide a general overview of the evolving relationships between neuroscience and HPC. I will also present some examples of scientific highlights which have been made possible by this interaction. Finally, I will provide a perspective of how neuroscience can contribute to future technology co-design keeping in focus societal impact. I will complement this talk with my personal story and international experiences.
In this talk, I will provide a general overview of the evolving relationships between neuroscience and HPC. I will also present some examples of scientific highlights which have been made possible by this interaction. Finally, I will provide a perspective of how neuroscience can contribute to future technology co-design keeping in focus societal impact. I will complement this talk with my personal story and international experiences.
Paper
Artificial Intelligence/Machine Learning
Applications
Modeling and Simulation
State of the Practice
TP
DescriptionMosaic Flow is a novel domain decomposition method designed to scale physics-informed neural PDE solvers to large domains. Its unique approach leverages pre-trained networks on small domains to solve partial differential equations on large domains purely through inference, resulting in high reusability. This paper presents an end-to-end parallelization of Mosaic Flow, combining data parallel training and domain parallelism for inference on large-scale problems. By optimizing the network architecture and data parallel training, we significantly reduce the training time for learning the Laplacian operator to minutes on 32 GPUs. Moreover, our distributed domain decomposition algorithm enables scalable inferences for solving the Laplace equation on domains 4096x larger than the training domain, demonstrating strong scaling while maintaining accuracy on 32 GPUs. The reusability of Mosaic Flow, combined with the improved performance achieved through the distributed-memory algorithms, makes it a promising tool for modeling complex physical phenomena and accelerating scientific discovery.
Workshop
Education
State of the Practice
W
DescriptionThe convergence of quantum technologies and high-performance computing offers unique opportunities for research and algorithm development, demanding a skilled workforce to harness the quantum systems' potential. In this lightning talk, we address the growing need to train experts in quantum computing and explore the challenges in training these individuals in quantum computing, including the abstract nature of quantum theory, or the focus on specific frameworks. To overcome these obstacles, we propose self-guided learning resources that offer interactive learning experiences and practical framework-independent experimentation for different target audiences.
Paper
Accelerators
Applications
Graph Algorithms and Frameworks
Performance Measurement, Modeling, and Tools
Programming Frameworks and System Software
TP
DescriptionMany real-world computations involve sparse data structures in the form of sparse matrices. A common strategy for optimizing sparse matrix operations is to reorder a matrix to improve data locality. However, it's not always clear whether reordering will provide benefits over the unordered matrix, as its effectiveness depends on several factors, such as structural features of the matrix, the reordering algorithm and the hardware that is used. This paper aims to establish the relationship between matrix reordering algorithms and the performance of sparse matrix operations. We thoroughly evaluate six different matrix reordering algorithms on 490 matrices across eight multicore architectures, focusing on the commonly used sparse matrix-vector multiplication (SpMV) kernel. We find that reordering based on graph partitioning provides better SpMV performance than the alternatives for a large majority of matrices, and that the resulting performance is explained through a combination of data locality and load balancing concerns.
Invited Talk
Education
HPC in Society
TP
DescriptionAchievements in high-performance computing (HPC) ─ including computational and data-enabled science, analytics, learning, and artificial intelligence (AI) ─ drive progress in science and technology throughout our world. For example, collaborators in the U.S. Department of Energy (DOE) Exascale Computing Project (ECP) are pushing advances across a compelling range of scientific and engineering disciplines by pioneering a robust ecosystem of software technologies that exploit cutting-edge exascale computer architectures.
In order for the HPC community to address the most urgent scientific and societal challenges of the 21st century, the HPC workforce must embody a wide range of skills and perspectives … fully reflecting the diversity of society, including traditionally underrepresented communities — Black or African American, Hispanic/Latinx, Native American, Alaska Native, Native Hawaiian, Pacific Islanders, women, persons with disabilities, and first-generation scholars.
Each of us can make important contributions to broadening participation in HPC. This presentation will provide an overview of a variety of workforce efforts throughout the HPC community and opportunities for involvement. We will discuss the contributions of DOE lab staff who are working as part of the ECP Broadening Participation Initiative to address DOE workforce challenges through a lens that considers the distinct needs and culture of high-performance computing. Activities focus on three complementary thrusts: (1) Establishing an HPC Workforce Development and Retention Action Group to foster a supportive and inclusive culture in DOE labs and communities; (2) expanding the Sustainable Research Pathways (SRP) internship and workforce development program as a multi-lab cohort of students from underrepresented groups (and faculty working with them), who collaborate with DOE lab staff on world-class R&D projects; and (3) creating the Intro to HPC Bootcamp, an immersive program designed to engage students in energy justice using project-based pedagogy and real-life science stories to teach foundational skills in HPC, scalable AI, and analytics while exposing students to the excitement of DOE mission-driven team science. The presentation will highlight the first bootcamp (a collaboration among staff from advanced computing facilities at Argonne, Lawrence Berkeley, and Oak Ridge National Labs, Sustainable Horizons Institute, the DOE Office of Economic Impact and Diversity, and academic partners), which took place in August 2023 and featured a variety of HPC energy justice projects inspired by the DOE Justice40 Initiative. We will also consider challenges and opportunities for future work to broaden participation in HPC.
In order for the HPC community to address the most urgent scientific and societal challenges of the 21st century, the HPC workforce must embody a wide range of skills and perspectives … fully reflecting the diversity of society, including traditionally underrepresented communities — Black or African American, Hispanic/Latinx, Native American, Alaska Native, Native Hawaiian, Pacific Islanders, women, persons with disabilities, and first-generation scholars.
Each of us can make important contributions to broadening participation in HPC. This presentation will provide an overview of a variety of workforce efforts throughout the HPC community and opportunities for involvement. We will discuss the contributions of DOE lab staff who are working as part of the ECP Broadening Participation Initiative to address DOE workforce challenges through a lens that considers the distinct needs and culture of high-performance computing. Activities focus on three complementary thrusts: (1) Establishing an HPC Workforce Development and Retention Action Group to foster a supportive and inclusive culture in DOE labs and communities; (2) expanding the Sustainable Research Pathways (SRP) internship and workforce development program as a multi-lab cohort of students from underrepresented groups (and faculty working with them), who collaborate with DOE lab staff on world-class R&D projects; and (3) creating the Intro to HPC Bootcamp, an immersive program designed to engage students in energy justice using project-based pedagogy and real-life science stories to teach foundational skills in HPC, scalable AI, and analytics while exposing students to the excitement of DOE mission-driven team science. The presentation will highlight the first bootcamp (a collaboration among staff from advanced computing facilities at Argonne, Lawrence Berkeley, and Oak Ridge National Labs, Sustainable Horizons Institute, the DOE Office of Economic Impact and Diversity, and academic partners), which took place in August 2023 and featured a variety of HPC energy justice projects inspired by the DOE Justice40 Initiative. We will also consider challenges and opportunities for future work to broaden participation in HPC.
Exhibits
Flash Session
TP
XO/EX
DescriptionJoin speakers from NVIDIA and Arc Compute as they discuss solutions to the everyday challenges organizations face when building AI infrastructure and learn how Arc Compute's turnkey, end-to-end AI solutions, powered by NVIDIA GPUs and networking, are game changers helping decision-makers design, procure, and deploy their AI infrastructure.
Workshop
Data Movement and Memory
Heterogeneous Computing
Programming Frameworks and System Software
W
DescriptionWe propose a new framework called CachedArrays and a set of APIs to address the data tiering problem in large scale heterogeneous and disaggregated memory systems. The proposed framework operates at a variable size object granularity and allows the programmer to specify semantic hints about future use of data via a Policy API, which are used by a Data Manager to choose when and where to place a particular data object using a data management API, thus bridging the semantic gap between the programmer and the platform-specific hardware details, and optimizing overall performance. We evaluate the proposed framework on a real hardware platform with terabytes of memory consisting of NVRAM and DRAM on large scale ML training workloads such CNNs, DNNs, and DLRM that exhibit different data access and usage patterns.
Paper
Artificial Intelligence/Machine Learning
Codesign
Performance Optimization
Programming Frameworks and System Software
TP
DescriptionThis paper presents a parameterized analytical performance model of transformer-based Large Language Models (LLMs) for guiding high-level algorithm-architecture codesign studies. This model derives from an extensive survey of performance optimizations that have been proposed for the training and inference of LLMs; the model's parameters capture application characteristics, the hardware system, and the space of implementation strategies. With such a model, we can systematically explore a joint space of hardware and software configurations to identify optimal system designs under given constraints, like the total amount of system memory. We implemented this model and methodology in a Python-based open-source tool called Calculon. Using it, we identified novel system designs that look significantly different from current inference and training systems, showing quantitatively the estimated potential to achieve higher efficiency, lower cost, and better scalability.
Workshop
W
DescriptionThe ongoing revolution enabled via containerization, virtualization, and new orchestration models has dramatically changed how applications and services are delivered and managed across the computing industry. This revolution has established a new ecosystem of tools and techniques with new, flexible and agile approaches, and continues to gain traction in the HPC community. In addition to HPC-optimized container runtimes, emerging technologies like Kubernetes create a new set of opportunities and challenges. While adoption is growing, questions regarding best practices, foundational concepts, tools, and standards remain. Our goal is to promote the adoption of these tools and introspect the impact of this new ecosystem on HPC use cases. This workshop serves as a key venue for presenting late-breaking research, sharing experiences and best practices, and fostering collaboration in this field. Our fifth workshop iteration will continue to emphasize real-world experiences and challenges in adopting and optimizing these new approaches for HPC.
Workshop
Middleware and System Software
Programming Frameworks and System Software
Runtime Systems
W
DescriptionExtending Linux through kernel modules offers immense potential benefits and capabilities for HPC. Deployment is also more likely since Linux is typically the only supported vendor OS. However, because Linux is monolithic, kernel modules are free to access any address with maximum permissions. A poorly written---or untrustworthy---module can wreak havoc. This makes it hard to justify including custom kernel modules in production HPC systems. We address this limitation using the previously developed compiler- and runtime-based address translation (CARAT) model and toolchain, which injects guards around memory accesses. The accesses are then allowed/disallowed according to a policy. We share our results regarding the guard injection and address validation process. Our CARAT-based Kernel Object Protection (CARAT KOP) prototype is able to transform a substantial production kernel module from the kernel tree (a NIC driver comprising ~19,000 lines of code). The transformed module runs with minimal effect on its performance.
Panel
Energy Efficiency
Green Computing
Sustainability
TP
DescriptionWhat does it mean for computer systems to be sustainable? We have made significant improvements to operational efficiency in HPC systems. We now need to consider a broader scope of environmental impacts across the life cycle of our systems. This includes how they are designed and manufactured, how they are transported, how they are operated and how we are tearing them down, re-using and recycling them after they are no longer useful. These considerations may not be obvious. For example, manufacturing costs dominate the life cycle carbon footprint of systems and that trend is on the rise. How can we start to consider the carbon footprint across the end to end life cycle of our systems? We have a lot of capabilities to understand the performance, power and energy of our systems, but the same cannot be said for carbon footprint. Should carbon footprint be a first order optimization target?
Early Career Program
Inclusivity
Inclusivity
TP
DescriptionFinding the right career path early may be one of the most rewarding discoveries in a young professional's life. This panel discussion will feature insightful stories and kernels of wisdom of four panelists whose diverse careers span from start-ups to large companies, non-profit organizations to universities, and government labs to government agencies. They offer their practical wisdom to present a broader picture of the different workplaces in the HPC community. It will help young individuals to better match their strengths and objectives to the challenges and rewards of the different work places.
Workshop
Programming Frameworks and System Software
W
DescriptionWe present a new methodology and tool that speeds up the process of optimizing science and engineering programs. The tool, called CaRV (Capture, Replay, and Validate), enables users to experiment quickly with large applications, comparing individual program sections before and after optimizations in terms of efficiency and accuracy. Using language-level checkpointing techniques, CaRV captures the necessary data for replaying the experimental section as a separate execution unit after the code optimization and validating the optimization against the original program. The tool reduces the amount of time and resources spent on experimentation with long-running programs by up to two orders of magnitude, making program optimization more efficient and cost-effective.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
TP
XO/EX
DescriptionPreparing for the deployment of large scientific and engineering codes on GPU-dense exascale systems is made challenging by the unprecedented diversity of vendor hardware and programming model alternatives for offload acceleration. To leverage the exaflops of GPUs from Frontier (AMD) and Aurora (Intel), users of high performance computing (HPC) legacy codes originally written to target NVIDIA GPUs will have to make decisions with implications regarding porting effort, performance, and code maintainability. To facilitate HPC users navigating this space, we have established a pipeline that combines generalized GPU performance models with proxy applications to evaluate the performance portability of a massively parallel computational fluid dynamics (CFD) code in CUDA, SYCL, HIP, and Kokkos with backends on current NVIDIA-based machines as well as testbeds for Aurora (Intel) and Frontier (AMD). We demonstrate the utility of predictive models and proxy applications in gauging performance bounds and guiding hand-tuning efforts.
Workshop
Software Engineering
W
Workshop
Education
State of the Practice
W
DescriptionCDER Announcements and Closing Announcements
Workshop
Programming Frameworks and System Software
W
DescriptionLocal support for large language models (LLMs) in a research community can address unique technological and procedural challenges that arise in an academic setting. Platforms providing multi-GPU nodes, typically found in a centralized computational resource, such as a university datacenter, can manage the large memory footprint of the open-source LLMs. Customizations employing peripheral frameworks help extend the capabilities of these models. Further, the local implementation addresses the protection of researcher IP and control of restricted data sources. This report describes recent efforts toward provisioning this popular new tool and provides guidance for recreating our approach at Arizona State University.
Workshop
Large Scale Systems
Middleware and System Software
Programming Frameworks and System Software
W
DescriptionGPU-centric accelerated supercomputing is still on the main stream for HPC and AI applications. However, in the next generation's systems, we need to consider wider variety of accelerators in different style of systems from the architecture level. On such complicated systems, what is the best way of programming keeping the balance between programmability/productivity and performance? We have been working on the multi-hetero accelerated environment to combine GPU and FPGA in a single platform to apply complicated multiphysics applications with 360-degree manner of utilization of accelerating devices. There are several approaches from the naive implementation to the high-level directive-base approach. In this talk, I will present the programming model, supporting language system, and target applications with the implementation on a real system.
Workshop
Architecture and Networks
Hardware Technologies
W
DescriptionThe RISC-V "V" extension introduces vector processing to the RISC-V architecture. Unlike most SIMD extensions, it supports long vectors which can result in significant improvement of multiple applications. We present our ongoing research to implement and optimize a vectorized Winograd algorithm used in convolutional layers on RISC-V Vector(RISC-VV) processors. Our study identifies effective techniques for optimizing the kernels of Winograd on RISC-VV using intrinsic instructions, and showcases how certain instructions offer better performance. Our co-design findings suggest that the Winograd algorithm benefits from vector lengths up to 2048 bits and cache sizes up to 64MB.
We use our experience with Winograd to highlight potential enhancements for the standard that would simplify code generation and aid low-level programming. Finally, we share our experience from experimenting with forks of gem5 for RISC-VV and stress the importance of a mature software ecosystem, to facilitate design space exploration and architectural optimization.
We use our experience with Winograd to highlight potential enhancements for the standard that would simplify code generation and aid low-level programming. Finally, we share our experience from experimenting with forks of gem5 for RISC-VV and stress the importance of a mature software ecosystem, to facilitate design space exploration and architectural optimization.
Posters
Research Posters
TP
XO/EX
DescriptionThe IceCube Neutrino Observatory is a cubic kilometer neutrino telescope located at the geographic South Pole. Understanding detector systematic effects is a continuous process. This requires the Monte Carlo simulation to be updated periodically to quantify potential changes and improvements in science results with more detailed modelling of the systematic effects. IceCube’s largest systematic effect comes from the optical properties of the ice the detector is embedded in. Over the last few years there have been considerable improvements in the understanding of the ice, which require a significant processing campaign to update the simulation. In winter 2023, the NRP project offered to provide the needed GPU compute to IceCube in support of this activity. Given the mostly uniform nature of such a simulation campaign, we thus have enough statistics to properly characterize the relative performance of the dozen GPU models present in the NRP in the context of IceCube.
Posters
Research Posters
TP
XO/EX
DescriptionOpenSHMEM is a widely used Partitioned Global Address Space (PGAS) programming model in the HPC community. The latest OpenSHMEM Specification v1.5 introduced the team concept and team-based collective communication that are similar to the communicator and collective communication in the Message Passing Interface (MPI) programming model. However, the typical design of OpenSHMEM collectives relies on one-sided communication such as PUT and Get to move the data, which is different from two-sided communication in MPI collectives. In this work, we compare OpenSHMEM collective designs using native one-sided communication and MPI-based two-sided communication on an HPC cluster. We characterize two aspects (i.e., synchronization and collective algorithms) that can influence the performance of these two different designs and use benchmarks to show the performance differences. Through our evaluation, we find that the MPI-based design is faster than the one-sided design at most times, while the one-sided design can perform faster in certain cases.
Posters
Research Posters
TP
XO/EX
DescriptionOptimizing iPIC3D, an implicit Particle-in-Cell (PIC) code, for large-scale 3D plasma simulations is crucial for space and astrophysical applications. This work focuses on characterizing iPIC3D’s communication efficiency through strategic measures like optimal node placement, communication and computation overlap, and load balancing. Profiling and tracing tools are employed to analyze iPIC3D’s communication efficiency and provide practical recommendations. Implementing optimized communication protocols addresses the Geospace Environmental Modeling (GEM) magnetic reconnection challenges in plasma physics with more precise simulations. This approach captures the complexities of 3D plasma simulations, particularly in magnetic reconnection, advancing space and astrophysical research.
Workshop
Accelerators
Artificial Intelligence/Machine Learning
Compilers
Heterogeneous Computing
Programming Frameworks and System Software
Runtime Systems
W
DescriptionIn recent years, we have seen an emergence of novel spatial architectures to accelerate domain-specific workloads like Machine Learning. There is a need to investigate their performance characteristics for traditional HPC workloads for their tighter integration with current and future heterogeneous compute resources. In this work, we implement, optimize and evaluate a parallel algorithm for Triangle Counting for graphs in Bulk Synchronous Parallel (BSP) model for Graphcore’s IPU architecture as well as discuss lessons learned. This study demonstrates IPU's competency in handling such irregular workloads by providing an average speedup of up to 5.3x over NVIDIA A100 GPU on real-world datasets.
Doctoral Showcase
Posters
Accelerators
Applications
TP
DescriptionThe reconstruction of the trajectories of charged particles through detector experiments is a core computational task in the domain of high-energy physics. Upcoming upgrades to accelerators such as the Large Hadron Collider as well as to experiments like ATLAS threaten to render existing CPU-based approaches to track reconstruction insufficient, and the use of massively parallel systems - GPGPUs in particular - is an important opportunity to meet future data processing requirements. In my thesis, I investigate the feasibility of GPGPU-based track reconstruction from performance engineering perspective: I focus on structured analysis of application performance, the development of statistical and analytical models of performance, methods for mitigating the challenges of GPGPU programming, and the design and implementation of novel track reconstruction algorithms. The key contributions of my thesis include the development of novel algorithms for hit clustering, seed finding, and combinatorial Kalman filtering, key parts of the track reconstruction process. These algorithms suffer from significant load imbalance and thread divergence, and I have developed a novel statistical method for estimating the performance effects of this, as well as to guide optimization through thread refinement and coarsening. I have developed a method for the automated design space exploration of data storage methods for magnetic fields, which play a crucial role in track reconstruction. Furthermore, I have developed an evolutionary method for finding layouts for multi-dimensional arrays in hierarchical memory systems. My thesis will be concluded by a comprehensive study of the performance of track reconstruction, as guided by the aforementioned research.
Workshop
W
DescriptionA popular approach to deploying scientific applications in high performance computing (HPC) is Linux containers, which package an application and all its dependencies as a single unit. This image is built by interpreting instructions in a machine-readable recipe, which is faster with a build cache that stores instruction results for re-use. The standard approach (used e.g. by Docker and Podman) is a many-layered union filesystem, encoding differences between layers as tar archives.
We describe a new approach, implemented in Charliecloud: store changing images in a Git repository. Our experiments show this performs similarly to layered caches on both build time and disk usage, with a considerable advantage for many-instruction recipes. Our approach also has structural advantages: better diff format, lower cache overhead, and better file de-duplication. These results show that a Git-based cache for layer-free container implementations is not only possible but may outperform the layered approach on important dimensions.
We describe a new approach, implemented in Charliecloud: store changing images in a Git repository. Our experiments show this performs similarly to layered caches on both build time and disk usage, with a considerable advantage for many-instruction recipes. Our approach also has structural advantages: better diff format, lower cache overhead, and better file de-duplication. These results show that a Git-based cache for layer-free container implementations is not only possible but may outperform the layered approach on important dimensions.
Workshop
Accelerators
Edge Computing
Heterogeneous Computing
W
DescriptionAddressing performance portability across diverse accelerator architectures has emerged as a major challenge in the development of application and programming systems for high-performance computing (HPC) environments. Although the recent performance portability programming systems significantly improved the productivity to meet this challenge, it becomes notably intricate within computing nodes equipped with multiple accelerator types, each distinguished by unique performance attributes, optimal data layout, and binary formats. To navigate the intricacies of multi-accelerator programming, we propose CHARM-SYCL that is extended from our multi-accelerator execution environment called CHARM. This environment is designed to compose our own SYCL-based performance portability programming front-end and extreme-heterogeneous runtime back-end implemented with the IRIS from Oak Ridge National Laboratory. We present the architecture of CHARM-SYCL, delving into the protocol of compilation flow and SYCL-IRIS runtime integration. Our preliminary evaluation indicates potential productivity boost while providing reasonable performance compared to platform specific programming system and runtime.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
TP
XO/EX
DescriptionIn autonomous driving, computational resources are strained by inference models. The viability of offloading inference to the cloud, considering latency between the car and data center, is questioned. We introduce a Cloud-Aided Real-time Inferencing Framework, integrating with Donkeycar and distributing computational load between cloud and edge. Utilizing Raspberry Pi 4 for edge inferencing and NVIDIA Triton Inference Server for the cloud, we demonstrate the framework's advantages, particularly in RNN performance, which achieved 90% autonomy. Our study includes a scaled car navigating obstacles, assessing factors like speed, resources, latency, and autonomy score. The system's performance shows faster inference time, eliminating bottlenecks, and processing 42 frames per second in the cloud, 11 times faster than on the edge. The poster will detail the strengths, limitations, and potential of leveraging cloud resources in real-time edge environments, focusing on autonomy scores and latency trade-offs.
Panel
Artificial Intelligence/Machine Learning
Codesign
Heterogeneous Computing
TP
DescriptionChiplets have become a compelling approach to incorporating specialization and massive bandwidth into compute and memory devices used in HPC. But there are many challenges in realizing the vision for affordable modular HPC using advanced packaging technology. We bring together a diverse panel of experts for a discussion on whether there will be an ecosystem or marketplace of Chiplets that will be available for system developers to use to build next generation devices and weigh the pros and cons of off-the-shelf Chiplets vs custom designed Chiplets. Chiplets could be processors, GPUs, Networking interfaces, optical engines, memory controllers, or FPGAs.
Paper
Architecture and Networks
Data Movement and Memory
Graph Algorithms and Frameworks
Performance Measurement, Modeling, and Tools
Programming Frameworks and System Software
TP
DescriptionGraph analytics has become a major workload in recent years. The underlying core algorithms tend to be irregular and data dependent, making them challenging to parallelize. Yet, these algorithms can be implemented and parallelized in many ways for CPUs and even more ways for GPUs. We took 6 key graph algorithms and created hundreds of parallel CUDA, OpenMP, and parallel C++ versions of each of them, most of which have never been described or studied. To determine which parallelization and implementation styles work well and under what circumstances, we evaluated the resulting 1106 programs on 2 GPUs and 2 CPUs using 5 input graphs. Our results show which styles and combinations thereof work well and which ones should be avoided. We found that choosing the wrong implementation style can yield over a 10x performance loss on average. The worst combinations of styles can cost 6 orders of magnitude in performance.
Inclusivity
Inclusivity
DescriptionCinematic Scientific Visualization (CSV) is a growing subfield of visualization which aims to reach diverse audiences through films, immersive experiences, and social media. CSVs make complex scientific datasets accessible by using Hollywood-style computer graphics techniques and cinematography. In this session, hear from NCSA’s Advanced Visualization Lab on its pioneering efforts in this field.
Workshop
Artificial Intelligence/Machine Learning
Energy Efficiency
Green Computing
Performance Measurement, Modeling, and Tools
Sustainability
W
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
TP
XO/EX
DescriptionTracking hemodynamic responses to treatment and stimuli for long periods is a grand challenge. Moving from established single-heartbeat technology to longitudinal profiles would require continuous data reflecting a patient's evolving state, methods to extend the temporal domain that could be feasibly computed, and high-throughput resources. Although personalized models can accurately measure 3D hemodynamics over single heartbeats, state-of-the-art methods would require centuries of runtime on leadership-class systems to simulate one day of activity. We are establishing the Longitudinal Hemodynamic Mapping Framework (LHMF), which combines patient-specific models, wearables, and cloud computing to enable the first digital twins that capture longitudinal hemodynamic maps (LHMs). We demonstrate validity through comparison with ground truth data for 750 beats. We applied LHMF to generate the first LHM of coronary arteries spanning 4.5 million heartbeats. LHMF relies on an initial fixed set of representative simulations to enable the computationally tractable creation of LHM over heterogeneous systems.
Paper
Algorithms
Cloud Computing
Distributed Computing
Heterogeneous Computing
Large Scale Systems
State of the Practice
TP
DescriptionTracking hemodynamic responses to treatment and stimuli over long periods remains a grand challenge. Moving from established single-heartbeat technology to longitudinal profiles would require continuous data describing how the patient’s state evolves, new methods to extend the temporal domain over which flow is sampled, and high-throughput computing resources. While personalized digital twins can accurately measure 3D hemodynamics over several heartbeats, state-of-the-art methods would require hundreds of years of wallclock time on leadership scale systems to simulate one day of activity. To address these challenges, we propose a cloud-based, parallel-in-time framework leveraging continuous data from wearable devices to capture the first 3D patient-specific, longitudinal hemodynamic maps. We demonstrate the validity of our method by establishing ground truth data for 750 beats and comparing the results. Our cloud-based framework is based on an initial fixed set of simulations to enable the wearable-informed creation of personalized longitudinal hemodynamic maps.
Paper
Cloud Computing
Distributed Computing
Energy Efficiency
Green Computing
Programming Frameworks and System Software
State of the Practice
Sustainability
TP
DescriptionThis paper presents a solution to the challenge of mitigating carbon emissions from hosting large-scale machine learning (ML) inference services. ML inference is critical to modern technology products, but it is also a significant contributor to carbon footprint. We introduce, Clover, a carbon-friendly ML inference service runtime system that balances performance, accuracy, and carbon emissions through mixed-quality models and GPU resource partitioning. Our experimental results demonstrate that Clover is effective in substantially reducing carbon emissions while maintaining high accuracy and meeting service level agreement (SLA) targets.
Workshop
Large Scale Systems
Performance Optimization
State of the Practice
W
DescriptionConfiguration of HPC nodes is an important aspect of maintaining any HPC cluster. Our flagship HPE/Cray EX supercomputer, Derecho, is approximately 2,500 compute nodes and is susceptible to power interruptions from external factors such as lightning strike induced power sags and utility mishaps. These events challenged us to find an acceptable mean time to recovery. Ansible is our selected configuration management system but struggles with single large-scale runs of configuration despite optimizing individual runs such as tuning fork count and enabling pipelining. We needed a method to perform a large blast of configuration within a short time period to get the system back to a functional state or apply some level of remediation such as security updates. We therefore wrote a utility, Clushible, which wraps Ansible with ClusterShell's Python API to scale out the execution of Ansible that effectively took our standard full system run from multiple hours to minutes.
Paper
Accelerators
Artificial Intelligence/Machine Learning
Codesign
Fault Handling and Tolerance
Performance Measurement, Modeling, and Tools
Post-Moore Computing
TP
DescriptionVector search has emerged as the foundation for large-scale information retrieval and machine learning systems, with search engines like Google and Bing processing tens of thousands of queries per second on petabyte-scale document datasets by evaluating vector similarities between encoded query texts and web documents. As performance demands for vector search systems surge, accelerated hardware offers a promising solution in the post-Moore's Law era. We introduce FANNS, an end-to-end and scalable vector search framework on FPGAs. Given a user-provided recall requirement on a dataset and a hardware resource budget, FANNS automatically co-designs hardware and algorithm, subsequently generating the corresponding accelerator. The framework also supports scale-out by incorporating a hardware TCP/IP stack in the accelerator. FANNS attains up to 23.0x and 37.2x speedup compared to FPGA and CPU baselines, respectively, and demonstrates superior scalability to GPUs, achieving 5.5x and 7.6x speedup in median and 95th percentile latency within an eight-accelerator configuration.
Birds of a Feather
State of the Practice
TP
XO/EX
DescriptionNew HPC technologies offer new opportunities but also bring challenges for the users in a fast-developing HPC ecosystem. In order to get a better understanding and to prepare adapted offers for industrial / commercial HPC users, the EC funded the HPC-GIG project to organize three market studies on the current HPC offers for industry, the current and future needs of industrial and commercial HPC use and the legal and business requirements for industrial/commercial use. In this BoF, we will present the highlights of the market studies and discuss with both industrial users and HPC experts the outlook for future services.
Birds of a Feather
Education
TP
XO/EX
DescriptionThe National Science Foundation's Office of Advanced Cyberinfrastructure (OAC) supports the development and provisioning of state-of-the-art cyberinfrastructure resources, including HPC systems, tools, and services essential to the advancement of science and engineering. A critical vision and investment plan of OAC is to support inclusive and sustainable workforce development that will lead to transformative research leveraging such cyberinfrastructure. We seek to engage with the community and institutions to obtain feedback on preparing a workforce to address the evolving needs of research communities, including facilitating the invention and usage of CI, promoting democratized access, and fostering sustainable CI ecosystems.
Workshop
Modeling and Simulation
Performance Measurement, Modeling, and Tools
W
DescriptionWe explore the performance of Intel Xeon MAX CPU Series, representing the most significant new variation upon the classical CPU architecture since the Xeon Phi. Given a large on-package high-bandwidth memory, the bandwidth-to-compute ratio has significantly shifted compared to other CPUs on the market. Since a large fraction of HPC workloads are sensitive to the available bandwidth, we explore how this architecture performs on a selection of HPC proxies and applications that are mostly sensitive to bandwidth, and how it compares to the previous 3rd generation Xeon processors (Ice Lake) and an EPYC 7003 Series Processor with 3D V-Cache Technology. We explore performance with different parallel implementations (MPI, MPI+OpenMP, MPI+SYCL), compiled with different compilers and flags, and executed with or without hyperthreading. We show how performance bottlenecks are shifted from bandwidth to communication latencies for some applications, and demonstrate speedups compared to the previous generation between 2.0x-4.3x.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
TP
XO/EX
DescriptionLarge scientific collaborations often have many users accessing the same data files, creating repeated file transfers over long distances. Data accesses to the distant data sources cause long latency to the applications and can be further delayed due to limited network bandwidth. XCache-based in-network regional data caching system stores scientific data and can reduce the network traffic and access latency. We examine the established Southern California Petabyte Scale Cache (So Cal Cache) and the newly deployed Chicago Regional Cache (Chicago Cache) for a high-energy physics experiment to analyze cache utilization trends and compare regional data access patterns. The results of the cache utilization trends show that the cache contributed to sharing a majority of data, and regional differences can be explained by the comparative study. Additionally, predictions of cache behavior show low error values in both regions, providing a useful tool for future resource planning.
Workshop
Accelerators
Algorithms
Applications
Compilers
Heterogeneous Computing
Programming Frameworks and System Software
Runtime Systems
W
DescriptionN-body algorithms aim to calculate the interactions between n different bodies with the goal to obtain their trajectories. Algorithms that solve the n-body problem can leverage significant amounts of parallelism. Today, GPUs are commonly used besides CPUs for the execution of parallel algorithms. However, targeting several hardware platforms at once often requires to use different programming languages. In this work, we have implemented the naive and tree-based Barnes-Hut n-body algorithm using SYCL to target CPUs and GPUs with the same programming language. We compare both algorithms on heterogeneous hardware platforms as well as for different SYCL implementations, with respect to their runtime behavior and their support for several performance optimizations. Our results show that some optimizations show unexpected behavior for different SYCL implementations. And even though data center GPUs have a clear performance advantage for the naive algorithm, surprisingly consumer GPUs offer competitive runtimes for the Barnes-Hut algorithm.
Workshop
Energy Efficiency
Green Computing
Sustainability
W
DescriptionPower is a limiting factor for supercomputers limiting their scale and operation. Characterizing the power signatures of different application types can enable data centers to operate efficiently, even when power constrained. This paper investigates power profiles of diverse scientific applications, spanning both traditional simulations and modern machine learning (ML) running on the Perlmutter supercomputer at the National Energy Research Scientific Computing Center (NERSC). Our findings indicate that traditional simulations typically consume more power on average than ML workloads. Furthermore, ML applications exhibit periodic power fluctuations attributed to epoch transitions during training. Finally, we discuss the potential implications of the research insights toward automatic demand response (ADR) and considerations for designing future systems.
Exhibitor Forum
Architecture and Networks
Cloud Computing
TP
XO/EX
DescriptionAfter our very lively panel last year at SC22, “Smackdown: Does HPC Need Composability Now?", where 40% of attendees agreed with the premise, and over 50% of attendees noted it was either “quite” or “extremely” relevant to the problems they are currently trying to solve, we are following up this year with a user perspective from the trenches.
While composable vendors literally promise the impossible — between “impossible servers” and “software-defined hardware” — the reality of implementing these systems in the wild can be sobering, despite promising many benefits for the HPC user.
In this talk, Sean Taylor will share his perspective as a Senior Linux HPC Engineer at Oak Ridge National Laboratory after having evaluated, deployed, dismantled, and ultimately adopted various composable systems.
This technical deep dive will parse out scenarios where composable solutions work well vs. others where it is merely helpful, and will outline the challenges of implementation. Sean will break down key factors potential users need to consider in order to take advantage of what composability offers without falling into its pitfalls.
Particular detail will be paid to the key benefits of a composable vs. static architecture, notably the fact the nodes are nondeterministic and malleable and configured based on job need, which in ORNL’s experience affords “the most efficient use of resources and substantially better cluster efficacy.”
While composable vendors literally promise the impossible — between “impossible servers” and “software-defined hardware” — the reality of implementing these systems in the wild can be sobering, despite promising many benefits for the HPC user.
In this talk, Sean Taylor will share his perspective as a Senior Linux HPC Engineer at Oak Ridge National Laboratory after having evaluated, deployed, dismantled, and ultimately adopted various composable systems.
This technical deep dive will parse out scenarios where composable solutions work well vs. others where it is merely helpful, and will outline the challenges of implementation. Sean will break down key factors potential users need to consider in order to take advantage of what composability offers without falling into its pitfalls.
Particular detail will be paid to the key benefits of a composable vs. static architecture, notably the fact the nodes are nondeterministic and malleable and configured based on job need, which in ORNL’s experience affords “the most efficient use of resources and substantially better cluster efficacy.”
Workshop
Education
State of the Practice
Sustainability
W
DescriptionThe current landscape of HPC courses and training materials heavily emphasizes foundational concepts, especially those related to MPI/OpenMP and CUDA. However, there seems to be a gap when it comes to mid-level topics like sparse linear algebra using HPC. Given the breadth of subject matter, such courses are less commonly developed.
A proposed solution leans on the UNIX philosophy of 'doing one thing and doing it well,' combined with the collaborative ethos and tools that have propelled open-source projects, viz. the Linux kernel, to success. By adopting these practices, we can enhance the distribution of HPC training, especially in relatively underrepresented topics. The end goal is to produce HPC materials focused to specific subjects such as sparse linear algebra numerics, along with coverage for rapidly evolving HPC topics.
In this lightning talk, we'll delve into the advantages, challenges around the potential of this fresh approach to HPC training material development.
A proposed solution leans on the UNIX philosophy of 'doing one thing and doing it well,' combined with the collaborative ethos and tools that have propelled open-source projects, viz. the Linux kernel, to success. By adopting these practices, we can enhance the distribution of HPC training, especially in relatively underrepresented topics. The end goal is to produce HPC materials focused to specific subjects such as sparse linear algebra numerics, along with coverage for rapidly evolving HPC topics.
In this lightning talk, we'll delve into the advantages, challenges around the potential of this fresh approach to HPC training material development.
Tutorial
Algorithms
Applications
Data Compression
I/O and File Systems
TUT
DescriptionLarge-scale numerical simulations, observations, experiments and AI computations are generating or consuming very large datasets that are difficult to analyze, store, and transfer. Data compression is an attractive and efficient technique to significantly reduce scientific datasets. This tutorial reviews the motivations, principles, techniques and error analysis methods for lossy compression of scientific datasets. It details the main compression stages (e.g. decorrelation, approximation and coding) and their variations through the presentation of the state of the art lossy compressors: SZ, ZFP, TThresh, MGARD, SPERR. A special attention is spent on lossy compression trustability. The tutorial addresses the following questions: Why lossy compression? How does compression work? How to measure and control compression error? What are the current use cases? The tutorial uses examples of real-world scientific datasets to illustrate the different compression techniques and their performance. From a participant perspective, the tutorial will detail how to use compression software as executables and as modules integrated in parallel I/O libraries (ADIOS, HDF5). This half-day tutorial, given by two of the leading teams in this domain and targeting primarily beginners interested in learning about lossy compression for scientific data, is improved from the highly rated tutorials given at ISC17-22 and SC17-22.
Exhibitor Forum
Artificial Intelligence/Machine Learning
Architecture and Networks
Hardware Technologies
TP
XO/EX
DescriptionCompute Express Link™ (CXL™) – an open industry standard interconnect – offers coherency and memory semantics using high-bandwidth, low-latency connectivity between the host processor and devices such as accelerators, memory buffers, and smart I/O devices. CXL advances memory expansion and fabric management capabilities to increase system scalability and flexibility across multiple compute domains, enabling resource sharing for higher performance, reduced software stack complexity, and lower overall system cost. The fabric enhancements and memory expansion features included in CXL 3.0 delivered new levels of composability required in HPC and modern data center environments.
Earlier this year, the CXL Consortium hosted the first official testing of CXL 1.1 products, Pre-FYI testing for CXL 2.0, and rolled out the first iteration of the CXL Integrators List. The HPC community will get a sneak peek at CXL products entering the market in the CXL Consortium booth (#1301).
This session will kick off with an update from the Consortium and introduce enhancements in the latest release of the CXL specification. The presentation will also highlight the challenges in the HPC industry and explore how CXL can disaggregate resources and enable composable systems. Additionally, the presentation will introduce CXL technology demos from our member companies showcasing CXL solutions, multi-vendor demos, and proofs-of-concept to highlight beneficial use cases for HPC and modern data center environments. Attendees will have the opportunity to ask CXL expert questions about the technology and gain insight into how CXL can help address challenges in the HPC community.
Earlier this year, the CXL Consortium hosted the first official testing of CXL 1.1 products, Pre-FYI testing for CXL 2.0, and rolled out the first iteration of the CXL Integrators List. The HPC community will get a sneak peek at CXL products entering the market in the CXL Consortium booth (#1301).
This session will kick off with an update from the Consortium and introduce enhancements in the latest release of the CXL specification. The presentation will also highlight the challenges in the HPC industry and explore how CXL can disaggregate resources and enable composable systems. Additionally, the presentation will introduce CXL technology demos from our member companies showcasing CXL solutions, multi-vendor demos, and proofs-of-concept to highlight beneficial use cases for HPC and modern data center environments. Attendees will have the opportunity to ask CXL expert questions about the technology and gain insight into how CXL can help address challenges in the HPC community.
Invited Talk
Education
HPC in Society
TP
DescriptionWhat do culture and identity have to do with computing, including HPC? This talk will initiate a conversation - A conversation about the ways in which an awareness of and attention to individual identities and workplace culture can positively impact your creativity, innovation, and productivity. Computing was created to be taught to everyone but it has become more narrow. In addition to computational systems and algorithmic processes, advanced computing capabilities also involve societal impact. The decisions about what is worthy of investigation, what problems get addressed (and funded) what is acceptable evidence of success, and so forth are influenced by who we are. These lived experiences represent a rich repository from which to draw phenomenal ideas. With the aim of urging reflection, this talk is intended to spur discussion about what Tissenbaum and colleagues (2021) refer to as “more diverse, equitable, and meaningful endpoints.”
Panel
Artificial Intelligence/Machine Learning
Edge Computing
IoT
TP
DescriptionNASA’s space missions have captured the imagination of those around the world for generations. From the International Space Station to Artemis, there is a need for HPC, data movement, analytics, and AI capabilities delivered as efficient pipelines. For example future missions, such as the Dragonfly mission to Titan and other icy moon missions may require AI at the extreme edge – with planned data flows to the core. Demand is skyrocketing with use cases spanning operational decision-making at the edge, ensuring the health and safety of our astronauts, and advancing scientific discovery. In fact, this edge case ability, with AI/ML, is changing the business models for the evolving space and climate economies and the way we architect HPC systems. Hear from and engage with our panel of experts about how recent missions have expanded our concept of computing at the edge – for both space-based and terrestrial challenges.
Workshop
W
DescriptionHigh-Performance Computing (HPC) provides significant advantages to researchers across diverse fields due to its capability to handle complex and data-intensive computations beyond conventional systems. However, not all researchers possess the expertise to leverage HPC effectively. Sandia's Computing as a Service (CaaS) project aims to democratize HPC by offering its performance without requiring researchers to become HPC experts.
CaaS first developed a prototype for the DetNet team, focusing on delivering simulation-as-a-service for the detonator community. This presentation highlights the prototype's components (UI, Cloud, HPC), demonstrating their use of containerization for scalability and portability. The creation of containers optimized for HPC resource utilization is discussed, covering performance and security challenges. Additionally, the deployment of frontend containers via Kubernetes is outlined, including challenges linked to integrating frontend and HPC cluster containers. This initiative bridges the gap between HPC capabilities and researchers' accessibility, fostering a collaborative environment for advanced computational research.
CaaS first developed a prototype for the DetNet team, focusing on delivering simulation-as-a-service for the detonator community. This presentation highlights the prototype's components (UI, Cloud, HPC), demonstrating their use of containerization for scalability and portability. The creation of containers optimized for HPC resource utilization is discussed, covering performance and security challenges. Additionally, the deployment of frontend containers via Kubernetes is outlined, including challenges linked to integrating frontend and HPC cluster containers. This initiative bridges the gap between HPC capabilities and researchers' accessibility, fostering a collaborative environment for advanced computational research.
Exhibits
Flash Session
TP
XO/EX
DescriptionOptimizing AI/ML networking involves balancing speed and performance while managing the inherent complexity of data transfer and computation. Efficient data pipelines, low-latency communication, and robust hardware configurations are crucial for enabling fast, high-performing AI/ML applications within intricate network infrastructures.
Birds of a Feather
Cloud Computing
Distributed Computing
TP
XO/EX
DescriptionHigh-Performance Computing systems that have been traditionally deployed at a single site are expected to significantly expand their reach to include a variety of remote edge systems. These edge systems include computing platforms located near instruments as well as instruments themselves. Examples range from interconnected ecosystems of large science instruments to smart energy grids supported by complex analytics and control. These interconnected systems form a compute and instrument continuum wherein computation is orchestrated in various stage. This BoF will discuss the aggregation and synthesis of previously distinct techniques and tools (including HPC, AI/ML, and digital twins) to enable continuum computing.
Posters
Scientific Visualization & Data Analytics Showcase
Data Analysis, Visualization, and Storage
HPC in Society
Modeling and Simulation
Visualization
TP
XO/EX
DescriptionThe 2019 Ridgecrest earthquakes occurred in a complex system of fault lines in the Mojave desert. Separated by 34 hours, the earthquakes were caused by ruptures in separate but nearby faults. In this study of the geophysical processes underlying these events the surface, known faults and the volumetric subsurface are modeled on HPC systems. Visualization techniques are used to analyze the simulation results in their three-dimensional context.
Doctoral Showcase
Posters
Cloud Computing
Distributed Computing
TP
DescriptionTo achieve the resource agnostic flexibility of compute described by the computing continuum, we combined our work in workload profiling and cost estimation with task provisioning to present DELTA–a framework for serverless workload placement across a computing ecosystem. To address the dynamic availability of modern computing resources as well as the multiple costs involved in computing, we presented extensions of our framework as DELTA+ which demonstrated the ability for resource provisioning and multidimensional compute costs.
To bring this idea of resource abstraction via serverless into the rapidly growing field of federated learning, we developed and released FLoX: Federated Learning on funcX. This framework was built from the ground up around a serverless computing paradigm with experimentation and usability in mind. Extending the lessons learned from DELTA around self-adaptive systems, we began exploring the potential of automating tradeoffs found in FLoX and federated learning in general.
Looking ahead, we are developing FLoX into a much more robust framework to enable the use of a wide range of computing resources while abstracting away the difficulties of configuring and optimizing a federated learning experiment. Additionally, we are actively working on a re-release of DELTA with all extensions combined into one framework with updated cost and execution time predictors and complete resource provisioning ability. Finally, we are designing an integration between FLoX and DELTA that will enable serverless-based FL to automatically place each component of an FL flow and move data as necessary to best use the available resources.
To bring this idea of resource abstraction via serverless into the rapidly growing field of federated learning, we developed and released FLoX: Federated Learning on funcX. This framework was built from the ground up around a serverless computing paradigm with experimentation and usability in mind. Extending the lessons learned from DELTA around self-adaptive systems, we began exploring the potential of automating tradeoffs found in FLoX and federated learning in general.
Looking ahead, we are developing FLoX into a much more robust framework to enable the use of a wide range of computing resources while abstracting away the difficulties of configuring and optimizing a federated learning experiment. Additionally, we are actively working on a re-release of DELTA with all extensions combined into one framework with updated cost and execution time predictors and complete resource provisioning ability. Finally, we are designing an integration between FLoX and DELTA that will enable serverless-based FL to automatically place each component of an FL flow and move data as necessary to best use the available resources.
Exhibitor Forum
Accelerators
Artificial Intelligence/Machine Learning
Architecture and Networks
Hardware Technologies
TP
XO/EX
DescriptionLarge language models (LLMs) are becoming increasingly popular for a variety of AI services, such as chatbots and virtual assistants. However, serving LLMs can be challenging, due to their high operating costs and long service latency. The main challenge in serving LLMs is the memory bandwidth bottleneck. LLMs require a lot of memory to store their parameters, and this memory bandwidth can be a limiting factor in the speed of inference. As LLM models continue to grow in size, this problem is only going to get worse.
We propose a new solution to the memory bandwidth bottleneck for serving LLMs. Our solution, called AiM (Accelerator-in-Memory), is a SK hynix's PIM device that is specialized for serving LLMs. AiM can exploit the abundant memory bandwidth available inside memory to accelerate GEMV operations, which are the most computationally expensive operations in LLM inference. We evaluated AiM on a variety of LLM models and tasks. Our results show that AiM can significantly improve the performance and energy efficiency of LLM inference. For example, on the GPT-3 model, AiM can achieve up to 10x speedup at lower cost and energy consumption over the state-of-the-art GPU systems.
We believe that AiM is a promising solution to the memory bandwidth bottleneck for serving LLMs. AiM can significantly improve the performance and energy efficiency of LLM inference, making it possible to deploy LLMs in real-world applications.
We propose a new solution to the memory bandwidth bottleneck for serving LLMs. Our solution, called AiM (Accelerator-in-Memory), is a SK hynix's PIM device that is specialized for serving LLMs. AiM can exploit the abundant memory bandwidth available inside memory to accelerate GEMV operations, which are the most computationally expensive operations in LLM inference. We evaluated AiM on a variety of LLM models and tasks. Our results show that AiM can significantly improve the performance and energy efficiency of LLM inference. For example, on the GPT-3 model, AiM can achieve up to 10x speedup at lower cost and energy consumption over the state-of-the-art GPU systems.
We believe that AiM is a promising solution to the memory bandwidth bottleneck for serving LLMs. AiM can significantly improve the performance and energy efficiency of LLM inference, making it possible to deploy LLMs in real-world applications.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
TP
XO/EX
DescriptionAnalysis of a High-Performance Computing cluster’s external network traffic provides the opportunity to identify security issues, cluster misuse, or configuration problems without reducing performance. This project captured the external network traffic to and from a Cray EX40 cluster over three months and analyzed it utilizing two open-source intrusion detection tools, Suricata and Zeek. The tool alerts were sent to Splunk via rsyslog for parsing and analysis. Several security concerns were identified, including excessive failed authentication attempts and the use of four invalid certificates. Multiple cluster configuration issues were also identified, including recurrent anomalous Domain Name Service (DNS) queries which comprised 97% of all DNS traffic and incorrectly routed outbound Hypertext Transfer Protocol traffic. The port mirror architecture combined with network intrusion detection tools offered valuable insight into security concerns and several configuration issues. Excessive failed authentication attempts and a switch DNS configuration issue were both resolved by this project.
Workshop
Large Scale Systems
Performance Measurement, Modeling, and Tools
Software Engineering
W
DescriptionInstrument-computing ecosystems that support automated electrochemical workflows typically require the integration of disparate instruments such as syringe pump, fraction collector, and potentiostat, all connected to an electrochemical cell. These specialized instruments with custom software and interfaces are typically not designed for network integration and remote automation. We developed a networked ecosystem of these instruments and computing platforms, including software that enables automated workflow orchestration from remote computers. Specifically, we developed Python wrappers of APIs and custom Pyro client-server modules to support remote operation of these instruments over the ecosystem network. Herein, we describe a specific workflow {for generating and validating voltammogram (I-V) measurements} of an electrolyte solution pumped into the electrochemical cell. We demonstrate the orchestration of this workflow which is composed using a Jupyter notebook and executed on a remote computer.
Workshop
Education
State of the Practice
W
DescriptionThe Cross-Institutional Research Engagement Network (CIREN) is a collaborative project between the University of Tennessee, Knoxville (UTK) and Arizona State University (ASU). This project’s purpose is to fill critical gaps in the development and retention of cyberinfrastructure (CI) facilitators via training, mentorship, and research engagement. Engagements may include research projects at the CI facilitator’s local institution, between CIREN partner institutions, and through NSF’s ACCESS program. This lightning talk will detail the training curriculum and mentorship activities the project has implemented in its first year as well as plans for its future research engagements. Feedback is welcome from the community with respect to project directions, best practices, and challenges experienced in implementing this or similar programs at academic institutions.
Workshop
Large Scale Systems
Middleware and System Software
Programming Frameworks and System Software
W
Workshop
Performance Measurement, Modeling, and Tools
Performance Optimization
W
DescriptionThe proliferation of artificial intelligence applications has underscored the need for increased portability among graphic processing units (GPUs) from different vendors. With CUDA as one of the most popular GPU programming languages, CuPBoP (CUDA for Parallelized and Broad-range Processors) aims to provide NVIDIA's proprietary CUDA language support to a variety of GPU and CPU platforms by translating CUDA programs at the LLVM/NVVM IR level. Our work extends CuPBoP to AMD GPUs as CuPBoP-AMD. CuPBoP-AMD is a CUDA translator that translates CUDA programs at NVVM IR level to HIP-compatible IR that can run on AMD GPUs. Currently, CuPBoP-AMD translates a broader range of applications in the Rodinia benchmark suite while maintaining approximately equal performance than the existing state-of-the-art AMD-developed translator, HIPIFY, without requiring programmer intervention.
Birds of a Feather
Data Analysis, Visualization, and Storage
TP
XO/EX
DescriptionStorage is an important part of HPC environments, especially with the explosion of data that comes with increasing computational power. But there are a number of evolving options and tradeoffs for storage (POSIX/S3, SSD/HDD/tape, on-premises/public cloud, management policies, etc.). The goal of this BoF is to facilitate a discussion about storage environments, and to share and hear plans and ideas from the audience and from the BoF leaders. Ultimately, we hope to help each other and the community better understand the options and best practices in the storage landscape.
Tutorial
Accelerators
Programming Frameworks and System Software
TUT
DescriptionOpen FPGA Stack is the first complete hardware and software infrastructure that is fully open sourced and comprised of composable hardware code and upstreamed kernel code to Linux.org to enable a collaborative community of FPGA developers. The intention of OFS is to provide an efficient approach to develop a custom FPGA-based platform or solution by providing a framework of synthesizable code, a simulation environment, and scripts that developers can use as-is or modify. OFS source code can be used for development of an Intel, 3rd party, or custom FPGA solution. This hands-on tutorial will spotlight Open FPGA Stack, as well as oneAPI (supported by OFS), by providing FPGA developers the opportunity to do some basic FPGA workload development using the open source OFS infrastructure, source code, and documentation we provide on GitHub at www.github.com/OFS. Attendees will modify the Acceleration Functional Unit Region (AFU Region) to create their own FPGA workload using both RTL and C++ (enabled by oneAPI).
Paper
Accelerators
Data Analysis, Visualization, and Storage
Data Compression
TP
DescriptionModern scientific applications and supercomputing systems are generating large amounts of data in various fields, leading to critical challenges in data storage footprints and communication times. To address this issue, error-bounded GPU lossy compression has been widely adopted, since it can reduce the volume of data within a customized threshold on data distortion. In this work, we propose an ultra-fast error-bounded GPU lossy compressor cuSZp. Specifically, cuSZp computes the linear recurrences with hierarchical parallelism to fuse the massive computation into one kernel, drastically improving the end-to-end throughput. In addition, cuSZp adopts a block-wise design along with a lightweight fixed-length encoding and bit-shuffle inside each block such that it achieves high compression ratios and data quality. Our experiments on NVIDIA A100 GPU with 6 representative scientific datasets demonstrate that cuSZp can achieve an ultra-fast end-to-end throughput (95.53x compared with cuSZ) along with a high compression ratio and high reconstructed data quality.
Workshop
Applications
Data Movement and Memory
Heterogeneous Computing
I/O and File Systems
Large Scale Systems
Middleware and System Software
Performance Measurement, Modeling, and Tools
Performance Optimization
W
DescriptionIn the landscape of High-Performance Computing (HPC), the quest for efficient and scalable memory solutions remains paramount. The advent of Compute Express Link (CXL) introduces a promising avenue with its potential to function as a Persistent Memory (PMem) solution in the context of disaggregated HPC systems. We present a comprehensive exploration of CXL memory’s viability as a candidate for PMem, supported by physical experiments conducted on cutting-edge multi-NUMA nodes equipped with CXL-attached memory prototypes. Our study not only benchmarks the performance of CXL memory but also illustrates the seamless transition from traditional PMem programming models to CXL, reinforcing its practicality.
To substantiate our claims, we establish a tangible CXL prototype using an FPGA card embodying CXL 1.1/2.0 compliant endpoint designs. Performance evaluations, executed through the STREAM and STREAM-PMem benchmarks, showcase CXL memory’s ability to mirror PMem characteristics in App-Direct and Memory Mode while achieving impressive bandwidth metrics.
To substantiate our claims, we establish a tangible CXL prototype using an FPGA card embodying CXL 1.1/2.0 compliant endpoint designs. Performance evaluations, executed through the STREAM and STREAM-PMem benchmarks, showcase CXL memory’s ability to mirror PMem characteristics in App-Direct and Memory Mode while achieving impressive bandwidth metrics.
Exhibitor Forum
Architecture and Networks
Data Movement and Memory
Hardware Technologies
TP
XO/EX
DescriptionThe Compute Express Link (CXL) shows a characteristic of composability by nature, which enables the disaggregation of memory resources via CXL.mem transactions. In this forum, we focus on the demonstration of two powerful use cases - memory pooling and sharing - from which users can get benefits that have never been experienced before.
Memory Pooling Case: A key to alleviate a memory stranding issue
The memory utilization of each host server in a compute cluster varies time to time, which mandates system operators to provision each server with DRAM capacity at its peak utilization for real-time or interactive applications. Unused memory in each server can never be utilized by other servers, which makes stranded memory. SK hynix’s Niagara, a CXL-based pooled memory solution, addresses this stranded memory issue. Our FPGA-based pooled memory solution can be connected to four host servers and supports four DDR DIMM channels with maximum capacity of 1TB. In our exhibition booth, we will demonstrate how Niagara can alleviate a memory stranding issue with its Elastic Memory feature.
Memory Sharing Case: A key to realize zero-copy distributed computing framework
Conventional distributed computing frameworks such as Spark and Ray suffer from heavy network traffic for distributing data and tasks to computing nodes in a cluster. To address this issue, we have implemented a memory sharing feature in Niagara so that multiple host servers can directly access the same shared data without data transfer over a network. In this forum, we demonstrate the effectiveness of memory sharing with a real workload in Ray framework, which is known for being used in ChatGPT.
Memory Pooling Case: A key to alleviate a memory stranding issue
The memory utilization of each host server in a compute cluster varies time to time, which mandates system operators to provision each server with DRAM capacity at its peak utilization for real-time or interactive applications. Unused memory in each server can never be utilized by other servers, which makes stranded memory. SK hynix’s Niagara, a CXL-based pooled memory solution, addresses this stranded memory issue. Our FPGA-based pooled memory solution can be connected to four host servers and supports four DDR DIMM channels with maximum capacity of 1TB. In our exhibition booth, we will demonstrate how Niagara can alleviate a memory stranding issue with its Elastic Memory feature.
Memory Sharing Case: A key to realize zero-copy distributed computing framework
Conventional distributed computing frameworks such as Spark and Ray suffer from heavy network traffic for distributing data and tasks to computing nodes in a cluster. To address this issue, we have implemented a memory sharing feature in Niagara so that multiple host servers can directly access the same shared data without data transfer over a network. In this forum, we demonstrate the effectiveness of memory sharing with a real workload in Ray framework, which is known for being used in ChatGPT.
Workshop
Data Movement and Memory
Heterogeneous Computing
I/O and File Systems
W
DescriptionThe Distributed Asynchronous Object Storage (DAOS) is an open source scale-out storage system that is designed from the ground up to support Storage Class Memory (SCM) and NVMe storage in user space. Until now, the DAOS storage stack has been based on Intel Optane Persistent Memory (PMem) and the Persistent Memory Development Kit (PMDK). With the discontinuation of Optane PMem, and no persistent CXL.mem devices in the market yet, DAOS continues to support PMem-based servers but now also supports server configurations where its Versioning Object Store (VOS) is held in DRAM. In this case, the VOS data structures are persisted through a synchronous Write-Ahead-Log (WAL) combined with asynchronous checkpointing to NVMe SSDs.
This contribution summarizes the recently accepted "DAOS beyond Persistent Memory" IXPUG-ISC23 workshop paper (M. Hennecke et al., https://doi.org/10.1007/978-3-031-40843-4_26; not live yet, see PDF upload), which describes the new non-PMem DAOS architecture and reports first performance results.
This contribution summarizes the recently accepted "DAOS beyond Persistent Memory" IXPUG-ISC23 workshop paper (M. Hennecke et al., https://doi.org/10.1007/978-3-031-40843-4_26; not live yet, see PDF upload), which describes the new non-PMem DAOS architecture and reports first performance results.
Birds of a Feather
Data Analysis, Visualization, and Storage
TP
XO/EX
DescriptionDAOS (https://docs.daos.io/) is an open-source scale-out object store that delivers extremely high performance to the most data-intensive HPC/AI workloads. With growing adoption, DAOS has seen significant community contributions like domain-specific container types, additional hardware support beyond x86_64 (e.g. ARM64), and enabling DAOS in the cloud.
This BoF brings together the DAOS community to discuss, share experiences, and brainstorm on future enhancements of DAOS. Topics include practical experiences with on-prem and cloud deployments, application use cases, and the software roadmap. This session targets end users, middleware developers, system administrators, DAOS core software developers, and vendors of DAOS-based hardware/software/cloud offerings.
This BoF brings together the DAOS community to discuss, share experiences, and brainstorm on future enhancements of DAOS. Topics include practical experiences with on-prem and cloud deployments, application use cases, and the software roadmap. This session targets end users, middleware developers, system administrators, DAOS core software developers, and vendors of DAOS-based hardware/software/cloud offerings.
Workshop
Distributed Computing
Middleware and System Software
Runtime Systems
W
DescriptionIn situ workflows are inescapable to fully leverage exascale architectures. They can however be complex to build as simulation and data analytics come from two different software ecosystems with their own paradigms. We extend deisa by introducing the concept of external tasks to support the description of analytics graphs spanning multiple timesteps ahead of time while improving scalability. This new approach leads to straightforward support for contracts to limit the data transferred to that actually analyzed in a given execution. We implement this approach using Dask and MPI and evaluate it using an in-transit workflow that uses an unsupervised ML model. We compare our work to plain Dask and to the previous version of deisa. Our work performs better, up to ×7, for the simulation and ×3, for the analytics compared to deisa and ×18 less costly compared to plain Dask. All of those with similar development efforts.
Paper
Algorithms
Linear Algebra
Post-Moore Computing
TP
DescriptionSparse matrix-vector multiplication (SpMV) plays a key role in computational science and engineering, graph processing and machine learning applications. Much SpMV work was devoted to resolving problems such as random access to vector x and unbalanced load. However, we have experimentally found that the compute of inner products still occupies much overhead in SpMV operation, which has been largely ignored in existing work.
In this paper, we propose DASP, a new algorithm using specific dense MMA units for accelerating the compute part of general SpMV. We analyze the row-wise distribution of nonzeros and group the rows into three categories. We then organize them into small blocks of proper sizes to meet the requirement of MMA computation. For the three categories, DASP offers different strategies to complete SpMV. The experimental results on the latest NVIDIA Ampere and Hopper GPUs show that our DASP brought significant speedups over state-of-the-art SpMV work.
In this paper, we propose DASP, a new algorithm using specific dense MMA units for accelerating the compute part of general SpMV. We analyze the row-wise distribution of nonzeros and group the rows into three categories. We then organize them into small blocks of proper sizes to meet the requirement of MMA computation. For the three categories, DASP offers different strategies to complete SpMV. The experimental results on the latest NVIDIA Ampere and Hopper GPUs show that our DASP brought significant speedups over state-of-the-art SpMV work.
Workshop
Education
State of the Practice
W
DescriptionStudents in community colleges are either interested in a quick degree or a skill that allows them to enter into a career area while minimizing debt. Attending a four-year university can be a challenge, and acceptance can be competitive.
The job market is challenged to hire and retain diverse staff especially within High Performance Computing (HPC) or a government laboratory. Higher salaries, potentially better benefits, or opportunities for remote work are contributing factors for this industry.
To encourage interest in HPC, NERSC partnered with Laney College to create a Data Analytics Program. Once Laney faculty learn how to teach the classes toward a certificate program, they fill a need for their students to build the skill in data analytics toward a career or to continue toward a four-year degree as transfer students.
We describe how NERSC partners with Laney to create a pipeline toward a data analytics career.
The job market is challenged to hire and retain diverse staff especially within High Performance Computing (HPC) or a government laboratory. Higher salaries, potentially better benefits, or opportunities for remote work are contributing factors for this industry.
To encourage interest in HPC, NERSC partnered with Laney College to create a Data Analytics Program. Once Laney faculty learn how to teach the classes toward a certificate program, they fill a need for their students to build the skill in data analytics toward a career or to continue toward a four-year degree as transfer students.
We describe how NERSC partners with Laney to create a pipeline toward a data analytics career.
Paper
Cloud Computing
Distributed Computing
Data Movement and Memory
Performance Measurement, Modeling, and Tools
TP
DescriptionA critical performance challenge in distributed scientific workflows is coordinating tasks and data flows on distributed resources. To guide these decisions, this paper introduces data flow lifecycle anal- ysis. Workflows are commonly represented using directed acyclic graphs (DAGs). Data flow lifecycles (DFL) enrich task DAGs with data objects and properties that describe data flow and how tasks interact with that flow. Lifecycles enable analysis from several important perspectives: task, data, and data flow. We describe representation, measurement, analysis, visualization, and opportunity identification for DFLs. Our measurement is both distributed and scalable, using space that is constant per data file. We use lifecycles and opportunity analysis to reason about improved task placement and reduced data movement for five scientific workflows with different characteristics. Case studies show improvements of 15×, 1.9×, and 10–30×. Our work is implemented in the DataLife tool.
Workshop
Education
Heterogeneous Computing
Reproducibility
State of the Practice
W
DescriptionThe Parallel and Distributed Computing community has been interested in integrating PDC content into early CS curriculum to prime the students for more advanced materials and build a workforce able to leverage advanced computing infrastructure. To deploy this strategy at scale, it is important to identify anchor points in early CS courses where we can insert PDC content.
We present an analysis of CS courses that primarily focuses on CS1 and Data Structure courses. We collected data on course content through in-person workshops, where instructors of courses classified their course materials against standard curriculum guidelines.
By using these classification, we make sense of how Computer Science is being taught. We highlight different types of CS1 and Data Structure courses. And we provide reflection on how that knowledge can be used by PDC experts to identify anchoring points for PDC content, while being sensitive to the needs of instructors.
We present an analysis of CS courses that primarily focuses on CS1 and Data Structure courses. We collected data on course content through in-person workshops, where instructors of courses classified their course materials against standard curriculum guidelines.
By using these classification, we make sense of how Computer Science is being taught. We highlight different types of CS1 and Data Structure courses. And we provide reflection on how that knowledge can be used by PDC experts to identify anchoring points for PDC content, while being sensitive to the needs of instructors.
Tutorial
Artificial Intelligence/Machine Learning
Performance Optimization
TUT
DescriptionDeep learning is rapidly and fundamentally transforming the way science and industry use data to solve problems. Deep neural network models have been shown to be powerful tools for extracting insights from data across a large number of domains, from large language models (LLMs) to protein folding. As these models grow in complexity to solve increasingly challenging problems with larger and larger datasets, the need for scalable methods and software to train them grows accordingly.
The Deep Learning at Scale tutorial aims to provide attendees with a working knowledge of deep learning on HPC-class systems, including core concepts, scientific applications, performance optimization, tips, and techniques for scaling. We will provide training accounts on some of the worlds largest GPU systems, example code, and datasets to allow attendees to experiment hands-on with optimized, scalable distributed training of deep neural network machine learning models from real scientific computing applications.
The Deep Learning at Scale tutorial aims to provide attendees with a working knowledge of deep learning on HPC-class systems, including core concepts, scientific applications, performance optimization, tips, and techniques for scaling. We will provide training accounts on some of the worlds largest GPU systems, example code, and datasets to allow attendees to experiment hands-on with optimized, scalable distributed training of deep neural network machine learning models from real scientific computing applications.
Workshop
Distributed Computing
Middleware and System Software
Runtime Systems
W
Exhibitor Forum
Exascale
Programming Frameworks and System Software
Quantum Computing
TP
XO/EX
DescriptionSupercomputing architectures based on GPU acceleration have greatly improved our scientific computing workflows and applications over the past decade. Quantum computing has recently been proposed as a potential addition to this heterogeneous compute architecture, serving as another node-level accelerator to continue problem scalability in domains such as quantum many-body physics and artificial intelligence. As stand-alone quantum processing units (QPUs) continue to evolve and improve, the applied computational science community is left to wonder - how do we build, program, and deploy large-scale quantum-classical heterogeneous architectures that incorporate both GPUs and QPUs? In this talk, we will demonstrate how NVIDIA is leveraging its current suite of multi-GPU platforms to define and deploy the NVIDIA quantum platform. We will highlight three components specifically that together constitute this quantum platform: (1) the cuQuantum multi-GPU quantum computer simulation libraries, (2) the CUDA Quantum programming model and compilation platform, and (3) the DGX Quantum tightly-coupled quantum-classical compute node. This talk will present the NVIDIA vision for quantum computing and how it fits into existing heterogeneous computing, how we are accelerating quantum algorithms research and development today with NVIDIA GPU platforms, and our vision for GPU-accelerated error correction and fault-tolerance.
Posters
Research Posters
TP
XO/EX
DescriptionThe training of new and existing HPC practitioners is recognized as a priority in the HPC community. Traditionally, delivering HPC System Administrator training has been through physical face-to-face workshops, using cloud-based services or remote hardware to provide compute resources to emulate an HPC system. There are several challenges associated with this approach, including class size limits, available compute resources, and disrupting work hours to attend training. By following lessons learned from MOOC methodology on developing HPC Training we have produced a reproducible, accessible, self-paced HPC virtual training lab that emulates a basic 3-node compute cluster on a trainee’s local machine without the need for any high-end computing resources or cloud infrastructure.
Our poster will provide an overview of the project, inter alia the delivery platforms, components and features of the lab, lessons learned and future improvements, as well as future plans for extended HPC training modules following this delivery format.
Our poster will provide an overview of the project, inter alia the delivery platforms, components and features of the lab, lessons learned and future improvements, as well as future plans for extended HPC training modules following this delivery format.
Workshop
Data Analysis, Visualization, and Storage
Large Scale Systems
Programming Frameworks and System Software
Reproducibility
Resource Management
Runtime Systems
W
DescriptionRules-based workflow scheduling is a recently developed method for constructing an analysis structure, in a far more dynamic manner than traditional graph based systems. However, rules-based workflows are still in their relative infancy and lack the breadth of features available in traditional scientific workflow systems. We will address some of these missing features by introducing the new meow_base library for generic construction of rules-based systems while meeting the requirements of a scientific workflow management system. It will also present two example workflows, showing how rules-based systems can better enable analysis loops or human-in-the-loop interactions than more traditional workflow systems
Workshop
Data Movement and Memory
State of the Practice
W
DescriptionAccessing HPC storage remotely can be cumbersome and involve using out-of-band tools (i.e. NFS, SCP, or SSHFS on Windows or SMB on Linux). We have begun to provide users access to our HPC storage using a tool that enables a familiar interface and behavior - like OneDrive or Dropbox - and unifies access to university-wide storage pools. We will walk through the software, configuration, lessons learned, and next steps in offering this service to our researchers. We are also exploring the use of built-in file tagging and other internal automation to provide a sensitive data workflow for HIPAA-aligned data security. Efficacy and lessons learned from this approach will be discussed.
Workshop
Large Scale Systems
Performance Measurement, Modeling, and Tools
Software Engineering
W
DescriptionIn February and April 2023 live, at-scale data processing demonstrations were conducted between the Advanced Photon Source (APS), a synchrotron light source, and the Argonne Leadership Computing Facility (ALCF). These tests were run as part of a novel beamline technique: coded aperture laue micro-diffraction. This technique requires a significant amount of compute to decode appeture patterns embedded in the detector stream. An autonomous system was able to send data to ALCF during an experiment, utilize 50 nodes of the Polaris supercomputer to process 6-12 hour scans, and return the data back to the APS within 12-15 minutes behind the detector. With scan points arriving every 72 seconds, the system kept up with the beamline, potentially enabling in-experiment analysis. The data processing system utilizes Globus infrastructure and an on-demand queue to dynamically acquire nodes on Polaris. The underlying reconstruction algorithms were parallelized via MPI and accelerated with custom CUDA kernels.
Workshop
Applications
Distributed Computing
Large Scale Systems
Programming Frameworks and System Software
Runtime Systems
W
DescriptionWith the largest datasets to date and a diverse set of discoveries to be made, the current generation of scientific analyses are well poised to utilize artificial intelligence (AI) and machine learning (ML) on high performance computing (HPC) resources. Like never before, these workflows can be written in one portable language, Python, which thanks to highly-optimized ML libraries achieves excellent cross-platform performance with little to no intervention by the user. In this demonstration, we explore the performance of several scientific AI/ML applications across leading HPC resources and highlight best practices for portable performance.
Paper
Accelerators
Artificial Intelligence/Machine Learning
Codesign
Fault Handling and Tolerance
Performance Measurement, Modeling, and Tools
Post-Moore Computing
TP
DescriptionSoft errors are prevalent in modern High-Performance Computing (HPC) systems, resulting in silent data corruptions (SDCs), compromising system reliability. Instruction duplication is a widely used software-based protection technique against SDCs. Existing instruction duplication techniques are mostly implemented at LLVM level and may suffer from low SDC coverage at assembly level. In this paper, we evaluate instruction duplication at both LLVM and assembly levels. Our study shows that existing instruction duplication techniques have protection deficiency at assembly level and are usually over-optimistic in the protection. We investigate the root-causes of the protection deficiency and propose a mitigation technique, Flowery, to solve the problem. Our evaluation shows that Flowery can effectively protect programs from SDCs evaluated at assembly level.
Exhibits
Flash Session
TP
XO/EX
DescriptionLambda is introducing one of the first NVIDIA GH200 GPU clusters in its cloud, featuring NVIDIA’s new ARM-based Grace CPU for enhanced efficiency and coherent NVIDIA NVLink-C2C interconnect that provides 900 GB/s of bandwidth between the Grace CPU and Hopper GPU. The presentation will discuss Lambda's cluster design compared to x86-based GPU clusters.
Exhibits
Flash Session
TP
XO/EX
DescriptionLambda is introducing one of the first NVIDIA GH200 GPU clusters in its cloud, featuring NVIDIA’s new ARM-based Grace CPU for enhanced efficiency and coherent NVIDIA NVLink-C2C interconnect that provides 900 GB/s of bandwidth between the Grace CPU and Hopper GPU. The presentation will discuss Lambda's cluster design compared to x86-based GPU clusters.
Exhibits
Flash Session
TP
XO/EX
DescriptionLambda is introducing one of the first NVIDIA GH200 GPU clusters in its cloud, featuring NVIDIA’s new ARM-based Grace CPU for enhanced efficiency and coherent NVIDIA NVLink-C2C interconnect that provides 900 GB/s of bandwidth between the Grace CPU and Hopper GPU. The presentation will discuss Lambda's cluster design compared to x86-based GPU clusters.
Workshop
Applications
Architecture and Networks
Distributed Computing
Compilers
Heterogeneous Computing
Message Passing
Programming Frameworks and System Software
Task Parallelism
W
DescriptionThe HPX asynchronous many-task runtime system has been using TCP and MPI as its communication backends (parcelports). We developed a new HPX parcelport using a new communication library, the Lightweight Communication Interface (LCI) that was designed to better match the needs of systems such as HPX. We evaluate its performance with various microbenchmarks and a real-world astrophysics application, Octo-Tiger. Compared to the best configuration of the MPI parcelport, microbenchmarks show that the new LCI parcelport improves the message rate by up to 30x and decreases latencies by up to 5x. It also reduces the total execution time of Octo-Tiger by up to 1.175x compared to the best configuration of the MPI parcelport and up to 13.6x compared to the same configuration of the MPI parcelport. We discuss the performance impacts of different design choices.
Doctoral Showcase
Posters
Quantum Computing
TP
DescriptionQuantum computing promises to solve problems beyond the reach of today’s machines, but it requires efficient and reliable software tools to realize its potential. This poster gives an overview of various contributions towards design automation methods and software for quantum computing that leverage existing knowledge and expertise in classical circuit and system design. It focuses on three major tasks: simulation, compilation, and verification of quantum circuits. The proposed solutions demonstrate significant improvements in efficiency, scalability, and reliability for all tasks and constitute the backbone of the Munich Quantum Toolkit (MQT), a collection of open-source tools for quantum computing. The respective solutions advance the state of the art in quantum computing and illustrate the benefits of design automation methods for this emerging field.
Paper
Accelerators
Architecture and Networks
Data Analysis, Visualization, and Storage
Fault Handling and Tolerance
TP
DescriptionMulti-level erasure coding (MLEC) has seen large deployments in the field, but there is no in-depth study of design considerations for MLEC at scale. In this paper, we provide comprehensive design considerations and analysis of MLEC at scale. We introduce the design space of MLEC in multiple dimensions, including various code parameter selections, chunk placement schemes, and various repair methods. We quantify their performance and durability, and show which MLEC schemes and repair methods can provide the best tolerance against independent/correlated failures and reduce repair network traffic by orders of magnitude. To achieve this, we use various evaluation strategies including simulation, splitting, dynamic programming, and mathematical modeling. We also compare the performance and durability of MLEC with other EC schemes such as SLEC and LRC and show that MLEC can provide high durability with higher encoding throughput and less repair network traffic over both SLEC and LRC.
Workshop
Data Analysis, Visualization, and Storage
Large Scale Systems
Performance Measurement, Modeling, and Tools
W
DescriptionThis paper presents preliminary work on a new in situ tool design
that allows flexibility in the type of in situ processing. The design
goal is to be able to specify at run-time whether to run in-line in situ
or in transit and to allow switching dynamically between in-line
in situ processing and in transit processing. By allowing the run
to dynamically switch between these two modes, as dictated by
current compute requirements of both the simulation and the in
situ tool, the machine resources can be utilized most efficiently. The
design uses a framework, or inversion of control, approach where
the allocation of MPI processes to simulation and in situ tool is
controlled by the framework. Initial work demonstrates the ability
to switch from an in transit paradigm to an in-line in situ paradigm
with two separate simulation codes, using ParaView Catalyst as
the in situ engine.
that allows flexibility in the type of in situ processing. The design
goal is to be able to specify at run-time whether to run in-line in situ
or in transit and to allow switching dynamically between in-line
in situ processing and in transit processing. By allowing the run
to dynamically switch between these two modes, as dictated by
current compute requirements of both the simulation and the in
situ tool, the machine resources can be utilized most efficiently. The
design uses a framework, or inversion of control, approach where
the allocation of MPI processes to simulation and in situ tool is
controlled by the framework. Initial work demonstrates the ability
to switch from an in transit paradigm to an in-line in situ paradigm
with two separate simulation codes, using ParaView Catalyst as
the in situ engine.
Birds of a Feather
Education
TP
XO/EX
DescriptionHPC Outreach is essential to enthusing young minds about computational science, informing the public and growing the HPC community, and yet many institutions do not have sufficient funding or staff effort to support the outreach activities. Effective outreach requires well designed activities that are suitable to the target audience and event type. Different activities are needed for different age groups, scientific backgrounds or venues. Each activity also has its own lifecycle and cannot be reused indefinitely. The goal of this session is to design several new activities that the community would be able to develop over the coming year.
Posters
Research Posters
TP
XO/EX
DescriptionIn cancer biology, large amounts of high dimensional data (genomic, transcriptomic, proteomic, phenotypic, etc.) are required for any computationally relevant work. The problem is further complicated by the sheer size of the human genome, roughly three billion base pairs long. Therefore, computation is time-consuming and data-intensive. To solve this problem for human colorectal cancer, we are implementing a machine learning engine based on inverse reinforcement learning, and includes several different kinds of neural networks to perform data preparation, training, and prediction. Our work aims to reconstruct the progression of tumor development in a sample, and predict the next steps of its evolution, to aid in diagnosis and treatment. This poster will be presented as a work in progress methodology.
Workshop
Large Scale Systems
Performance Measurement, Modeling, and Tools
Software Engineering
W
Posters
Research Posters
TP
XO/EX
DescriptionDensity functional theory based codes are significant users of HPC resources, often ranking among the top users of core hours on these systems. However, despite their popularity and resource usage, they are not very well optimised for current HPC architectures - and are not easily adapted. We present DFToy, a new proxy-app for DFT codes that is accessible, easy to understand and FOSS. DFToy's accessibility makes it an excellent platform for benchmarking, experimentation and development - allowing developers to research novel algorithms for DFT codes.
We will show DFToy's use and capabilities in its current state, compare its behavior to a state-of-the-art DFT code, and discuss where we will take the code going forward - including the development of a self-tuning parallel model.
We will show DFToy's use and capabilities in its current state, compare its behavior to a state-of-the-art DFT code, and discuss where we will take the code going forward - including the development of a self-tuning parallel model.
Paper
Architecture and Networks
Data Movement and Memory
Graph Algorithms and Frameworks
Performance Measurement, Modeling, and Tools
Programming Frameworks and System Software
TP
DescriptionDynamic graphs have grown in importance for numerous real-world applications. To accommodate this, graph frameworks, particularly their internal data structures, must support both persistent graph updates and rapid graph analysis simultaneously. Emerging persistent memory technologies, such as Optane DCPMM, offer a promising choice to simplify the designs by providing data persistence, low latency, and high IOPS together. We propose DGAP, a framework for efficient dynamic graph analysis on persistent memory. DGAP utilizes mutable Compressed Sparse Row (CSR) with new designs for persistent memory to construct the framework. Specifically, DGAP introduces a per-section edge log to reduce write amplification; a per-thread undo log to enable high-performance, crash-consistent rebalancing operations; and a data placement schema to minimize in-place updates. Our extensive evaluation results demonstrate that DGAP can achieve up to 3.2x better graph update performance and up to 3.77x better graph analysis performance compared to state-of-the-art dynamic graph frameworks for persistent memory.
Exhibitor Forum
Artificial Intelligence/Machine Learning
Fault Handling and Tolerance
Large Scale Systems
Programming Frameworks and System Software
TP
XO/EX
DescriptionDigital Twins have emerged as one of the hot new modeling concepts in HPC as we enter the Post Exascale Era. Originally conceived as a modeling tool for manufacturing and Product Life Cycle Management Digital Twins are evolving in HPC with the convergence of simulation, machine learning and live data. The introduction of machine learning into the HPC workflows has been a critical component in the evolution of the Digital Twin for Science where for the first time models with fidelity at the atomic level can scale to the full scope of a physical object or system.
The talk will include examples from physics, biology and climate science and describe how the NVIDIA platform can address the requirements for first principles simulation using the HPC SDK, the RAPIDS SDK and Modulus to develop the robust machine learning models, Holoscan our tool set to enable data acquisition from live data sources and Omniverse our SDK to aggregate the workflow components and visualize it.
The talk will include examples from physics, biology and climate science and describe how the NVIDIA platform can address the requirements for first principles simulation using the HPC SDK, the RAPIDS SDK and Modulus to develop the robust machine learning models, Holoscan our tool set to enable data acquisition from live data sources and Omniverse our SDK to aggregate the workflow components and visualize it.
Workshop
W
DescriptionDigital twins are physically accurate virtual representations of real-world systems, providing beneficial information in actionable time by combining sensor data with surrogate models. Recent shifts in HPC combining simulation, AI, and edge computing have not only given us the opportunity to apply digital twins in science, but has also magnified their impact on global public policy and institutions, in domains including climate change, renewable energy, industry 4.0 and global healthcare. Increasingly accurate simulations become virtual sources of truth, capable of multi-physics synchrony in real world time ranging from subatomic to interstellar spaces. Digital twins in HPC are crucial to enabling breakthroughs in computational biomedicine, nuclear fusion, and building automation. This workshop will bring like minds together to identify challenges and opportunities in establishing digital twins as a common HPC practice and will highlight key principles for their use in high performance computing.
Workshop
Fault Handling and Tolerance
Large Scale Systems
W
DescriptionLarge-scale HPC systems demand extensive disk-based storage for data generated by HPC applications, necessitating scalable reliability, availability, and failure management. Extracted failure data from HPC storage offers valuable insights for preventing and managing failures, spanning understanding storage robustness, guiding system design and deployment, and creating durable data protection schemes. This paper introduces a failure dataset from OLCF’s Summit supercomputer's file system, Alpine, encompassing 4000+ events over 2.75 years from 32000+ disks. Before analysis, we delve into Alpine's components and introduce IBM Spectrum Scale technology, then assess collected data for failure distribution and burst correlations. We infer that, proximity to enclosure fan modules heightens disk failure rates. Also, burst failure analysis highlights 1/3rd of failures occurring in bursts, with 90% non-spatially correlated, impacting multiple racks.
Workshop
Exascale
Message Passing
Programming Frameworks and System Software
W
Workshop
Distributed Computing
Security
W
DescriptionHPC systems are designed to meet peak performance and scalability goals but today’s security guidance and tools are designed for enterprise infosec. This means that it is quite difficult to secure HPC resources without impacting performance goals. In this talk, we will examine the key security differences between enterprise systems and the common features of HPC environments. We will also discuss a new HPC Security NIST publication (currently draft) and touch on how secure ‘open science’ research really needs to be. From there, we will explore emerging trends to keep track of such as scientific workflows that span multiple security domains and whether trusted computing and zero trust models can be adapted to HPC. Finally, we will demonstrate one example of a zero-day vulnerability found on a previous #1 top 500 system (disclosed and patched in 2018) to help motivate broader action to put an ‘S’ in HPC.
Workshop
Data Analysis, Visualization, and Storage
Large Scale Systems
Programming Frameworks and System Software
Reproducibility
Resource Management
Runtime Systems
W
DescriptionScheduling tasks close to their associated data is crucial in distributed systems to minimize network traffic and latency. Some Big Data frameworks like Apache Spark employ locality functions and job allocation algorithms to minimize network traffic and execution times. However, these frameworks rely on centralized mechanisms, where the master node determines data locality by allocating tasks to available workers with minimal data transfer time, ignoring variances in worker configurations and availability. To address these limitations, we propose a decentralized approach to locality-driven scheduling that grants workers autonomy in the job allocation process while factoring in workers' configurations, such as network and CPU speed differences. Our approach is developed and evaluated on Crossflow, a distributed stream processing platform with data-aware independent worker nodes. Preliminary evaluation experiments indicate that our approach can yield up to 3.57x faster execution times when compared to the baseline centralized approach where the master controls data locality.
Workshop
Quantum Computing
Software Engineering
W
DescriptionWe consider the automated compilation of quantum circuits to heterogeneous networks of quantum computing modules, sparsely connected via Bell states. A circuit too large to be implemented on any one module alone requires the insertion of operations, typically gate teleportation or qubit teleportation, consuming Bell states. Here we focus on the use of simultaneous teleportation of CZ gates. Inter-module operations constitute a computational bottleneck, and are likely to add more noise to the computation than intra-module operations. In this work we introduce techniques for distributing quantum circuits, in a way which minimizes the number of Bell states required to do so. We present pytket-dqc, a python library containing implementations of our techniques; simplifying, automating and making accessible quantum circuit distribution.
Paper
Artificial Intelligence/Machine Learning
TP
DescriptionMemory-based Temporal Graph Neural Networks are powerful tools in dynamic graph representation learning and have demonstrated superior performance in many real-world applications. However, their node memory favors smaller batch sizes to capture more dependencies in graph events and needs to be maintained synchronously across all trainers. As a result, existing frameworks suffer from accuracy loss when scaling to multiple GPUs. Even worse, the tremendous overhead to synchronize the node memory make it impractical to be deployed to distributed GPU clusters.
In this work, we propose DistTGL --- an efficient and scalable solution to train memory-based TGNNs on distributed GPU clusters.
DistTGL has three improvements over existing solutions: an enhanced TGNN model, a novel training algorithm, and an optimized system. In experiments, DistTGL achieves near-linear convergence speedup, outperforming state-of-the-art single-machine method by 14.5% in accuracy and 10.17x in training throughput.
In this work, we propose DistTGL --- an efficient and scalable solution to train memory-based TGNNs on distributed GPU clusters.
DistTGL has three improvements over existing solutions: an enhanced TGNN model, a novel training algorithm, and an optimized system. In experiments, DistTGL achieves near-linear convergence speedup, outperforming state-of-the-art single-machine method by 14.5% in accuracy and 10.17x in training throughput.
Workshop
Large Scale Systems
Performance Measurement, Modeling, and Tools
Software Engineering
W
Workshop
Artificial Intelligence/Machine Learning
Energy Efficiency
Green Computing
Performance Measurement, Modeling, and Tools
Sustainability
W
DescriptionFrequency scaling is a well-known energy-saving power management technique that modulates the device frequency to explore the trade-off between energy and performance. Higher energy savings require a frequency tuning phase since different applications can have different energy and time behavior depending on the frequency setting. Machine learning models can be used to predict the optimal frequency configuration based on static or dynamic features extracted from the target application. While general-purpose energy models can be very accurate on a wide range of applications their accuracy can be limited by the specific input of the target application. We present an energy characterization that spans the fields of drug discovery and magnetohydrodynamics by using two real-world applications as case studies: LiGen and Cronos. To overcome the limitations of general-purpose approaches, we define two domain-specific energy models, which enhance the general-purpose energy models by leveraging the target application's input parameter to increase the accuracy.
Workshop
Large Scale Systems
Middleware and System Software
Programming Frameworks and System Software
W
DescriptionProgramming heterogeneous computing systems is a daunting task which is becoming even more challenging with the advent of emerging, non Von-Neumann computer architectures. Innovation in programming abstractions and compilers are thus badly needed to cope with the current golden age of computer architecture. This talk discusses domain-specific abstractions and languages as a promising avenue to hide the system complexity from non-expert programmers while passing richer information to compilers. The high-level semantics in DSLs improves productivity while enabling coarser-grained optimization and safer code generation. Examples are provided from the domains of big-data, physics simulations and machine learning, targeting modern reconfigurable hardware, for emerging memory technologies and for emerging in-memory computing.
Paper
Architecture and Networks
Performance Measurement, Modeling, and Tools
Resource Management
TP
DescriptionMaximizing performance under a power budget is essential for HPC systems and has inspired the development of many power management frameworks. These can be broadly characterized into two groups: model-based and stateless. Model-based frameworks achieve good performance under a power budget but are highly dependent on the quality of the model and the data used to train it. Stateless frameworks are more robust and require no training, but are generally lower performance. In this paper, we propose a new framework that does not require a model, but does track state in the form of recent power dynamics. We implement this idea and test it on a public cloud running both Spark and HPC jobs. We find when total power demand is low, our framework achieves equivalent performance to prior work, but when power demand is high it achieves a mean 8% performance improvement (with no reliance on a learned model).
Workshop
Compilers
Heterogeneous Computing
Performance Optimization
W
DescriptionDPUs as network co-processors are an emerging trend in our community. These have been generally used as domain-specific accelerators transparent to application developers; In the HPC field, DPUs have been used as MPI accelerators, but also to offload some tasks from the general-purpose processor. However, the latter required application developers to deploy MPI ranks in the DPUs, as if they were remote (weak) compute nodes, hence considerably hindering programmability. The wide adoption of OpenMP as the threading model in the HPC arena, along with that of GPU accelerators, is making OpenMP offloading to GPUs a wide trend for HPC applications. In this paper we introduce, for the first time in the literature, OpenMP offloading support for network co-processor DPUs. We present our design in LLVM to support OpenMP standard offloading semantics and discuss the programming productivity advantages with respect to the existing MPI-based programming model.
Workshop
Applications
Distributed Computing
Large Scale Systems
Programming Frameworks and System Software
Runtime Systems
W
DescriptionWe present a novel method for obtaining proxy access to remote instances of the Dragon distributed runtime. Dragon is a composable distributed runtime for managing dynamic processes, high-performance communication objects, memory and data at scale that is based on an abstraction of a distributed system. Proxy access to a remote instance of the Dragon runtime allows the client Dragon runtime to run any command that could be run directly by the remote Dragon runtime, but executes the command on the remote runtime. Commands to be run on a remote Dragon runtime are mediated by a Python object that acts as a proxy for the remote runtime, which we call a proxy runtime. These proxy runtimes, combined with the ability to start and tear down remote Dragon runtimes both programmatically and via the command line interface, make a number of challenging workflows simple to program.
Exhibits
Flash Session
TP
XO/EX
DescriptionWe will discuss the growth of AI, especially LLMs and generative AI, and the supercomputing making this possible. Azure HPC provides purpose-built supercomputing infrastructure to support training/tuning of foundational AI models, plus HPC infrastructure to support inferencing as consumers in all industries use AI models to assist their everyday productivity.
Exhibits
Flash Session
TP
XO/EX
DescriptionWe will discuss the growth of AI, especially LLMs and generative AI, and the supercomputing making this possible. Azure HPC provides purpose-built supercomputing infrastructure to support training/tuning of foundational AI models, plus HPC infrastructure to support inferencing as consumers in all industries use AI models to assist their everyday productivity.
Exhibits
Flash Session
TP
XO/EX
DescriptionWe will discuss the growth of AI, especially LLMs and generative AI, and the supercomputing making this possible. Azure HPC provides purpose-built supercomputing infrastructure to support training/tuning of foundational AI models, plus HPC infrastructure to support inferencing as consumers in all industries use AI models to assist their everyday productivity.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
TP
XO/EX
DescriptionInteractive parallel programs have varying responsiveness requirements for tasks of differing urgency, which has been met with the solution of thread priorities to determine the tasks' allocation of processor time. Previous priority-based language models limit the span of entire threads to a single priority. Given an approaching real-time deadline, tasks are unable to shift to a higher priority in order to match the changing requirements. We design a type system that enforces thread priorities and allows dynamic prioritization, treating priorities as first-class to reduce code complexity. We create a dependency graph-based cost model for our system and define strong well-formedness to exclude unwanted priority inversions. We then prove that programs under our type system produce strongly well-formed graphs.
Workshop
Applications
Data Movement and Memory
Heterogeneous Computing
I/O and File Systems
Large Scale Systems
Middleware and System Software
Performance Measurement, Modeling, and Tools
Performance Optimization
W
DescriptionDisaggregated memory intends to break the rigid boundaries between node memory hierarchies by providing memory as a pooled resource. The resource manager allocates system’s memory at job’s submission time. But it is hard for users to know the job's precise peak memory footprint, and prior work has shown users have an incentive to overestimate. It leads to significant overallocation, and most of the physical memory in the system is wasted. We present a way to reclaim much of this overallocated memory. We extend the Slurm job scheduler to dynamically reallocate memory, according to the job’s current memory footprint. We enhance an existing Slurm simulator to model this situation and combine publicly available traces to model an HPC system on up to 1490 nodes. We show that dynamic memory provisioning approach increases the throughput per dollar by up to 38%, compared to a system with static allocation of disaggregated memory.
Workshop
Fault Handling and Tolerance
Large Scale Systems
W
DescriptionSoft errors occur frequently on large computing platforms due to the increasing scale and complexity of HPC systems. Various resilience techniques have been proposed to protect scientific applications from soft errors. Among them, system-level replication often involves duplicating or triplicating the entire computation, resulting in high resilience overhead. This paper proposes dynamic selective protection for sparse iterative solvers, in particular for the Preconditioned Conjugate Gradient (PCG) solver, at the system level to reduce the resilience overhead. We leverage machine learning (ML) to predict the impact of soft errors that strike different elements of a key computation at different iterations of the solver. Based on the result of the prediction, we design a dynamic strategy to selectively protect those elements that result in a large performance degradation if struck by soft errors. An experimental evaluation demonstrates that our dynamic protection strategy reduces the resilience overhead compared to existing algorithms.
Posters
Research Posters
TP
XO/EX
DescriptionIn the realm of natural language processing, Large Language Models (LLMs) have emerged as powerful tools for tasks such as language translation, text generation, and sentiment analysis. However, the immense parameter size and complexity of LLMs present significant challenges. This work delves into the exploration and characterization of high-performance interconnects in the distributed training of various LLMs. Our findings reveal that high-performance network protocols, notably RDMA, significantly outperform other protocols like IPoIB and TCP/IP in training performance, offering improvements by factors of 2.51x and 4.79x respectively. Additionally, we observe that LLMs with larger parameters tend to demand higher interconnect utilization. Despite these findings, our study suggests potential for further optimization in overall interconnect utilization. This research contributes to a deeper understanding of the performance characteristics of LLMs over high-speed interconnects, paving the way for more efficient training methodologies.
Workshop
W
DescriptionContainers have become an increasingly important part of the research software ecosystem for enabling portability and reproducibility, especially for cloud-native workflows. However, challenges remain for facilitators working to bring the success of containers in the cloud to their local HPC clusters due to strict security and performance requirements. Established container runtimes such as Singularity sacrifice features in order to meet these requirements, leaving some researcher’s needs unfilled. Charliecloud, a newer container runtime, offers a more complete container experience for HPC by being fully unprivileged at all stages of container development, testing, and production. TAMU HPRC, an early adopter of Charliecloud, reports on its experiences in supporting Charliecloud for HPC to provide guidance and inspiration for other institutions considering supporting Charliecloud. Lessons learned about installation, usage, best practices, applications, and user training are described, as well as recommendations for further development.
Paper
Artificial Intelligence/Machine Learning
TP
DescriptionDistributed synchronized GPU training is commonly used for deep learning. The resource constraint of using a fixed number of GPUs makes large-scale training jobs suffer from long queuing time for resource allocation, and lowers the cluster utilization. Adapting to resource elasticity can alleviate this but often introduces inconsistent model accuracy, due to lacking of capability to decouple model training procedure from resource allocation. We propose EasyScale, an elastic training system that achieves consistent model accuracy under resource elasticity for both homogeneous and heterogeneous GPUs. EasyScale preserves the data-parallel training behaviors strictly, traces the consistency-relevant factors carefully, utilizes the deep learning characteristics for EasyScaleThread abstraction and fast context-switching. To utilize heterogeneous cluster, EasyScale dynamically assigns workers based on the intra-inter-job schedulers, minimizing load imbalance and maximizing aggregated job throughput. Deployed in an online serving cluster, EasyScale powers the training jobs to utilize idle GPUs opportunistically, improving overall cluster utilization by 62.1%.
Workshop
W
DescriptionHighly optimized systems face the challenge of meeting the computational demands of domain-specific simulations and workflow applications. Those range from low- to high-level implementations and differ widely regarding optimization and requirements, and affecting increasingly the centers' energy efficiency. With today’s heterogeneous clusters, containerization allows arbitrary applications to bypass architectural differences, and focus mainly on core-functions instead of deployment issues.
Our framework determines general performance characteristics of containerized HPC-applications with unknown behavior, and satisfies computational, memory, and interconnect requirements independent of the container-technology. It allows developers and administrators to evaluate runtime parameters of blackbox applications, without inspecting the inside of any container based on kernel-level measurements with eBPF. We derived first algorithms for the Container-Fingerprint, a quantified and comparable runtime characteristic, enabling optimized mapping to target systems.
For evaluation we investigate benchmark applications on different architectures typical for supercomputing centers. Measurements indicate that the derived fingerprints are suitable to distinguish the performance of containers and systems allowing an optimized allocation of HPC-containers. By applying the scheme we work towards a twofold improvement: an increase in the efficiency of system usage and energy consumption, and a deployment optimization of containers by enabling streamlined, requirements-oriented allocations optimizing the balance between resource-usage and time-to-solution.
Our framework determines general performance characteristics of containerized HPC-applications with unknown behavior, and satisfies computational, memory, and interconnect requirements independent of the container-technology. It allows developers and administrators to evaluate runtime parameters of blackbox applications, without inspecting the inside of any container based on kernel-level measurements with eBPF. We derived first algorithms for the Container-Fingerprint, a quantified and comparable runtime characteristic, enabling optimized mapping to target systems.
For evaluation we investigate benchmark applications on different architectures typical for supercomputing centers. Measurements indicate that the derived fingerprints are suitable to distinguish the performance of containers and systems allowing an optimized allocation of HPC-containers. By applying the scheme we work towards a twofold improvement: an increase in the efficiency of system usage and energy consumption, and a deployment optimization of containers by enabling streamlined, requirements-oriented allocations optimizing the balance between resource-usage and time-to-solution.
Early Career Program
Inclusivity
Inclusivity
TP
DescriptionThis is a succession of conversations between a mentor and a small group of mentees. Mentees rotate the tables having an opportunity to talk to several mentors during a specified amount of time asking questions and trying to establish a connection.
Workshop
Education
State of the Practice
W
DescriptionEduHPC23 – Welcome and Introductory Message
Workshop
Education
State of the Practice
W
DescriptionThe EduHPC workshop brings together stakeholders from industry (developers, hardware and software vendors), national labs, and academia in the context of SC, to hear the pedagogical challenges others are facing, share approaches to meeting such challenges, and generally exchange ideas related to high-performance computing, parallel and distributed computing, distributed data science, scalable AI and IoT/Edge computing in undergraduate and graduate education. In addition to paper presentations, this workshop will feature invited keynotes, panels (e.g., reproducibility in HPC education and training, inclusive pedagogy and efforts in broadening participation in HPC), special sessions such as “Peachy Assignments,” and invited talks on opportunities for collaboration, resource sharing, educator training, internships, and other means of increasing cross-fertilization between industry, government, and academia.
Workshop
Education
State of the Practice
W
DescriptionThe first generation of exascale computing systems is coming online along with new application capabilities and system software. At the same time, demands for high performance computing continue to grow for more powerful simulations, adoption of machine learning methods, and huge data analysis problems arising from new instruments and increasingly ubiquitous devices. In its broadest sense, computational science research is expanding beyond physical and life sciences into social sciences, public policy, and even the humanities.
Concurrent with these trends, chip technology is facing scaling limits, making it increasingly difficult to meet these new demands. Disruptions in the computing marketplace, which include supply chain limitations, a shrinking set of system integrators, and the growing influence of cloud providers are changing underlying assumptions about how to acquire and deploy future supercomputers. At the same time, there are discussions around the role of AI/ML and quantum computing.
How do we educate students for a post-exascale world? A finite set of computational motifs represent much of the parallel computing workload in modeling and simulation. Should the HPC community focus on those or should they be expanded to include data analytics and machine learning approaches? Finally, what are the workforce needs for the future of high end computing?
Concurrent with these trends, chip technology is facing scaling limits, making it increasingly difficult to meet these new demands. Disruptions in the computing marketplace, which include supply chain limitations, a shrinking set of system integrators, and the growing influence of cloud providers are changing underlying assumptions about how to acquire and deploy future supercomputers. At the same time, there are discussions around the role of AI/ML and quantum computing.
How do we educate students for a post-exascale world? A finite set of computational motifs represent much of the parallel computing workload in modeling and simulation. Should the HPC community focus on those or should they be expanded to include data analytics and machine learning approaches? Finally, what are the workforce needs for the future of high end computing?
Workshop
Education
State of the Practice
W
DescriptionPanel Q&A for EduHPC23 paper session II
Workshop
Education
State of the Practice
W
DescriptionPanel Q&A for EduHPC23 Lightning Talks.
Workshop
Education
State of the Practice
W
DescriptionPanel Q&A for EduHPC23 Peachy Assignment Session
Workshop
Education
State of the Practice
W
DescriptionPanel Q&A for EduHPC23 paper session I
Posters
Research Posters
TP
XO/EX
DescriptionThe energy consumption of HPC data centers is a decisive factor in the procurement and operation of the systems. EE-HPC achieves a more efficient energy use of HPC systems by targeted job-specific control and optimization of the hardware. The project started end of 2022 and builds on the existing stable software components ClusterCockpit and LIKWID. It provides a simple, robust, secure and scalable monitoring and energy control solution for hybrid HPC clusters. The job-specific performance and monitoring framework ClusterCockpit is already used in production at several large HPC computing centers. The energy manager and node controller is implemented in a Python based prototype and will be ported to Golang and integrated in ClusterCockpit. The framework will be evaluated with a set of relevant HPC applications from molecular dynamics, engineering, and climate research.
Workshop
Exascale
Message Passing
Programming Frameworks and System Software
W
DescriptionProcess malleability can be defined as the ability of a distributed MPI parallel job to change the number of processes on--the--fly without stopping its execution, reallocating the compute resources originally assigned to the job, and without storing application data to disk. MPI malleability consists of four stages: resource reallocation, process management, data redistribution and execution resuming. Among them, data redistribution is the most time-consuming and determines the reconfiguration time.
In this work, we compare different implementations of this stage using point-to-point and collective MPI operations, and discuss the impact of overlapping computation-communication. We then combine these strategies with different methods to expand/shrink jobs, using a synthetic application to emulate MPI-based codes and their malleable counterparts, in order to evaluate the effect of different malleability methods in parallel distributed applications. The results show that the use of asynchronous techniques speeds up execution by 1.14 and 1.21, depending on the network used.
In this work, we compare different implementations of this stage using point-to-point and collective MPI operations, and discuss the impact of overlapping computation-communication. We then combine these strategies with different methods to expand/shrink jobs, using a synthetic application to emulate MPI-based codes and their malleable counterparts, in order to evaluate the effect of different malleability methods in parallel distributed applications. The results show that the use of asynchronous techniques speeds up execution by 1.14 and 1.21, depending on the network used.
Tutorial
Accelerators
Exascale
Heterogeneous Computing
Performance Optimization
TUT
DescriptionOver the past decade, GPUs became ubiquitous in HPC installations around the world, delivering the majority of performance of some of the largest supercomputers (e.g. Summit, Sierra, JUWELS Booster). This trend continues in the recently deployed and upcoming Pre-Exascale and Exascale systems (JUPITER, LUMI, Leonardo; El Capitan, Frontier, Perlmutter): GPUs are chosen as the core computing devices to enter this next era of HPC. To take advantage of future GPU-accelerated systems with tens of thousands of devices, application developers need to have the proper skills and tools to understand, manage, and optimize distributed GPU applications.
In this tutorial, participants will learn techniques to efficiently program large-scale multi-GPU systems. While programming multiple GPUs with MPI is explained in detail, also advanced tuning techniques and complementing programming models like NCCL and NVSHMEM are presented. Tools for analysis are shown and used to motivate and implement performance optimizations. The tutorial teaches fundamental concepts that apply to GPU-accelerated systems in general, taking the NVIDIA platform as an example. It is a combination of lectures and hands-on exercises, using one of Europe’s fastest supercomputers, JUWELS Booster, for interactive learning and discovery.
In this tutorial, participants will learn techniques to efficiently program large-scale multi-GPU systems. While programming multiple GPUs with MPI is explained in detail, also advanced tuning techniques and complementing programming models like NCCL and NVSHMEM are presented. Tools for analysis are shown and used to motivate and implement performance optimizations. The tutorial teaches fundamental concepts that apply to GPU-accelerated systems in general, taking the NVIDIA platform as an example. It is a combination of lectures and hands-on exercises, using one of Europe’s fastest supercomputers, JUWELS Booster, for interactive learning and discovery.
Paper
Accelerators
Algorithms
Graph Algorithms and Frameworks
TP
DescriptionMaximal biclique enumeration (MBE) in bipartite graphs is an important problem in data mining with many real-world applications. All existing solutions for MBE are designed for CPUs. Parallel MBE algorithms for GPUs are needed for MBE acceleration leveraging its many computing cores. However, enumerating maximal bicliques using GPUs has three main challenges including large memory requirement, thread divergence, and load imbalance. In this paper, we propose GMBE, the first highly-efficient GPU solution for the MBE problem. To overcome the challenges, we design a stack-based iteration approach to reduce GPU memory usage, a pro-active pruning method using the vertex’s local neighborhood size to alleviate thread divergence, and a load-aware task scheduling framework to achieve load balance among threads within GPU warps and blocks. Our experimental results show that GMBE on an NVIDIA A100 GPU can achieve 70.6× speedup over the state-of-the-art parallel MBE algorithm ParMBE on a 96-core CPU machine.
Workshop
Performance Optimization
W
DescriptionEnsemble forecasting techniques are gaining popularity in the weather and renewable energy communities, thanks to their ability to produce accurate predictions and at the same time to provide a measure of the uncertainty in the forecast. Analog Ensemble techniques are a class of computationally efficient ensemble forecasting methods that predict future weather events based on historical similar cases (i.e., analogs). The definition of "similar" is dependent on the type of predictors used for searching in the historical dataset, and on how relevant they are to identify a similar weather event happened in the past. For a given geographical location, the relevancy of a predictor in identifying good analogs requires a long tuning process usually performed via brute-force. In this work, we provide several probabilistic alternatives to the tuning process, based on the dataset size, computational cost of a single evaluation, and number of predictors.
Workshop
Software Engineering
W
Workshop
Exascale
Message Passing
Programming Frameworks and System Software
W
DescriptionThe Message-Passing Interface (MPI) requires implementations that are able to adapt to new hardware and architectures while ensuring correctness and usability. The most widely used MPI implementations, however, are written in older programming languages that can lead to memory-unsafe code with poor isolation between modules, and complicated interfaces that can lead to serious bugs, all of which leads to difficulty in testing, debugging, and checking for correctness. In order to improve development of MPI implementations, we posit that new components, and key existing code segments, may benefit from being written in the Rust programming language. In this work, we re-implement a core component of Open MPI used for intra-node communication in Rust and show that it achieves performance approaching that of the existing, highly optimized, C code, demonstrating that Rust is able to provide performance while allowing for better testing, memory safety guarantees, and correctness.
Workshop
Distributed Computing
State of the Practice
W
DescriptionAs the adoption of Kubernetes continues to grow, there is an increasing demand for performing larger scale batch processing using Kubernetes. Much of the initial workloads are around machine learning but there is also interest in converging traditional HPC and Kubernetes clusters for operational efficiencies.
In this talk, we want to look at how we can leverage both traditional HPC workload partitioning as well as features of Kubernetes to achieve a hybrid system that can be used for all types of workloads. We will show how we isolate the orchestration and user processes in Kubernetes allowing for maximum use of the hardware for running batch workloads with benchmark comparisons against a traditional HPC cluster.
It is important to note that this area of research is still in its early stages and through our exploration of this topic we hope that this will continue to foster discussion in the HPC community.
In this talk, we want to look at how we can leverage both traditional HPC workload partitioning as well as features of Kubernetes to achieve a hybrid system that can be used for all types of workloads. We will show how we isolate the orchestration and user processes in Kubernetes allowing for maximum use of the hardware for running batch workloads with benchmark comparisons against a traditional HPC cluster.
It is important to note that this area of research is still in its early stages and through our exploration of this topic we hope that this will continue to foster discussion in the HPC community.
Paper
Distributed Computing
Message Passing
Programming Frameworks and System Software
TP
DescriptionYGM is a general-purpose asynchronous distributed computing library for C++/MPI, designed to handle the irregular data access patterns and small messages of graph algorithms and data science applications. It uses data serialization to give an easily usable active message interface and message aggregation to maximize application throughput. Our design philosophy makes a tradeoff that increases network bandwidth utilization at the cost of added latency. We provide a suite of benchmarks showcasing YGM’s performance. Compared to similar distributed active message benchmark implementations that do not provide message buffering, we are able to achieve over 10x throughput on thousands of cores at a latency cost that can be as small as 2x or as large as 100x, depending on the machine being used. For applications that can be written to be latency-tolerant, this represents a significant potential performance improvement through using YGM.
Workshop
Education
State of the Practice
W
Workshop
Energy Efficiency
Green Computing
Sustainability
W
DescriptionLarge supercomputing facilities are critical to research in many areas that impact on decisions such as how to address the current climate emergency, including renewable energy facility design and new battery technologies. However, these systems themselves are a source of large amounts of emissions due to the embodied emissions associated with their construction and decommissioning; and the power consumption associated with running the facility. Recently, we have been analyzing the impact of a UK national HPC facility (ARCHER2) in terms of energy and emissions. Based on this work, we have made changes to the operation of the service that give a saving of more than 20% in power draw of the computational resources, with all application benchmarks showing reduced power to solution. We describe our analysis and the changes made to the operation of the service to improve its energy efficiency, and thereby reduce its climate impacts.
Workshop
Large Scale Systems
Performance Measurement, Modeling, and Tools
Software Engineering
W
DescriptionThis paper explores the challenges and solutions for managing the vast amount of data generated by the Advanced Photon Source (APS), a synchrotron light source producing ultra-bright x-rays for diverse scientific domains. With 68 experimental beamlines, the APS serves a wide user base across academia, government, and industry. The ongoing upgrade of the APS storage ring and new instruments will amplify data generation and processing demands. This paper discusses the approach to address these demands through automated standardized workflows for faster scientific insights. The APS Data Management System coordinates data related tasks and interfaces with Globus. Through integration with the Argonne Leadership Computing Facility (ALCF), APS users can efficiently access HPC resources. Standardized workflows have led to reduced computational burdens on scientists and greater accessibility of HPC resources. We demonstrate how standardization and collaboration enable scientists to rapidly produce meaningful scientific results, establishing a streamlined path from collection to publication.
Workshop
Performance Measurement, Modeling, and Tools
Programming Frameworks and System Software
W
DescriptionModern scientific applications utilize numerous software and hardware layers to efficiently access data. This approach poses a challenge for I/O optimization because of the need to instrument and correlate information across those layers. The Darshan characterization tool seeks to address this challenge by providing efficient, transparent, and compact runtime instrumentation of many common I/O interfaces. It also includes command-line tools to generate actionable insights and summary reports. However, the extreme diversity of today's scientific applications means that not all applications are well served by one-size-fits-all analysis tools.
In this work, we present PyDarshan, a Python-based library that enables agile analysis of I/O performance data. PyDarshan caters to both novice and advanced users by offering ready-to-use HTML reports as well as a rich collection of APIs to facilitate custom analyses. We present the design of PyDarshan and demonstrate its effectiveness in four diverse real-world analysis use cases.
In this work, we present PyDarshan, a Python-based library that enables agile analysis of I/O performance data. PyDarshan caters to both novice and advanced users by offering ready-to-use HTML reports as well as a rich collection of APIs to facilitate custom analyses. We present the design of PyDarshan and demonstrate its effectiveness in four diverse real-world analysis use cases.
Birds of a Feather
Programming Frameworks and System Software
TP
XO/EX
DescriptionTraditional interest in increasing the parallelism for individual jobs in HPC systems is conditioned by the diversity and dynamics of their resource demands at runtime. Malleability techniques can help to dynamically adapt resource usage to achieve maximum efficiency. Malleable HPC systems, however, present a series of fundamental research challenges in the fields of resource management, scheduling, malleability control, flexibilization of application structures, and data movement. All aforementioned issues will be discussed in the proposed Birds of a Feather session, which aims at building a community of developers and users around the topic of malleability in high-performance computing, networking, and storage.
Workshop
Data Analysis, Visualization, and Storage
Large Scale Systems
Performance Measurement, Modeling, and Tools
W
DescriptionThe significance of studying cellular systems in silico is underscored by persistent innovation in computational models. These models can now capture hundreds of millions of cells to recapitulate physiological behavior. The growing scale of models, however, poses challenges not only for visualization and analysis of the data but also for the process of simulation maintenance. Without proper, flexible analysis routines in place, the deployment of these models continues to lag. This paper presents an approach to enable in situ visualization and analysis of large-scale fluid-structure-interaction models for real-time data interrogation and visualization on leadership class systems in preparation for increasing scale. We demonstrate the feasibility and explore the flexibility of this pipeline on a complex cell model with millions of components running on the Summit supercomputer.
Workshop
Distributed Computing
Middleware and System Software
Runtime Systems
W
DescriptionDynamic neural network (DyNN) enables high computational efficiency and strong representation capability. However, training DyNN can face a memory capacity problem because of increasing model size or limited GPU memory capacity. Managing tensors to save GPU memory is challenging, because of the dynamic structure of DyNN. We introduce DyNN-Offload, a memory-management runtime system to train DyNN. DyNN-Offload uses a learned approach (using a neural network called the pilot model) to increase predictability of tensor accesses to facilitate memory
management. The key of DyNN-Offload is to enable fast inference of the pilot model in order to reduce its performance overhead, while providing high inference (or prediction) accuracy. DyNN-Offload reduces input feature space and model complexity of the pilot model based on a new representation of DyNN. DyNN-Offload enables 8× larger DyNN training on a single GPU compared with using PyTorch alone (unprecedented with any existing solution). Evaluating with AlphaFold (a production-level, large-scale DyNN), we show that DyNN-Offload outperforms unified virtual memory (UVM) and dynamic tensor rematerialization (DTR), the most advanced solutions to save GPU memory for DyNN, by 3× and 2.1× respectively in terms of maximum batch size.
management. The key of DyNN-Offload is to enable fast inference of the pilot model in order to reduce its performance overhead, while providing high inference (or prediction) accuracy. DyNN-Offload reduces input feature space and model complexity of the pilot model based on a new representation of DyNN. DyNN-Offload enables 8× larger DyNN training on a single GPU compared with using PyTorch alone (unprecedented with any existing solution). Evaluating with AlphaFold (a production-level, large-scale DyNN), we show that DyNN-Offload outperforms unified virtual memory (UVM) and dynamic tensor rematerialization (DTR), the most advanced solutions to save GPU memory for DyNN, by 3× and 2.1× respectively in terms of maximum batch size.
Workshop
W
DescriptionContainers based on NVIDIA GPU Cloud (NGC) images have become increasingly popular for deploying optimized software on NVIDIA GPUs, particularly in the context of ML/AI frameworks and models. However, it's important to note that the software stack within NGC images lacks the components necessary to interact with the HPE Slingshot 11 interconnect, which is a high-speed network utilized in some of the world's most powerful supercomputers. This limitation adds to the challenge of efficiently running containers for this noteworthy combination of systems and use cases.
This presentation aims to share insights into the process of enabling NGC-based containers to leverage Slingshot 11. The discussion will cover key elements for optimizing application performance, including the NCCL communication collectives, the libfabric communication framework, and GPUDirect RDMA. The presentation will also feature quantitative results from synthetic benchmarks that measure communication bandwidth and deep learning performance using the PyTorch framework.
This presentation aims to share insights into the process of enabling NGC-based containers to leverage Slingshot 11. The discussion will cover key elements for optimizing application performance, including the NCCL communication collectives, the libfabric communication framework, and GPUDirect RDMA. The presentation will also feature quantitative results from synthetic benchmarks that measure communication bandwidth and deep learning performance using the PyTorch framework.
Workshop
Quantum Computing
Software Engineering
W
DescriptionQuantum computer simulators play an essential role in advancing the field of quantum computing, serving as indispensable tools for quantum computer verification, debugging, and quantum algorithm prototyping. Among these simulators, Google's state vector quantum simulator qsim has gained significant popularity for its high performance and support for AVX512 vector instructions, OpenMP, and NVIDIA GPUs. However, the lack of support for AMD GPUs presents a limitation in qsim's widespread applicability to current large-scale supercomputers with AMD GPUs, including the current fastest supercomputer in the world, Frontier. This work addresses this gap by developing a qsim software backend that leverages the AMD HIP (Heterogeneous-Compute Interface for Portability) programming interface and tools. We discuss the efficiency and effectiveness of the newly introduced support for AMD GPUs through performance evaluations and comparisons with the existing Nvidia GPU backend.
Paper
Accelerators
Applications
Architecture and Networks
Modeling and Simulation
TP
DescriptionMolecular dynamics (MD) simulation can provide an affordable way for inspecting microscopic phenomena, which is a powerful complement to real-world experiments. But the spatial scale of MD simulations is usually magnitudes smaller than experiment systems. In this paper, we present our work, redesigning the widely used inter-layer potential in structural superlubricity. By carrying out a specialized neighbor list for inter-layer potential computation, the total memory access amount is reduced significantly. Besides, a simple but efficient vectorization strategy is implemented based on the new neighbor list. In the extreme case, our work can scale to 38 million cores to achieve a sustainable performance of 61 PFLOPS, enabling a simulation of a superlubricity system of 32 um^2 with 7.2 billion atoms at 4.75 ns/day, which is 11,834 times of reported largest scale simulation in superlubricity systems in contact area and almost ten times faster in time-to-solution.
Doctoral Showcase
Posters
Reproducibility
TP
DescriptionScientific communities across fields like earth science, biology, and materials science increasingly run complex workflows for their scientific discovery. We work closely with these communities to leverage high-performance computing (HPC), big data analytics, and artificial intelligence/machine learning (AI/ML) to increase and accelerate their workflows’ productivity. Our work addresses the new challenges brought about by this optimization process.
We identify three main challenges in these workflows: i) they integrate AI/ML methods with limited transparency and include many interoperable components (data and applications) that are hard to trace and reuse to reproduce results; ii) they hide the complexity of large intermediate data and their overall execution can be affected by the I/O bandwidth of the underlying infrastructure; and iii) they run on heterogeneous and distributed infrastructure with data and application dependencies that require efficient data management and resource allocation.
To address these challenges, we provide solutions that leverage the convergence between high-performance and cloud computing. First, we design and develop fine-grained containerized environments that enable data traceability and results explainability by automatically annotating and seamlessly attaching provenance information. Second, since the workflows are already containerized, we integrate them in HPC and native-cloud infrastructure and tune the storage technology to enable better I/O and data scalability. Finally, we orchestrate the end-to-end execution of workflows, ensuring efficient allocation of infrastructure resources and intermediate data management, and supporting reproducibility and reusability of workflows’ executions.
We identify three main challenges in these workflows: i) they integrate AI/ML methods with limited transparency and include many interoperable components (data and applications) that are hard to trace and reuse to reproduce results; ii) they hide the complexity of large intermediate data and their overall execution can be affected by the I/O bandwidth of the underlying infrastructure; and iii) they run on heterogeneous and distributed infrastructure with data and application dependencies that require efficient data management and resource allocation.
To address these challenges, we provide solutions that leverage the convergence between high-performance and cloud computing. First, we design and develop fine-grained containerized environments that enable data traceability and results explainability by automatically annotating and seamlessly attaching provenance information. Second, since the workflows are already containerized, we integrate them in HPC and native-cloud infrastructure and tune the storage technology to enable better I/O and data scalability. Finally, we orchestrate the end-to-end execution of workflows, ensuring efficient allocation of infrastructure resources and intermediate data management, and supporting reproducibility and reusability of workflows’ executions.
Workshop
Quantum Computing
Software Engineering
W
DescriptionLarge-scale simulations of quantum circuits pose significant challenges, especially in the context of quantum chemistry, due to the number of qubits, circuit depth, and the number of circuits needed per problem. High-performance computing (HPC) systems offer massive computational capabilities that could help overcome these obstacles. We developed a high-performance quantum circuit simulator, called NWQ-Sim, and demonstrate its capability to simulate large quantum chemistry problems on NERSC's Perlmutter supercomputer. Integrating NWQ-Sim with XACC, we have executed QPE and VQE algorithms for downfolded quantum chemistry systems at unprecedented scales. Our work demonstrates the potential of leveraging HPC resources to advance quantum chemistry and other applications of near-term quantum devices.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
TP
XO/EX
DescriptionThis poster presents the DYnamic and Asynchronous Data Streamliner (DYAD) middleware that provides an efficient and transparent method for data movement in scientific workflows based on the producer-consumer paradigm. We develop DYAD on top of Flux, a fully hierarchical HPC workload manager, and Unified Communication X (UCX), a unified framework for networking on HPC systems. We measure DYAD's performance with a suite of mini-apps and show how it outperforms traditional methods for data transfer while providing a higher level of transparency.
Workshop
Applications
Cloud Computing
Distributed Computing
Edge Computing
Large Scale Systems
W
DescriptionCurrent scientific workflow systems do not typically integrate simulation-centric and data-centric aspects due to their very different software/infrastructure requirements. A transparent integration of such components into a single end-to-end workflow would lead to a more efficient and automated way for generating insights from large simulation data. This work presents a complex case study related to extreme events analysis of future climate data that integrates in the same workflow numerical simulations, Big Data analytics and Machine Learning models. The case study is being implemented in the context of the eFlows4HPC project using the project's software stack for deployment and orchestration of the workflow. The solution implemented in the project has shown to simplify the development and execution of end-to-end climate workflows with heterogeneous software requirements. Moreover, such an approach can, in the long term, increase the reuse of workflows by scientists and their portability over different HPC infrastructures.
Workshop
Artificial Intelligence/Machine Learning
Energy Efficiency
Green Computing
Performance Measurement, Modeling, and Tools
Sustainability
W
DescriptionHigh-Performance Computing (HPC) systems today are gradually increasing in size and complexity due to the correspondent demand for ever-increasing computing needs, requiring more complicated tasks and higher accuracy. The growing energy demands of HPC systems necessitate the urgent adoption of green HPC approaches to mitigate environmental impact and promote energy-efficient computing.
This paper explores a monitoring solution for the energy values detected during the execution of two parallel algorithms for the solution of linear systems: the Inhibition Method and Gaussian Elimination from ScaLAPACK library. The main goal is to profile their execution from the energy consumption perspective. Moreover, it also collates the energy and power values for different ranks, nodes, and sockets configurations. The monitoring tools employed to track the energy consumption of these algorithms are PAPI and RAPL, which will be integrated with the parallel execution of the algorithms managed with the Message Passing Interface MPI.
This paper explores a monitoring solution for the energy values detected during the execution of two parallel algorithms for the solution of linear systems: the Inhibition Method and Gaussian Elimination from ScaLAPACK library. The main goal is to profile their execution from the energy consumption perspective. Moreover, it also collates the energy and power values for different ranks, nodes, and sockets configurations. The monitoring tools employed to track the energy consumption of these algorithms are PAPI and RAPL, which will be integrated with the parallel execution of the algorithms managed with the Message Passing Interface MPI.
Workshop
Energy Efficiency
Green Computing
Sustainability
W
DescriptionClassical simulations are essential for the development of quantum computing, and their exponential scaling can easily fill any modern supercomputer. In this paper we consider the performance and energy consumption of large Quantum Fourier Transform (QFT) simulations run on ARCHER2, the UK's National Supercomputing Service, with QuEST toolkit. We take into account CPU clock frequency and node memory size, and use cache-blocking to rearrange the circuit, which minimizes communications. We find that using 2.00GHz instead of 2.25GHz can save as much as 25% of energy at 5% increase in runtime. Higher node memory also has the potential to be more efficient, and cost the user fewer CUs, but at higher runtime penalty. Finally, we present a cache-blocking QFT circuit, which halves the required communication. All our optimizations combined result in 40% faster simulations and 35% energy savings in 44 qubit simulations on 4,096 ARCHER2 nodes.
Tutorial
Accelerators
Energy Efficiency
TUT
DescriptionEnergy efficiency has become a critical concern in High Performance Computing (HPC) and supercomputing, especially with the advent of exascale systems. The increasing demand for computational power and the associated energy consumption have led to a growing need for optimization techniques to reduce power consumption. GPUs, now the primary source of compute power in exascale supercomputers, contribute significantly to the overall energy expenditure of these systems. Consequently, the development and implementation of energy-efficient strategies for GPU applications are essential to reduce the environmental impact and operational costs of HPC facilities.
This tutorial offers a comprehensive introduction to energy-efficient computing in the context of HPC, focusing on GPU applications. As a participant, you will gain insight into code optimization techniques that improve energy efficiency, automatically explore performance-energy trade-offs using Kernel Tuner, dive into mixed-precision techniques, and learn how to write clean code for reduced-precision arithmetic on GPUs.
Finally, the tutorial addresses GPU clock frequency optimization as a means to improve energy efficiency, including how to find the optimal core clock frequency range. The hands-on approach of this tutorial enables participants to acquire valuable knowledge and practical experience in energy-efficient computing, essential for advancing environmentally sustainable and cost-effective HPC and supercomputing solutions.
This tutorial offers a comprehensive introduction to energy-efficient computing in the context of HPC, focusing on GPU applications. As a participant, you will gain insight into code optimization techniques that improve energy efficiency, automatically explore performance-energy trade-offs using Kernel Tuner, dive into mixed-precision techniques, and learn how to write clean code for reduced-precision arithmetic on GPUs.
Finally, the tutorial addresses GPU clock frequency optimization as a means to improve energy efficiency, including how to find the optimal core clock frequency range. The hands-on approach of this tutorial enables participants to acquire valuable knowledge and practical experience in energy-efficient computing, essential for advancing environmentally sustainable and cost-effective HPC and supercomputing solutions.
Birds of a Feather
State of the Practice
TP
XO/EX
DescriptionThe DoD has invested significant time and funding to acquire HPC systems, software, and networking to support a large base of DoD and industry users on HPC-backed projects. This BoF will use interactive lightning talks about current and future research, technology acquisition plans, and software development needs that align with DoD goals. These interactive talks are intended to help external organizations and researchers connect with DoD HPC leadership, encourage partnerships, strengthen diversity, and collaborate with problem solving. External engagement will help DoD users and HPC sites enhance expertise and connect to the larger HPC community.
Paper
Accelerators
Applications
Architecture and Networks
Modeling and Simulation
TP
DescriptionPhysical phenomenon such as protein folding requires simulation up to microseconds of physical time, which directly corresponds to the strong scaling of molecular dynamics(MD) on modern supercomputers. In this paper, we present a highly scalable implementation of the state-of-the-art MD code LAMMPS on Fugaku by exploiting the 6D mesh/torus topology of the TofuD network. Based on our detailed analysis of the MD communication pattern, we first adapt coarse-grained peer-to-peer ghost-region communication with uTofu interface, then further improve the scalability via fine-grained thread pool. Finally, Remote direct memory access (RDMA) primitives are utilized to avoid buffer overhead. Numerical results show that our optimized code can reduce 77% of the communication time, improving the performance of baseline LAMMPS by a factor of 2.9x and 2.2x for Lennard-Jones and embedded-atom method potentials when scaling to 36, 846 computing nodes. Our optimization techniques can also benefit other applications with stencil or domain decomposition methods.
Paper
Applications
Modeling and Simulation
DescriptionSimulations of cancer cell transport require accurately modeling mm-scale and longer trajectories through a circulatory system containing trillions of deformable red blood cells, whose intercellular interactions require submicron fidelity. Using a hybrid CPU-GPU approach, we extend the advanced physics refinement (APR) method to couple a finely-resolved region of explicitly-modeled red blood cells to a coarsely-resolved bulk fluid domain. We further develop algorithms that: capture the dynamics at the interface of differing viscosities, maintain hematocrit within the cell-filled volume, and move the finely-resolved region and encapsulated cells while tracking an individual cancer cell. Comparison to a fully-resolved fluid-structure interaction model is presented for validation. Finally, we use the advanced APR method to simulate cancer cell transport over a mm-scale distance while maintaining a local region of RBCs, using a fraction of the computational power required to run a fully-resolved model.
Workshop
Large Scale Systems
Middleware and System Software
Programming Frameworks and System Software
W
Workshop
Large Scale Systems
Middleware and System Software
Programming Frameworks and System Software
W
Workshop
Large Scale Systems
Middleware and System Software
Programming Frameworks and System Software
W
ACM Gordon Bell Finalist
Awards
TP
DescriptionPeople are increasingly concerned about how tectonic processes affect climate and vice versa. We establish a cross-sphere modeling system for volcanic eruptions and atmosphere circulation on a new Sunway supercomputer with a spatial resolution from 10m locally to 3km globally, using an improved multi-medium and multiphase smoothed particle hydrodynamics (SPH) combined with a fully coupled meteorology-chemistry global atmospheric modeling scheme. We achieve 400 billion particles and 80% parallel efficiency using 39,000,000 processor cores. The simulation captures the whole dynamic process of the Tonga eruption from shock waves, earthquakes, tsunamis, mushroom clouds to the following 6-7 days of transport and diffusion of ash and water vapor, and preliminarily obtains the influence effect of full coupling of volcano, earthquake, ocean and atmosphere. This work is of great significance for deeply understanding the interaction between tectonic processes and climate change, and establishing an early warning simulation system for similar global hazard events.
Exhibitor Forum
Accelerators
Artificial Intelligence/Machine Learning
Architecture and Networks
Hardware Technologies
TP
XO/EX
DescriptionThe convergence of HPC and AI entails an explosion of performance of the number of nodes/cores, data volume, and data movement. In the coming years, the deployment of AI networks, ranging from rack-scale to datacenter-scale, is set to accelerate, necessitating the evolution of networking technology to effectively support and accommodate these needs.
Eviden Vision and solution:
The existing HPC & AI networking choices have their virtues to continuously support the simulation evolution. However, depending on the specific requirements, e.g. workload characteristics, performance considerations, and budget constraints, customers should have the flexibility and choices to an open, interoperable, high performance, full-communications stack architecture to meet the growing network demands at scale, whether on-premises or in the cloud.
As one of the founding members of the UEC (Ultra Ethernet Consortium), Eviden supports an Ethernet-based, multivendor interoperable, scalable, and cost-effective high-performance networking for HPC and AI workload, focusing on:
• Performance and reliability: extremely low latency for efficient communication between nodes for both HPC and AI workloads, supporting massive data transfer or high-throughput inter-node communication. Effectively manage packet processing, network congestion, and message handling protocols, enabling low-latency and substantial data transfers with RDMA over Ethernet, ensuring data integrity.
• Interoperability and Simplicity: Leveraging the ubiquity, mature ecosystem, vendor support, and interoperability with various software and toolsets, Ethernet-based next-gen networking fabric offers seamless integration with popular operating systems, cluster management software, storage solutions, and distributed file systems commonly used in data centers.
• Scalability and cost-effectiveness: the affordability of Ethernet hardware, software and the widespread adoption in commercial networks makes it a cost-effective and easy-to-scale networking choices.
With the new workload scale and complexity, optimizing the interconnect networking for HPC and AI systems become a major contributor to global performance. Ethernet-based high-performance networking technology provides a new avenue to tackle technical and economic challenges.
Eviden Vision and solution:
The existing HPC & AI networking choices have their virtues to continuously support the simulation evolution. However, depending on the specific requirements, e.g. workload characteristics, performance considerations, and budget constraints, customers should have the flexibility and choices to an open, interoperable, high performance, full-communications stack architecture to meet the growing network demands at scale, whether on-premises or in the cloud.
As one of the founding members of the UEC (Ultra Ethernet Consortium), Eviden supports an Ethernet-based, multivendor interoperable, scalable, and cost-effective high-performance networking for HPC and AI workload, focusing on:
• Performance and reliability: extremely low latency for efficient communication between nodes for both HPC and AI workloads, supporting massive data transfer or high-throughput inter-node communication. Effectively manage packet processing, network congestion, and message handling protocols, enabling low-latency and substantial data transfers with RDMA over Ethernet, ensuring data integrity.
• Interoperability and Simplicity: Leveraging the ubiquity, mature ecosystem, vendor support, and interoperability with various software and toolsets, Ethernet-based next-gen networking fabric offers seamless integration with popular operating systems, cluster management software, storage solutions, and distributed file systems commonly used in data centers.
• Scalability and cost-effectiveness: the affordability of Ethernet hardware, software and the widespread adoption in commercial networks makes it a cost-effective and easy-to-scale networking choices.
With the new workload scale and complexity, optimizing the interconnect networking for HPC and AI systems become a major contributor to global performance. Ethernet-based high-performance networking technology provides a new avenue to tackle technical and economic challenges.
Birds of a Feather
HPC in Society
TP
XO/EX
DescriptionIn recent years, the European HPC ecosystem has undergone profound changes. EuroHPC JU a joint initiative between the EU, European countries, and private partners to develop a World Class Supercomputing Ecosystem in Europe was created. PRACE is in the process of transforming itself into a European HPC User and Centre Association.
The objective of this BoF is to give an overview of the current state of European HPC activities. We will present and discuss with the different European HPC stakeholders the current state of play, future plans, challenges and analyze critically the European HPC offers and services.
The objective of this BoF is to give an overview of the current state of European HPC activities. We will present and discuss with the different European HPC stakeholders the current state of play, future plans, challenges and analyze critically the European HPC offers and services.
Birds of a Feather
Architecture and Networks
TP
XO/EX
DescriptionThis BoF aims to foster discussion on RISC-V accelerators led by efforts on European accelerators for HPC and foster community interest in these projects. There are several accelerator efforts around the HPC community in Europe, many of them leveraging and fostering the RISC-V ecosystem. We will start with a short presentation (15 minutes) on a brief overview of current efforts and a quick insight into EUPILOT (part of the European Processor Initiative - EPI effort) to start the conversation. A Q&A session and open discussion with audience members will follow the introduction.
Workshop
Architecture and Networks
Hardware Technologies
W
DescriptionIn recent years, computers based on the RISC-V architecture have raised broad interest in the high-performance computing (HPC) community. As the RISC-V community develops the core instruction set architecture (ISA) along with ISA extensions, the HPC community has been actively ensuring HPC applications and environments are supported. In this context, assessing the performance of asynchronous many-task runtime systems (AMT) is important. We describe our experience with porting of a full 3D adaptive mesh-refinement, multi-scale, multi-model, and multi-physics application, Octo-Tiger, that is based on the HPX AMT, and we explore its performance characteristics on different RISC-V systems. The demonstrated results confirm that Octo-Tiger shows good scaling behavior on all tested systems. We, however, expect that exceptional hardware support based on dedicated ISA extensions (such as single-cycle context switches, extended atomic operations, and direct support for HPX's global address space) would allow for even better performance results.
Posters
Research Posters
Heterogeneous Computing
Performance Measurement, Modeling, and Tools
TP
DescriptionMaintaining a single codebase that can achieve good performance on a range of accelerator-based supercomputing platforms is of extremely high value for productive scientific application development. However, the large quantity of programming models available which claim to provide performance portability leaves developers with a complex choice when picking a model to use, potentially requiring an intensive effort to test each available model with kernels from their app. In order to better understand the current state of performance portable programming models, this project evaluates seven of the most popular programming models using two memory-bound mini-applications on two leadership-class supercomputers, Summit and Perlmutter. These results provide a useful evaluation of how well each programming model provides true performance portability in real-world usage for memory-bound applications.
Workshop
Accelerators
Edge Computing
Heterogeneous Computing
W
DescriptionA deep neural network library (DNNL) is an optimized library of low-level computational primitives for deep neural networks. In this study, we choose the softmax function, a primitive commonly used in new computing models for DNNs, as a case study on evaluating the unique programming models adopted by the vendors’ DNNLs (cuDNN, MIOpen, and oneDNN) and the performance and portability of DNNLs on NVIDIA and AMD GPUs. We find that cuDNN selects different compute kernels to execute based on the problem size for the primitive, which may have a significant performance impact. oneDNN successfully enables functional portability of the primitive across vendors’ platforms, but performance portability will need to be improved. In addition, the performance of a primitive in the DNNLs may be suboptimal compared to a custom implementation.
Workshop
Data Movement and Memory
Hardware Technologies
Heterogeneous Computing
Performance Measurement, Modeling, and Tools
W
DescriptionWe evaluate the 3rd generation of Intel's Optane non-volatile memory technology, assess the performance it can provide, and investigate the modes of use that can be beneficial in high performance computing, both for application performance and system architecture. We demonstrate sustained performance and functionality improvements from the latest hardware, along with I/O and memory performance and functionality that is not available from other memory or storage hardware. We show that leveraging Optane can provide significant reductions in the required volatile memory for applications, with minimal performance impacts, with appropriate memory hierarchy designs and considerations.
Workshop
Performance Measurement, Modeling, and Tools
Performance Optimization
W
DescriptionAs high-performance GPU computing becomes the trend, GPU-initiated one-sided communication becomes a viable solution for multi-GPU scaling. It also raises attention to the use of one-sided communication on CPUs. However, the lack of deep understanding of one-sided communication performance and its impact on an application's performance becomes a hurdle. In this paper, we overcome this hurdle by proposing a Message Roofline model, which characterizes an application’s sustained messaging performance (GB/s) as a function of its message size, number of messages per synchronization, peak network bandwidth, and network latency. We use three benchmarks to demonstrate the potentials of one-sided communication on CPUs and GPUs. These benchmarks include Stencils, Sparse Triangular Solve and Distributed HashTable. Our evaluation provides insights into practically understanding the two-sided and one-sided communications in MPI applications, and can also guide hardware vendors with design principles lest the potential performance of one-sided communications being under-utilized.
Workshop
Performance Measurement, Modeling, and Tools
Performance Optimization
W
DescriptionWe evaluate the portability of the SYCL programming model on some of the latest CPUs and GPUs from a wide range of vendors, utilizing the two main compilers: DPC++ and hipSYCL/OpenSYCL. Both compilers currently support GPUs from all three major vendors; we evaluate performance on the Intel(R) Data Center GPU Max 1100, the NVIDIA A100 GPU, and the AMD MI250X GPU. Support on CPUs currently is less established, with DPC++ only supporting Intel CPUs through OpenCL, however, OpenSYCL does have an OpenMP backend capable of targeting all modern CPUs; we benchmark the Intel Xeon Platinum 8360Y Processor (Ice Lake), the AMD EPYC 7V73X (Milan-X), and the Ampere Altra platforms. We study a range of primarily bandwidth-bound applications implemented using the OPS and OP2 DSLs, evaluate different formulations in SYCL, and contrast their performance to “native” programming approaches where available (CUDA/HIP/OpenMP).
Workshop
Modeling and Simulation
Performance Measurement, Modeling, and Tools
W
DescriptionIt is generally assumed that elastic parallel applications, with the ability to dynamically resize their process count, would provide numerous benefits to High-Performance Computing (HPC) systems and applications. Supporting this capability, however, requires significant effort at several layers of the HPC software stack. At a minimum, the resource management system, the distributed communication libraries, and the distributed applications themselves would have to explicitly support elasticity. With this level of widespread support required, there must be significant motivation for developers to commit to adding this capability. We aim to determine whether there are practical benefits to supporting elasticity by simulating HPC systems with support for elastic jobs using real-world job data. Our simulations show significant improvements of adding elastic jobs with up to 35.34% higher system utilization, 75.3% lower runtime, 99.76% lower wait time, and 75.22% lower total turnaround time.
Workshop
Fault Handling and Tolerance
Large Scale Systems
W
DescriptionIEEE-754 is the de-facto standard for the implementation of floating point number systems in hardware, although recently, posits have been proposed as a drop-in replacement. Recent work has suggested that posits can offer greater numerical accuracy and reproducibility than IEEE-754-compliant floating point numbers at a comparable architectural cost. There have been several studies that consider the use of posits and other floating-point implementations in hardware and software, but there is limited work examining this new number system from a reliability perspective. In this paper, we evaluate the resiliency of posits to inform hardware design for fault-tolerant systems. Our analysis breaks down the impact of bit flips on the various fields within both floating-point standards. After examining the patterns and quirks regarding bit flip error in posits, we conclude that posits offer superior resiliency than the IEEE-754 standard in the majority of cases.
Workshop
Energy Efficiency
Green Computing
Sustainability
W
DescriptionIn this presentation, we outline the results of a project to evaluate the total climate/carbon impact of a digital research infrastructure for a defined snapshot period. We outline the carbon model used to calculate the impact and the data collected to quantify that impact for a defined set of resources. We discuss the variation in potential impact across both the active and embodied carbon for computing hardware and produce a range of estimates on the amount of carbon equivalent climate impact for the snapshot period.
Awards
TP
DescriptionEmerging data-driven scientific workflows are seeking to leverage distributed data sources to understand end-to-end phenomenon, drive experimentation, and facilitate important decision making. Despite the exponential growth of available digital data sources at the edge, and the ubiquity of non-trivial computational power for processing this data, realizing such science workflows remains challenging. In this talk, I will explore a computing continuum that is everywhere and nowhere -- one spanning resources at the edges, in the core and in-between, and providing abstractions that can be harnessed to support science. I will also introduce recent research in programming abstractions that can express what data should be processed and when and where it should be processed, and autonomic middleware services that automate the discovery of resources and the orchestration of computations across these resources.
Workshop
Exascale
Message Passing
Programming Frameworks and System Software
W
Workshop
Exascale
Message Passing
Programming Frameworks and System Software
W
DescriptionThe aim of this workshop is to bring together researchers and developers to present and discuss innovative algorithms and concepts in the message passing programming model and to create a forum for open and potentially controversial discussions on the future of MPI in the exascale era and beyond.
Birds of a Feather
Energy Efficiency
State of the Practice
Sustainability
TP
XO/EX
DescriptionEfficient energy usage of data centers has attention locally, nationally and globally. Many data centers are increasingly interested in utilizing waste heat reuse. Two organizations; CSC and NREL will provide an overview of the cooling and heat reuse processes with lessons learned from design, construction and operations.
The session will outline the metrics (ERF, ERE, CoP etc.) used and foster discussion of standards, gaps and the different approaches. Both sites will highlight metrics, methodologies and how differences affect the calculations.
Audience discussion and Q&A is aimed at engaging the community to understand potentiality for new waste heat reuse projects.
The session will outline the metrics (ERF, ERE, CoP etc.) used and foster discussion of standards, gaps and the different approaches. Both sites will highlight metrics, methodologies and how differences affect the calculations.
Audience discussion and Q&A is aimed at engaging the community to understand potentiality for new waste heat reuse projects.
Workshop
Education
State of the Practice
W
DescriptionTo thrive in the context of high-end computing, data centric cognitive computing and simulation, computational scientists require a diverse set of competences, encompassing technical, domain knowledge, and soft skills. As a computational applied mathematician in the high-performance computing community, the speaker has long-term experience in the efficient use of HPC for simulation and modeling, especially in the area of systems biology and physical phenomena. He has worked on performance analysis of advanced computer architectures and investigated methods that exploit these architectures in computational science research. He will discuss what are the urgent pressures for educating and upskilling computational scientists in the fast changing environment of exascale and beyond, quantum computing and generative AI.
ACM Gordon Bell Finalist
Awards
TP
DescriptionENRICO is a coupled application developed under the US Department of Energy’s Exascale Computing Project (ECP) targeting the modeling of advanced nuclear reactors. It couples radiation transport with heat and fluid simulation, including the high-fidelity, high-resolution Monte-Carlo code Shift and the Computational fluid dynamics code NekRS. NekRS is based on rapidly convergent high-order spectral element discretizations that feature minimal numerical dissipation and dispersion.
On Frontier, NekRS has recently achieved an unprecedented milestone in breaching over 1 billion spectral elements and 350 billion degrees of freedom. Shift has demonstrated the capability to transport upwards of 1 billion particles per second in full core nuclear reactor simulations featuring complete temperature-dependent, continuous-energy physics on Frontier. Shift achieved a weak-scaling efficiency of 97.8% on 8192 nodes of Frontier and calculated 6 reactions in 214,896 fuel pin regions below 1% statistical error yielding first-of-a-kind resolution for a Monte Carlo transport application.
On Frontier, NekRS has recently achieved an unprecedented milestone in breaching over 1 billion spectral elements and 350 billion degrees of freedom. Shift has demonstrated the capability to transport upwards of 1 billion particles per second in full core nuclear reactor simulations featuring complete temperature-dependent, continuous-energy physics on Frontier. Shift achieved a weak-scaling efficiency of 97.8% on 8192 nodes of Frontier and calculated 6 reactions in 214,896 fuel pin regions below 1% statistical error yielding first-of-a-kind resolution for a Monte Carlo transport application.
Panel
Exascale
Heterogeneous Computing
Software Engineering
TP
DescriptionThis panel brings together experts and leads from national exascale initiatives around the globe focusing on stacks encompassing algorithms to system level software to share their insights and experiences, and to identify synergies for collaboration. Exascale systems being deployed and on the horizon feature diversity and heterogeneity of not only hardware but software ecosystems. On one hand, the variety of accelerator technologies, alongside processor, memory, networking and storage configurations, pose challenges for algorithm developers, domain specific language and library architects, and performance engineers. On the other hand, there are expectations for supporting modern software development and delivery tools for reproducibility, portability, efficiency and security to fulfill the edge to cloud to supercomputing continuum requirements for workflows. Against these backdrops, national initiatives are prioritizing and funding a diverse portfolio of initiatives to address the programmatic needs, which the panel will reflect on as a SWOT (strengths, weaknesses, opportunities and threats) analysis.
Posters
Scientific Visualization & Data Analytics Showcase
Data Analysis, Visualization, and Storage
Exascale
HPC in Society
Modeling and Simulation
Visualization
TP
XO/EX
DescriptionThe objective of the ExaWind component of the Exascale Computing Project is to deliver many-turbine blade-resolved simulations in complex terrain. These simulations bring new challenges to both compute and analysis of the resulting data. In this paper/video, we visually explore the impact of ExaWind on wind simulations through two studies of a small wind farm under two atmospheric conditions. We then turn to analysis and review tools that visualization researchers at NREL use to answer the challenges that ExaWind brings.
Workshop
Education
State of the Practice
W
DescriptionHigh-performance computing (HPC) is an important tool for research, development, and the industry. With the recent expansion of machine learning applications, the need for HPC is increasing even further. However, in developing countries with limited access to the HPC ecosystem, the lack of infrastructure, expertise, and access to knowledge represents a major obstacle to the expansion of HPC. The adoption of HPC by communities presents several challenges. The HPC Summer Schools are an initiative of CyberColombia over the past 5 years, which aims to develop the critical skills, strategic planning, and networking required to disseminate and maintain the knowledge of high-performance computing and its applications in Colombia. Here we report the results of this series of Summer Schools. The events have proven to be successful, with over 200 participants from more than 20 institutions. Participants span different levels of expertise, including undergraduate and graduate students as well as professionals.
Workshop
Applications
Exascale
Heterogeneous Computing
Programming Frameworks and System Software
State of the Practice
W
DescriptionIn May 2022, the newest supercomputer to top the TOP 500 list was Frontier at Oak Ridge National Laboratory, capable of computing more than 1.1 quintillion (10^18) floating-point calculations every second. Driving this ground-breaking rate of computing is Frontier's more than 37,000 graphics processing units (GPUs) and 9,408 central processing units (CPUs). At this scale, the smallest margin of error may generate hundreds of errors across the system. In this work, we describe and evaluate two strategies for finding hardware-level faults in Frontier's 9,408 compute nodes. There are two strategies developed: the first uses the Slurm scheduler to scavenge available compute time to run the node screen, the second enforces a weekly screen of each node. Using June 2023 as a case study, we find that the first scheduling strategy consumed ten times the resources as the second scheduling strategy, but successfully detected five hardware defects in Frontier.
Paper
Exascale
Large Scale Systems
State of the Practice
TP
DescriptionThe advent of exascale computing invites an assessment of existing best practices for developing application readiness on the world’s largest supercomputers. This work details observations from the last four years in preparing scientific applications to run on the Oak Ridge Leadership Computing Facility's (OLCF) Frontier system. This paper addresses a range of topics in software including programmability, tuning, and portability considerations that are key to moving applications from existing systems to future installations. A set of representative workloads provides case studies for general system and software testing. We evaluate the use of early access systems for development across several generations of hardware. Finally, we discuss how best practices were identified and disseminated to the community through a wide range of activities including user-guides and trainings. We conclude with recommendations for ensuring application readiness on future leadership computing systems.
Paper
Post-Moore Computing
Quantum Computing
TP
DescriptionAmong the various types of quantum computers, photonic quantum computers have shown great potential due to their high degree of scalability. However, the development of photonic quantum computers is still in its infancy, and the characterization of their performance is of critical importance to guide further improvements. In this work, we present the first characterization and insights derived from Xanadu's X8 photonic quantum computer. Our work represents an important step toward the development of practical and scalable photonic quantum computers.
Workshop
Large Scale Systems
Performance Measurement, Modeling, and Tools
Software Engineering
W
DescriptionSelf Driving Labs (SDLs) that combine automation of experimental procedures with autonomous decision making are gaining popularity as a means of increasing the throughput of scientific workflows. The task of identifying a mix of supplied colored pigments that matches a target color, the color matching problem, has emerged as a simple and flexible test case for these labs, as it requires experiment proposal, sample creation, and sample analysis, three common components in automated discovery applications. We present a modular, easily retargetable robotic solution to the color matching problem that allows for fully autonomous execution of a color matching protocol, with feedback from pluggable optimization approaches allowing for continuous refinement and automated publication of results facilitating experiment tracking and post-hoc analysis
Exhibitor Forum
Accelerators
Artificial Intelligence/Machine Learning
Architecture and Networks
Hardware Technologies
TP
XO/EX
DescriptionConverged compute infrastructure refers to a trend where HPC clusters are set up for both AI and traditional HPC workloads, allowing these workloads to run on the same infrastructure, potentially reducing under-utilization. Here, we explore opportunities for converged compute with GroqChip™, an AI accelerator optimized for running large-scale inference workloads with high throughput and ultra-low latency. GroqChip features a Tensor Streaming architecture optimized for matrix-oriented operations commonly found in AI, but GroqChip can also efficiently compute other applications such as linear algebra-based HPC workloads.
We consider two opportunities for using the Groq AI accelerator for converged HPC. The first example is a structured grid solver for Computational Fluid Dynamics (CFD). This solver can run in a classical implementation as a direct numerical solver (DNS) using the pressure projection method. In a hybrid AI implementation, the same DNS solver is augmented with CNN-based downscaling and upscaling steps. This enables a reduction of grid size from 2048 to 64, thus significantly reducing the amount of compute necessary while maintaining a similar quality of results after upscaling. A speedup of three orders of magnitude is made possible by the combination of reducing the number of compute steps in the algorithm through introducing AI, and by accelerating both the CNN and DNS stages with GroqChip. The second example is using HydraGNN for materials science and computational chemistry. These problems are typically solved with Density Field Theory algorithms, but recently, Graph Neural Networks (GNNs) have been explored as an alternative. For example, GNNs can be used to predict the total energy, charge density, and magnetic moment for various atom configurations, identifying molecules with desired reactivity. The computation requires many parallel walks of HydraGNN with low batch sizes, and can be solved on GroqChip 30-50x faster than an A100 graphics processor.
We consider two opportunities for using the Groq AI accelerator for converged HPC. The first example is a structured grid solver for Computational Fluid Dynamics (CFD). This solver can run in a classical implementation as a direct numerical solver (DNS) using the pressure projection method. In a hybrid AI implementation, the same DNS solver is augmented with CNN-based downscaling and upscaling steps. This enables a reduction of grid size from 2048 to 64, thus significantly reducing the amount of compute necessary while maintaining a similar quality of results after upscaling. A speedup of three orders of magnitude is made possible by the combination of reducing the number of compute steps in the algorithm through introducing AI, and by accelerating both the CNN and DNS stages with GroqChip. The second example is using HydraGNN for materials science and computational chemistry. These problems are typically solved with Density Field Theory algorithms, but recently, Graph Neural Networks (GNNs) have been explored as an alternative. For example, GNNs can be used to predict the total energy, charge density, and magnetic moment for various atom configurations, identifying molecules with desired reactivity. The computation requires many parallel walks of HydraGNN with low batch sizes, and can be solved on GroqChip 30-50x faster than an A100 graphics processor.
Posters
Research Posters
TP
XO/EX
DescriptionCryptographic hash functions are fundamental for ensuring data security and integrity in all consensus algorithms in blockchains. While SHA256 has been widely used in many blockchain implementations, its throughput and efficiency has led the rise of a modern lightweight and speed superior implementation BLAKE3. We compared and contrasted SHA256 and BLAKE3 with a focus on blockchain workloads with small inputs and outputs. We explored different compilers and optimizations, different ways to parallelize using multi-threading and multi-processing, as well as different size systems from small Raspberry Pi 4 to a modern AMD Epyc server. We found that BLAKE3 is superior from a performance perspective. To showcase its strengths, we integrated BLAKE3 into a basic Proof-of-Space implementation that used advanced data index and search, and compared our results to the Chia blockchain plotting mechanism. Our approach offers one to two orders of magnitude higher hash generation and storage rates.
Posters
Research Posters
TP
XO/EX
DescriptionWe evaluate the use of Julia as a single language and ecosystem paradigm powered by LLVM for the development of high-performance computing (HPC) workflow components. A Gray-Scott 2-variable diffusion-reaction application using a memory-bound 7-point stencil kernel is run on Frontier, the first exascale supercomputer. We evaluate the feasibility, performance, scaling, and trade-offs of (i) the computational kernel on AMD's MI250x GPUs, (ii) weak scaling up to 4,096 MPI processes/GPUs or 512 nodes, (iii) parallel I/O write using the ADIOS2 library bindings, and (iv) Jupyter Notebooks for interactive data analysis.
We will discuss our results which show that although Julia generates a reasonable LLVM-IR kernel, there is nearly a 50% performance difference with native AMD HIP stencil codes on GPU. We observed near-zero overhead when using MPI and parallel I/O bindings to system-wide installed implementations. Consequently, Julia emerges as a compelling high-performance and high-productivity workflow composition strategy as measured on Frontier.
We will discuss our results which show that although Julia generates a reasonable LLVM-IR kernel, there is nearly a 50% performance difference with native AMD HIP stencil codes on GPU. We observed near-zero overhead when using MPI and parallel I/O bindings to system-wide installed implementations. Consequently, Julia emerges as a compelling high-performance and high-productivity workflow composition strategy as measured on Frontier.
Posters
Research Posters
TP
XO/EX
DescriptionHPC systems, driven by the rise of workloads with significant data requirements, face challenges in I/O performance. To address this, a thorough I/O analysis is crucial to identify potential bottlenecks. However, the multitude of metrics makes it difficult to pinpoint the causes of low I/O performance. In this work, we analyze three scientific workloads using three widely accepted I/O metrics. We demonstrate that different metrics uncover different I/O bottlenecks, highlighting the importance of considering multiple metrics for comprehensive I/O analysis.
Workshop
State of the Practice
W
DescriptionThe two-sided MPI has become a de facto standard for communication on a distributed memory system. As high-performance GPU computing becomes the trend, some numerical methods have a relatively simple communication pattern adhering to the BSP model find MPI and its CUDA-aware variant can satisfy the performance requirements. Conversely, DAG-like computations have a more complex communication pattern and find it hard to scale to multiple GPUs. Thus, GPU-initiated one-sided communication becomes a viable solution for multi-GPU scaling. However, the lack of deep understanding of GPU-initiated one-sided communication and the real impact on applications performance become a hurdle. In this work, we use a multi-GPU SpTRSV, which is used in conjunction with Sparse LU for solving sparse linear systems, either as a direct solver or as a precondition, to demonstrate our multi-GPU SpTRSV implementation using NVSHMEM achieves up to 3x speedup on Perlmutter compared to a single GPU implementation.
ACM Gordon Bell Finalist
Awards
TP
DescriptionWe detail our developments in the high-fidelity spectral-element code Neko that are essential for unprecedented large-scale direct numerical simulations of fully developed turbulence. Major innovations are modular multi-backend design enabling performance portability across a wide range of GPUs and CPUs, a GPU-optimized preconditioner with task overlapping for the pressure-Poisson equation and in-situ data compression. We carry out initial runs of Rayleigh–Bénard Convection (RBC) at extreme scale on the LUMI and Leonardo supercomputers. We show how Neko is able to strongly scale to 16,384 GPUs and obtain results that are not possible without careful consideration and optimization of the entire simulation workflow. These developments in Neko will help resolving the long-standing question regarding the ultimate regime in RBC.
Posters
Research Posters
TP
XO/EX
DescriptionMemory-bound applications like graph processing applications often require large memory capacity beyond a single node. Current HPC systems over-provision compute and memory resources to meet requirements of diverse workloads. In this work, we explore using network-attached memory for disaggregating memory from compute nodes to satisfy the demand of memory-intensive workloads. We provide a library that enables applications to access network-attached memory as if in its main memory, and exposes critical controls to userspace, including concurrency level and page-level data compression. Our preliminary results show that the flexibility of tuning concurrency and compression is important for improving performance and reducing data movement. Also, our results on 12 scientific data sets indicate that DPU compression offloading could significantly speed up compression and is important for future optimizations.
Workshop
Data Analysis, Visualization, and Storage
Large Scale Systems
Performance Measurement, Modeling, and Tools
W
DescriptionThe presence of GPUs and accelerators in recent super computing
systems, so called heterogeneous architectures, has lead to increased
complexity in execution environments and programming models
as well as deeper memory hierarchies on these systems. In this
work we discuss challenges that arise in in situ code coupling on
heterogeneous architectures. We present data and execution model
extensions to the SENSEI in situ framework targeted at effective use
of systems with heterogeneous architectures. We use the new data
and execution model extensions in SENSEI to investigate a number
of in situ placement and execution configurations and analyze the
impact these choices have on overall performance.
systems, so called heterogeneous architectures, has lead to increased
complexity in execution environments and programming models
as well as deeper memory hierarchies on these systems. In this
work we discuss challenges that arise in in situ code coupling on
heterogeneous architectures. We present data and execution model
extensions to the SENSEI in situ framework targeted at effective use
of systems with heterogeneous architectures. We use the new data
and execution model extensions in SENSEI to investigate a number
of in situ placement and execution configurations and analyze the
impact these choices have on overall performance.
Workshop
Performance Measurement, Modeling, and Tools
Programming Frameworks and System Software
W
DescriptionWith the rapidly increasing size and complexity of DNNs, equally sophisticated methods are needed to train them efficiently, including distributed training and various model/hybrid parallelism approaches. Even though developers heavily rely on state-of-the-art frameworks such as PyTorch and TensorFlow, these provide little insight into an application's training behavior at scale, leading to latent performance bottlenecks and inefficient training configurations. We propose Extra-Deep, an automated empirical performance modeling approach for distributed deep learning. We leverage the created models to analyze a training task's performance, scalability, efficiency, and cost. Using an efficient sampling strategy that reduces the profiling time for the required empirical measurements by, on average, about 94.9%, we can identify cost-effective training configurations even for large-scale applications. We evaluated our approach on three parallelization strategies, with four DNN models and five datasets. The results show that Extra-Deep has an average prediction accuracy of 93.6% when compared to empirical results.
Workshop
Education
State of the Practice
W
DescriptionParallel and Distributed Computing (PDC) has become pervasive and is now exercised on a variety of platforms. Most students in computer science (CS) and computer engineering (CE) programs are still introduced to computational problem solving using an old model, in which all processing is serial and synchronous, with input and output via text using a terminal interface or a local file system.
Teaching a range of PDC knowledge and skills at multiple levels in CS and related CE curricula is essential. The authors of this paper conducted a series of week-long faculty training workshops on the integration of PDC topics in CS1 and CS2 classes, and this paper provides an experience report on the impact and effectiveness of these workshops. Our survey results indicate such faculty development workshops can be effective in gradual inclusion of PDC in early computing curricula.
Teaching a range of PDC knowledge and skills at multiple levels in CS and related CE curricula is essential. The authors of this paper conducted a series of week-long faculty training workshops on the integration of PDC topics in CS1 and CS2 classes, and this paper provides an experience report on the impact and effectiveness of these workshops. Our survey results indicate such faculty development workshops can be effective in gradual inclusion of PDC in early computing curricula.
Workshop
Data Analysis, Visualization, and Storage
Large Scale Systems
Programming Frameworks and System Software
Reproducibility
Resource Management
Runtime Systems
W
Paper
Accelerators
Applications
Architecture and Networks
Modeling and Simulation
TP
DescriptionConducting long-timescale simulations of small molecules using Molecular Dynamics (MD) is crucial in drug design. However, traditional methods to accelerate the process, including ASICs or GPUs, have limitations. ASIC solutions are not always generally available, while GPU solutions may not scale when processing small molecules. FPGAs are both communication processors and accelerators, with tight coupling between these capabilities, and so could be used to address strong scaling in this domain.
We present FASDA, the first FPGA-based MD accelerator available for community development. FASDA enables the use of FPGA enhanced clusters and clouds to execute range-limited MD, which is the most resource-intensive and computation-demanding component in MD. FASDA is built with a series of plugable components that are adjustable based on user requirements and demonstrates nearly linear scaling on an eight FPGA cluster. It outperforms the state-of-the-art GPU solution by 4.67x, with the resulting prospect of significantly reducing lead evaluation time.
We present FASDA, the first FPGA-based MD accelerator available for community development. FASDA enables the use of FPGA enhanced clusters and clouds to execute range-limited MD, which is the most resource-intensive and computation-demanding component in MD. FASDA is built with a series of plugable components that are adjustable based on user requirements and demonstrates nearly linear scaling on an eight FPGA cluster. It outperforms the state-of-the-art GPU solution by 4.67x, with the resulting prospect of significantly reducing lead evaluation time.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
TP
XO/EX
DescriptionThe frequency of checkpoint creation in large language models is limited by the write bandwidth to a parallel file system. In this study, we aim to reduce the checkpoint creation time by writing to the Intel Optane Persistent Memory installed on the compute nodes.
We propose TensorStore CHFS, a storage driver that adds an ad hoc parallel file system CHFS to the TensorStore. The proposed method succeeded in increasing the checkpoint creation bandwidth of the T5 1.1 model by 4.5 times on 32 nodes.
We propose TensorStore CHFS, a storage driver that adds an ad hoc parallel file system CHFS to the TensorStore. The proposed method succeeded in increasing the checkpoint creation bandwidth of the T5 1.1 model by 4.5 times on 32 nodes.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
TP
XO/EX
DescriptionIn modern scientific computing and machine learning systems, data movement has overtaken compute as the performance bottleneck, thus motivating the wider adoption of lossy data compression. Unfortunately, state-of-the-art floating-point array compressors such as SZ and ZFP require decompression before operations can be performed on the data. In this work, our contribution is to show that compression methods can be designed to allow efficient operations on compressed arrays without having to first decompress. In particular, compression methods that consist of only linear transformations and quantization allow certain operations on compressed arrays without decompression. We develop such a compression method, called PyBlaz, the first compression method we know that can compress arbitrary-dimensional arrays and directly operate on the compressed representation, with all stages running on GPUs.
In the poster session, I will provide details about each compression step, several compressed-spaced operations, and our ongoing performance and application experiments.
In the poster session, I will provide details about each compression step, several compressed-spaced operations, and our ongoing performance and application experiments.
Workshop
Quantum Computing
Software Engineering
W
DescriptionUntil high-fidelity quantum computers with a large number of qubits become widely available, classical simulation remains a vital tool for algorithm design, tuning, and validation. We present a simulator for the Quantum Approximate Optimization Algorithm (QAOA). Our simulator is designed with the goal of reducing the computational cost of QAOA parameter optimization and supports both CPU and GPU execution. Our central observation is that the computational cost of both simulating the QAOA state and computing the QAOA objective to be optimized can be reduced by precomputing the diagonal Hamiltonian encoding the problem. We reduce the time for a typical QAOA parameter optimization by eleven times for n = 26 qubits compared to a state-of-the-art GPU quantum circuit simulator based on cuQuantum. Our simulator is available on GitHub: https://github.com/jpmorganchase/QOKit
Workshop
State of the Practice
W
DescriptionThe primary objective of this work is to conduct an evaluation of the acceleration of NLP training for the task of text classification on legal documents. The dataset used is AsyLex, a dataset of refugee claims from Canada. We implement fast AsyLex (fAsylex) and scale it across up to 64 GPUs. Through systematic experimentation, we seek to address the following research questions: How does the training time differ between single-GPU and multi-GPU setups for two commonly used PLMs? Does the choice of training approach (single-GPU vs. multi-GPU) influence the classification performance on the chosen dataset? We offer an investigation into the practical implications of employing single-GPU and multi-GPU training, we compare two of the most commonly used masked language models, RoBERTa and DeBERTa and reduce runtime out-of-the-box by 49% and 37% respectively, and we demonstrate that there is a trade-off in terms of NLP metrics and distributed training.
Tutorial
Algorithms
Data Movement and Memory
Fault Handling and Tolerance
TUT
DescriptionResilience is a critical issue for large-scale platforms. This tutorial provides a comprehensive survey of fault-tolerant techniques for high-performance and big data applications, with a fair balance between theory and practice. This tutorial is organized across four main topics:
(i) Overview of failure types (software/hardware, transient/fail-stop), and typical probability distributions (Exponential, Weibull, Log-Normal);
(ii) General-purpose techniques, which include several checkpoints and rollback recovery protocols, replication, prediction, and silent error detection;
(iii) Application-specific techniques, such as user-level in-memory checkpointing, data replication (map-reduce), or fixed-point convergence for iterative applications (back-propagation);
(iv) Practical deployment of fault tolerance techniques with User Level Fault Mitigation (MPI standard extension). Relevant examples will include widely used routines such as Monte-Carlo methods, SPMD stencil, map-reduce, and back-propagation in neural networks.
A step-by-step approach will show how to protect these routines and make them fault-tolerant, using a variety of techniques, in a hands-on session.
The tutorial is open to all SC23 attendees who are interested in the current status and expected promise of fault-tolerant approaches for scientific and big data applications. There are no audience prerequisites: background will be provided for all protocols and probabilistic models. However, basic knowledge of MPI will be helpful for the hands-on session.
(i) Overview of failure types (software/hardware, transient/fail-stop), and typical probability distributions (Exponential, Weibull, Log-Normal);
(ii) General-purpose techniques, which include several checkpoints and rollback recovery protocols, replication, prediction, and silent error detection;
(iii) Application-specific techniques, such as user-level in-memory checkpointing, data replication (map-reduce), or fixed-point convergence for iterative applications (back-propagation);
(iv) Practical deployment of fault tolerance techniques with User Level Fault Mitigation (MPI standard extension). Relevant examples will include widely used routines such as Monte-Carlo methods, SPMD stencil, map-reduce, and back-propagation in neural networks.
A step-by-step approach will show how to protect these routines and make them fault-tolerant, using a variety of techniques, in a hands-on session.
The tutorial is open to all SC23 attendees who are interested in the current status and expected promise of fault-tolerant approaches for scientific and big data applications. There are no audience prerequisites: background will be provided for all protocols and probabilistic models. However, basic knowledge of MPI will be helpful for the hands-on session.
Workshop
Large Scale Systems
Middleware and System Software
Programming Frameworks and System Software
W
DescriptionAurora is an exascale supercomputer in the final stages of assembly at the Argonne Leadership Computing Facility (ALCF) in the U.S. This talk will focus on the Aurora hardware and software architectures with emphasis on the interconnect and programming models, and their impact on application performance and scalability.
Workshop
Accelerators
Edge Computing
Heterogeneous Computing
W
DescriptionFFTX-IRIS is a dynamic system to efficiently utilize novel heterogeneous platforms. This system links two next generation frameworks, FFTX and IRIS, to navigate the complexity of different hardware architectures. FFTX provides a runtime code generation framework for high performance Fast Fourier Transform kernels. IRIS runtime provides portability and multi-device heterogeneity, allowing computation on any available compute resource. Together, FFTX-IRIS enables code generation, seamless portability, and performance without user involvement. We show the design of the FFTX-IRIS system along with an evaluation of various small FFT benchmarks. We also demonstrate multi-device heterogeneity of FFTX-IRIS with a larger stencil application.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
TP
XO/EX
DescriptionCheckpointing serves numerous functionalities in modern-day HPC systems and applications. In recent years, synchronous checkpointing, which blocks the application until checkpoints are persisted to external storage, suffers rising synchronization overheads at scale, resulting in little forward progress by the application. Therefore, asynchronous checkpointing has become more popular by quickly capturing checkpoints locally and flushing them in the background concurrently alongside the application. State-of-the-art solutions like VELOC utilize a file-per-process strategy, which is difficult for users and parallel file systems to manage. We implement a tunable N-to-M aggregation strategy within VELOC, obtaining 2.5x greater throughput than state-of-the-art aggregation library ADIOS2 and 1.5x higher throughput than the naive N-to-1 aggregation currently supported by VELOC.
Workshop
Performance Measurement, Modeling, and Tools
Programming Frameworks and System Software
W
DescriptionMany high-performance computing applications reach millions of code lines and hundreds of code regions. Analyzing all code regions for parallelization with OpenMP is neither efficient nor necessary. To facilitate this task and minimize the effort by the user, the code regions of the application need to be filtered and ranked. We provide a simple filtering method to detect the critical code regions by clearly defining a hotspot. Afterward, we identify parallelizable loops by analyzing their data dependencies using an automatic tool. As the number of parallel opportunities can be high and the users must verify these parallel suggestions, we suggest a ranking strategy based on parallelization overhead to help them prioritize their endeavors and present a set of OpenMP microbenchmarks for overhead analysis. We calculate optimistic expected benefits using overhead estimations as ranking metrics and show how our ranking provides an improvement on the ranking based on serial runtime.
Workshop
Middleware and System Software
Programming Frameworks and System Software
Runtime Systems
W
DescriptionFunction-as-a-service (FaaS) is a promising execution environment for high-performance computing (HPC) and machine learning (ML) applications, as it offers developers a simple way to write and deploy programs. Nowadays, GPUs and other accelerators are indispensable for HPC and ML workloads. However, we have observed that state-of-the-art FaaS frameworks usually treat accelerators as a single device to run a single workload and have little support for multiplexing accelerators.
In this work, we have presented techniques to multiplex GPUs with Parsl, a popular FaaS framework. With our enhancements, we show up to 60% lower task completion time and 250% improvement in the throughput of a large language model when multiplexing a GPU vs running without multiplexing. We plan to extend the support for GPU multiplexing in FaaS platforms by tackling the challenges of changing compute resources in the partition and approximating how to right-size a GPU partition for a function.
In this work, we have presented techniques to multiplex GPUs with Parsl, a popular FaaS framework. With our enhancements, we show up to 60% lower task completion time and 250% improvement in the throughput of a large language model when multiplexing a GPU vs running without multiplexing. We plan to extend the support for GPU multiplexing in FaaS platforms by tackling the challenges of changing compute resources in the partition and approximating how to right-size a GPU partition for a function.
Paper
Data Analysis, Visualization, and Storage
I/O and File Systems
State of the Practice
TP
DescriptionA burst buffer is commonly deployed on large-scale supercomputers to bridge the performance gap between the shared file system and the I/O needs of modern supercomputing applications. Existing I/O sharing methods either require resource isolation, offline profiling, or repeated execution that significantly limit the utilization and applicability of these systems. Here we present ThemisIO, a policy-driven I/O sharing framework for a remote-shared burst buffer. ThemisIO can accurately and efficiently allocate I/O cycles among applications purely based on real-time I/O behavior, without requiring user-supplied information or offline-profiled application characteristics. By exploiting a statistical token-based strategy, ThemisIO can precisely balance I/O cycles between applications via time slicing to enforce processing isolation, enabling a variety of fair sharing policies. Our experiments show that ThemisIO sustains 13.5–13.7% higher I/O throughput and 19.5–40.4% lower performance variation than existing algorithms. For applications, ThemisIO significantly reduces or nearly eliminates the slowdown caused by I/O interference.
Workshop
Programming Frameworks and System Software
State of the Practice
W
DescriptionThis workshop brings together HPC researchers, practitioners, and vendors from around the globe to present and discuss state-of-the-art HPC system testing methodologies, tools, benchmarks, tests, procedures, and best practices. The increasing complexity of HPC architectures requires a larger number of tests in order to thoroughly evaluate the status of the system after its installation or a software upgrade before it is transitioned to production users. Therefore, HPC centers and vendors use different methodologies to evaluate their systems during its lifetime, not only at the beginning during the installation and acceptance time, but also regularly during maintenance windows. This workshop will provide a venue to present and discuss the latest HPC system test technologies. The event will include a keynote focused on current HPC system testing topics, followed by a series of paper presentations from peer-reviewed accepted submissions, and will conclude with a panel discussion.
Birds of a Feather
Energy Efficiency
State of the Practice
Sustainability
TP
XO/EX
DescriptionLiquid cooling mitigates the effects of heat density, reduces energy consumption and increases performance. It is now a requirement to stay on the chip technology roadmap. After a decade's experience with liquid cooling in large-scale supercomputing centers, many data centers are still facing challenges with adoption. Building on deep expertise from major supercomputing centers, this BoF will present recommendations for initial adoption of direct liquid cooling (DLC). See https://sites.google.com/lbl.gov/ee-hpc-wg-liquid-cooling/home. There will be presentations on experiences from sites that have just adopted DLC. We are expecting a lot of audience discussion and networking that extends beyond the BoF.
Paper
Cloud Computing
Distributed Computing
Energy Efficiency
Performance Measurement, Modeling, and Tools
TP
DescriptionEnterprise-grade permissioned blockchain systems provide a promising infrastructure for data sharing and cooperation between different companies. However, performance bottlenecks seriously hinder the adoption of these systems in many industrial applications that process complex business logic and huge transaction volumes.
In this paper, we present FISCO-BCOS, an enterprise-grade permissioned blockchain system with high performance. We conducted experiments on two popular test platforms and compared FISCO-BCOS with state-of-the-art platforms in academia and industry such as BIDL and Hyperledger Fabric (HLF). The result shows that FISCO-BCOS achieves 7.4 times and 28.4 times the throughput of BIDL and HLF, respectively, with half the latency of them. FISCO-BCOS has already been used in over 300 different large-scale industrial scenarios and has become one of the most popular permissioned blockchains.
In this paper, we present FISCO-BCOS, an enterprise-grade permissioned blockchain system with high performance. We conducted experiments on two popular test platforms and compared FISCO-BCOS with state-of-the-art platforms in academia and industry such as BIDL and Hyperledger Fabric (HLF). The result shows that FISCO-BCOS achieves 7.4 times and 28.4 times the throughput of BIDL and HLF, respectively, with half the latency of them. FISCO-BCOS has already been used in over 300 different large-scale industrial scenarios and has become one of the most popular permissioned blockchains.
Workshop
Data Analysis, Visualization, and Storage
Large Scale Systems
Programming Frameworks and System Software
Reproducibility
Resource Management
Runtime Systems
W
DescriptionThe current era of exascale supercomputing and the emergence of a computing continuum present several significant resource management challenges. These include, but are not limited to, management of complex scientific workflows, diverse resources such as power, elasticity in user jobs, and converged environments. The resource models that underpin today's job scheduling frameworks reflect the node- (or core-) centric system architectures prevalent when the frameworks were designed. Consequently, they are not suited to capturing resource relationships or dynamism. This greatly limits their applicability to the emerging multifaceted challenges in high-performance computing (HPC) and other converged environments. We propose a scalable graph-based resource model to overcome these challenges, which allows for representation of complex, changing resource relationships and multiple containment hierarchies. We implement this model, Fluxion, in a production-quality framework, and evaluate its performance. Additionally, we present emerging and advanced scheduling use cases that are enabled by our model.
Paper
Artificial Intelligence/Machine Learning
Applications
Modeling and Simulation
State of the Practice
TP
DescriptionLarge language models (LLMs) are poised to revolutionize the way we conduct scientific research, yet their complexity and cost hinder adoption by the wider science community. Identifying suitable scientific use cases, optimizing model and data sizes, and scaling up training are among the most pressing issues. Here we provide practical solutions for building and using LLM-based foundation models targeting scientific use cases. We present an end-to-end examination of the effectiveness of LLMs in scientific research, including their scaling behavior and computational requirements on Frontier, the first exascale supercomputer. We have also developed for release to the scientific community a suite of open foundation models called FORGE with up to 26B parameters using 257B tokens from over 200M scientific articles. We have demonstrated the use and effectiveness of FORGE on scientific downstream tasks. Our research establishes best practices that can be applied across various fields to utilize LLMs for scientific discovery.
Workshop
Compilers
Heterogeneous Computing
Performance Optimization
W
DescriptionMLIR has become popular since it was open sourced in 2019. A sub-project of LLVM, the flexibility provided by MLIR to represent Intermediate Representations (IR) as dialects at different abstraction levels, to mix these, and to leverage transformations between dialects provides opportunities for automated program optimisation and parallelisation. In addition to general purpose compilers built upon MLIR, domain specific abstractions have also been developed.
In this paper, we explore complimenting the Flang MLIR general purpose compiler by combining with the domain specific Open Earth Compiler’s MLIR stencil dialect. Developing transformations to discover and extracts stencils from Fortran, this specialisation delivers between a 2- and 10-times performance improvement for our benchmarks on a Cray supercomputer compared to using Flang alone. Furthermore, by leveraging existing MLIR transformations we develop an auto-parallelisation approach targeting multi-threaded and distributed memory parallelism, and optimised execution on GPUs, without any modifications to the serial Fortran source code.
In this paper, we explore complimenting the Flang MLIR general purpose compiler by combining with the domain specific Open Earth Compiler’s MLIR stencil dialect. Developing transformations to discover and extracts stencils from Fortran, this specialisation delivers between a 2- and 10-times performance improvement for our benchmarks on a Cray supercomputer compared to using Flang alone. Furthermore, by leveraging existing MLIR transformations we develop an auto-parallelisation approach targeting multi-threaded and distributed memory parallelism, and optimised execution on GPUs, without any modifications to the serial Fortran source code.
Workshop
State of the Practice
W
DescriptionAchieving a diverse and inclusive workforce requires focus and commitment, both at the organizational level and the individual level. Last year, Google achieved its most diverse, representative workforce yet. Getting there required targeted actions, along with following the data to understand which initiatives were delivering on their potential and which were not. We will share our perspectives and stories on DEI in a large technical firm, along with some ideas on how to create and expand opportunities for underrepresented groups.
Inclusivity
Inclusivity
DescriptionIn the pursuit of advancing research computing and data (RCD) Team Science, it is crucial to embrace inclusivity and broaden engagement across diverse academic institutions. This CASC-shared SC23 Inclusivity session aims to facilitate discussions and knowledge sharing among all participants on how to foster greater involvement among Minority-Serving Institutions (MSIs), Historically Black Colleges and Universities (HBCUs) and Tribal Colleges (TCUs) and the CASC higher education institution and high performance computing center members traditionally connected to the RCD enterprise.
This session will bring together case studies from MSIs, HBCUs and TCUs and CASC member partners. These case studies will not only showcase achievements but also delve into the strategies employed to replicate these successes. Attendees will explore best practices for building replicable, sustainable, scalable collaborations without imposing an undue burden the partner institutions involved.
Discussions will center around defining what success looks like in the context of inclusivity and the tangible goals that should be pursued to achieve it. Participants will discuss potential funding opportunities to support these institutional partnerships to enhance research computing infrastructure and support.
While significant strides have been made to integrate inclusivity into research computing, this session will also explore areas where gaps still exist. Attendees will have the opportunity to identify challenges, potential barriers, and strategies for overcoming them. By collectively exploring these issues, the session will work towards a more comprehensive and sustainable approach to broadening engagement in research computing and data sciences.
We welcome participation from all SC23 attendees and all institutions with an interest in the topic.
This session will bring together case studies from MSIs, HBCUs and TCUs and CASC member partners. These case studies will not only showcase achievements but also delve into the strategies employed to replicate these successes. Attendees will explore best practices for building replicable, sustainable, scalable collaborations without imposing an undue burden the partner institutions involved.
Discussions will center around defining what success looks like in the context of inclusivity and the tangible goals that should be pursued to achieve it. Participants will discuss potential funding opportunities to support these institutional partnerships to enhance research computing infrastructure and support.
While significant strides have been made to integrate inclusivity into research computing, this session will also explore areas where gaps still exist. Attendees will have the opportunity to identify challenges, potential barriers, and strategies for overcoming them. By collectively exploring these issues, the session will work towards a more comprehensive and sustainable approach to broadening engagement in research computing and data sciences.
We welcome participation from all SC23 attendees and all institutions with an interest in the topic.
Workshop
Quantum Computing
Software Engineering
W
DescriptionQuantum computing is emerging as a remarkable technology that promises to achieve major scientific breakthroughs. This includes solving complex problems whose solution lies well beyond contemporary and even future supercomputers based on conventional technologies. Interacting with these quantum computers, including noisy-intermediate scale quantum devices, for both basic and applied research will require a unique collection of software tools.
The purpose of this workshop is to explore the innovative software needed to make quantum computing practical and accessible. The workshop will focus heavily on the tools and software for quantum computing with a particular emphasis on realized implementations.
Topics of interest for this workshop include but are not limited to: Languages, Compilers/Profilers, Quantum Machine Learning Software, Numerical Simulators, Workflows, Debugging/Verification, and Optimal Quantum Control Software.
Topics that are not relevant to the workshop include domain-specific applications of quantum computing, development of quantum computing hardware or devices, and benchmarking of quantum computers.
The purpose of this workshop is to explore the innovative software needed to make quantum computing practical and accessible. The workshop will focus heavily on the tools and software for quantum computing with a particular emphasis on realized implementations.
Topics of interest for this workshop include but are not limited to: Languages, Compilers/Profilers, Quantum Machine Learning Software, Numerical Simulators, Workflows, Debugging/Verification, and Optimal Quantum Control Software.
Topics that are not relevant to the workshop include domain-specific applications of quantum computing, development of quantum computing hardware or devices, and benchmarking of quantum computers.
Workshop
Data Movement and Memory
Heterogeneous Computing
W
DescriptionHeterogeneous memory architectures have recently emerged and revolutionized the traditional memory hierarchy. Today’s architectures may comprise multiple memory technologies next to DRAM, such as: 3D-stacked memory, high-bandwidth multi-channel RAM, persistent memory, or Compute Express Link (CXL)-based architectures.
Even though heterogeneous memory architectures can benefit applications in terms of improved performance, energy-efficiency, and cost trade-offs, exploiting the full potential of such complex architectures poses significant challenges. Since heterogeneous memory architectures introduce dramatic disruptions to the usual memory hierarchy assumptions that have guided decades of system and software design, we need to rethink solutions across all the layers of system and software stack to embrace the new era of memory heterogeneity and satisfy modern applications demands.
As in previous years, the workshop on Heterogeneous Memory systems (HMEM) will serve as a forum to bring together researchers from the HPC community to present and discuss ongoing research around heterogeneous memory systems.
Even though heterogeneous memory architectures can benefit applications in terms of improved performance, energy-efficiency, and cost trade-offs, exploiting the full potential of such complex architectures poses significant challenges. Since heterogeneous memory architectures introduce dramatic disruptions to the usual memory hierarchy assumptions that have guided decades of system and software design, we need to rethink solutions across all the layers of system and software stack to embrace the new era of memory heterogeneity and satisfy modern applications demands.
As in previous years, the workshop on Heterogeneous Memory systems (HMEM) will serve as a forum to bring together researchers from the HPC community to present and discuss ongoing research around heterogeneous memory systems.
Exhibitor Forum
Artificial Intelligence/Machine Learning
Fault Handling and Tolerance
Large Scale Systems
Programming Frameworks and System Software
TP
XO/EX
DescriptionDebugging today’s complex HPC applications can be a challenge, often requiring multiple hardware technologies, different software libraries to facilitate parallelism, applications built with multiple languages, handling issues of scale, and working on remote clusters. This all creates a complex environment that makes it difficult to find and fix problems in code.
This interactive session highlights the important debugging technologies and techniques for effectively finding and solving challenging issues in HPC applications. You will learn:
• The advantages of parallel debuggers over traditional debuggers
• How to simultaneously debug CPU and either NVIDIA GPU or AMD GPU code
• How to easily debug hybrid MPI and OpenMP applications
• How to combine advanced debugging features to efficiently tackle tough parallel problems
• How to leverage powerful tools such as reverse debugging and memory debugging to solve elusive bugs
Taking full advantage of the TotalView debugger's many features will help you improve your productivity by streamlining the debugging process and reducing the time and effort required to identify and fix bugs. You'll also enhance the scalability of your application by gaining insights into the parallel execution of your program.
Being able to identify and resolve hard-to-find errors will result in more robust, reliable HPC applications.
This interactive session highlights the important debugging technologies and techniques for effectively finding and solving challenging issues in HPC applications. You will learn:
• The advantages of parallel debuggers over traditional debuggers
• How to simultaneously debug CPU and either NVIDIA GPU or AMD GPU code
• How to easily debug hybrid MPI and OpenMP applications
• How to combine advanced debugging features to efficiently tackle tough parallel problems
• How to leverage powerful tools such as reverse debugging and memory debugging to solve elusive bugs
Taking full advantage of the TotalView debugger's many features will help you improve your productivity by streamlining the debugging process and reducing the time and effort required to identify and fix bugs. You'll also enhance the scalability of your application by gaining insights into the parallel execution of your program.
Being able to identify and resolve hard-to-find errors will result in more robust, reliable HPC applications.
Exhibitor Forum
Accelerators
Artificial Intelligence/Machine Learning
Architecture and Networks
Hardware Technologies
TP
XO/EX
DescriptionGroqChip™ is an AI accelerator optimized for running large-scale inference workloads with high throughput and ultra-low latency. It features a Tensor Streaming architecture optimized for matrix-oriented operations commonly found in AI, but the chip can also efficiently compute other applications such as HPC workloads that can be expressed as large-scale matrix multiplication. GroqChip uses a deterministic dataflow execution model that results in predictable and repeatable performance without runtime variation, and its RealScale™ chip-to-chip interconnect technology makes it possible to scale applications across cards in a node, or nodes in a rack, without hitting the bottlenecks of PCIe or the network.
Here, we explore how GroqChip and its architecture can be used to deliver high performance for linear algebra-based applications in HPC. Seismic imaging typically involves a 3D finite difference solver, which involves 3D stencil computations on a volume of data. The original stencil algorithm is not well-suited to run on a tensor-based architecture, but we outline how stencil operation can be transformed into tensor operations by decomposing the stencil and recomposing it into matrices. The finite difference step can now be solved by matrix multiplications and matrix transpositions. A single GroqChip can run the finite difference step for a sub-cube of data which is fully kept in on-chip memory, while larger volumes are computed by mapping the computation to a full rack or several racks. Halo data is exchanged between GroqChip processors via RealScale interconnect, enabling the scaling of the application’s domain size without PCIe or internode communication becoming the bottleneck. The deterministic dataflow model supports efficient orchestration of data movements within the chip and between chips without ever stalling the compute units. Finally, numerical analysis and optimization allows us to leverage of Groq TruePoint™ arithmetic to satisfy the numerical requirements of seismic imaging.
Here, we explore how GroqChip and its architecture can be used to deliver high performance for linear algebra-based applications in HPC. Seismic imaging typically involves a 3D finite difference solver, which involves 3D stencil computations on a volume of data. The original stencil algorithm is not well-suited to run on a tensor-based architecture, but we outline how stencil operation can be transformed into tensor operations by decomposing the stencil and recomposing it into matrices. The finite difference step can now be solved by matrix multiplications and matrix transpositions. A single GroqChip can run the finite difference step for a sub-cube of data which is fully kept in on-chip memory, while larger volumes are computed by mapping the computation to a full rack or several racks. Halo data is exchanged between GroqChip processors via RealScale interconnect, enabling the scaling of the application’s domain size without PCIe or internode communication becoming the bottleneck. The deterministic dataflow model supports efficient orchestration of data movements within the chip and between chips without ever stalling the compute units. Finally, numerical analysis and optimization allows us to leverage of Groq TruePoint™ arithmetic to satisfy the numerical requirements of seismic imaging.
Tutorial
Architecture and Networks
Middleware and System Software
TUT
DescriptionArm technology has increasingly become a compelling choice for HPC due to its promise of higher efficiency, density, scalability, and broad ecosystem of software. Arm expansion in the datacentre started in 2018 with Arm Neoverse, a set of infrastructure CPU IPs designed for high-end computing. The Arm-based Fugaku supercomputer, first of its kind implementing Arm SVE instruction set, entered the Top 500 in June 2020 scoring at the top and retaining a leadership position over the years not only in HPL but also for HPCG (where it is still unbeaten). This event has been a wake-up call for the HPC community. The datacentre and HPC space have long been dominated by x86 CPUs. There is a growing interest in diversifying and exploring new architectures to re-create a vibrant and diverse ecosystem of architectures as it was more than a decade ago. Arm technology is at the forefront of this wave of change. This tutorial welcomes scientists and engineers interested in running a variety of workloads on a Arm-based system, either on-premises or in the cloud. The tutorial will guide the attendee through compile, execute, profile and optimize codes for Arm to demystify those claims that changing CPU architecture is hard.
Paper
Exascale
Large Scale Systems
State of the Practice
TP
Best Paper Finalist
DescriptionAs the US Department of Energy (DOE) computing facilities began deploying petascale systems in 2008, DOE was already setting its sights on exascale. In that year, DARPA published a report on the feasibility of reaching exascale. The report authors identified several key challenges in the pursuit of exascale including power, memory, concurrency, and resiliency. That report informed the DOE's computing strategy for reaching exascale. With the deployment of Oak Ridge National Laboratory's Frontier supercomputer, we have officially entered the exascale era. In this paper, we discuss Frontier's architecture, how it addresses those challenges, and describe some early application results from Oak Ridge Leadership Computing Facility's Center of Excellence and the Exascale Computing Project.
Workshop
Performance Measurement, Modeling, and Tools
Programming Frameworks and System Software
W
DescriptionIn recent years, High Performance Computing (HPC) has become increasingly important for many industries and research areas besides ‘classic’ applications. As new domains emerge, applications, implementations and frameworks become more diverse. Generic performance analysis tools often cannot keep up with the development speed of new approaches for workload distribution, offloading, and communication. Some of the new approaches employ their own performance monitoring, which is difficult to integrate into generic tools designed for traditional HPC. Performance measurements often result in a collection of separate performance logs that logically form a unit but cannot intuitively be investigated together with established performance tools. We present a tool library that can be used to combine separate performance logs and separately recorded metrics into one single performance log, enabling investigation of such performance data as a unit. Use cases from Big Data processing and AI show the broad applicability of our approach.
Workshop
Fault Handling and Tolerance
Large Scale Systems
W
DescriptionThank you for attending FTXS 2023. See you next year!
Workshop
Fault Handling and Tolerance
Large Scale Systems
W
DescriptionIntroduction and welcome to FTXS 2023.
Workshop
Fault Handling and Tolerance
Large Scale Systems
W
DescriptionQuantum computing is a new computational paradigm, expected to revolutionize the computing field in the next few years. Qubits, the atomic units of a quantum circuit, exploit the quantum physics properties to increase the parallelism and speed of computation. Unfortunately, qubits are both intrinsically noisy and highly susceptible to external sources of faults, such as ionizing radiation. The reported qubits error rate is so high that researchers are questioning the large-scale adoption of quantum computers and forces unpractical mitigation solutions such as installing the quantum computer in underground caves. Innovative solutions to improve the reliability of quantum applications are then highly necessary.
In the talk, after providing all information and background needed to understand quantum computing basics and an overview of the available quantum technologies vulnerabilities, we will present the available hardening solutions and the open challenges that need to be addressed. We will consider both the intrinsic noise, that has a predictable and incremental effect, and radiation-induced transient faults, that are stochastic and modify the qubit in an unpredictable way. Based on the latest studies and radiation experiments performed on real quantum machines, we will show how to model the transient faults in a qubit and how to inject this fault in a quantum circuit to track its propagation. We will discuss the vulnerability of qubits and of circuits, identifying the most critical parts and the main course for output corruption. Finally, we will provide an overview of the open (reliability) challenges in quantum computing to stimulate further studies and solutions.
In the talk, after providing all information and background needed to understand quantum computing basics and an overview of the available quantum technologies vulnerabilities, we will present the available hardening solutions and the open challenges that need to be addressed. We will consider both the intrinsic noise, that has a predictable and incremental effect, and radiation-induced transient faults, that are stochastic and modify the qubit in an unpredictable way. Based on the latest studies and radiation experiments performed on real quantum machines, we will show how to model the transient faults in a qubit and how to inject this fault in a quantum circuit to track its propagation. We will discuss the vulnerability of qubits and of circuits, identifying the most critical parts and the main course for output corruption. Finally, we will provide an overview of the open (reliability) challenges in quantum computing to stimulate further studies and solutions.
Paper
Compilers
Performance Measurement, Modeling, and Tools
Performance Optimization
Programming Frameworks and System Software
TP
DescriptionThe current hardware landscape and application scale is driving performance engineers toward writing bespoke optimizations. Verifying such optimizations, and generating minimal failing cases, is important for robustness in the face of changing program conditions, such as inputs and sizes. However, isolation of minimal test-cases from existing applications and generating new configurations are often difficult due to side effects on the system state, mostly related to dataflow. This paper introduces FuzzyFlow: a fault localization and test case extraction framework designed to test program optimizations. We leverage dataflow program representations to capture a fully reproducible system state and area-of-effect for optimizations to enable fast checking for semantic equivalence. To reduce testing time, we design an algorithm for minimizing test inputs, trading off memory for recomputation. We demonstrate FuzzyFlow on exemplary use cases in real-world applications where the approach provides up to 528 times faster optimization testing and debugging compared to traditional approaches.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
TP
XO/EX
DescriptionWe use genome assembly as a representative case to showcase the use of the ‘actor model’, a novel programming system for high-performance data-intensive workloads. The actor version of the 𝑘-mer counting kernel shows on average 1.6× speedup over similar MPI implementation. We provide a novel parallel algorithm that leverages the actor model to traverse de Bruijn graphs in a non-blocking, one-directional manner. Our findings highlight the potential of the actor model for writing simple and efficient parallel programs for data-heavy workloads.
Posters
Research Posters
TP
XO/EX
DescriptionIn this poster, we will show how to leverage Nvidia's Bluefield Data Processing Unit (DPU) in geospatial systems. Existing work in literature has explored DPUs in the context of machine learning, compression and MPI acceleration. We show our designs on how to integrate DPUs into existing high performance geospatial systems like MPI-GIS. The workflow of a typical spatial computing workload consists of two phases - filter and refine. First we used DPU as a target to offload spatial computations from the host CPU. We show the performance improvements due to offload. Next we used DPU for network I/O processing. In network I/O case, the query data first comes to DPU for filtering and then the query goes to CPU for refinement. DPU-based filter and refine system can be useful in other domains like Physics where an FPGA is used to perform the filter to handle Big Data.
Birds of a Feather
Middleware and System Software
TP
XO/EX
DescriptionThe increasing reliance on inherently variable Green energy is poised to impact HPC centers fundamentally: they cannot count on a guaranteed supply of grid power, yet could play a significant role in stabilizing the Grid by quickly adapting their load.
“Adaptive Capacity Computing” touches on system architecture, hardware, scheduling and resource management, programming models, and applications with the objective of enabling future HPC centers to react gracefully to varying power profiles, achieving optimal throughput and avoiding loss of computational state wherever possible.
This BoF discusses challenges and approaches to support this paradigm, should it become necessary to do so.
“Adaptive Capacity Computing” touches on system architecture, hardware, scheduling and resource management, programming models, and applications with the objective of enabling future HPC centers to react gracefully to varying power profiles, achieving optimal throughput and avoiding loss of computational state wherever possible.
This BoF discusses challenges and approaches to support this paradigm, should it become necessary to do so.
Workshop
Middleware and System Software
Programming Frameworks and System Software
Runtime Systems
W
DescriptionTo achieve maximum performance on current heterogeneous architectures, applications have to be tailored to the available hardware by using special APIs to interact with the hardware resources, such as the CUDA APIs for NVIDIA GPUs. Simultaneously, unikernels emerge as a solution for the increasing overhead introduced by the complexity of modern operating systems and their inability to optimize for specific application profiles. Despite this, there is a lack of support for using GPUs in unikernels. We propose using Cricket GPU virtualization to introduce GPU support to the unikernels RustyHermit and Unikraft. To interface with Cricket, we implement a generic library for using ONC RPCs in Rust. With Cricket and our RPC library, unikernels are able to use GPU resources, even when they are installed in remote machines. This way, we enable the use of unikernels for applications that require the high parallel performance of GPUs to achieve manageable execution times.
Workshop
Applications
Data Movement and Memory
Heterogeneous Computing
I/O and File Systems
Large Scale Systems
Middleware and System Software
Performance Measurement, Modeling, and Tools
Performance Optimization
W
DescriptionIn GPU graph analytics, the use of external memory such as the host DRAM and solid-state drives is a cost-effective approach to processing large graphs beyond the capacity of the GPU onboard memory. This paper studies the use of Compute Express Link (CXL) memory as alternative external memory for GPU graph processing in order to see if this emerging memory expansion technology enables graph processing that is as fast as using the host DRAM. Through analysis and evaluation using FPGA prototypes, we show that representative GPU graph traversal algorithms involving fine-grained random access can tolerate an external memory latency of up to a few microseconds introduced by the CXL interface as well as by the underlying memory devices. This insight indicates that microsecond-latency flash memory may be used as CXL memory devices to realize even more cost-effective GPU graph processing while still achieving performance close to using the host DRAM.
Posters
Research Posters
TP
XO/EX
DescriptionLarge-scale parallel computing is crucial in Gaussian regressions to reduce the complexity of spatial statistics applications. The log-likelihood function is utilized to evaluate the Gaussian model for a set of measurements in N geographical locations. Several studies have shown a utilization of modern hardware to scale the log-likelihood function for handling large numbers of locations. ExaGeoStat is an example of software that allows parallel statistical parameter estimation from the log-likelihood function. However, generating a covariance matrix is mandatory and challenging when estimating the log-likelihood function. In ExaGeoStat, the generation process was performed on CPU hardware due to missing math functions in CUDA libraries, e.g., the modified Bessel function of the second kind. This study aims to optimize the generation process using GPU with two proposed generation schemes: pure GPU and hybrid. Our implementations demonstrate up to 6X speedup with pure GPU and up to 1.5X speedup with the hybrid scheme.
Workshop
Algorithms
Heterogeneous Computing
Large Scale Systems
W
DescriptionThis paper presents a portable and performance-efficient approach to solve a batch of linear systems of equations using Graphics Processing Units (GPUs). Each system is represented using a special type of matrices with a band structure above and/or below the diagonal. Each matrix is factorized using an LU factorization with partial pivoting for numerical stability. Subsequently, the factors are used to find the solution for as many right hand sides as needed. The width of the band is often small enough that performing a fully dense LU factorization results in poor performance. We follow the standard LAPACK specifications for addressing this type of problems and develop a dedicated solver that runs efficiently on GPUs. No similar solver is currently available in the vendor's software stack, so performance results are shown on both NVIDIA and AMD GPUs relative to a parallel CPU solution utilizing OpenMP for thread-level parallelization.
Workshop
Performance Measurement, Modeling, and Tools
Programming Frameworks and System Software
W
DescriptionGPUs pose an attractive opportunity for delivering high-performance applications. However, GPU codes are often limited due to memory contention, resulting in overall performance degradation. Since GPU scheduling is transparent to the user, and GPU memory architectures are very complex compared to ones on CPUs, finding such bottlenecks is a very cumbersome process.
In this paper, we present a novel method of systematically detecting the root cause of frequent memory performance bottlenecks on NVIDIA GPUs that we call GPUscout. It connects three approaches to analyzing performance - static CUDA SASS code analysis, sampling warp stalls, and kernel performance metrics. Connecting these approaches, GPUscout can identify the problem, locate the code segment where it originates, and assess its importance.
This paper illustrates the capabilities and the design of our implementation of GPUscout. We show its applicability based on three commonly-used kernels, yielding promising results in terms of accuracy, efficiency, and usability.
In this paper, we present a novel method of systematically detecting the root cause of frequent memory performance bottlenecks on NVIDIA GPUs that we call GPUscout. It connects three approaches to analyzing performance - static CUDA SASS code analysis, sampling warp stalls, and kernel performance metrics. Connecting these approaches, GPUscout can identify the problem, locate the code segment where it originates, and assess its importance.
This paper illustrates the capabilities and the design of our implementation of GPUscout. We show its applicability based on three commonly-used kernels, yielding promising results in terms of accuracy, efficiency, and usability.
Posters
Research Posters
TP
XO/EX
DescriptionPerformance anomaly detection can aid in discovering algorithmic inefficiencies or hardware issues in an application’s environment. The Chimbuko framework monitors large-scale workflow applications in real-time and identifies function executions which deviate from accumulated statistics (performance anomalies). Performance anomalies across runs correlate with variation in execution times of an application; quicker resolution of performance anomalies caused by hardware issues improves cluster performance. Anomalous and normal executions are stored as events in Chimbuko. In this study, we investigate the applicability of graph-based deep learning methods for anomaly classification. We hypothesize that transforming data into a graph will allow correlations to be modeled, thus allowing graph-based methods to learn embeddings that can improve the effectiveness of downstream anomaly classification tasks. Our evaluations demonstrate that the graph-based methods yield up to 95% accuracy and outperform a state-of-the-art gradient-based method. Moreover, we provide an explanation of the classification model’s decision-making process through explainable AI techniques.
Paper
Cloud Computing
Data Analysis, Visualization, and Storage
Graph Algorithms and Frameworks
TP
DescriptionObject cloud storage systems are deployed with diverse applications that have varying latency service level objectives (SLOs), posting challenges for supporting quality of service with limited storage resources. Existing methods provide prediction-based recommendations for dispatching requests from applications to storage devices, but the prediction accuracy can be affected by complex system topology. To address this issue, Graph3PO is designed to combine storage device queue information with system topological information for forming a temporal graph, which can accurately predict device queue states. Additionally, Graph3PO contains the urgency degree model and cost model for measuring SLO violation risks and penalties of scheduling requests on storage device queues. When the urgency degree of a request exceeds a threshold, Graph3PO determines whether to schedule it in the queue or initiate a hedge request to another storage device. Experimental results show that Graph3PO outperforms its competitors, with SLO violation rates 2.8 to 201.1 times lower.
Paper
Post-Moore Computing
Quantum Computing
TP
Best Paper Finalist
DescriptionMultiple technologies for realizing quantum computing are currently under development. Neutral atom quantum computing is one such promising technology; it offers advantages such as the ability to perform long-distance interactions and gates consisting of more than two qubits. A particular advantage it provides is the flexibility to arrange the qubits in different topologies by customizing atom layouts. We design GRAPHINE, which, to the best of our knowledge, is the first technique to leverage this flexibility to design application-specific topologies for different quantum algorithms based on the structural characteristics of the algorithm circuits. This enables GRAPHINE to improve key performance metrics like the number of gates and pulses by up to 56% and the probability of error by up to 42% on average over widely-used topology designs.
Paper
Accelerators
Applications
Graph Algorithms and Frameworks
Performance Measurement, Modeling, and Tools
Programming Frameworks and System Software
TP
DescriptionGraph mining is of critical use in a number of fields such as social networks, knowledge graphs, and fraud detection. As an NP-complete problem, improving computation performance is the main target for current optimizations. Due to excellent performance, state-of-the-art graph mining systems mainly rely on pattern-aware algorithms. Despite previous efforts, complex control flows introduced by pattern-aware algorithms bring large overhead and also impede further acceleration on heterogeneous hardware.
To address these challenges, we propose a set-based equivalent transformation approach for the optimization of pattern-aware graph mining applications, which can leverage set properties to eliminate most control flows and reduce computation overhead exponentially. We implement a high-performance pattern-aware graph mining system supporting both CPU and GPU, namely GraphSet, to automatically apply these transformations. Evaluation results show that GraphSet outperforms state-of-the-art cross-platform and hardware-specific graph mining frameworks by up to 3384.1x and 243.2x (18.0x and 10.2x on average), respectively.
To address these challenges, we propose a set-based equivalent transformation approach for the optimization of pattern-aware graph mining applications, which can leverage set properties to eliminate most control flows and reduce computation overhead exponentially. We implement a high-performance pattern-aware graph mining system supporting both CPU and GPU, namely GraphSet, to automatically apply these transformations. Evaluation results show that GraphSet outperforms state-of-the-art cross-platform and hardware-specific graph mining frameworks by up to 3384.1x and 243.2x (18.0x and 10.2x on average), respectively.
Paper
Cloud Computing
Distributed Computing
Energy Efficiency
Green Computing
Programming Frameworks and System Software
State of the Practice
Sustainability
TP
DescriptionNetwork Function Virtualization (NFV) platforms consume significant energy, introducing high operational costs in edge and data centers. This paper presents a novel framework called GreenNFV that optimizes resource usage for network function chains using deep reinforcement learning. GreenNFV optimizes resource parameters such as CPU sharing ratio, CPU frequency scaling, last-level cache (LLC) allocation, DMA buffer size, and packet batch size. GreenNFV learns the resource scheduling model from the benchmark experiments and takes Service Level Agreements (SLAs) into account to optimize resource usage models based on the different throughput and energy consumption requirements. Our evaluation shows that GreenNFV models achieve high transfer throughput and low energy consumption while satisfying various SLA constraints. Specifically, GreenNFV with Throughput SLA can achieve 4.4X higher throughput and 1.5X better energy efficiency over the baseline settings, whereas GreenNFV with Energy SLA can achieve 3X higher throughput while reducing energy consumption by 50%
Paper
Artificial Intelligence/Machine Learning
TP
DescriptionLarge-scale language models have become increasingly challenging and expensive to train. Among various methods addressing this issue, Pipeline Parallelism has been widely employed to accommodate massive model weights within limited GPU memory. This paper introduces Hanayo, a wave-like pipeline parallelism strategy that boasts a concise structure and practical applicability, alongside a high-performance pipeline execution runtime to tackle the challenges of pipeline strategy implementation. Hanayo mitigates the issues of pipeline bubbles and excessive memory consumption prevalent in existing schemes, without resorting to model duplicates as in Chimera. Our evaluation, conducted on four distinct computing clusters and involving both GPT-like and BERT-like architectures with up to 32 GPUs, demonstrates up to a 30.4% increase in throughput compared to the state-of-the-art approach.
Tutorial
Accelerators
Applications
Heterogeneous Computing
Quantum Computing
TUT
DescriptionSYCL is a programming model that lets developers support a wide variety of devices (CPUs, GPUs, and more) from a single code base. Given the growing heterogeneity of processor roadmaps, moving to an open standard, platform-independent model such as SYCL is essential for modern software developers. SYCL has the further advantage of supporting a single-source style of programming from completely standard C++.
In this tutorial, we will introduce SYCL and provide programmers with a solid foundation they can build on to gain mastery of this language. The main benefit of using SYCL over other heterogeneous programming models is the single programming language approach, which enables one to target multiple devices using the same programming model, and therefore to have a cleaner, portable, and more readable code.
This is a hands-on tutorial. The real learning will happen as students write code. The format will be short presentations followed by hands-on exercises. Hence, attendees will require their own laptop to perform the hands-on exercises.
In this tutorial, we will introduce SYCL and provide programmers with a solid foundation they can build on to gain mastery of this language. The main benefit of using SYCL over other heterogeneous programming models is the single programming language approach, which enables one to target multiple devices using the same programming model, and therefore to have a cleaner, portable, and more readable code.
This is a hands-on tutorial. The real learning will happen as students write code. The format will be short presentations followed by hands-on exercises. Hence, attendees will require their own laptop to perform the hands-on exercises.
Tutorial
Accelerators
Applications
Heterogeneous Computing
Performance Optimization
TUT
DescriptionThis tutorial presents state-of-the-art performance tools for leading-edge HPC systems founded on the community-developed Score-P instrumentation and measurement infrastructure, demonstrating how they can be used for performance engineering of effective scientific applications based on standard MPI, OpenMP, hybrid combination of both, and increasingly common usage of accelerators. Parallel performance tools from the Virtual Institute – High Productivity Supercomputing (VI-HPS) are introduced and featured in hands-on exercises with Score-P, Scalasca, Vampir, and TAU. We present the complete workflow of performance engineering, including instrumentation, measurement (profiling and tracing, timing and PAPI hardware counters), data storage, analysis, tuning, and visualization. Emphasis is placed on how tools are used in combination for identifying performance problems and investigating optimization alternatives. Using their own notebook computers, participants will conduct exercises on a contemporary HPC system where remote access will be provided for the hands-on sessions through AWS running an E4S [http://e4s.io] image containing all of the necessary tools. This image supports NVIDIA GPUs using CUDA 12 and Python. This will help to prepare participants to locate and diagnose performance bottlenecks in their own parallel programs.
Workshop
Modeling and Simulation
Performance Measurement, Modeling, and Tools
W
DescriptionHardware specialization is one of the promising directions in the post-Moore era. It is imperative to understand how hardware specialization paradigms can benefit HPC. An essential question revolves around estimating the theoretical performance of an optimally specialized architecture without requiring extensive hardware development expertise and efforts.
Focusing on the Monte Carlo cross-section lookup kernel, known for its notably low resource utilization, we develop a workflow to simulate a specialized architecture's timing and estimate resource usage to answer these questions, leveraging open-source hardware tools. We implement building blocks of the kernel pipeline in the Chisel construction language and generate Verilog codes for resource estimation. Our late-breaking results show that the kernel latency is 46 cycles per lookup while the optimized CPU code takes 680 cycles, and a potential 15k pipeline copies within a 698 mm2 die, reflective of the Intel Xeon Platinum 8180 dimensions.
Focusing on the Monte Carlo cross-section lookup kernel, known for its notably low resource utilization, we develop a workflow to simulate a specialized architecture's timing and estimate resource usage to answer these questions, leveraging open-source hardware tools. We implement building blocks of the kernel pipeline in the Chisel construction language and generate Verilog codes for resource estimation. Our late-breaking results show that the kernel latency is 46 cycles per lookup while the optimized CPU code takes 680 cycles, and a potential 15k pipeline copies within a 698 mm2 die, reflective of the Intel Xeon Platinum 8180 dimensions.
Birds of a Feather
Data Analysis, Visualization, and Storage
TP
XO/EX
DescriptionHDF5 is a critical I/O library for scientific applications. It has been 25 years since its first release in November, 1998. HDF5’s sustainability and adaptation to today’s computational and storage environment would not be possible without feedback and contributions from the HDF5 community. We will begin with a panel who will present case studies on how they use or would like to use HDF5 in current and emerging computational environments. We will then invite our community members to discuss the roadmap, how to contribute to HDF5, and what is required to sustain HDF5 for another 25 years.
Paper
Distributed Computing
Message Passing
Programming Frameworks and System Software
TP
Best Student Paper Finalist
DescriptionAllreduce is one of the most commonly used collective operations. Its latency and bandwidth can be improved by offloading the calculations to the network. However, no way exists to conduct such offloading securely; in state-of-the-art solutions, the data is passed unprotected into the network. Security is a significant concern for High-Performance Computing applications, but achieving it while maintaining performance remains challenging. We present HEAR, the first high-performance system for securing in-network compute and Allreduce operations based on homomorphic encryption. HEAR implements carefully designed and modified encryption schemes for the most common Allreduce functions and leverages communication domain knowledge in MPI programs to obtain decryption and encryption routines with high performance. HEAR operates on integers and floats with no code base and no or little hardware changes. We design and evaluate HEAR, showing its minimal overhead, and open-source our implementation. HEAR represents the first step towards achieving confidential HPC.
Workshop
Artificial Intelligence/Machine Learning
Data Analysis, Visualization, and Storage
State of the Practice
W
DescriptionHeterogeneous test-bed clusters present a unique challenge in identifying system hardware failures and anomalies as a result of the variation in the ways that errors and warnings are reported through the system log. We present a novel approach for the real-time classification of syslog messages, generated from a heterogeneous test-bed cluster, to proactively identify potential hardware issues and security events. By integrating machine learning models with high-performance computing systems, our system facilitates continuous system health monitoring.
The paper introduces a taxonomy for classifying system issues into actionable categories of problems, while filtering out groups of messages that the system administrators would consider unimportant "noise". Finally we experiment with using newly available large language models as a form of message classifier, and share our results and experience with doing so. Results demonstrate promising performance, and more explainable results compared to currently available techniques, but the computational costs may offset the benefits.
The paper introduces a taxonomy for classifying system issues into actionable categories of problems, while filtering out groups of messages that the system administrators would consider unimportant "noise". Finally we experiment with using newly available large language models as a form of message classifier, and share our results and experience with doing so. Results demonstrate promising performance, and more explainable results compared to currently available techniques, but the computational costs may offset the benefits.
Birds of a Feather
Applications
TP
XO/EX
DescriptionThe growth of climate and earth data brings a pressing need to enhance techniques for its handling with HPC. This is crucial for our understanding of the coupling between the solid Earth, atmosphere, hydrology, and oceans, enabling proactive responses to extremes through improved forecasting of weather, climate change, and sudden disasters like earthquakes. This BoF will discuss the HPC community’s interface with earth data and related engagements of stakeholder communities in climate, environmental, and Earth sciences, including the mathematics and spatial statistics that are involved. It endeavors to begin a targeted approach to the democratization of HPC for earth sciences.
Doctoral Showcase
Posters
Applications
TP
DescriptionModern radiation therapy relies heavily on computational methods to design optimal treatment plans (control parameters for the treatment machine) for individual patients. These parameters are determined by constructing and solving a mathematical optimization problem. Ultimately, the goal is to create treatment plans for each patient such that a high dose is delivered to the tumor, while sparing surrounding healthy tissue as much as possible. Solving the optimization problem can be computationally expensive, as it requires both a method to compute the delivered dose in the patient and an algorithm to solve a (in general) constrained and nonlinear optimization problem.
The goal of this thesis project has been to investigate the use of HPC hardware and methods to accelerate the computational workflow in radiation therapy treatment planning. First, we propose two methods to bring the optimization to HPC hardware using GPU acceleration and distributed computing for dose summation and objective function calculation respectively. We show that our methods achieve competitive performance compared to state-of-the-art libraries and scale well, up to the Amdahl’s law limit.
Then, we investigate methods to accelerate interior point methods, a popular algorithm for constrained optimization. We investigate the use of iterative Krylov subspace linear solvers to solve Newton systems from interior point methods and show that we can compute solutions in reasonable time for our problems, in spite of extreme ill-conditioning. This approach presents one avenue by which constrained optimization solvers for radiation therapy could be ported to GPU accelerators.
The goal of this thesis project has been to investigate the use of HPC hardware and methods to accelerate the computational workflow in radiation therapy treatment planning. First, we propose two methods to bring the optimization to HPC hardware using GPU acceleration and distributed computing for dose summation and objective function calculation respectively. We show that our methods achieve competitive performance compared to state-of-the-art libraries and scale well, up to the Amdahl’s law limit.
Then, we investigate methods to accelerate interior point methods, a popular algorithm for constrained optimization. We investigate the use of iterative Krylov subspace linear solvers to solve Newton systems from interior point methods and show that we can compute solutions in reasonable time for our problems, in spite of extreme ill-conditioning. This approach presents one avenue by which constrained optimization solvers for radiation therapy could be ported to GPU accelerators.
Workshop
Large Scale Systems
Programming Frameworks and System Software
W
DescriptionThis workshop aims to connect researchers, developers, and Python practitioners to share their experiences scaling Python applications and codes on supercomputers. The goal is to provide a platform for topical discussion of best practices, hands-on demonstrations, and community engagement via open-source contributions to new libraries, runtimes, and frameworks. Based on keynote talks that survey and summarize the best practices and recent success stories, panel sessions that discuss details of implementation and live demo sessions for hands-on enthusiasts – the workshop will serve as a requirement gathering exercise for the future of Python in HPC and science.
Doctoral Showcase
Posters
Cloud Computing
TP
DescriptionFunction-as-a-Service (FaaS) computing brought a fundamental shift in resource management. It allowed for new and better solutions to the problem of low resource utilization, an issue that has been known in data centers for decades. The problem persists as the frequently changing resource availability cannot be addressed entirely with techniques such as persistent cloud allocations and batch jobs. The elastic fine-grained tasking and largely unconstrained scheduling of FaaS create new opportunities. Still, modern serverless platforms struggle to achieve the high performance needed for the most demanding and latency-critical workloads. Furthermore, many applications cannot be “FaaSified” without non-negligible loss in performance, and the short and stateless functions employed in FaaS must be easy to program, debug, and optimize. By solving the fundamental performance challenges of FaaS, we can build a fast and efficient programming model that brings innovative cloud techniques into HPC data centers, allowing users to benefit from pay-as-you-go billing and helping operators to decrease running costs and their environmental impact. My PhD research attempts to bridge the gap between high-performance programming and modern FaaS computing frameworks. I have been working on tailored solutions for different levels of the FaaS computing stack: from computing and network devices to high-level optimizations and efficient system designs.
Paper
Artificial Intelligence/Machine Learning
TP
DescriptionRecent years have seen a surge in deep learning approaches to accelerate numerical solvers, which provide faithful but computationally intensive simulations of the physical world. These deep surrogates are generally trained in a supervised manner from limited amounts of data slowly generated by the same solver they intend to accelerate. We propose an open-source framework that enables the online training of these models from a large ensemble run of simulations. It leverages multiple levels of parallelism to generate rich datasets. The framework avoids I/O bottlenecks and storage issues by directly streaming the generated data. A training reservoir mitigates the inherent bias of streaming while maximizing GPU throughput. Experiment on training a deep surrogate for the heat equation shows the proposed approach enables training on 8TB of data in 2 hours with an accuracy improved by 47% and a batch throughput multiplied by 13 compared to a traditional offline procedure.
Workshop
Performance Measurement, Modeling, and Tools
Performance Optimization
W
DescriptionIn recent years, a new scientific software design pattern has emerged that pairs a Python interface with high-performance kernels in lower-level languages. The rise of general-purpose GPUs necessitates the rewriting of many such kernels, posing challenges in GPU programming and ensuring future portability and flexibility.
This paper investigates the use of high-level frameworks that abstract system architecture details, aiming for straightforward, portable yet performant GPU code. We focus on TOAST, a cosmology software framework designed to take full advantage of a supercomputer, and compare using the JAX Python library with OpenMP target offload compiler directives as porting strategies. While JAX allows kernel code to be written in pure Python, OpenMP target offload is a directive-based strategy that integrates seamlessly with our existing OpenMP-accelerated C++ kernels.
We port a dozen kernels, analyzing development cost, performance, and the viability of using either framework for complex numerical Python applications.
This paper investigates the use of high-level frameworks that abstract system architecture details, aiming for straightforward, portable yet performant GPU code. We focus on TOAST, a cosmology software framework designed to take full advantage of a supercomputer, and compare using the JAX Python library with OpenMP target offload compiler directives as porting strategies. While JAX allows kernel code to be written in pure Python, OpenMP target offload is a directive-based strategy that integrates seamlessly with our existing OpenMP-accelerated C++ kernels.
We port a dozen kernels, analyzing development cost, performance, and the viability of using either framework for complex numerical Python applications.
Paper
Artificial Intelligence/Machine Learning
Compilers
Performance Measurement, Modeling, and Tools
Performance Optimization
Programming Frameworks and System Software
Tensors
TP
DescriptionGraph attention models (A-GNNs), a type of Graph Neural Networks (GNNs), have been shown to be more powerful than simpler convolutional GNNs (C-GNNs). However, A-GNNs are more complex to program and difficult to scale. To address this, we develop a novel mathematical formulation, based on tensors that group all the feature vectors, targeting both training and inference of A-GNNs The formulation enables straightforward adoption of communication-minimizing routines, it fosters optimizations such as vectorization, and it enables seamless integration with established linear algebra DSLs or libraries such as GraphBLAS. Our implementation uses a data redistribution scheme explicitly developed for sparse-dense tensor operations used heavily in GNNs, and fusing optimizations that further minimize memory usage and communication cost. We ensure theoretical asymptotic reductions in communicated data compared to the established message-passing GNN paradigm. Finally, we provide excellent scalability and speedups of >5x over modern libraries such as Deep Graph Library.
Posters
Research Posters
Architecture and Networks
I/O and File Systems
TP
DescriptionCollective I/Os are widely used to transform small non-contiguous accesses into large contiguous accesses for parallel I/O optimization. The existing collective I/O techniques assume that computer memory is volatile. They are limited both by the size of the buffer, which must be small so data is not lost during a crash, and the communication overhead that occurs during collective I/O. PMIO is a proposed framework to utilize persistent memory (PMEM) for collective I/O, as opposed to DRAM. First, we utilize a log-structured buffer to take advantage of the non-volatility of PMEM. Second, we utilize larger buffers to take advantage of the larger space available on less expensive PMEM. Finally, we implement a two-phase merging algorithm to eliminate the communication overhead. The poster provides an overview of collective I/O and its problems, an introduction to PMEM, an outline of PMIO, and a brief discussion of PMIO's performance.
Workshop
Applications
Distributed Computing
Compilers
Heterogeneous Computing
Message Passing
Programming Frameworks and System Software
Task Parallelism
W
DescriptionWe will demonstrate how the parallelism and expressiveness of the Chapel programming language are used to achieve an enormous improvement in computational speed for a problem related to coral reef conservation. Chapel’s concise syntax and versatile data structures enable this problem to be solved in under 300 lines of code, while reducing the time to solution from days down to the order of seconds. This improvement is so substantial that it represents a paradigm shift in the way biodiversity can be measured at scale, providing a wealth of novel information for marine ecosystem managers and opening up brand new avenues for scientific inquiry. This paper will review the solution strategy and data structures in Chapel that allowed these improvements to be realized, and will preview future extensions of this work that have been made possible by this drastic speedup.
Paper
Algorithms
Linear Algebra
Post-Moore Computing
TP
DescriptionWe introduce a new singular value decomposition (SVD) solver based on the QR-based Dynamically Weighted Halley (QDWH) algorithm for computing the partial spectrum SVD (QDWHpartial-SVD) problems. By optimizing the rational function underlying the algorithms only in the desired part of the spectrum, QDWHpartial-SVD algorithm efficiently computes a fraction (say 1-20%) of the most significant singular values/vectors. We develop a high-performance implementation of QDWHpartial-SVD on distributed-memory manycore systems and demonstrate their numerical robustness. We perform a benchmarking campaign against their counterparts from the state-of-the-art numerical libraries across various matrix sizes using up to 36K MPI processes. Experimental results show performance speedups for QDWHpartial-SVD up to 6X and 2X against PDGESVD from ScaLAPACK and KSVD, respectively. We also report energy consumption for these algorithms and demonstrate how QDWHpartial-SVD can further outperform PDGESVD in that regard by performing fewer memory-bound operations.
Workshop
Data Movement and Memory
Heterogeneous Computing
W
DescriptionWelcome to the HMEM Workshop 2023
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
TP
XO/EX
DescriptionCentralized machine learning techniques have caused privacy concerns for users. Federated Learning~(FL) mitigates this as a decentralized training system where no raw data are communicated across the network to a centralized server. Instead, the machine learning model is trained locally on each device and they send the locally-trained model weights to a central server to aggregate. However, there are critical challenges with FL. Security issues plague FL, such as model poisoning via label flipping. Additionally, there even exist privacy concerns via data leakage by reconstruction of weights. In this work, we apply differential privacy (which adds noise to the model weights before sending across the network) as an added privacy measure to protect sensitive data from being reconstructed. Through this research, we study the effects of differential privacy on FL with respect to security and privacy trade-offs.
Early Career Program
Inclusivity
Inclusivity
TP
DescriptionMentorship is a dynamic, career-long phenomenon spanning many different relationships that support our personal and professional development. A wealth of scholarship on mentorship practices has emerged across many disciplines studying how mentorship happens in the workplace, its benefits, and what companies can do to foster those relationships. Of note, numerous studies have linked mentorship with diversity and inclusion; mentorship can support the growth and retention of workers from underrepresented and marginalized groups by “bringing them into the fold” and empowering them. As a software engineering researcher, Reed has actively been investigating the instrumental role that mentorship can play in the careers of women and LGBTQIA+ individuals in tech. In this talk, he will make the case for how we can leverage these insights to build stronger mentor-mentee relationships and to foster more inclusive and equitable communities.
Paper
Accelerators
Distributed Computing
Middleware and System Software
Performance Measurement, Modeling, and Tools
Post-Moore Computing
TP
DescriptionThe end of Dennard scaling and the slowdown of Moore's law led to a shift in technology trends toward parallel architectures, particularly in HPC systems. To continue providing performance benefits, HPC should embrace Approximate Computing (AC), which trades application quality loss for improved performance. However, existing AC techniques have not been extensively applied and evaluated in state-of-the-art hardware architectures such as GPUs, the primary execution vehicle for HPC applications today.
This paper presents HPAC-Offload, a pragma-based programming model that extends OpenMP offload applications to support AC techniques, allowing portable approximations across different GPU architectures. We conduct a comprehensive performance analysis of HPAC-Offload across GPU-accelerated HPC applications, revealing that AC techniques can significantly accelerate HPC applications (1.64x LULESH on AMD, 1.57x NVIDIA) with minimal quality loss (0.1%). Our analysis offers deep insights into the performance of GPU-based AC that guide the future development of AC algorithms and systems for these architectures.
This paper presents HPAC-Offload, a pragma-based programming model that extends OpenMP offload applications to support AC techniques, allowing portable approximations across different GPU architectures. We conduct a comprehensive performance analysis of HPAC-Offload across GPU-accelerated HPC applications, revealing that AC techniques can significantly accelerate HPC applications (1.64x LULESH on AMD, 1.57x NVIDIA) with minimal quality loss (0.1%). Our analysis offers deep insights into the performance of GPU-based AC that guide the future development of AC algorithms and systems for these architectures.
Posters
Research Posters
TP
XO/EX
DescriptionClimate models cannot perfectly represent the complex climate system, but by running them multiple times with small variations in input parameters, it's possible to estimate uncertainties and explore different climate scenarios. Generating these ensembles demands significant computational resources and time, which can be crucial for risk assessments and decision-making. This study utilizes generative adversarial networks (GANs) and deep diffusion models (DDMs) to produce low-resolution ensemble runs trained on data provided by climate model simulations with low computational expense. Additionally, convolutional neural networks (CNNs) are employed for downscaling as well as parallelization techniques to enhance performance and reduce computation time. This approach allows for time-efficient exploration of high-resolution ensemble members, facilitating climate modeling investigations that were previously challenging due to resource constraints.
Panel
Artificial Intelligence/Machine Learning
Cloud Computing
Heterogeneous Computing
TP
DescriptionThe end of Dennard scaling and tapering of Moore’s law has led to economic conditions that favor cloud hyperscalers. Consequently, cloud is projected to be the largest sector of computing by revenue by 2025. The tremendous growth translates into substantial investment in research and development to manage the complexity of emerging systems. Cloud technologies such as elasticity, containerization and orchestration, and automation are gaining prevalence in HPC due to their abilities to manage new composite scientific workflows. Similarly, HPC techniques for performance optimization, scheduling, and fine-grained resource management are being integrated into the cloud to improve performance. The trend of integrating technologies from each community into the other leads to Converged Computing, an environment that combines the best capabilities from both worlds. In this highly interactive panel, we invite experts from industry, national laboratories, and academia to discuss their experiences with converged computing and share their views on its future.
Workshop
Education
State of the Practice
W
DescriptionThe HPC Carpentry lesson program is a highly interactive, hands-on approach to getting users up to speed on HPC cluster systems. It is motivated by the increasing availability of cluster resources to a wide range of user groups, many of whom come from communities that have not traditionally used HPC systems.
We adopt the Carpentries approach to pedagogy, which consists of a workshop setting where learners type along with instructors while working through the instructional steps, building up "muscle memory" of the tasks, further reinforced by challenge exercises at critical points within the lesson.
We review the development of the HPC Carpentry Lesson Program as it becomes the first entrant into phase 2 of The Carpentries Lesson Program Incubator. This incubator is the pathway for HPC Carpentry to become an official lesson program of The Carpentries.
We adopt the Carpentries approach to pedagogy, which consists of a workshop setting where learners type along with instructors while working through the instructional steps, building up "muscle memory" of the tasks, further reinforced by challenge exercises at critical points within the lesson.
We review the development of the HPC Carpentry Lesson Program as it becomes the first entrant into phase 2 of The Carpentries Lesson Program Incubator. This incubator is the pathway for HPC Carpentry to become an official lesson program of The Carpentries.
Workshop
W
DescriptionWhile containerization revolutionized the delivery and execution of software, it introduces new challenges as the usual practice with one big software file-system with a subsequent module load to rule all environments is not feasible with containers. This lighting talk introduces the 'HPC Container Conformance' Project which aims to provide guidance on how to build container images and how to annotate them so that end-users and system admins can integrate them in their workflows. The talk also briefly introduces the MetaHub Registry, an OCI compliant container registry to serve environment/hardware specific images and reduce overall complexity of herding all the container images as a practical implementation of the HPC Container Compliance project.
Birds of a Feather
Algorithms
TP
XO/EX
DescriptionGovernment agencies, industry and academia are demanding a new generation of tools to efficiently solve large scale analytics problems in a variety of business, scientific and national security applications. This BoF gathers the community developing high-performance frameworks and workflows for large scale graph analytics to survey current approaches, identify new challenges and opportunities, and discuss interoperability of emerging infrastructures. A central goal is developing requirements and recommendations for future tools. As in previous editions, this BoF will explore and compare and contrast conventional implementations as well as algebraic approaches, inviting the GraphBLAS community to discuss its state and evolution.
Birds of a Feather
Architecture and Networks
TP
XO/EX
DescriptionRISC-V is an open instruction set standard which is experiencing extraordinary growth and has the potential to revolutionize supercomputing. There are a growing number of RISC-V activities by the HPC community, and the goal of this BoF is to continue the discussion with the community about the RISC-V ecosystem and how it can best support HPC research and development. Beginning with a short overview on the status of the RISC-V HPC ecosystem, this will be followed by a Q&A with the panel and audience. There will be directed questions, as well as ad hoc questions and discussions with the audience.
Workshop
Distributed Computing
Middleware and System Software
Runtime Systems
W
DescriptionTraditional HPC systems rely on balanced soft scaling, which adjusts the compute-to-memory ratio according to the workload. However, this approach is challenged by Machine Learning applications, especially Large Language Model (LLM) workloads, which demand much more memory than compute. This leads to wasted compute resources and excessive data movement in the system. To address this issue, we propose to use CXL 3.0 Global Fabric Attached Memory (GFAM), which enables independent scaling of compute and memory and reduces data movement. In this talk, we will explore how GFAM architectures require changes in memory and compute placement, as well as software stacks, to optimize performance for LLM workloads.
Workshop
Programming Frameworks and System Software
State of the Practice
W
Workshop
Programming Frameworks and System Software
W
DescriptionThe HPC user suppport tools (HUST) workshop, has become a key forum to promote new and innovative user support tools such as XALT, Spack, Easybuild, and ReFrame to the HPC community. Many of the HPC user tools presented at earlier HUST workshops have matured to the point of becoming the community standard and are integral tools for the user support at HPC centers around the world. The HUST workshop is a forum for system administrators, user support members, tool developers, policy makers, and end users to learn about new and innovative tools. The HUST workshop central aim is as a publication venue for current and on-going support tool developments and to promote the uptake of these tools. Identify and support best practices, novel tools and novel ideas to help streamline user support efforts within the novel technology ecosystems at HPC centers. These issues are all in-scope for the HUST workshop.
Posters
Research Posters
TP
XO/EX
DescriptionTypical GPU programs consist of four steps: (1) data preparation, (2) host CPU-to-GPU data transfers, (3) execution of one or more GPU kernels, and (4) transfer of results back to CPU. While the kernel is running on the GPU, the CPU cores often remain idle, waiting on the GPU to finish kernel execution.
In recent years, several frameworks have been presented that perform automated distribution of workload to both CPU and GPU. While the aforementioned frameworks offer techniques for CPU+GPU workload distribution for regular applications, identifying a performant CPU+GPU workload distribution for irregular applications remains a difficult problem due to workload imbalance and irregular memory access patterns.
This work evaluates a hybrid CPU+GPU implementation of an irregular workload -- graph link prediction using the Jaccard similarity. For the graphs that benefit the most from our hybrid CPU-GPU approach, our implementation delivers a 16.4-28.4% improvement over the state-of-the-art Jaccard similarity implementation.
In recent years, several frameworks have been presented that perform automated distribution of workload to both CPU and GPU. While the aforementioned frameworks offer techniques for CPU+GPU workload distribution for regular applications, identifying a performant CPU+GPU workload distribution for irregular applications remains a difficult problem due to workload imbalance and irregular memory access patterns.
This work evaluates a hybrid CPU+GPU implementation of an irregular workload -- graph link prediction using the Jaccard similarity. For the graphs that benefit the most from our hybrid CPU-GPU approach, our implementation delivers a 16.4-28.4% improvement over the state-of-the-art Jaccard similarity implementation.
Exhibitor Forum
Exascale
Programming Frameworks and System Software
Quantum Computing
TP
XO/EX
DescriptionAs quantum computing (QC) matures and scales, it’s placement and alignment as part of the high-performance computing (HPC) realm more clearly comes into focus. Qubit-based calculations pledge an additional acceleration capability and the possibility to address previously intractable computation science challenges. Upcoming quantum-enabled HPC systems are leveraging many best practices garnered from decades of development in supercomputing, including workflows, standards, and programming tools. At the same time, necessary divergences and augmentations are under development to bridge bit-qubit synergies including run-time compilation, long optimization times, statistical evaluations of results, hybrid scheduling and resource management, and the need to work with few centralized resources.
In this exhibitor presentation, the Leibniz Supercomputing Centre of the Bavarian Academy of Sciences will overview and highlight its multi-dimensional efforts to provide, merge and optimize various forms of quantum accelerators into its HPC systems. Our efforts drive the hybrid software development of the Munich Quantum Software Stack – the unifying software stack of the Munich Quantum Valley for its regionally-developed quantum modalities for superconducting materials, ions and atoms – and its mission to fold them into current and upcoming HPC systems. Our efforts expand from MQV through national efforts including Germany’s first quantum demonstrator named Q-Exa with the Finnish/German company IQM, currently being seated at LRZ. Additionally, with the upcoming placement of the EuroHPC Joint Undertaking superconducting system, Euro-Q-Exa, this effort expands to the European landscape and alignment to several HPC centers across the continent to forward quantum-HPC.
This talk describes our vision and research for an integrated ecosystem that combines existing HPC and evolving quantum software stacks into a single system to enable a common and continuous user experience for the benefit of next-generation science and industry results.
In this exhibitor presentation, the Leibniz Supercomputing Centre of the Bavarian Academy of Sciences will overview and highlight its multi-dimensional efforts to provide, merge and optimize various forms of quantum accelerators into its HPC systems. Our efforts drive the hybrid software development of the Munich Quantum Software Stack – the unifying software stack of the Munich Quantum Valley for its regionally-developed quantum modalities for superconducting materials, ions and atoms – and its mission to fold them into current and upcoming HPC systems. Our efforts expand from MQV through national efforts including Germany’s first quantum demonstrator named Q-Exa with the Finnish/German company IQM, currently being seated at LRZ. Additionally, with the upcoming placement of the EuroHPC Joint Undertaking superconducting system, Euro-Q-Exa, this effort expands to the European landscape and alignment to several HPC centers across the continent to forward quantum-HPC.
This talk describes our vision and research for an integrated ecosystem that combines existing HPC and evolving quantum software stacks into a single system to enable a common and continuous user experience for the benefit of next-generation science and industry results.
I Am HPC Plenary
TP
W
TUT
XO/EX
DescriptionThe panel focuses on the people and social impact of the HPC community. We explore some of the major impacts of HPC on scientific discovery and society as well as some of its future technical and applications directions. The discussion will be engaging and exciting, including many different perspectives on these issues.
High-performance computing has significantly impacted scientific discovery in many areas such as climate modeling, materials design, cosmology, biology, computing to name a few. It has significantly reduced the time to scientific discoveries via simulations that provide new insights and new directions for experiments. HPC is critical to the training and inference for AI, which is used in many areas including the humanities, life sciences, physical sciences, sociology, and more. Further, HPC has had societal impacts on our daily lives including faster drug design, addressing threats posed by COVID-19, and climate modeling for better predictions and the implications for proactive responses to mitigate and adapt to climate changes.
The collective impact of the people in the HPC community is significant, and the community is growing to bring about additional directions for impact.
High-performance computing has significantly impacted scientific discovery in many areas such as climate modeling, materials design, cosmology, biology, computing to name a few. It has significantly reduced the time to scientific discoveries via simulations that provide new insights and new directions for experiments. HPC is critical to the training and inference for AI, which is used in many areas including the humanities, life sciences, physical sciences, sociology, and more. Further, HPC has had societal impacts on our daily lives including faster drug design, addressing threats posed by COVID-19, and climate modeling for better predictions and the implications for proactive responses to mitigate and adapt to climate changes.
The collective impact of the people in the HPC community is significant, and the community is growing to bring about additional directions for impact.
Doctoral Showcase
Posters
Artificial Intelligence/Machine Learning
I/O and File Systems
TP
DescriptionMy research focuses on systems optimizations for machine learning, specifically on I/O efficient model storage and retrieval.
The first part of my work focuses on efficient inference serving of tree ensemble models. Tree structures are inherently not cache friendly and their traversal incurs random I/Os. We developed two systems - Blockset (Block Aligned Serialized Trees) and T-REX (Tree Rectangles).
Blockset improves inference latency in the scenario where the model doesn’t fit in memory. It introduces the concept of selective access for tree ensembles in which only the parts of the model needed for inference are deserialized and loaded into memory. It uses principles from external memory algorithms to rearrange tree nodes in a block aligned format to minimize the number of I/Os needed for inference. T-REX optimizes inference latency for both in-memory inference as well as inference when the model doesn’t fit in memory. T-REX reformulates decision tree traversal as hyperrectangle enclosure queries using the fact that decision trees partition the space into convex hyperrectangles. The test points are then queried for enclosure inside the hyperrectangles. In doing random I/O is traded for additional computation.
The second part of my work focuses on efficient deep learning model storage. We implemented a deep learning model repository that requires fine-grained access to individual tensors in models. This is useful in applications such as transfer learning, where individual tensors in layers are transferred from one model to another. We’re currently working on caching and prefetching popular tensors based on application level hints.
The first part of my work focuses on efficient inference serving of tree ensemble models. Tree structures are inherently not cache friendly and their traversal incurs random I/Os. We developed two systems - Blockset (Block Aligned Serialized Trees) and T-REX (Tree Rectangles).
Blockset improves inference latency in the scenario where the model doesn’t fit in memory. It introduces the concept of selective access for tree ensembles in which only the parts of the model needed for inference are deserialized and loaded into memory. It uses principles from external memory algorithms to rearrange tree nodes in a block aligned format to minimize the number of I/Os needed for inference. T-REX optimizes inference latency for both in-memory inference as well as inference when the model doesn’t fit in memory. T-REX reformulates decision tree traversal as hyperrectangle enclosure queries using the fact that decision trees partition the space into convex hyperrectangles. The test points are then queried for enclosure inside the hyperrectangles. In doing random I/O is traded for additional computation.
The second part of my work focuses on efficient deep learning model storage. We implemented a deep learning model repository that requires fine-grained access to individual tensors in models. This is useful in applications such as transfer learning, where individual tensors in layers are transferred from one model to another. We’re currently working on caching and prefetching popular tensors based on application level hints.
Paper
Data Analysis, Visualization, and Storage
I/O and File Systems
State of the Practice
TP
DescriptionLarge-scale parallel applications can face significant I/O performance bottlenecks, making efficient I/O crucial. This work presents a comparative study of several parallel I/O implementations in the Weather Research and Forecasting model, including PnetCDF blocking and non-blocking I/O options, netCDF4, HDF5 Log VOL, and ADIOS. For I/O methods creating files in a canonical data layout, PnetCDF's non-blocking option offers up to 2x improvement over its blocking option and up to 4.5x over HDF5 via netCDF4, demonstrating the effectiveness of the write request aggregation technique. The HDF5 Log VOL outperforms ADIOS with a 4x improvement in write performance when creating files in the log layout, although both require non-negligible time to convert the file back to canonical order for post-run analysis. From these results, we extract some observations that can guide I/O strategies for modern parallel codes.
Workshop
Resource Management
State of the Practice
W
DescriptionThe Partnership for an Advanced Computing Environment (PACE) at Georgia Tech (GT) has been running two campus-wide cluster resources available for academic courses and workshops for five years. The initial design focused on creating a federated resource for a wide range of educational topics, based on a PACE and College of Computing (COC) partnership. Due to funding, this took the form of separate resources, one funded by PACE, and another by COC. These "Instructional Cluster Environments", PACE-ICE and COC-ICE, became very popular with instructors at GT but led to a high maintenance cost due to the split nature of the environments. With the transition to the Slurm scheduler, PACE collaborated with COC to merge the two clusters into one, ICE. This work details the strategies used to sensibly merge the two production systems, including the storage architecture, shared system policies, and scheduler priority configurations that honor funding complexities.
Birds of a Feather
Post-Moore Computing
Quantum Computing
TP
XO/EX
DescriptionThe IEEE Quantum-HPC Working Group BoF is the second community-building session targeting academic and enterprise stakeholders in HPC and hybrid HPC-QCS (Quantum Computing and Simulation). Launched at IEEE Quantum Week 2023, the Quantum-HPC Working Group addresses the challenges and opportunities of interfacing HPC and QCS through a full-stack approach across infrastructure, system software, programming tools and use cases. The BoF brings together attendees who are interested in the role of QCS in the HPC ecosystem to chart a sustainable path forward to interface the two technologies by collaborating on the focus areas and technical working structure of Quantum-HPC.
Panel
Artificial Intelligence/Machine Learning
Energy Efficiency
Hardware Technologies
TP
DescriptionAs the world increasingly relies on a new era of high-wattage CPU and GPU platforms to deliver HPC and AI breakthroughs, deploying and cooling these systems within traditional data centers presents a problem. This panel will discuss the need to create more sustainable deployment environments and how immersion cooling is a critical piece to this puzzle.
Join expert panelists in a discussion about the following three considerations: 1) Cost, 2) Available Options, and 3) Service/Support/Warranty Implications. The panel will discuss why out with the old (cooling with fans) and in with the immersive (immersing HPC and AI servers into high-tech non-conductive fluid) is a sustainable option for the future of modern data centers. Most people can agree on one thing – driving rack density and cooling of high-wattage processors presents a new set of challenges. The good news? We have options! Join our SC23 panel to learn more.
Join expert panelists in a discussion about the following three considerations: 1) Cost, 2) Available Options, and 3) Service/Support/Warranty Implications. The panel will discuss why out with the old (cooling with fans) and in with the immersive (immersing HPC and AI servers into high-tech non-conductive fluid) is a sustainable option for the future of modern data centers. Most people can agree on one thing – driving rack density and cooling of high-wattage processors presents a new set of challenges. The good news? We have options! Join our SC23 panel to learn more.
Workshop
Algorithms
Applications
Distributed Computing
Compilers
Heterogeneous Computing
Message Passing
Programming Frameworks and System Software
Quantum Computing
Task Parallelism
Tensors
W
DescriptionExact diagonalization is a well-established method for simulating small quantum systems. Its applicability is limited by the exponential growth of the Hamiltonian matrix that needs to be diagonalized. Physical symmetries are usually utilized to reduce the matrix dimension, and distributed-memory parallelism is employed to explore larger systems. This paper focuses on an implementation of the core distributed algorithms, with a special emphasis on the matrix-vector product. Instead of the conventional MPI+X paradigm, Chapel is chosen as the language in this work.
We provide a comprehensive description of the algorithms and present performance and scalability tests. Our implementation outperforms the state-of-the-art MPI-based solution by a factor of 7--8 on 32 compute nodes or 4096 cores and scales well through 256 nodes or 32768 cores. The implementation has 3 times fewer software lines of code than the current state of the art, but is still able to handle generic Hamiltonians.
We provide a comprehensive description of the algorithms and present performance and scalability tests. Our implementation outperforms the state-of-the-art MPI-based solution by a factor of 7--8 on 32 compute nodes or 4096 cores and scales well through 256 nodes or 32768 cores. The implementation has 3 times fewer software lines of code than the current state of the art, but is still able to handle generic Hamiltonians.
Birds of a Feather
Architecture and Networks
TP
XO/EX
DescriptionZero-Trust is the cybersecurity architecture of choice and is now being discussed in supercomputing environments. Zero-Trust is based on a least-privilege per-request approach - and it has serious implications for HPC centers, application developers, and end-user workflows. Join this discussion with US Federal CIOs to discuss their expectations and with HPC leaders on their approach.
Posters
Research Posters
Programming Frameworks and System Software
TP
DescriptionAccelerators based on reconfigurable devices are becoming popular for data analytics in high performance computing and cloud computing systems. However, designing these accelerators is a hard problem. High-Level Synthesis tools can help by generating RTL designs from high-level languages, but they tend to optimize the computational part of the kernel, often not considering data movement and memory accesses. For many applications, instead, memory operations take a significant part of the overall execution time and can be the actual bottleneck limiting performance, especially when accessing large, possibly remote, memories.
We propose an approach based on the generation and integration of highly-customizable accelerator caches in order to reduce the latency with which an HLS-generated accelerator accesses external memory through spatial and temporal locality. We integrate it in a state-of-the-art open-source HLS tool and show how our approach allows to easily explore tradeoffs between performance and resource utilization with minimal user effort required.
We propose an approach based on the generation and integration of highly-customizable accelerator caches in order to reduce the latency with which an HLS-generated accelerator accesses external memory through spatial and temporal locality. We integrate it in a state-of-the-art open-source HLS tool and show how our approach allows to easily explore tradeoffs between performance and resource utilization with minimal user effort required.
Tutorial
Data Analysis, Visualization, and Storage
I/O and File Systems
TUT
DescriptionScientific visualization and analysis are key ingredients in HPC simulation workflows. For decades, the dominant paradigm has been post-hoc visualization; simulation codes iterate and save files to disk, giving the domain scientists the opportunity to read the data back at a later time for analysis. In recent years though, this paradigm has been stressed by an ever-diverging rate of growth between I/O and compute speeds. In-situ processing helps mitigate these I/O bottlenecks, enabling simulation and visualization calculations to run in-memory, at higher spatial and temporal resolution, avoiding the transfer of raw data to disks. Even in cases where I/O bottlenecks do not dominate, in-situ processing is well suited for batch-focused analysis, allowing simulation users to obtain distilled results without additional workflow steps.
This half-day tutorial introduces the in-situ visualization paradigm along with Ascent and ParaView Catalyst, two open-source in-situ processing libraries. Both libraries leverage a common interface, Conduit, which provides an intuitive model for describing hierarchical scientific data in C++, C, Fortran, and Python. Attendees will gain hands-on experience learning how to describe simulation data with Conduit and how to use Ascent and Catalyst to transform data, render images, and export results.
This half-day tutorial introduces the in-situ visualization paradigm along with Ascent and ParaView Catalyst, two open-source in-situ processing libraries. Both libraries leverage a common interface, Conduit, which provides an intuitive model for describing hierarchical scientific data in C++, C, Fortran, and Python. Attendees will gain hands-on experience learning how to describe simulation data with Conduit and how to use Ascent and Catalyst to transform data, render images, and export results.
Birds of a Feather
Artificial Intelligence/Machine Learning
TP
XO/EX
DescriptionCXL’s advanced memory expansion and fabric management capabilities can be used to increase system scalability and flexibility across multiple compute domains, enabling resource sharing for higher performance, reduced software stack complexity, and lower overall datacenter memory cost. The fabric enhancements and memory expansion features included in CXL 3.0 deliver new levels of composability required by the large models used in HPC and AI in the modern datacenter. Expert representatives from CXL Consortium member companies who are implementing the specification will explore the CXL 3.0 features, new use case enablement, and ROI examples when implementing CXL attached memory.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
TP
XO/EX
DescriptionWe develop a distributed memory graph clustering algorithm to find clusters in a graph where new nodes and edges are being added incrementally. At each stage of the algorithm, we maintain a summary of the clustered graph computed from all incremental batches received thus far. As we receive a new batch of nodes and edges, we cluster the new graph and merge new clusters with the previous summary clusters. We use sparse linear algebra to perform these operations. Our algorithm would make it possible to find clusters in very large graphs for which regular graph clustering algorithms could not run due to computation/communication bottlenecks.
Workshop
Architecture and Networks
W
Workshop
Architecture and Networks
W
Workshop
INDIS Paper 2: Experimental Study of TCP Throughput Profiles and Dynamics Over Dedicated Connections
Architecture and Networks
W
Workshop
Architecture and Networks
W
Workshop
Architecture and Networks
W
Workshop
Architecture and Networks
W
Workshop
Data Analysis, Visualization, and Storage
Large Scale Systems
Performance Measurement, Modeling, and Tools
W
DescriptionIn-situ processing has widely been recognized as an effective approach for the visualization and analysis of large-scale simulation outputs from modern HPC systems. However, traditional batch-based in-situ visualization can produce large amounts of rendering results for post-hoc visual analysis, which can make it difficult to gain rapid insight into the simulation results during post-hoc visual analysis. As an alternative to accelerate the process of obtaining scientific knowledge, we have worked on a smart visualization approach, focusing on extracting a set of images that may facilitate the rapid understanding of the underlying simulated phenomena. In this work, we present a method for automatically adjusting the camera focus point and zoom level during in-situ visualization. We integrated the proposed approach with the existing in-situ smooth camera path estimation method for evaluation purposes and used two CFD simulation codes and two HPC systems (x86 Server and Arm-based Fugaku supercomputer) for the evaluations.
Workshop
Education
Heterogeneous Computing
Reproducibility
State of the Practice
W
DescriptionWe have developed a software infrastructure for testing multi-threaded programs that implement the fork-join concurrency model. The infrastructure employs several key ideas: The student solutions use print statements to trace the execution of the fork-join phases. The test writer provides a high-level specification of the problem-specific aspects of the traces, which is used by the infrastructure to handle the problem-independent and low-level details of processing the traces. During performance testing, trace output is disabled automatically. During functionality testing, fine-grained feedback is provided to identify the correct and incorrect implementation of the various fork-join phases. Tests written using our infrastructure have been used in an instructor-training workshop as an instructor agent clarifying requirements and checking in-progress work. The size of the code to check the concurrency correctness of final and intermediate results was far smaller than the code to check the serial correctness of such results.
Birds of a Feather
Cloud Computing
TP
XO/EX
DescriptionAs cloud environments deploy HPC capable infrastructure, large scale supercomputing and HPC centers are exploring how to integrate these resources into their ecosystems. This BoF will provide an opportunity for these centers to share their experiences and insights as well as provide a venue to establish collaborative efforts and develop broader strategies across the community. This BoF will provide a forum for discussion between supercomputing facility operators, cloud service providers, and the user community that will cover strategies and approaches for integrating cloud resources into existing HPC facility environments.
Posters
Research Posters
TP
XO/EX
DescriptionLCLS-II at SLAC, SNS at Oak Ridge Laboratory, and other instruments use software written in C and C++, producing huge volumes of time evolving data at high rate. Data compression can decrease the volume of data we need to move and store. TEZIP is a neural network (NN) based compressor designed for high-quality compression of time-evolving data. However, TEZIP is written in Python and is not easily usable from or ported to C++. In this work, we develop new components in LibPressio that allow us to integrate with TEZIP and other external compressors efficiently and evaluate them with a systematic approach. We find that TEZIP’s compression ratio (Error Bound 1e-06) for Hurricane Isabel is 128, which is 2.4 times greater than the leading SZ3’s, 52.8. Our basic integration of TEZIP into Libpressio sets a precedent for the integration of non C/C++ compressors into LibPressio.
Birds of a Feather
State of the Practice
TP
XO/EX
DescriptionMany HPC systems are managed using batch queues; however, not all HPC applications and workflows are best served by batch queue systems. Interactive prototyping, urgent streaming data analysis, application steering, and in-situ visualization are among the workflows that require interactive and urgent capabilities to be effective. After three successful SC BoFs and seven successful workshops at SC and ISC, the interactive and urgent HPC community is writing a position paper during the summer of 2023 to document progress and cast future research foci. In this BoF, we will present the state of the draft paper and solicit discussion and feedback.
Doctoral Showcase
Posters
Data Analysis, Visualization, and Storage
TP
DescriptionLarge distributed volume data are routinely produced in numerical simulations and experiments. In-situ visualization, the visualization of simulation or experiment data as it is generated, enables simulation steering and experiment control, which helps scientists gain an intuitive understanding of the studied phenomena. Such data exploration requires interactive visualization with smooth viewpoint changes and zooming to convey depth perception and spatial understanding. As data sizes increase, this becomes increasingly challenging.
This thesis presents an end-to-end solution for interactive in-situ visualization on distributed computers based on novel extensions to the Volumetric Depth Image (VDI) representation. VDIs are view-dependent, compact representations of volume data that can be rendered faster than the original data.
We propose the first algorithm to generate VDIs on distributed 3D data, using sort-last parallel compositing to scale to large data sizes. Scalability is achieved by a novel compact in-memory representation of VDIs that exploits sparsity and optimizes performance. We also propose a low-latency architecture for sharing data and hardware resources with a running simulation. The resulting VDI is streamed for remote interactive visualization.
We provide a novel raycasting algorithm for rendering streamed VDIs, significantly outperforming existing solutions. We exploit properties of perspective projection to minimize calculations in the GPU kernel and leverage spatial smoothness in the data to minimize memory accesses.
The quality and performance of the approach are evaluated on multiple datasets, showing that the approach outperforms state-of-the-art techniques for visualizing large distributed volume data. The contributions are implemented as extensions to established open-source tools.
This thesis presents an end-to-end solution for interactive in-situ visualization on distributed computers based on novel extensions to the Volumetric Depth Image (VDI) representation. VDIs are view-dependent, compact representations of volume data that can be rendered faster than the original data.
We propose the first algorithm to generate VDIs on distributed 3D data, using sort-last parallel compositing to scale to large data sizes. Scalability is achieved by a novel compact in-memory representation of VDIs that exploits sparsity and optimizes performance. We also propose a low-latency architecture for sharing data and hardware resources with a running simulation. The resulting VDI is streamed for remote interactive visualization.
We provide a novel raycasting algorithm for rendering streamed VDIs, significantly outperforming existing solutions. We exploit properties of perspective projection to minimize calculations in the GPU kernel and leverage spatial smoothness in the data to minimize memory accesses.
The quality and performance of the approach are evaluated on multiple datasets, showing that the approach outperforms state-of-the-art techniques for visualizing large distributed volume data. The contributions are implemented as extensions to established open-source tools.
Paper
Accelerators
Distributed Computing
Middleware and System Software
Performance Measurement, Modeling, and Tools
Post-Moore Computing
TP
DescriptionA common strategy for improving efficiency in training deep learning entails multiplexing tasks on a single GPU. To mitigate the interference caused by multiplexing, existing approaches primarily employ kernel-level solutions to regulate GPU kernel execution, or harness hardware-level techniques to explicitly restrict GPU streaming multiprocessors and memory. Nevertheless, none of them perform satisfactorily in optimizing the completion time of tasks.
In this paper, we present IADeep, a middleware solution designed to significantly improve multiplexing efficiency. The core concept is the co-optimization of task assignments within a cluster and interference mitigation on each device. IADeep coordinates the configuration of all co-located tasks in a less fine-grained fashion, effectively reducing interference and enhancing task training performance. Across the entire cluster, IADeep intelligently selects applications suitable for multiplexing to further amplify the advantages of optimizing task configurations. Evaluations on a 20 RTX 3090-GPU cluster demonstrate that IADeep can significantly outperform state-of-the-art multiplexing solutions.
In this paper, we present IADeep, a middleware solution designed to significantly improve multiplexing efficiency. The core concept is the co-optimization of task assignments within a cluster and interference mitigation on each device. IADeep coordinates the configuration of all co-located tasks in a less fine-grained fashion, effectively reducing interference and enhancing task training performance. Across the entire cluster, IADeep intelligently selects applications suitable for multiplexing to further amplify the advantages of optimizing task configurations. Evaluations on a 20 RTX 3090-GPU cluster demonstrate that IADeep can significantly outperform state-of-the-art multiplexing solutions.
Workshop
Software Engineering
W
Workshop
Education
State of the Practice
W
DescriptionThe US Department of Energy is a long-standing leader in HPC for science. However, we face daunting challenges in fostering a robust and diverse HPC workforce. Basic HPC is not typically taught at early stages of students’ academic careers, and the capacity and knowledge of HPC at many institutions are limited. Even so, such topics are prerequisites for advanced training programs, internships, graduate school, and ultimately for careers in HPC. To help address this challenge, we launched a training and workforce pipeline program.
We describe the Intro to HPC Bootcamp, an immersive program designed to engage students from underrepresented groups as they learn foundational HPC skills. The program takes a novel approach to HPC training by turning the traditional curriculum upside down. Instead of focusing on technology and its applications, the bootcamp focuses on energy justice to motivate the training of HPC skills through project-based pedagogy and real-life science stories.
We describe the Intro to HPC Bootcamp, an immersive program designed to engage students from underrepresented groups as they learn foundational HPC skills. The program takes a novel approach to HPC training by turning the traditional curriculum upside down. Instead of focusing on technology and its applications, the bootcamp focuses on energy justice to motivate the training of HPC skills through project-based pedagogy and real-life science stories.
Birds of a Feather
Programming Frameworks and System Software
TP
XO/EX
DescriptionThe Message Passing Interface (MPI) API is the most dominant programming approach for HPC environments. Its specification is driven by the MPI forum, an open forum consisting of MPI developers, vendors and users. Just before SC23, the MPI forum published the latest version of the standard, MPI 4.1. We will take a look at the new features and will discuss what this means for the user of MPI. However, MPI 4.1 is not the end of the MPI standard – the forum is already working toward MPI 5.0 and we will discuss ideas, directions and will feedback from the community.
Workshop
Programming Frameworks and System Software
W
DescriptionOne of the issues with HPC clusters is that the prerequisite knowledge required to use them is large, making the learning cost high for novice users. Moreover, it is desirable to run graphical user interface applications with interactive operations on the compute nodes, but the procedure is complicated.
We describe how we introduced Open OnDemand, a web portal that enables easy use of the computing resources of an HPC cluster, to Fugaku, a Japanese flagship supercomputer. To introduce the resources to new users, we developed an adapter that enables the job scheduler used in Fugaku to be used from Open OnDemand. In addition, to further improve user convenience, we developed applications that enable data sharing between Open OnDemand and external storages. This paper describes the various features we have given to Open OnDemand for Fugaku and the development of the data sharing applications.
We describe how we introduced Open OnDemand, a web portal that enables easy use of the computing resources of an HPC cluster, to Fugaku, a Japanese flagship supercomputer. To introduce the resources to new users, we developed an adapter that enables the job scheduler used in Fugaku to be used from Open OnDemand. In addition, to further improve user convenience, we developed applications that enable data sharing between Open OnDemand and external storages. This paper describes the various features we have given to Open OnDemand for Fugaku and the development of the data sharing applications.
Posters
Research Posters
TP
XO/EX
DescriptionRemote Time Migration (RTM) poses substantial computational challenges, demanding large memory and extended processing times. Our RTM implementation processes three-dimensional fields on multiple NVIDIA GPUs using the Revolve algorithm for checkpointing. However, transferring data between the host and GPU memory introduces a bottleneck.
We introduced a checkpoint prefetching mechanism to overcome this, anticipating memory transfers from host to GPU. Additionally, we integrated GPU data compression using the cuZFP library to reduce data transfer sizes further. The experimental results demonstrated significant performance improvements, achieving a speedup of 1.98x - 2.53x in our benchmark dataset. Prefetching + compression techniques together could reduce host-to-GPU memory transfers by up to 16x.
We introduced a checkpoint prefetching mechanism to overcome this, anticipating memory transfers from host to GPU. Additionally, we integrated GPU data compression using the cuZFP library to reduce data transfer sizes further. The experimental results demonstrated significant performance improvements, achieving a speedup of 1.98x - 2.53x in our benchmark dataset. Prefetching + compression techniques together could reduce host-to-GPU memory transfers by up to 16x.
Tutorial
Distributed Computing
Software Engineering
TUT
DescriptionA majority of HPC system users utilize scripting languages such as Python to prototype their computations, coordinate their large executions, and analyze the data resulting from their computations. Python is great for these many uses, but it frequently falls short when significantly scaling up the amount of data and computation, as required to fully leverage HPC system resources. In this tutorial, we show how example computations such as heat diffusion, k-mer counting, file processing, and distributed maps can be written to efficiently leverage distributed computing resources in the Chapel, UPC++, and Fortran parallel programming models.
The tutorial is targeted for users with little-to-no parallel programming experience, but everyone is welcome. A partial differential equation example will be demonstrated in all three programming models. That example and others will be provided to attendees in a virtual environment. Attendees will be shown how to compile and run these programming examples, and the virtual environment will remain available to attendees throughout the conference, along with Slack-based interactive tech support.
Come join us to learn about some productive and performant parallel programming models!
The tutorial is targeted for users with little-to-no parallel programming experience, but everyone is welcome. A partial differential equation example will be demonstrated in all three programming models. That example and others will be provided to attendees in a virtual environment. Attendees will be shown how to compile and run these programming examples, and the virtual environment will remain available to attendees throughout the conference, along with Slack-based interactive tech support.
Come join us to learn about some productive and performant parallel programming models!
Tutorial
Algorithms
Post-Moore Computing
Quantum Computing
TUT
DescriptionQuantum computing offers the potential to revolutionize high-performance computing by providing a means to solve certain computational problems faster than any classical computer. Relatively recently, quantum computing has advanced from a theoretical possibility to engineered reality, with commercial entities offering early prototype quantum processors representing a variety of qubit technologies and computational paradigms. The media have been showcasing each new development and implicitly conveying the message that quantum-computing ubiquity is nigh. Here, we will respond to this hype and provide an overview of the exciting but still early state of the field.
We introduce participants to the computational models underlying quantum computing. We work through examples of its immense computational power while highlighting what the quantum computing community still does not know in terms of quantum algorithms and where the power of quantum computing comes from. We examine the thought processes that programmers use to map problems to circuit-model quantum computers, quantum annealers, measurement-based quantum systems, analog Rydberg atom arrays, and other recent inventions in the quantum-computing space. We conclude with an overview of the hardware and algorithmic challenges that must be overcome before quantum computing becomes a component of the HPC developer's repertoire.
We introduce participants to the computational models underlying quantum computing. We work through examples of its immense computational power while highlighting what the quantum computing community still does not know in terms of quantum algorithms and where the power of quantum computing comes from. We examine the thought processes that programmers use to map problems to circuit-model quantum computers, quantum annealers, measurement-based quantum systems, analog Rydberg atom arrays, and other recent inventions in the quantum-computing space. We conclude with an overview of the hardware and algorithmic challenges that must be overcome before quantum computing becomes a component of the HPC developer's repertoire.
Workshop
Accelerators
Artificial Intelligence/Machine Learning
Applications
Distributed Computing
Compilers
Exascale
Heterogeneous Computing
Message Passing
Performance Optimization
Programming Frameworks and System Software
Software Engineering
Sustainability
Task Parallelism
W
DescriptionAs supercomputers become more and more powerful, the number and diversity of applications that can be tackled with these machines grows. Unfortunately, the architectural complexity of these supercomputers grows as well, with heterogeneous processors, multiple levels of memory hierarchy, and many ways to move data and synchronize between processors. The MPI+X programming model, use of which is considered by many to be standard practice, demands that a programmer be expert in both the application domain and the low-level details of the architecture(s) on which that application will be deployed, and the availability of such superhuman programmers is a critical bottleneck. Things become more complicated when evolution and change in the underlying architecture translates into significant re-engineering of the MPI+X code to maintain performance.
Numerous alternatives to the MPI+X model exist, and by raising the level of abstraction on the application domain and/or the target architecture, they offer the ability for “mere mortal” programmers to take advantage of the supercomputing resources that are available to advance science and tackle urgent real-world problems. However, compared to the MPI+X approach, these alternatives generally lack two things. First, they aren’t as well known as MPI+X and a domain scientist may simply not be aware of models that are a good fit to their domain. Second, they are less mature than MPI+X and likely have more functionality or performance “potholes” that need only be identified to be addressed.
PAW-ATM is a forum for discussing HPC applications written in alternatives to MPI+X. Its goal is to bring together application experts and proponents of high-level languages to present concrete example uses of such alternatives, describing their benefits and challenges.
Numerous alternatives to the MPI+X model exist, and by raising the level of abstraction on the application domain and/or the target architecture, they offer the ability for “mere mortal” programmers to take advantage of the supercomputing resources that are available to advance science and tackle urgent real-world problems. However, compared to the MPI+X approach, these alternatives generally lack two things. First, they aren’t as well known as MPI+X and a domain scientist may simply not be aware of models that are a good fit to their domain. Second, they are less mature than MPI+X and likely have more functionality or performance “potholes” that need only be identified to be addressed.
PAW-ATM is a forum for discussing HPC applications written in alternatives to MPI+X. Its goal is to bring together application experts and proponents of high-level languages to present concrete example uses of such alternatives, describing their benefits and challenges.
Posters
Research Posters
TP
XO/EX
DescriptionAs compute clusters used for running batch jobs continue to grow in scale and complexity, the frequency of anomalies significantly increases. Timely detection of anomalous events has become vital to maintain system efficiency and availability. Our study presents an attention-based graph neural network (GNN) to detect anomalies in clusters at the compute node level and provide detailed root cause analysis to pinpoint issues. Evaluating on real-world datasets, attention-based GNN shows its ability to accurately detect and localize anomalies.
Workshop
State of the Practice
W
DescriptionThis work is a contribution to the advancement of linear solvers for the Exascale Computing Project. It focuses on direct sparse linear solvers using High-Performance Computing (HPC) for large-scale power systems, resembling the United States power grids. This paper explores supercomputers at Oak Ridge Leadership Computing Facility, Summit and Frontier, comparing both performance and optimization strategies. The project encompasses a comprehensive test bench for Trilinos Amesos2 CPU-based solvers, KLU2 and ShyLUBasker, and the testing of GPU-based solvers from NVIDIA cuSolver to AMD rocSolver on distinct architecture configurations. The challenges of power flow analysis are addressed through optimization techniques, like matrix symmetry and GPU acceleration, and by evaluating accuracy and stability of linear solvers through residual analysis. Beyond technical gains, this work underscores the significance of collaboration and diverse expertise in HPC for innovative analysis of power grid systems, critical for resilient infrastructure against burgeoning threats like climate change and cyberattacks.
Workshop
Algorithms
Heterogeneous Computing
Large Scale Systems
W
DescriptionThe US Department of Energy (DOE) Exascale Computing Project (ECP) is coming to an end. But the impact of the project is just beginning. ECP has produced dozens of GPU-enabled, scalable application codes and dozens of GPU-capable libraries and tools that underpin these applications. The experiences and software capabilities coming out of ECP are ready for further leveraging within DOE and beyond. ECP outcomes demonstrate that the opportunity for impact on science and engineering computations is exceptional. Across the board, ECP applications realized improvements of 100 times or more in performance and scalability, leading to similar orders of scientific impact. Furthermore, this impact comes primarily from adapting algorithms and software to exploit GPU devices from NVIDIA, AMD, and Intel through performance portability layers. While ECP focused on expanding capabilities for large systems, the same technical advances are directly translatable to migrating existing applications to smaller systems. For example, a problem that presently requires a CPU-based cluster can realize similar performance on desktop systems leveraging GPUs, greatly reduce energy and infrastructure costs. As the energy costs of HPC projects become an increasing concern, accelerated (GPU) devices become the path to increased efficiency but only if our HPC software stacks can deliver the performance potential. The true legacy of ECP will be how it advanced the development of application codes and established a stack of libraries and tools to be leveraged by the broader HPC community. ECP provides a foundation for transforming HPC applications to realize their performance potential on GPUs and future accelerated devices.
Workshop
Algorithms
Heterogeneous Computing
Large Scale Systems
W
DescriptionScientific workflows are now a common tool used by domain scientists in a number of disciplines. They are appealing because they enable users to think at high level of abstraction, composing complex applications from individual application components. Workflow management systems (WMSs), such as Pegasus (http://pegasus.isi.edu) automate the process of executing these workflows on modern cyberinfrastructure. They take these high-level, resource-independent descriptions and map them onto the available heterogeneous resources: campus clusters, high-performance computing resources, high-throughput resources, clouds, and the edge. WMSs can select the appropriate resources based on their architecture, availability of key software, performance, reliability, availability of cycles, storage space, among others. Using algorithms like those used in compilers, they can determine what data to save during execution, and which are no longer needed. Similarly to compiler solutions, they can generate an executable workflow that is tailored to the target execution environment, taking into account reliability, scalability, and performance. WMS use workflow execution engines to run the executable workflows on the target resources, while the jobs within the workflow are managed by the host runtime system. This talk will describe the key concepts used in the Pegasus WMS and pose the question of how to improve workflow management systems to be more dynamic and resilient.
Workshop
Algorithms
Heterogeneous Computing
Large Scale Systems
W
DescriptionOne of the most dramatic differences between the brain and modern computing systems is the ubiquitous stochasticity of neural circuits. The brain leverages noise in its biophysics to make its computations more powerful and efficient, whereas today’s computers are designed, at great expense, to be deterministic from the transistor up. Such determinism is assumed to be necessary for microelectronics, but it leads to high costs in both fabrication and in the design of probabilistic computing applications.
In this talk, I will describe how modern neuromorphic computing are approaching this level of widespread stochasticity, enabling the development of a new class of probabilistic neuromorphic applications. The talk will highlight our results on implementing Monte Carlo random walk applications and stochastic optimization on the Intel Loihi, SpiNNaker, and IBM TrueNorth systems, showing how today’s neuromorphic systems are increasingly competitive with CPUs and GPUs. Finally, I will describe how future neuromorphic systems that leverage true random number generation from stochastic “coinflip” devices may prove critical for realizing the full potential for neuromorphic computing for scientific applications.
In this talk, I will describe how modern neuromorphic computing are approaching this level of widespread stochasticity, enabling the development of a new class of probabilistic neuromorphic applications. The talk will highlight our results on implementing Monte Carlo random walk applications and stochastic optimization on the Intel Loihi, SpiNNaker, and IBM TrueNorth systems, showing how today’s neuromorphic systems are increasingly competitive with CPUs and GPUs. Finally, I will describe how future neuromorphic systems that leverage true random number generation from stochastic “coinflip” devices may prove critical for realizing the full potential for neuromorphic computing for scientific applications.
Workshop
Algorithms
Heterogeneous Computing
Large Scale Systems
W
DescriptionRecently, supercomputing has been changing dramatically. Integration/convergence of Simulation/Data/Learning (S+D+L) is important towards Society 5.0 proposed by Japanese Government, which enables integration of cyber space and physical space. In 2015, we started the BDEC project (Big Data & Extreme Computing) for development of supercomputers and software for integration of (S+D+L). In May 2021, we started operation of the Wisteria/BDEC-01. It is the first BDEC system, which consists of computing nodes for computational science and engineering with A64FX (Odyssey), and those for Data Analytics/AI with NVIDIA A100 GPU’s (Aquarius). We also develop a software platform “h3-Open-BDEC” for integration of (S+D+L) on the Wisteria/BDEC-01, which is designed for extracting the maximum performance of the supercomputers with minimum energy consumption focusing on (1) Innovative method for numerical analysis by adaptive precision, accuracy verification and automatic tuning, (2) Hierarchical Data Driven Approach based on machine learning, and (3) Software for heterogeneous systems. Integration of (S+D+L) by h3-Open-BDEC enables significant reduction of computations and power consumption, compared to those by conventional simulations. In January 2025, we will start to operate the Oakforest-PACS II system (OFP-II) together with University of Tsukuba. OFP-II will consist of NVIDIA H100 nodes with a total peak performance of 100+ PFLOPS. This is our next platform for integration of (S+D+L). Since October 2022, we started supports for our users to migrate their applications to the OFP-II with GPUs under collaboration with NVIDIA. In this talk, our activities in integration of (S+D+L) and efforts towards OFP-II will be described.
Workshop
Algorithms
Heterogeneous Computing
Large Scale Systems
W
DescriptionQuantum machine learning is a rapidly growing field of quantum computing, and many deep learning models and methods have been adapted into quantum analogues using gate-based or annealing-based platforms. These methods have been essential for uncovering subtleties in quantum learning dynamics, and there are a growing number of examples that can be found in the literature, implemented in simulated, or actual quantum hardware. The maturity of quantum technology presents opportunities for building, and training larger quantum machine models. But with increasing circuit depth and width, when working with real-world, classical datasets, the field still faces several obstacles, namely, how to pre-process data efficiently and effectively for quantum machine learning, and how to post-process the outcomes of measurements.
In this talk, I will present an overview of several research projects that are ongoing at Oak Ridge National Laboratory in the fields of high energy physics and natural language processing. I will highlight and discuss the challenges, and advantages we have encountered when building, training and deploying quantum generative models, quantum natural language processing models, and quantum classifiers, either as standalone models or as components of hybrid workflows.
In this talk, I will present an overview of several research projects that are ongoing at Oak Ridge National Laboratory in the fields of high energy physics and natural language processing. I will highlight and discuss the challenges, and advantages we have encountered when building, training and deploying quantum generative models, quantum natural language processing models, and quantum classifiers, either as standalone models or as components of hybrid workflows.
Workshop
Distributed Computing
Security
W
DescriptionThe SABSA (Sherwood Applied Business Security Architecture) model is a useful generic means of exploring users’ preferences for reducing residual risks to acceptable levels given budgetary (financial, resource, time frames etc.) constraints while traceably supporting business objectives.
This talk presents why and how SABSA can be used in the HPC context to optimise selection of controls to address mandatory (e.g. pursuant to USA's National Strategic Computing Initiative establishment by Presidential Executive Order 13702) and discretionary security requirements.
This talk presents why and how SABSA can be used in the HPC context to optimise selection of controls to address mandatory (e.g. pursuant to USA's National Strategic Computing Initiative establishment by Presidential Executive Order 13702) and discretionary security requirements.
Workshop
Accelerators
Codesign
Heterogeneous Computing
Task Parallelism
W
DescriptionTraining artificial intelligence (AI) models involves repeatedly loading large amounts of datasets. Data loading and transferring can potentially become one of the bottlenecks. AI training has different Input/Output (I/O) patterns compared with traditional scientific simulations. It is read intensive, involving heavy metadata operations, complex data formats, random access patterns, multithreading asynchronous I/O, etc. It is crucial to capture the I/O behavior and understand the requirement of storage and I/O in these workloads. In this talk, we will present our efforts in developing profiling tools and benchmark suites to address the need. Our study shines light on how to better design storage hardware and I/O software to support AI workloads.
Workshop
Accelerators
Compilers
Heterogeneous Computing
Programming Frameworks and System Software
Runtime Systems
W
DescriptionFollowing the joint effort by the Spanish, Portuguese and Turkish governments, together with EuroHPC JU (EC), the new supercomputer MareNostrum 5 will entry in operation in the following weeks. This highly heterogeneous supercomputer, with an aggregated peak performance above 300 PFlop/s, will include a world-class Accelerated Partition based on NVIDIA Hopper cards and the largest x86-General Purpose partition in the world, as well as the new NVIDIA GRACE CPU. In order to fully exploit this new research infrastructure, the investment includes a strict access mechanism based on scientific excellence criteria and a complete user support programme
Workshop
Accelerators
Edge Computing
Heterogeneous Computing
W
DescriptionIn recent years, there has been a renewed interest in near-memory processing (NMP) architectures as a workaround for the performance and energy issues of frequent and irregular memory accesses. However, effective use of NMP architectures requires rethinking data structures and their algorithms, especially as these data structures scale up in size well beyond the size of last level caches. In this talk, I will focus on cache-optimized data structures, such as skiplists and B+ trees, often used in online transaction processing (OLTP) systems to enable fast key-based lookups. I will present a hardware/software co-design solution of NMP-aware algorithms for these concurrent data structures and show that our approach can improve performance by more than 2X compared to the state-of-the-art.
Workshop
Distributed Computing
Security
W
DescriptionMulti-host clusters built with CXL memory modules (enabled by the CXL 3.0 standard) provide a for an opportunity for power efficient computing. Enabling near data computing is not without it security challenges. This presentation identifies several security issues worthy of architectural considerations and approaches for mitigating these issues.
Workshop
Performance Measurement, Modeling, and Tools
Programming Frameworks and System Software
W
Workshop
Accelerators
Codesign
Heterogeneous Computing
Task Parallelism
W
DescriptionNowadays, powerful optimizing compilers are needed to transform and specialize software for a particular machine, for performance and energy considerations. For example, compilers for High-level synthesis (HLS) can greatly facilitate the description of complex hardware implementations, by raising the level of abstraction to a classical imperative language such as C/C++, usually augmented with vendor-specific pragmas and APIs. Software is being used to describe hardware, but despite productivity improvements attaining high performance for the final designs remains a challenge: many crucial optimizations require substantial changes in control-flow structure, I/O approach, on-chip buffer management, function boundaries, exposing concurrency, etc.
In this talk, we discuss techniques and tools to assist with the development of optimized software, and optimized hardware using HLS. By specializing the compilation process to a specific class of programs, those whose control flow and dataflow can be exactly computed by means of interpretation at compile-time (e.g., many deep learning applications), we can for instance develop advanced code generation techniques for a class of sparse computation, powerful source-to-source transformations for better hardware designs via HLS, and verify the correctness of these optimized programs automatically.
In this talk, we discuss techniques and tools to assist with the development of optimized software, and optimized hardware using HLS. By specializing the compilation process to a specific class of programs, those whose control flow and dataflow can be exactly computed by means of interpretation at compile-time (e.g., many deep learning applications), we can for instance develop advanced code generation techniques for a class of sparse computation, powerful source-to-source transformations for better hardware designs via HLS, and verify the correctness of these optimized programs automatically.
Birds of a Feather
Data Analysis, Visualization, and Storage
TP
XO/EX
DescriptionAs efficient IO becomes increasingly critical to reach peak computing performance, IO500 has become the de-facto standard for measuring HPC storage performance. Developed in 2017, the IO500 has released bi-annual lists at SC and ISC since then. This BoF’s highlight is the presentation of the new IO500 list.
This BoF’s goal is to foster the IO500 community to progress common goals of creating, sharing, and benefiting from a large corpus of shared storage performance data. We are also building a detailed repository of high-performance production storage systems as they evolve, providing a knowledge base for HPC researchers and system designers.
This BoF’s goal is to foster the IO500 community to progress common goals of creating, sharing, and benefiting from a large corpus of shared storage performance data. We are also building a detailed repository of high-performance production storage systems as they evolve, providing a knowledge base for HPC researchers and system designers.
Workshop
Distributed Computing
Middleware and System Software
Runtime Systems
W
Exhibits
Flash Session
TP
XO/EX
DescriptionLearn how IQM is continuously improving quantum computing technology for superconducting Qubits, moving towards quantum utility for our customers. We explain our technology roadmap, and our product offering, ranging from research and education and HPC environments to industry customers.
Workshop
Architecture and Networks
Hardware Technologies
W
DescriptionThe Sophon SG2042 is the world's first commodity 64-core RISC-V CPU for high performance workloads and an important question is whether the SG2042 has the potential to encourage the HPC community to embrace RISC-V.
We undertake a performance exploration of the SG2042 against existing RISC-V hardware and high performance x86 CPUs in use by modern supercomputers. Leveraging the RAJAPerf benchmarking suite, we discover that on average, the SG2042 delivers, per core, between five and ten times the performance compared to the nearest widely available RISC-V hardware. We found that, on average, the x86 high performance CPUs under test outperform the SG2042 by between four and eight times for multi-threaded workloads, although some individual kernels do perform faster on the SG2042. The result of this work is a performance study that not only contrasts this new RISC-V CPU against existing technologies, but furthermore shares performance best practice.
We undertake a performance exploration of the SG2042 against existing RISC-V hardware and high performance x86 CPUs in use by modern supercomputers. Leveraging the RAJAPerf benchmarking suite, we discover that on average, the SG2042 delivers, per core, between five and ten times the performance compared to the nearest widely available RISC-V hardware. We found that, on average, the x86 high performance CPUs under test outperform the SG2042 by between four and eight times for multi-threaded workloads, although some individual kernels do perform faster on the SG2042. The result of this work is a performance study that not only contrasts this new RISC-V CPU against existing technologies, but furthermore shares performance best practice.
Invited Talk
Artificial Intelligence/Machine Learning
HPC Infrastructure
TP
DescriptionThis talk summarizes policies, actors and institutions that contributed for the development of HPC in Brazil during the last 40 years. It visits activities at academia, professional societies, industries, federal and state governments related to such purpose. It emphasizes actions that could be useful for other development countries that are willing to invest in HPC.
Workshop
Data Analysis, Visualization, and Storage
Large Scale Systems
Performance Measurement, Modeling, and Tools
W
DescriptionPresenting the best paper of ISAV23.
Closing remarks.
Closing remarks.
Workshop
Data Analysis, Visualization, and Storage
Large Scale Systems
Performance Measurement, Modeling, and Tools
W
DescriptionIntroduction to the ISAV23 workshop and presentations.
Workshop
Data Analysis, Visualization, and Storage
Large Scale Systems
Performance Measurement, Modeling, and Tools
W
Workshop
Data Analysis, Visualization, and Storage
Large Scale Systems
Performance Measurement, Modeling, and Tools
W
DescriptionExascale computers are becoming a playground for scientific discovery. Using the extreme-scale kinetic fusion PIC code XGC as a proxy, this presentation will demonstrate the challenges and opportunities of in situ analysis, reduction, and visualization in our high-performance computing ecosystem, with new contributions we have made therein. The first discussion is enabling HPC science studies that have been difficult due to gap in memory size relative to FLOPS. Often, first-principles-level scientific analysis requires deep-level identifications and indexing of simulation high-dimensional simulation objects, which amplify the memory requirements to an impractically high level. Developing in situ approaches for our time and phase-space analysis and visualization, which consider specific features, minimizes the node-memory requirement and enables such studies. Another concern is addressing the growing compute speed to I/O bandwidth gap. Our data is analyzed, visualized, and compressed while being generated, without first storing it to a file system. This enables faster scientific discovery that can be used for quicker feedback to next-day experimental or simulation inputs. We also consider the potential for increased accuracy, where fine temporal and phase-space sampling of transient analysis might expose complex behavior missed by the coarse sampling that is often necessitated by adopting an off-line approach. There is also possibility for the assessment of the error and uncertainty in the predictability of the target science in parallel with the simulation, which could enable automated/AI-assisted simulation steering. Finally, we discuss building more complete data bases for AI/ML training via automated identification and healing of scientific simulation data sets in the phase spaces where previous simulation or experimental data do not exist.
Paper
Heterogeneous Computing
Programming Frameworks and System Software
Task Parallelism
TP
DescriptionThis paper introduces Itoyori, a task-parallel runtime system designed to tackle the challenge of scaling task parallelism (more specifically, nested fork-join parallelism) beyond a single node. The partitioned global address space (PGAS) model is often employed in task-parallel systems, but naively combining them can lead to poor performance due to fine-grained and redundant remote memory accesses. Itoyori addresses this issue by automatically caching global memory accesses at runtime, enabling efficient cache sharing among parallel tasks running on the same processor. As a real-world case study, we ported an existing task-parallel implementation of the Fast Multipole Method (FMM) to distributed memory with Itoyori and achieved a 7.5x speedup when scaled from a single node to 12 nodes and up to 6.0x faster performance than without caching. This study demonstrates that global-view fork-join programming can be made practical and scalable, while requiring minimal changes to the shared-memory code.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
TP
XO/EX
DescriptionDetecting and correcting Silent Data Corruption (SDC) is of high interest for many HPC applications due to the dramatic consequences such undetected computation errors can have. Additionally, going into the exascale era of computing, SDC error rates are only increasing with growing system sizes. State of the art methods based on instruction duplication suffer from only partial error coverage, significant synchronization overhead and strong coupling of computation and validation.
This work proposes a novel communication-avoiding approach of detecting and mitigating SDCs at the job level within the workload manager, assuming a directed acyclic graph (DAG) job model. Each job only communicates a locally generated output data hash. Computation and validation are decoupled as separately schedulable jobs and dependency stalling is avoided with a special error recovery method. The implementation of this project within the SLURM workload manager is in progress and key design aspects are outlined.
This work proposes a novel communication-avoiding approach of detecting and mitigating SDCs at the job level within the workload manager, assuming a directed acyclic graph (DAG) job model. Each job only communicates a locally generated output data hash. Computation and validation are decoupled as separately schedulable jobs and dependency stalling is avoided with a special error recovery method. The implementation of this project within the SLURM workload manager is in progress and key design aspects are outlined.
Workshop
Data Analysis, Visualization, and Storage
Large Scale Systems
Programming Frameworks and System Software
Reproducibility
Resource Management
Runtime Systems
W
DescriptionWe evaluate Julia as a single language and ecosystem paradigm powered by LLVM to develop workflow components for high-performance computing. We run a Gray-Scott, 2-variable diffusion-reaction application using a memory-bound, 7-point stencil kernel on Frontier, the US Department of Energy's first exascale supercomputer. We evaluate the performance, scaling, and trade-offs of (i) the computational kernel on AMD's MI250x GPUs, (ii) weak scaling up to 4,096 MPI processes/GPUs or 512 nodes, (iii) parallel I/O writes using the ADIOS2 library bindings, and (iv) Jupyter Notebooks for interactive analysis. Results suggest that although Julia generates a reasonable LLVM-IR, a nearly 50% performance difference exists vs. native AMD HIP stencil codes when running on the GPUs. As expected, we observed near-zero overhead when using MPI and parallel I/O bindings for system-wide installed implementations. Consequently, Julia emerges as a compelling high-performance and high-productivity workflow composition language, as measured on the fastest supercomputer in the world.
Birds of a Feather
Programming Frameworks and System Software
TP
XO/EX
DescriptionThe “Julia for HPC” birds-of-a-feather (BoF) session provides a place for the high-performance computing (HPC) community with interests in the Julia programming language. Julia proposes an integrated development end-to-end co-design model as a LLVM front-end for science to close the gap between high-productivity languages and the desired performance of traditional compiled languages on extreme heterogeneous systems.
We invite participants from academia, government, and industry to share and discuss their experiences, identify and learn about current opportunities and gaps. Potential topics include: community, adoption and support in leadership facilities, the Julia ecosystem, programming models and packages targeting HPC workflows.
We invite participants from academia, government, and industry to share and discuss their experiences, identify and learn about current opportunities and gaps. Potential topics include: community, adoption and support in leadership facilities, the Julia ecosystem, programming models and packages targeting HPC workflows.
Workshop
Quantum Computing
Software Engineering
W
DescriptionWe introduce JuliQAOA, a simulation package specifically built for the Quantum Alternating Operator Ansatz (QAOA). JuliQAOA does not require a circuit-level description of QAOA problems, or another package to simulate such circuits, instead relying on a more direct linear algebra implementation. This allows for increased QAOA-specific performance enhancements, as well as improved flexibility and generality. JuliQAOA is the first QAOA package designed to aid in the study of both constrained and unconstrained combinatorial optimization problems, and can easily include novel cost functions, mixer Hamiltonians, and other variations. JuliQAOA also includes robust and extensible methods for learning optimal angles. Written in the Julia language, JuliQAOA outperforms existing QAOA software packages and scales well to HPC-level resources.
Workshop
Education
State of the Practice
W
DescriptionWe introduce the sixth example in a series of assignments used in a Parallel Computing course to teach approaches to solving the same problem with different parallel programming models. This assignment is based on the K-means clustering algorithm. The program is intentionally designed to be straightforward and easily understandable for students, while also providing specific parallelization and optimization opportunities. It is a simpler example than the previously presented assignments in this series, focusing mainly on key base concepts that many students find complex to apply in a practical case: Race-conditions, reductions, and collective operations. It proposes a clear and guided parallelization and optimization strategy across the different programming models. This assignment can be used to establish a solid foundation before tackling more advanced concepts or parallel structures. This assignment was successfully used as a practical assignment in a Parallel Computing course in the third year of a Computer Engineering degree.
Workshop
Education
State of the Practice
W
DescriptionThis is the summary of a peachy parallel assignment centered on classifying objects based on a database of pre-classified objects; in particular this assignment uses the k-Nearest Neighbors method. With the increase of popularity of data science and machine learning, data science assignments have become more engaging for students. In this particular case, we rely on existing databases of machine learning problems to provide real world applications of the k-nearest neighbor algorithm. The databases being fairly large makes the runtime of the algorithm fairly slow, which makes the consideration of parallel computing natural. This incarnation of the assignment uses Map Reduce MPI and was used in an upper division parallel computing class. However the assignment can be adapted as a CS1/CS2 assignment or as a Data Structures assignment.
Workshop
Applications
Architecture and Networks
Data Movement and Memory
Heterogeneous Computing
Large Scale Systems
Middleware and System Software
W
Workshop
Accelerators
Codesign
Heterogeneous Computing
Task Parallelism
W
DescriptionThe rapid deployment of machine learning system has witnessed various challenges such as high computation and privacy/security concerns. In this talk, we will first discuss the current challenges and advances in efficient machine learning. We will present several machine learning accelerations through algorithm-hardware codesign, on various computing platforms such as GPU, MCU, and ReRAM. On the other hand, Machine-Learning-As-A-Service (MLaaS) provides cloud-based tools to mitigate the cost and risk of building individual ML platforms. Privacy-preserving machine learning (PPML) serves as a good solution to protect sensitive user data. However, the introduced crypto-primitives come at extra high computation and communication overhead and potentially prohibit the machine learning popularity. We will present a systematic acceleration framework that enables low latency, high energy efficiency and accuracy, and security-guaranteed machine learning.
Workshop
Data Movement and Memory
Heterogeneous Computing
W
DescriptionThe size of large artificial intelligence (AI) models has increased by at least 100x in the past few years, which leads to memory consumption at the scale of hundreds of GBs and even TBs. Recent advance of heterogeneous memory (HM) provides a cost-effective approach to increase memory capacity. Using external memory (e.g., CXL memory expansion and GPU-like accelerator's memory) as an extension to GPU memory, we can build an HM to enable large-scale AI model inference and training without using extra GPUs to accommodate large memory consumption. However, not only HM imposes challenges on tensor allocation and migration on HM itself, but it is also unclear how HM affects training/inference throughput. AI model workload possesses unique characteristics of memory access patterns and data structures, which places challenges on the promptness of data migration, load balancing, and tensor redundancy on GPU. In this talk, I will discuss the work we have been done to optimize the management of HM for large language model and graph neural networks. The key insights in our designs are to leverage AI domain knowledge to reconcile the tensions between multiple design targets (e.g., minimizing tensor migration volume and maintaining high system throughput). Finally, I will discuss the opportunities and challenges for future HM management in the era of large generative models.
Workshop
Energy Efficiency
Green Computing
Sustainability
W
DescriptionLeadership Supercomputing increasingly requires consideration of sustainability (carbon-emissions) and power cost (Opex) beyond traditional architecture, system software, and application concerns. Power extrapolation of Top 500 systems suggests that next generation leadership systems will approach ~100MW with even higher power levels beyond.
The past decade has shown available power supply as a real limit. With examples of power restriction a reality in Japan, Germany, China, and the UK. These challenges affect cloud as well with numerous datacenter reductions stemming from power shortages. Further, power Opex of supercomputer systems have become a significant part of TCO/lifetime cost, limiting lifetime delivered capability. A concomitant problem as power consumption grows and climate change accelerates is growing pressure to reduce operational carbon footprint.
Leadership Supercomputing in the modern era (2025+) requires large quantities of green power and at low cost. How is this possible? We discuss three dimensions that must be considered:
1. Location: where supercomputing resources are shapes power opportunity, and thus system design and operation
2. Power: dynamics of local power grid, both generation and local competitive load dictate both power infrastructure design and how systems should be operated and applications scheduled/managed
3. Flexibility: the new challenge for operations and applications is flexibility in the time, shape, and location of computing use
We will describe opportunities that are the key the sustainable, leadership supercomputing in the modern (2025+ era). These opportunities enable 90% reduction in power cost, even larger reductions in scope 2 carbon emissions, and the changes in both supercomputer design as well as facilities design and operation to achieve. We are working with leaders in Japan and the US DOE to capture these opportunities.
The past decade has shown available power supply as a real limit. With examples of power restriction a reality in Japan, Germany, China, and the UK. These challenges affect cloud as well with numerous datacenter reductions stemming from power shortages. Further, power Opex of supercomputer systems have become a significant part of TCO/lifetime cost, limiting lifetime delivered capability. A concomitant problem as power consumption grows and climate change accelerates is growing pressure to reduce operational carbon footprint.
Leadership Supercomputing in the modern era (2025+) requires large quantities of green power and at low cost. How is this possible? We discuss three dimensions that must be considered:
1. Location: where supercomputing resources are shapes power opportunity, and thus system design and operation
2. Power: dynamics of local power grid, both generation and local competitive load dictate both power infrastructure design and how systems should be operated and applications scheduled/managed
3. Flexibility: the new challenge for operations and applications is flexibility in the time, shape, and location of computing use
We will describe opportunities that are the key the sustainable, leadership supercomputing in the modern (2025+ era). These opportunities enable 90% reduction in power cost, even larger reductions in scope 2 carbon emissions, and the changes in both supercomputer design as well as facilities design and operation to achieve. We are working with leaders in Japan and the US DOE to capture these opportunities.
Birds of a Feather
Programming Frameworks and System Software
TP
XO/EX
DescriptionThe SYCL programming model provides an open standard way to program heterogeneous systems in modern C++. Since the major SYCL2020 release, which added abstractions and features for HPC, SYCL has seen increased use in application domains needing large exascale-class machines, including fusion energy, molecular dynamics, and aerospace.
In this Birds of a Feather session, we will bring together the community of everyone using and developing SYCL applications and implementations. We will discuss future directions and seek feedback on priorities for SYCLNext. A panel of SYCL experts, runtime/compiler implementers, and application specialists will lead an audience discussion and Q&A.
In this Birds of a Feather session, we will bring together the community of everyone using and developing SYCL applications and implementations. We will discuss future directions and seek feedback on priorities for SYCLNext. A panel of SYCL experts, runtime/compiler implementers, and application specialists will lead an audience discussion and Q&A.
Birds of a Feather
Applications
TP
XO/EX
DescriptionDiverse big data, interdisciplinary science, ML/AI applications and in-situ computations necessitate knowledge representation. Knowledge, organized for machine understanding in graph form known as knowledge graphs, augments large-scale science. For example, biology and semantic web utilize large knowledge graphs. Utilizing AI, knowledge graphs enable natural language querying of linked information, semantic recommendation systems, and knowledge completion. HPC challenges abound including parallelizing queries, retrieval-efficient knowledge representation, and knowledge graph context-exploiting AI. This BoF will introduce big ideas, as lightning talks followed by discussion, and engage a general audience in a discussion of emerging research topics aiming to seed a community for collaboration.
Workshop
W
DescriptionDevelopment platforms specific to domain sciences has the potential to improve user's productivity on a HPC cluster by smoothing the steep learning curve using it. These platforms also help abstracting certain practices the user must implement to get the optimal performance out of the allocated resource. These objectives require pre-work, both on the systems side and at the application level. The presentation discusses the first experiences of prototyping with Kubeflow and deploying it as-a-service to be shared by multiple users. The deployment was designed with HPC cluster or multi-node cloud instances as target computational resources. Kubeflow is an opensource platform to make deployment of ML/DL workloads easy. It depends on Kubernetes. Kubeflow offers a simple UI for interactive computing, orchestration of workflows using Kubeflow pipeline, and an intuitive interface for hyperparameter tuning experiments using Katib. These are attractive features when considering the ease of use in deploying software environments for model and workflow development for users in academic research settings on cloud and university HPC cluster.
Workshop
Applications
Cloud Computing
Distributed Computing
Edge Computing
Large Scale Systems
W
DescriptionThis paper introduces Laminar, a novel serverless framework based on dispel4py, a parallel stream-based dataflow library. Laminar efficiently manages streaming workflows and components through a dedicated registry, offering a seamless serverless experience. Leveraging large language models, Laminar enhances the framework with semantic code search, code summarization, and code completion. This contribution enhances serverless computing by simplifying the execution of streaming computations, managing data streams more efficiently, and offering a valuable tool for both researchers and practitioners.
Exhibitor Forum
Artificial Intelligence/Machine Learning
Fault Handling and Tolerance
Large Scale Systems
Programming Frameworks and System Software
TP
XO/EX
DescriptionIn this talk, Intel will describe how oneAPI Rendering Toolkit (RenderKit) was enhanced for multi-architecture multi-platform large scale workloads. We will showcase some of our work that was deployed onto Argonne’s Aurora Supercomputer, one of the first exascale machines on the planet and the challenges that we met along the way. The high-performance capabilities of the machine's 20000 Saphire Rapids HBM CPUs and 60000 Intel Data Center Max (aka PonteVecchio) GPUs, including accelerated ray tracing hardware in the GPUs, are exercised through RenderKit’s use of Intel's OneAPI Data Parallel C++ Sycl Implementation.
This talk will also introduce the newly updated architecture of OSPRay, the Open, Scalable, and Portable Ray Tracing Engine. It will highlight OSPRAY’s multi-GPU capabilities on workloads including Surface and Volume Rendering of LANL’s Deepwater Impact Asteroid data and Argonne’s Stellar Radiation data sets which were performance tuned in collaboration with Argonne. This will include OSPRay’s performance within resources like Kitware’s ParaView and LLNL’s VisIt.
Furthermore, we will share our approach to solving problems with large scale rendering and future performance opportunities we will continue to push for. We will also discuss our efforts with rendering at scale has enabled emerging industry segments like Digital Twins on HPC infrastructure.
This talk will also introduce the newly updated architecture of OSPRay, the Open, Scalable, and Portable Ray Tracing Engine. It will highlight OSPRAY’s multi-GPU capabilities on workloads including Surface and Volume Rendering of LANL’s Deepwater Impact Asteroid data and Argonne’s Stellar Radiation data sets which were performance tuned in collaboration with Argonne. This will include OSPRay’s performance within resources like Kitware’s ParaView and LLNL’s VisIt.
Furthermore, we will share our approach to solving problems with large scale rendering and future performance opportunities we will continue to push for. We will also discuss our efforts with rendering at scale has enabled emerging industry segments like Digital Twins on HPC infrastructure.
ACM Gordon Bell Finalist
Awards
TP
DescriptionAb initio electronic-structure has remained dichotomous between achievable accuracy and length-scale. Quantum many-body (QMB) methods realize quantum accuracy but fail to scale. Density functional theory (DFT) scales favorably but remains far from quantum accuracy. We present a framework that breaks this dichotomy by use of three interconnected modules:
(i) invDFT: a methodological advance in inverse DFT linking QMB methods to DFT;
(ii) MLXC: a machine-learned density functional trained with invDFT data, commensurate with quantum accuracy;
(iii) DFT-FE-MLXC: an adaptive higher-order spectral finite-element (FE) based DFT implementation that integrates MLXC with efficient solver strategies and HPC innovations in FE-specific dense linear algebra, mixed-precision algorithms, and asynchronous compute-communication.
We demonstrate a paradigm shift in DFT that not only provides an accuracy commensurate with QMB methods in ground-state energies, but also attains an unprecedented performance of 659.7 PFLOPS (43.1% peak FP64 performance) on 619,124 electrons using 8,000 GPU nodes of Frontier supercomputer.
(i) invDFT: a methodological advance in inverse DFT linking QMB methods to DFT;
(ii) MLXC: a machine-learned density functional trained with invDFT data, commensurate with quantum accuracy;
(iii) DFT-FE-MLXC: an adaptive higher-order spectral finite-element (FE) based DFT implementation that integrates MLXC with efficient solver strategies and HPC innovations in FE-specific dense linear algebra, mixed-precision algorithms, and asynchronous compute-communication.
We demonstrate a paradigm shift in DFT that not only provides an accuracy commensurate with QMB methods in ground-state energies, but also attains an unprecedented performance of 659.7 PFLOPS (43.1% peak FP64 performance) on 619,124 electrons using 8,000 GPU nodes of Frontier supercomputer.
Paper
Accelerators
Applications
Modeling and Simulation
TP
DescriptionStructural dynamics simulation plays an important role in research on reactor design and complex engineering. The Hybrid Total Finite Element Tearing and Interconnecting (HTFETI) method combined with Newmark method is an efficient way to solve large-scale structural dynamics problems. However, the sparse direct solver and the load imbalance caused by inconsistent density models are two critical issues limiting the performance and the scalability of structural dynamics computing. For the former, we propose an efficient variable-size batched method to accelerate SpMV on GPUs. For the latter, we establish an online performance prediction model, based on which we then design a novel inter-cluster subdomain fine-tuning algorithm to balance the workload of HTFETI parallel computing. We are the first to achieve the high-fidelity structural dynamics simulation of China Experimental Fast Reactor core assembly with up to 53.4 billion grids. The weak and strong scalability efficiencies reach 91.77% and 86.13% on 12,800 GPUs, respectively.
Workshop
Modeling and Simulation
Performance Measurement, Modeling, and Tools
W
DescriptionAs a rule, Top 500 class supercomputers are extensively benchmarked as part of their acceptance testing process. However, barring publicly posted LINPACK / HPCG results, most benchmark results are often inaccessible outside the hosting institution. Moreover, these higher level benchmarks do not provide easy answers to common questions such as “What is the realizable memory bandwidth?” or “What is the launch latency on the accelerator?” To partially address these issues, we executed selected single-node micro-benchmarks — focused on latencies and memory bandwidth — on every US Department of Energy system above rank 150 of the June 2023 Top 500 list, excepting NERSC’s Cori and ORNL’s Frontier TDS (now decommissioned or repurposed). We hope to provide an easy “first stop” reference for users of current Top 500 systems and inspire users and administrators of other Top 500 systems to similarly compile and make available benchmark results for their systems.
Invited Talk
Artificial Intelligence/Machine Learning
HPC Infrastructure
TP
DescriptionThe National AI Initiative Act of 2020 established the National AI Research Resource (NAIRR) Task Force to investigate establishing a national infrastructure for AI research. After 18 months, the task force produced a report detailing a vision for the NAIRR as a widely accessible US national infrastructure comprised of a set of federated resources including high performance computing, cloud computing, testbeds, software and datasets with accompanying user support and educational and training materials. The overarching goal of the envisioned NAIRR is to strengthen and democratize the U.S. AI innovation ecosystem by spurring innovation, increasing the diversity of talent in AI, improving US AI R&D capacity, and advancing trustworthy AI, including increasing research opportunities in critical areas such as testing and evaluation, bias mitigation, AI safety and privacy.
Today, a US Government Interagency Working Group led by OSTP and NSF is deploying a NAIRR pilot to demonstrate the value, capabilities, and impact of the NAIRR concept, reach broad communities, expose technical issues early, and test drive the proposed NAIRR governance structure. This talk will describe the latest status and plans for the NAIRR pilot.
Today, a US Government Interagency Working Group led by OSTP and NSF is deploying a NAIRR pilot to demonstrate the value, capabilities, and impact of the NAIRR concept, reach broad communities, expose technical issues early, and test drive the proposed NAIRR governance structure. This talk will describe the latest status and plans for the NAIRR pilot.
Paper
Heterogeneous Computing
Programming Frameworks and System Software
Task Parallelism
TP
DescriptionThe sparse module of the popular SciPy Python library is widely used across applications in scientific computing, data analysis, and machine learning. The standard implementation of SciPy is restricted to a single CPU and cannot take advantage of modern distributed and accelerated computing resources. We introduce Legate Sparse, a system that transparently distributes and accelerates unmodified sparse matrix-based SciPy programs across clusters of CPUs and GPUs, and composes with cuNumeric, a distributed NumPy library. Legate Sparse uses a combination of static and dynamic techniques to performantly compose independently written sparse and dense array programming libraries, providing a unified Python interface for distributed sparse and dense array computations. We show that Legate Sparse is competitive with single-GPU libraries like CuPy and the industry-standard PETSc library on up to 1280 CPU cores and 192 GPUs of the Summit supercomputer, while offering the productivity benefits of idiomatic SciPy and NumPy.
Birds of a Feather
Architecture and Networks
TP
XO/EX
DescriptionIPv6 is quickly becoming the dominant protocol on the internet. As the global transition from IPv4 to IPv6 continues, many ISPs are now seeing over 50% of their traffic via IPv6. SCinet22 saw wireless IPv6 traffic ranging from 35-55%. This BoF continues the engagement from SC22 with discussions centered on international migration efforts, cyber security, HPC, IPAM and real-time IPv6 usage from SCinet23! Join our discussion on the efforts, implications and challenges for transitioning HPC, data centers and networks. Ask questions, provide updates, and hear from others about their real-world experience - learn all the ways you can embrace IPv6.
Workshop
Education
State of the Practice
W
DescriptionThe training of new and existing HPC practitioners is recognized as a priority in the HPC community. Traditionally, delivering HPC System Administrator training has been through physical face-to-face workshops, using cloud-based services or remote hardware to provide compute resources to emulate an HPC system.
We have identified several challenges associated with the reliance on cloud-based services for hosting HPC training workshops, including: class size is limited by the available compute resources provided on the hosted resource; the training is a non-starter without available cloud resources; the hosted resources are temporary.
To address these fundamental problems associated with the traditional cloud-hosted HPC Labs, and by following lessons learned from MOOC & Educational methodology on developing HPC Training, we have produced a reproducible, offline-capable, self-paced HPC virtual training lab that emulates a basic 3-node compute cluster on a trainee’s local machine without the need for any high-end computing resources or cloud infrastructure.
We have identified several challenges associated with the reliance on cloud-based services for hosting HPC training workshops, including: class size is limited by the available compute resources provided on the hosted resource; the training is a non-starter without available cloud resources; the hosted resources are temporary.
To address these fundamental problems associated with the traditional cloud-hosted HPC Labs, and by following lessons learned from MOOC & Educational methodology on developing HPC Training, we have produced a reproducible, offline-capable, self-paced HPC virtual training lab that emulates a basic 3-node compute cluster on a trainee’s local machine without the need for any high-end computing resources or cloud infrastructure.
Workshop
Data Analysis, Visualization, and Storage
Large Scale Systems
Programming Frameworks and System Software
Reproducibility
Resource Management
Runtime Systems
W
DescriptionThe recent development of large language models (LLMs) with multi-billion parameters, coupled with the creation of user-friendly application programming interfaces (APIs), has paved the way for automatically generating and executing code in response to straightforward human queries. This paper explores how these emerging capabilities can be harnessed to facilitate complex scientific workflows, eliminating the need for traditional coding methods. We present initial findings from our attempt to integrate Phyloflow with OpenAI's function-calling API, and outline a strategy for developing a comprehensive workflow management system based on these concepts.
Tutorial
Accelerators
Applications
TUT
DescriptionThe past few years have witnessed a surge in the number of advanced network adapters, known as "SmartNICs", that offer additional functionalities beyond standard packet processing capabilities. These devices often feature programmable lightweight processing cores, FPGAs, and even CPU- and GPU-based platforms capable of running separate operating systems. Though primarily aimed at data center operations, such as infrastructure management, packet filtering, and I/O acceleration, SmartNICs are increasingly being explored for high-performance computing (HPC) application acceleration.
This tutorial offers an in-depth exploration of the state-of-the-art for SmartNICs and the emerging software ecosystems supporting them. Attendees will engage in hands-on exercises to better understand how to use SmartNICs for HPC application acceleration, including MPI collective operation offloading, OpenMP offload, and algorithmic modifications to maximize on-board processing power. Participants will have the opportunity to execute these exercises using cutting-edge SmartNICs like NVIDIA's BlueField-3 Data Processing Unit (DPU). The tutorial presenters will discuss additional techniques for optimizing applications to harness SmartNICs as communication accelerators in HPC systems.
This tutorial offers an in-depth exploration of the state-of-the-art for SmartNICs and the emerging software ecosystems supporting them. Attendees will engage in hands-on exercises to better understand how to use SmartNICs for HPC application acceleration, including MPI collective operation offloading, OpenMP offload, and algorithmic modifications to maximize on-board processing power. Participants will have the opportunity to execute these exercises using cutting-edge SmartNICs like NVIDIA's BlueField-3 Data Processing Unit (DPU). The tutorial presenters will discuss additional techniques for optimizing applications to harness SmartNICs as communication accelerators in HPC systems.
Paper
Algorithms
Cloud Computing
Distributed Computing
Heterogeneous Computing
Large Scale Systems
State of the Practice
TP
DescriptionThis paper presents the core concepts of the widely-distributed combination technique, which allows us to use the compute power and memory of more than one HPC system for the same simulation. We apply the sparse-grid combination technique to a six-dimensional advection problem serving as a proxy for plasma simulations. The full-grid solution approximated by the combination technique would contain ≈5ZB if computed with conventional grid-based methods. The combination-technique simulation operates on ≈988GB plus the supporting sparse grid data structures. We propose a new approach to divide the compute load, requiring only 76GB to be exchanged. Based on this, we have realized the first synchronous grid-based simulation using two HPC systems, the Tier-0 supercomputers Hawk and SuperMUC-NG. The simulation is computed at an average overhead of ≈35% (108s per combination step) for file-I/O and transfer. The presented concepts apply to any pair of HPC systems if high-speed data transfer is possible.
Workshop
Compilers
Heterogeneous Computing
Performance Optimization
W
DescriptionLoop modifications are critical steps of code optimization. They can allow other techniques to be used, improve performance by reducing memory traffic with better data locality, or reduce loop overhead. Unfortunately, the error-prone index arithmetic calculations involved can make these modifications cumbersome for a human programmer. These modifications also cannot be implemented using a text-based tool, such as sed, because they require semantic knowledge about the keywords and symbols used. Compilers also may not implement all of these transformations, or may not always apply them if correctness or profitability analyses are inconclusive.
Previously, we presented an approach to code rewriting, MARTINI, which exposes complex and semantic-driven rewrite capabilities, based in the program's abstract syntax tree (AST), to users in a simple and natural way. In this paper, we show how this approach can be used to implement source-based loop optimizations with only before-and-after code samples written in the source language.
Previously, we presented an approach to code rewriting, MARTINI, which exposes complex and semantic-driven rewrite capabilities, based in the program's abstract syntax tree (AST), to users in a simple and natural way. In this paper, we show how this approach can be used to implement source-based loop optimizations with only before-and-after code samples written in the source language.
Workshop
Compilers
Heterogeneous Computing
Performance Optimization
W
DescriptionPerformance portability is a major concern on current architectures. One way to achieve it is by using autotuning. In this paper, we are presenting how we extended a just-in-time compilation infrastructure to introduce autotuning capabilities triggered at run-time. When a function is executed, the first iterations optimize it, and once the best solution has been found, it is used for subsequent calls to the function. This just-in-time autotuning infrastructure is relevant for optimizing computation kernels that will be called numerous times with similar parameters through the execution, re-optimizes kernels when they are called with other parameters, and the programmer can obtain the optimal parameters to use them for other kernels. We present an experimental performance evaluation of our approach. Compiling the code introduces an overhead on the first iterations, and this overhead is compensated for during subsequent iterations. We also determined that the optimum found seems stable and accurate.
Workshop
Compilers
Heterogeneous Computing
Performance Optimization
W
DescriptionHPC developers often develop domain-specific languages and libraries to improve productivity. The software implementing these languages and libraries employ advanced C++ language techniques such as template metaprogramming. HPC-oriented libraries employing template metaprogramming techniques permit a substantial level of customization and portability across multiple environments. Although applications developed using these libraries can be performant, they may lead to performance regressions. These regressions can be challenging for the compiler to identify and correct. Without an understanding of the compiler underlying the HPC-aligned library in use or the target hardware, such issues may remain undetected and unresolved.
META, a portable static analysis infrastructure, addresses these challenges by extending the LLVM compiler toolchain such that it can not only detect performance regressions but make concrete suggestions about how to best modify an application written with C++ parallel template metaprogramming libraries.
META, a portable static analysis infrastructure, addresses these challenges by extending the LLVM compiler toolchain such that it can not only detect performance regressions but make concrete suggestions about how to best modify an application written with C++ parallel template metaprogramming libraries.
Workshop
Compilers
Heterogeneous Computing
Performance Optimization
W
DescriptionThe rise of serverless introduced a new class of scalable, elastic and highly available parallel workers in the cloud. Many systems and applications benefit from offloading computations and parallel tasks to dynamically allocated resources. However, the developers of C++ applications found it difficult to integrate functions due to complex deployment, lack of compatibility between client and cloud environments, and loosely typed input and output data. To enable single-source and efficient serverless acceleration in C++, we introduce Cppless, an end-to-end framework for implementing serverless functions which handles the creation, deployment, and invocation of functions. Cppless is built on top of LLVM and requires only two compiler extensions to automatically extract C++ function objects and deploy them to the cloud. We demonstrate that offloading parallel computations from a C++ application to serverless workers can provide up to 30x speedup, requiring only minor code modifications and costing less than one cent per computation.
Workshop
Architecture and Networks
Hardware Technologies
W
DescriptionMonte Cimone is a fully-operational multi-blade computer prototype and hardware-software test-bed based upon E4's RV007 blade system which comprises of the SiFive Freedom U740 SoC, which is a double-precision capable multi-core, 64-bit RISC-V CPU. In this talk, I will provide an overview of the current generation Monte Cimone HPC system and describe our work preparing for our next generation RISC-V HPC system.
Workshop
Architecture and Networks
Hardware Technologies
W
Workshop
Architecture and Networks
Hardware Technologies
W
DescriptionSophgo introduced the industry's first server-grade RISC-V CPU, the SG2042, which is helping RISC-V make strides in the high-performance computing arena. With a 9-12 stage pipeline design, out-of-order execution support, and a clock speed of up to 2GHz, the SG2042 features 16 clusters, each with a maximum of 4 cores, resulting in a single SoC chip boasting 64 cores and 64MB of shared L3 cache. With the collaboration of industry partners, research institutions, and the open-source community, SG2042 is accelerating the construction of the entire RISC-V software stack with high-performance hardware, ranging from operating systems to mathematical computing libraries.
Base on the SG2042, Sophgo is soon to release the SG2044, which supports vector 1.0, and also a high-performance RISC-V CPU based on SiFive IP Core. These offerings will accelerate the adoption of RISC-V in high-performance computing, data centers, and AI/ML scenarios, with a focus on delivering high performance, low power consumption, and affordability, thereby contributing to the growth of the RISC-V HPC ecosystem.
Base on the SG2042, Sophgo is soon to release the SG2044, which supports vector 1.0, and also a high-performance RISC-V CPU based on SiFive IP Core. These offerings will accelerate the adoption of RISC-V in high-performance computing, data centers, and AI/ML scenarios, with a focus on delivering high performance, low power consumption, and affordability, thereby contributing to the growth of the RISC-V HPC ecosystem.
Workshop
Architecture and Networks
Hardware Technologies
W
DescriptionDoug will give an overview of Austin-based InspireSemi’s disruptive next generation Thunderbird compute accelerator targeting HPC and graph analytics applications. This RISC-V based “supercomputer-cluster-on-a-chip” packs 1,536 high performance CPU cores (all FP64 double-precision of their own design) onto a single SOC, and the initial product will be a PCIe card with 4 Thunderbird chips, delivering >6,000 FP64 cores. For maximum/predictable performance and low latency, these CPU cores are all interconnected with their high speed mesh network fabric which can connect up to 256 Thunderbird chips. After >3 years of customer-driven development, the chip is in final verification and will tape out in November to TSMC.
Workshop
W
DescriptionContainerization approaches based on namespaces offered by the Linux kernel have seen an increasing popularity in the HPC community both as a means to isolate applications and as a format to package and distribute them. However, their adoption and usage in HPC systems faces several challenges. These include difficulties in unprivileged running and building of scientific application container images directly on HPC resources, increasing heterogeneity of HPC architectures, and access to specialized networking libraries available only on HPC systems. These challenges of container-based HPC application development closely align with the several advantages that a new universal intermediate binary format called WebAssembly (Wasm) has to offer. These include a lightweight userspace isolation mechanism and portability across operating systems and processor architectures. This talk highlights the potential of using Wasm for packaging and distributing HPC applications and describes MPIWasm, a novel Wasm embedder that enables the high-performance execution of MPI-based HPC applications compiled to Wasm.
Workshop
Large Scale Systems
Performance Measurement, Modeling, and Tools
Software Engineering
W
DescriptionThe Dynamic PicoProbe at Argonne National Laboratory is undergoing upgrades that will enable it to produce up to 100s of GB of data per day. While this data is highly important for both fundamental science and industrial applications, there is currently limited on-site infrastructure to handle these high-volume data streams. We address this problem by providing a software architecture capable of supporting large-scale data transfers to the neighboring supercomputers at the Argonne Leadership Computing Facility. To prepare for future scientific workflows, we implement two instructive use cases for hyperspectral and spatiotemporal datasets, which include: (i) off-site data transfer, (ii) machine learning/artificial intelligence and traditional data analysis approaches, and (iii) automatic metadata extraction and cataloging of experimental results. This infrastructure supports expected workloads and also provides domain scientists the ability to reinterrogate data from past experiments to yield additional scientific value and derive new insights.
Workshop
Artificial Intelligence/Machine Learning
Software Engineering
W
Workshop
Compilers
Heterogeneous Computing
Performance Optimization
W
DescriptionLLVM, winner of the 2012 ACM Software System Award, has become an integral part of the software-development ecosystem for optimizing compilers, dynamic-language execution engines, source-code analysis and transformation tools, debuggers and linkers and a whole host of programming-language and toolchain-related components. Now heavily used in both academia and industry, where it allows for rapid development of production-quality tools, LLVM is increasingly used in work targeted at high-performance computing. Research in, and implementation of, program analysis, compilation, execution, and profiling have clearly benefited from the availability of a high-quality, freely-available infrastructure on which to build. This workshop will focus on recent developments, from both academia and industry, that build on LLVM to advance the state-of-the-art in high-performance computing.
Workshop
Performance Optimization
W
DescriptionMulti-messenger astrophysics combines observations from multiple instruments to study transient astrophysical phenomena, many occurring at seconds-level timescales. To identify and precisely localize these events in the sky, current systems often search through extensive sensor data, requiring resource-intensive computation to achieve results on the timescale of the events themselves. We seek to reduce computational requirements so as to perform real-time event localization with limited computational resources suitable for an orbital platform.
This work studies the performance of a computational pipeline for real-time gamma-ray burst (GRB) detection and localization aboard the Antarctic Demonstrator for the Advanced Particle-astrophysics Telescope (ADAPT), a balloon-borne prototype for a space-based gamma-ray observatory supporting multi-messenger observations. ADAPT observes gamma-ray Compton scattering, then uses the pipeline to combine information from multiple photons to identify a GRB's source direction. We identify, model, and measure key uncertainties, then deploy instrumentation and computational improvements to reduce them, substantially improving localization accuracy.
This work studies the performance of a computational pipeline for real-time gamma-ray burst (GRB) detection and localization aboard the Antarctic Demonstrator for the Advanced Particle-astrophysics Telescope (ADAPT), a balloon-borne prototype for a space-based gamma-ray observatory supporting multi-messenger observations. ADAPT observes gamma-ray Compton scattering, then uses the pipeline to combine information from multiple photons to identify a GRB's source direction. We identify, model, and measure key uncertainties, then deploy instrumentation and computational improvements to reduce them, substantially improving localization accuracy.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
TP
XO/EX
DescriptionOptical Coherence Tomography (OCT) can be used as a fast and non-destructive technology for bacterial biofilm imaging. However, OCT generates approximately 100 GB per flow cell, which complicates storage and data sharing. Data reduction can reduce data complications by reducing the overhead and the amount of data transferred. This work leverages the similarities between layers of OCT images to minimize the data in order to improve compression. This paper evaluates the 5 lossless and 2 lossy state-of-the-art compressors to reduce the OCT data. The reduction techniques are evaluated to determine which compressor has the most significant compression ratio while maintaining a strong bandwidth and minimal image distortion. Results show that SZ with frame before pre-processing is able to achieve the highest CR of 204.6x on its higher error bounds. The maximum compression bandwidth SZ on higher error bounds is ~41MB/s, for decompression bandwidth, it is able to outperform ZFP achieving.
Birds of a Feather
Data Analysis, Visualization, and Storage
TP
XO/EX
DescriptionLustre is the leading open-source and open-development file system for HPC. Around two thirds of the top 100 supercomputers use Lustre. It is a community developed technology with contributors from around the world. Lustre currently supports many HPC infrastructures beyond scientific research, such as financial services, energy, manufacturing and life sciences. Lustre clients are available for broadly deployed instruction set architectures such as x86, POWER, and ARM.
At this BoF, Lustre developers, administrators, and solution providers will gather to discuss recent Lustre developments and challenges, including the role of Lustre in AI and its use in Cloud environments.
At this BoF, Lustre developers, administrators, and solution providers will gather to discuss recent Lustre developments and challenges, including the role of Lustre in AI and its use in Cloud environments.
Birds of a Feather
Artificial Intelligence/Machine Learning
TP
XO/EX
DescriptionThis BoF will spotlight the underemphasized role of inputs and data in machine learning (ML), contrasting the prevalent focus on hardware aspects. It invites the SC community to contribute insights in these areas: 1) the value proposition for data-centric AI in scientific computing; 2) foundation models for the long tail of science; 3) the role of benchmarks in data-centric AI. To foster interactive dialogue, we will facilitate discussions, conduct live polling, and arrange short breakout sessions. These activities will enable participants to delve into the practical implications of data-centric AI, benchmarking, and contributing to scientific foundation models.
Tutorial
Cloud Computing
Middleware and System Software
TUT
DescriptionAre you new to the world of HPC and are trying to find an affordable and accessible way that you can learn, practice and experiment? Do you miss the days when learning about HPC was connecting a few grey boxes together and configuring a cluster? Do you wish you could transfer all the complexity inherent in production HPC systems into an accessible sandbox environment, designed to facilitate teaching and experimental development? Stop wishing and come explore Magic Castle with this tutorial!
Magic Castle is an open-source software that replicates the HPC infrastructure experience via community or commercial cloud resources. It is easy to deploy and can be created in minutes. Once their cluster is deployed, the user is provided with a complete HPC cluster software environment including the scheduler, a data-transfer node, JupyterHub, and thousands of software applications compiled by experts and accessible via CVMFS. Since its initial public release in 2018, Magic Castle has been used for thousands of workshops and tutorials world-wide.
In this tutorial, you will learn how to deploy a virtual HPC cluster on your preferred cloud resource in minutes, and fully customize your environment to suit your application, whether that be training, development, or practice.
Magic Castle is an open-source software that replicates the HPC infrastructure experience via community or commercial cloud resources. It is easy to deploy and can be created in minutes. Once their cluster is deployed, the user is provided with a complete HPC cluster software environment including the scheduler, a data-transfer node, JupyterHub, and thousands of software applications compiled by experts and accessible via CVMFS. Since its initial public release in 2018, Magic Castle has been used for thousands of workshops and tutorials world-wide.
In this tutorial, you will learn how to deploy a virtual HPC cluster on your preferred cloud resource in minutes, and fully customize your environment to suit your application, whether that be training, development, or practice.
Workshop
Quantum Computing
Software Engineering
W
DescriptionWe present a work-in-progress to create a software toolchain that links the Quantum Intermediate Representation (QIR) to the hardware-agnostic execution framework XACC. The novelty of this work is an implementation of the QIR specification for use in the XACC framework by translating the QIR to an XACC intermediate representation (XIR) and then illustrating how this toolchain will be able to represent both quantum and classical computations. The anticipated impact includes the expansion of quantum programming languages that can be executed on quantum computing hardware and the utilization of QPU hardware agnostic properties of XACC framework.
Tutorial
Cloud Computing
Distributed Computing
TUT
DescriptionThe modern scientific software stack includes thousands of packages, from C, C++, and Fortran libraries, to packages written in interpreted languages like Python and R. HPC applications may depend on hundreds of packages spanning all of these ecosystems. To achieve high performance, they must also leverage low-level and difficult-to-build libraries such as MPI, BLAS, and LAPACK. Integrating this stack is extremely challenging. The complexity can be an obstacle to deployment at HPC sites and deters developers from building on each other's work.
Spack is an open source tool for HPC package management that simplifies building, installing, customizing, and sharing HPC software stacks. Its adoption has grown rapidly: it is used by end-users, developers, clouds, and the world's largest HPC centers. Spack provides a powerful and flexible dependency model, a simple Python syntax for writing package build recipes, and a repository of over 7,000 packages maintained by a community of over 1,100 contributors. This tutorial provides an introduction to Spack's capabilities: installing and authoring packages, integrating Spack with development workflows, and deploying software at HPC facilities. Attendees will learn foundational skills for automating day-to-day tasks, as well as deeper knowledge of Spack for advanced use cases.
Spack is an open source tool for HPC package management that simplifies building, installing, customizing, and sharing HPC software stacks. Its adoption has grown rapidly: it is used by end-users, developers, clouds, and the world's largest HPC centers. Spack provides a powerful and flexible dependency model, a simple Python syntax for writing package build recipes, and a repository of over 7,000 packages maintained by a community of over 1,100 contributors. This tutorial provides an introduction to Spack's capabilities: installing and authoring packages, integrating Spack with development workflows, and deploying software at HPC facilities. Attendees will learn foundational skills for automating day-to-day tasks, as well as deeper knowledge of Spack for advanced use cases.
Workshop
Performance Measurement, Modeling, and Tools
Performance Optimization
W
DescriptionIn recent history, GPUs became a key driver of compute performance in HPC. With the installation of the Frontier supercomputer, they became the enablers of the exascale era; further largest-scale installations are in progress (Aurora, El Capitan, JUPITER). But the early-day dominance by NVIDIA and their CUDA programming model has changed: The current HPC GPU landscape features three vendors (AMD, Intel, NVIDIA), each with native and derived programming models. The choices are ample, but not all models are supported on all platforms, especially if support for Fortran is needed; in addition, some restrictions might apply. It is hard for scientific programmers to navigate this abundance of choices and limits. This presentation gives a guide by matching the GPU platforms with supported programming models, presented in a concise table and further elaborated in detailed comments. An assessment is made regarding the level of support of a model on a platform.
Workshop
Algorithms
Heterogeneous Computing
Large Scale Systems
W
DescriptionDesigning large-scale geological carbon capture and storage projects and ensuring safe long-term CO2 containment - as a climate change mitigation strategy - requires fast and accurate numerical simulations. These simulations involve solving complex PDEs governing subsurface fluid flow using implicit finite-volume schemes widely based on Two-Point Flux Approximation (TPFA). This task is computationally and memory expensive, especially when performed on highly detailed geomodels. In most current HPC architectures, memory hierarchy and data management mechanisms are insufficient to overcome the challenges of large scale numerical simulations. Therefore, it is crucial to design algorithms that can exploit alternative and more balanced paradigms, such as dataflow and in-memory computing. This work introduces an algorithm for TPFA computations that exploits a dataflow architecture, such as Cerebras CS-2, which helps to significantly minimize memory bottlenecks. Our implementation achieves two orders of magnitude speedup compared to multiple reference implementations running on latest generations of NVIDIA GPUs.
Students@SC
DescriptionIn today's fast-paced world, your ability to communicate your value and make a lasting impression in a short time can be a game-changer. Join our workshop, “Finding the Right Elevator Pitch and Practicing It,” and gain the skills to captivate your audience and seize opportunities. The workshop provides information on how to craft a powerful elevator pitch, tailoring your pitch to your audience, and interactive practice sessions.
Tutorial
Accelerators
Software Engineering
Task Parallelism
TUT
DescriptionWith the increasing prevalence of multi-core processors, shared-memory programming models are essential. OpenMP is a popular, portable, widely supported, and easy-to-use shared-memory model. Since version 3.0 released in 2008, OpenMP offers tasking to support the creation of composable parallel software blocks and the parallelization of irregular algorithms. Developers usually find OpenMP easy to learn. However, mastering the tasking concept of OpenMP requires a change in the way developers reason about the structure of their code and how to expose the parallelism of it. Our tutorial addresses this critical aspect by examining the tasking concept in detail and presenting patterns as solutions to many common problems.
We assume attendees understand basic parallelization concepts and know the fundamentals of OpenMP. We present the OpenMP tasking language features in detail and focus on performance aspects, such as introducing cut-off mechanisms, exploiting task dependencies, and preserving locality. All aspects are accompanied by extensive case studies. If accepted as a full-day tutorial, we will include hands-on sessions. Throughout all topics, we present the recent additions of OpenMP 5.1 and 5.2 and comment on the developments targeting OpenMP 6.0.
We assume attendees understand basic parallelization concepts and know the fundamentals of OpenMP. We present the OpenMP tasking language features in detail and focus on performance aspects, such as introducing cut-off mechanisms, exploiting task dependencies, and preserving locality. All aspects are accompanied by extensive case studies. If accepted as a full-day tutorial, we will include hands-on sessions. Throughout all topics, we present the recent additions of OpenMP 5.1 and 5.2 and comment on the developments targeting OpenMP 6.0.
Workshop
Performance Measurement, Modeling, and Tools
Performance Optimization
W
DescriptionVendor libraries are tuned for a specific architecture and are not portable to others. Moreover, they lack support for heterogeneity and multi-device orchestration, which is required for efficient use of contemporary HPC and cloud resources. To address these challenges, we introduce MatRIS-a multilevel math library abstraction for scalable and performance-portable sparse/dense BLAS/LAPACK operations using IRIS runtime. The MatRIS-IRIS co-design introduces three levels of abstraction to make the implementation completely architecture agnostic and provide highly productive programming. We demonstrate that MatRIS is portable without any change in source code and can fully utilize multi-device heterogeneous systems by achieving high performance and scalability on Summit, Frontier, and a CADES cloud node equipped with four NVIDIA A100 GPUs and four AMD MI100 GPUs. A detailed performance study is presented in which MatRIS demonstrates multi-device scalability. When compared, MatRIS provides competitive and even better performance than libraries from vendors and other third parties.
Workshop
Applications
Distributed Computing
Large Scale Systems
Programming Frameworks and System Software
Runtime Systems
W
DescriptionLarge-scale HPC workflows are increasingly implemented in dynamic languages such as Python, which allow for more rapid development than traditional techniques. However, the cost of executing Python applications at scale is often dominated by the distribution of common datasets and complex software dependencies. As the application scales up, data distribution becomes a limiting factor that prevents scaling beyond a few hundred nodes. To address this problem, we present the integration of Parsl (a Python-native parallel programming library) with TaskVine (a data-intensive workflow execution engine). Instead of relying on a shared filesystem to provide data to tasks on demand, Parsl is able to express advance data needs to TaskVine, which then performs efficient data distribution at runtime. This combination provides a performance speedup of 1.48x over the typical method of on-demand paging from the shared filesystem, while also providing an average task speedup of 1.79x with 2048 tasks and 256 nodes.
Paper
Cloud Computing
Data Analysis, Visualization, and Storage
Graph Algorithms and Frameworks
TP
DescriptionThe challenge of executing extensive graph analyses in-memory intensifies with growing graph sizes. This has given rise to disk-based external graph analytics systems that prioritize cost-effective HDDs/SSDs over pricier memory solutions. In response to this issue, our paper introduces and assesses the MBFGraph external graph system. This system leverages millions of Bloom filters within 1KB or 2KB graph data blocks to diminish graph analysis execution delays. Through our innovative MBF-query and MBF-construct algorithms, MBFGraph utilizes these Bloom filters as approximate indices, enabling the reading of only pertinent sections of dynamic graph data, thereby facilitating scalable analytics. Our tests revealed that, on a 475GB graph, MBFGraph cut down the execution duration of BFS and Pagerank by 24% and 60% respectively, using a mere 4GB memory. This is in comparison to a sequential, tailored-for-workload, disk-based external graph analytics system.
Birds of a Feather
HPC in Society
TP
XO/EX
DescriptionCome and learn from the leaders of the professional societies focused on HPC from ACM, IEEE, and SIAM! Your SIGHPC, TCPP, and SIAG-SC representatives invite SC23 participants to join this cross-society BoF to learn about joint societies' efforts to promote collaborations, discuss the status of HPC as a community, and engage the audience to address common challenges.
Workshop
Accelerators
Compilers
Data Movement and Memory
Heterogeneous Computing
Performance Optimization
Programming Frameworks and System Software
Runtime Systems
W
DescriptionWe provide an automated framework that utilizes complex hardware links while preserving the simplified abstraction level for the user. Through the decomposition of user-issued memory operations into architecture-aware sub-tasks, we automatically exploit generally underused connections of the system. The operations we support include moving, distribution, and consolidation of memory across the node. For each of them, our Auto-Strategyzer framework proposes a task graph that transparently improves performance, in terms of latency or bandwidth, compared to naive strategies. For our evaluation, we integrated the Auto-Strategyzer as a C++ library into the LLVM-OpenMP runtime infrastructure. We demonstrate that some memory operations can be improved by a factor of 5x compared to naive versions. Integrated into LLVM/OpenMP, our Auto-Strategyzer accelerates cross-device memory movement by a factor of 1.9x, for large transfers, resulting in approx 6% end-to-end execution time decrease for a scientific proxy application.
Workshop
Quantum Computing
Software Engineering
W
DescriptionWe introduce a highly memory-efficient state vector simulation of quantum circuits premised on data compression, harnessing the capabilities of both CPUs and GPUs. We have elucidated the inherent challenges in architecting this system, while concurrently proposing our tailored solutions. Moreover, we have delineated our preliminary implementation and deliberated upon the potential for integration with other GPU-oriented simulators. In forthcoming research, we aim to present a more comprehensive set of results, bolstering the assertion of the efficacy and performance of our approach.
Posters
Research Posters
TP
XO/EX
DescriptionScientific workflows execute a series of tasks where each task may consume data as an input and produce data as an output. Within these workflows, tasks often produce intermediate results that may serve as inputs to subsequent tasks within the workflow. These results can vary in size and may need to be transported to another worker node. Data movement can become the primary bottleneck for many scientific workflows thus minimizing the cost of data movement can provide a significant performance benefit for a given workflow. Distant futures enable transfers between worker nodes, eliminating the need for intermediate results to pass through a centralized manager for future tasks invocations. Additionally, asynchronous transfers enable increased concurrency by preventing the blocking of task invocations. This poster shows the performance benefit received from the implementation of distant futures within a workflow that produces numerous intermediate results.
Paper
Architecture and Networks
Performance Measurement, Modeling, and Tools
Resource Management
TP
DescriptionAccommodating long-running deep learning (DL) training and inference jobs is challenging on GPU clusters that use traditional batch schedulers, such as Slurm. Given fixed wall clock time limits, DL researchers usually need to run a sequence of batch jobs and experience long interruptions on overloaded machines. Such interruptions significantly lower the research productivity and QoS for services that are deployed in production. To mitigate the issues from interruption, we explore a set of machine learning and reinforcement learning techniques to design a proactive provisioner. We examine the generality of the method using production job traces from three GPU clusters. We validate the effectiveness and generality of our proactive provisioner using the validation trace of each cluster. Our experiments show that the proposed resource provisioner safeguards 23%-76% of jobs with zero interruption across varying load levels on the three clusters.
Paper
Post-Moore Computing
Quantum Computing
TP
DescriptionWe introduce a technique for the suppression of state-dependent and correlated measurement errors, which are commonly observed on modern superconducting quantum devices. Our method leverages previous results, establishing that correlated errors tend to be physically localized on quantum devices to perform characterizations over the coupling map of the device, and to join overlapping measurement calibrations as a series of sparse matrices. We term this "Coupling Map Calibration". We quantitatively demonstrate the advantages of our proposed error mitigation system design across a range of current IBM quantum devices. Our experimental results on common benchmark circuits demonstrate up to a 41% reduction in the error rate without increasing the number of executions of the quantum device required when compared to conventional error mitigation methods.
Birds of a Feather
Applications
TP
XO/EX
DescriptionWhat if we have been oversolving in computational science and engineering for decades? Are low precision arithmetic formats only for AI workloads? How can HPC applications exploit mixed-precision hardware features? This BoF invites the HPC community at large interested in applying mixed precisions into their workflows and discussing the impact on time-to-solution, memory footprint, data motion, and energy consumption. Experts from scientific applications/software libraries/hardware architectures will briefly provide the context on this trendy topic, share their own perspectives, and mostly engage with the audience via a set of questions, while gathering feedback to define a roadmap moving forward.
Workshop
Education
State of the Practice
Sustainability
W
DescriptionThis course introduces high-school students to machine learning and NLP concepts using high-performance computing resources. Age-appropriate teaching strategies and a rapid shift to self-directed learning are emphasized. The current course diverges from undergraduate coursework in both its format and intended audience. First, the course occurs within the context of a two-week science and technology summer program called the Summer Institute (SI), sponsored by the Ohio Supercomputer Center to expand educational opportunities surrounding high-performance computing and STEM fields more generally. The compressed timeline requires teaching adaptations and a limited scope. Secondly, students are high-school aged (9-12th grades), and the educational aims are engagement and recruitment as much as content learning. Extra attention is placed on making the course age-appropriate and age-accessible; applicability of these strategies to undergraduate-level pedagogy is discussed. Positive student feedback and well-executed oral presentations demonstrate successful learning outcomes in both engagement and content learning.
Birds of a Feather
Artificial Intelligence/Machine Learning
TP
XO/EX
DescriptionMachine learning applications are rapidly expanding into scientific domains and challenging the hallmarks of traditional high performance computing workloads. We present MLPerf, a community-driven system performance benchmark which spans a range of machine learning tasks. The speakers at this BoF are experts in the fields of HPC, science applications, machine learning, and computer architecture, representing academia, government research organizations, and private industry. In this session, we will cover the past year’s development within the MLPerf organization and provide an update on the latest round of submissions to MLPerf-HPC benchmark suite to solicit input from interested parties within the HPC community.
Workshop
Modeling and Simulation
Performance Measurement, Modeling, and Tools
W
DescriptionOne of the novel features of the Fujitsu A64FX CPU is the sector cache. This feature enables hardware-supported partitioning of the L1 and L2 caches and allows the programmer control of which partition is used to place data in. This paper performs an in-depth study of how to apply the sector cache to a frequently used sparse matrix-vector multiplication (SpMV) kernel. A performance model based on reuse analysis is used to better understand situations where the sector cache leads to improved reuse and to predict the cache behavior. The model correctly predicts the number of L2 cache misses within 2–3 % for sequential and parallel SpMV with 48 threads using a collection of 490 sparse matrices. Further experiments show the effect of various sector cache configurations on performance. A median speedup of about 1.05× is achieved, whereas the maximum speedup is about 1.6×.
Posters
Research Posters
TP
XO/EX
DescriptionIn the past year a large number of large language model (LLM) based tools for software development have been released. These tools have the capability to assist developers with many of the difficulties that arise from the ever-growing complexity in the software stack. As we enter the exascale era, with a diverse set of emerging hardware and programming paradigms, developing, optimizing, and maintaining parallel software is becoming burdensome for developers. While LLM-based coding tools have been instrumental in revolutionizing software development, mainstream models are not designed, trained, or tested on High Performance Computing (HPC) problems. We present a LLM fine-tuned on HPC data and demonstrate its effectiveness in HPC code generation, OpenMP parallelization, and performance modeling.
Doctoral Showcase
Posters
Heterogeneous Computing
Software Engineering
TP
DescriptionModern HPC hardware is becoming increasingly heterogeneous and diverse in the exascale era. The diversity of hardware and software stacks adds additional development challenges to high performance simulations. One common development approach is to re-engineer the code for each new target architecture in order to maximize performance. However, this re-engineering effort is no longer practical due to increasing heterogeneous hardware. Adding support for a single family of GPUs alone poses a significant challenge. Supporting each major vendor's hardware and software stacks takes valuable developer time away from optimizing and enhancing simulation capabilities. Moving forward, the community must modernize the code development process in order to achieve the greatest scientific output.
In this work, we examine the challenges posed by emerging heterogeneous hardware. These challenges include developing performance portable code, leveraging hardware features targeting AI/ML for HPC applications, and difficulties managing limited I/O resources while checkpointing. To address these challenges we present a modernization approach for scientific software that ensures the following. Attain high performance and portability across architectures using the Kokkos portability framework in addition to optimizations to memory layout, sorting algorithms, and vectorization. Leverage alternative number formats such as half-precision and fixed-point to maximize usage of the limited memory on GPUs and enable larger simulations. Reduce IO overhead and storage requirements through the identification and elimination of spatial-temporal redundancy in application data.
In this work, we examine the challenges posed by emerging heterogeneous hardware. These challenges include developing performance portable code, leveraging hardware features targeting AI/ML for HPC applications, and difficulties managing limited I/O resources while checkpointing. To address these challenges we present a modernization approach for scientific software that ensures the following. Attain high performance and portability across architectures using the Kokkos portability framework in addition to optimizations to memory layout, sorting algorithms, and vectorization. Leverage alternative number formats such as half-precision and fixed-point to maximize usage of the limited memory on GPUs and enable larger simulations. Reduce IO overhead and storage requirements through the identification and elimination of spatial-temporal redundancy in application data.
Birds of a Feather
Energy Efficiency
State of the Practice
Sustainability
TP
XO/EX
DescriptionModular and container-based industrial structures for HPC buildings are now common. Resulting CapEx reductions include shorter design-build schedules, and commodity pricing of the structural envelope, and flexibility for expansion and upgradability are enhanced. Typical HPC life cycles for power, cooling and compute machinery are highly varied and require constant modification and renovation of facilities. Commodity structures can reduce this problem. Replacing concrete with steel, creating vertically stacked compute racks, might allow 3-D cube compute architectures with low latency communication and high accessibility for servicing. The transition from air to liquid cooling will drive this change.
Workshop
Algorithms
Heterogeneous Computing
Large Scale Systems
W
DescriptionThe lattice Boltzmann method is a highly scalable Navier-Stokes solver that has been applied to flow problems in a wide array of domains. However, the method is bandwidth-bound on modern GPU accelerators and has a large memory footprint. In this paper, we present new 2D and 3D GPU implementations of two different regularized lattice Boltzmann methods, which are not only able to achieve an acceleration of ~1.4× w.r.t. reference lattice Boltzmann implementations but also reduce the memory requirements by up to 35% and 47% in 2D and 3D simulations respectively. These new approaches are evaluated on NVIDIA and AMD GPU architectures.
Posters
Research Posters
TP
XO/EX
DescriptionVlasiator is a popular and powerful massively parallel code for accurate magnetospheric and solar wind plasma simulations. This work provides an in-depth analysis of Vlasiator, focusing on MPI performance using the Integrated Performance Monitoring (IPM) tool. We show that MPI non-blocking point-to-point communication accounts for most of the communication time. The communication topology shows a large number of MPI messages exchanging data in a six-dimensional grid. We also show that relatively large messages are used in MPI communication, reaching up to 256MB. As a communication-bound application, we found that using OpenMP in Vlasiator is critical for eliminating intra-node communication. Our results provide important insights for optimizing Vlasiator for the upcoming Exascale machines.
Workshop
Artificial Intelligence/Machine Learning
Software Engineering
W
DescriptionMessage Passing Interface (MPI) plays a crucial role in distributed memory parallelization across multiple nodes. However, parallelizing MPI code manually, and specifically, performing domain decomposition, is a challenging, error-prone task. In this paper, we address this problem by developing MPI-RICAL, a novel data-driven, programming-assistance tool that assists programmers in writing domain decomposition based distributed memory parallelization code. Specifically, we train a supervised language model to suggest MPI functions and their proper locations in the code on the fly. We also introduce MPICodeCorpus, the first publicly available corpus of MPI-based parallel programs that is created by mining more than 15,000 open-source repositories on GitHub. Experimental results have been done on MPICodeCorpus and more importantly, on a compiled benchmark of MPI-based parallel programs for numerical computations that represent real-world scientific applications. MPI-RICAL achieves F1 scores between 0.87-0.91 on these programs, demonstrating its accuracy in suggesting correct MPI functions at appropriate code locations.
Workshop
Distributed Computing
Middleware and System Software
Runtime Systems
W
DescriptionThe evolution of high-performance computing toward diverse accelerators, including NVIDIA, AMD, Intel GPUs, and Habana Gaudi Accelerators, demands a user-friendly and efficient utilization of these technologies. While both GPU-aware MPI libraries and vendor-specific communication libraries cater to communication requirements, trade-offs emerge based on library selection across various message sizes. Thus, prioritizing usability, we propose MPI-xCCL, a Message Passing Interface-based runtime with cross-accelerator support for efficient, portable, scalable, and optimized communication performance. MPI-xCCL incorporates vendor-specific libraries with GPU-aware MPI runtimes ensuring multi-accelerator compatibility while adhering to MPI standards. The proposed hybrid designs leverage the benefits of MPI and xCCL algorithms and transparently to the end user. We evaluated our designs on various HPC systems using OSU Micro-Benchmarks, and Deep Learning frameworks TensorFlow with Horovod. On NVIDIA-GPU-enabled ThetaGPU, our designs outperformed Open MPI by 4.6x. On emerging Habana Gaudi-based systems, MPI-xCCL was also able to deliver similar performance as vendor-provided communication runtimes.
Birds of a Feather
Programming Frameworks and System Software
TP
XO/EX
DescriptionMPICH is a widely used, open-source implementation of the MPI message passing standard. It has been ported to many platforms and used by several vendors and research groups as the basis for their own MPI implementations. This BoF session will provide a forum for users of MPICH as well as developers of MPI implementations derived from MPICH to discuss experiences and issues in using and porting MPICH. Future plans for MPICH will be discussed. Representatives from MPICH-derived implementations will provide brief updates on the status of their efforts. MPICH developers will also be present for an open forum discussion.
Workshop
Programming Frameworks and System Software
W
DescriptionPerformance tuning of High-Performance Computing (HPC) applications depends on sophisticated tuning of parameters on diverse architectures. These parameters are made available by vendors through low-level dials such as Model-Specific Registers (MSRs). While the MSRs themselves provide a powerful mechanism for users to monitor and control processor features, accessing them is laborious due to lack of standard interfaces and clear documentation. As a result, the burden of determining which MSRs to consider and how to fine-tune them for an application lies on the end user.
We present MSR-genie, an efficient and extensible query tool which reduces this user-level burden and allows them to query bidirectionally across MSR lists as well as a processor families and models, and providing them with guidance on appropriate bitmasks. The MSR-genie tool is open-source and easily extensible, and we demonstrate its effectiveness with over thirty Intel processor models and over two-thousand unique MSRs.
We present MSR-genie, an efficient and extensible query tool which reduces this user-level burden and allows them to query bidirectionally across MSR lists as well as a processor families and models, and providing them with guidance on appropriate bitmasks. The MSR-genie tool is open-source and easily extensible, and we demonstrate its effectiveness with over thirty Intel processor models and over two-thousand unique MSRs.
Students@SC
DescriptionIn today's dynamic job market, discovering the ideal career path can be exciting and challenging. Our diverse panel of experts is here to guide you through the process, regardless of your background or career stage. Join us for a fascinating discussion and gain insights into your career journey across diverse industries!
Birds of a Feather
State of the Practice
TP
XO/EX
DescriptionWith increasing demand for AI in HPC, there has been an explosion in architectures, programming models, and AI frameworks. The already-daunting task of programming for heterogenous systems has become even more challenging. This BoF, organized by the IXPUG but not limited to Intel technology, will focus on portable programming across a wide variety of architectures running a diverse set of HPC, and AI workloads.
This BoF will explore challenges, state-of-the-art approaches, and emergent best practices for programming across heterogeneous systems and novel architectures, identifying common principles and practices that enable development and maintenance of software across sites, architectures, and applications.
This BoF will explore challenges, state-of-the-art approaches, and emergent best practices for programming across heterogeneous systems and novel architectures, identifying common principles and practices that enable development and maintenance of software across sites, architectures, and applications.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
TP
XO/EX
DescriptionThe COVID-19 pandemic has highlighted the power of using computational methods for virtual drug screening. However, the molecular search space is enormous and the common protein docking methods are still computationally intractable without access to the world’s largest supercomputers. Instead, researchers are using AI methods to provide a powerful new tool to help guide docking campaigns. In such approaches, a lightweight surrogate model is trained and then used to identify promising candidates for screening. We present ParslDock, a Python-based pipeline using the Parsl parallel programming library and the K-Nearest Neighbors machine learning model to screen a huge molecular space of molecules against arbitrary receptors. We achieved a 38X speedup with ParslDock compared to a brute-force docking approach.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
TP
XO/EX
DescriptionEfficient reduce and allreduce communication collectives are crucial building blocks in many workloads, including deep learning training, and have been optimized for various architectures. We provide the first systematic investigation of the reduce operation on the Cerebras Wafer-Scale Engine (WSE) using the Cerebras SDK. We improve upon existing reduce implementations by up to 5x in certain settings. We show that using at most three different implementations we can achieve performance at most 1.38x slower than an optimal reduction tree. Finally, we provide an allreduce that outperforms patterns like ring or butterfly by up to 2x.
We will (a) cover unique features of the Cerebras WSE, (b) introduce a model to accurately predict performance on the hardware, (c) discuss different reduce implementations, (d) analyze the results of running them using an accurate simulator and compare them against an optimal reduction tree, (e) show how to extend them to an efficient allreduce.
We will (a) cover unique features of the Cerebras WSE, (b) introduce a model to accurately predict performance on the hardware, (c) discuss different reduce implementations, (d) analyze the results of running them using an accurate simulator and compare them against an optimal reduction tree, (e) show how to extend them to an efficient allreduce.
Posters
Research Posters
TP
XO/EX
DescriptionNeoRodinia is a comprehensive benchmark suite developed from Rodinia, containing 23 real-world applications and 5 microbenchmarks. It addresses the limitations of Rodinia by optimizing OpenMP GPU offloading programs and introducing OpenACC variants. The evaluation involves thorough performance assessments on various hardware architectures and compilers, measuring execution time and memory usage. These evaluations offer valuable insights into parallel programming models and compiler choices, guiding optimization efforts and helping developers, especially beginners to make informed decisions.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
TP
XO/EX
DescriptionNetCDF's original design included a portable file format and an intuitive application programming interface (API). However, the current NetCDF framework and its derived libraries lack efficient support for querying and visualizing data subsets with low memory use and time cost. Therefore, a full-stack solution to handle and display multidimensional data frames in NetCDF must be developed to meet the research needs. In this project, a next-generation full-stack tool, “NetCDFaster,” was developed to accelerate the reading and viewing of NetCDF data. This tool was derived from serial and parallel interfaces built on MPI-IO. The test results showed that processing time and memory usage were significantly improved than conventional methods.
Tutorial
Architecture and Networks
Distributed Computing
TUT
DescriptionInfiniBand (IB), High-speed Ethernet (HSE), RoCE, Omni-Path, EFA, Tofu, and Slingshot technologies are generating a lot of excitement towards building next generation High-End Computing (HEC) systems including clusters, datacenters, filesystems, storage, cloud computing, Big Data (Spark) and AI (Deep Learning and Machine Learning) environments. This tutorial will provide an overview of these emerging technologies, their offered architectural features, their current market standing, and their suitability for designing HEC systems. It will start with a brief overview of IB, HSE, RoCE, Omni-Path, EFA, Tofu, and Slingshot. In-depth overview of the architectural features of IB, HSE (including iWARP and RoCE), and Omni-Path, their similarities and differences, and the associated protocols will be presented. An overview of the emerging NVLink2, NVSwitch, AMD Infinity Fabric, Slingshot, and Tofu architectures will also be given. Next, an overview of the OpenFabrics stack and Libfabrics software stack to support a range of different interconnects will be provided. Hardware/software solutions and the market trends behind these networking technologies will be highlighted. Sample performance numbers of these technologies and protocols for different environments will be presented. Finally, hands-on exercises will be carried out for the attendees to gain first-hand experience of running experiments with high-performance networks.
Posters
Research Posters
TP
XO/EX
DescriptionThe computational bottleneck in many fluid simulations arises from solving the variable coefficient Poisson equation. To tackle this challenge, we propose a novel neural domain decomposition algorithm to accelerate its solution. Our approach hinges on two key ideas: first, using neural PDE solvers to approximate the solutions within subdomains, and second, ensuring continuity across subdomain boundaries by solving a Schur complement system derived from the cell-centered discretized Poisson equation. A distinct advantage of our approach lies in generating a large dataset consisting only of small-scale problems to train the subdomain solver. This trained model can subsequently be applied to problems with large and complex geometries. Moreover, by batching the independent subdomain solves, we achieve high GPU utilization with neural solvers compared to state-of-the-art numerical methods. In contrast to neural domain decomposition algorithms that rely on Schwarz overlapping methods, our optimization-based approach, coupled with neural PDE solvers, improves accuracy and performance.
Birds of a Feather
State of the Practice
TP
XO/EX
DescriptionExascale computing (EC) can process larger quantities of data faster than ever before and the technologies being developed can help accelerate innovation across the economy. Quantum-classical hybrid solutions have already gone beyond research environments into the business spheres. The first-generation EC projects in the USA and UK are soon ending.
Which tools and environments are emerging as most sought-after? How ready are we to answer the skills needs of computational researchers and business users? Do we have a clear competence framework? What are the needed skills to harness the promise and potential of emerging technologies?
Which tools and environments are emerging as most sought-after? How ready are we to answer the skills needs of computational researchers and business users? Do we have a clear competence framework? What are the needed skills to harness the promise and potential of emerging technologies?
Workshop
W
DescriptionCharliecloud, LANL’s lightweight unprivileged container implementation, has a new root emulation mode as of version 0.32. We use this to tell programs, which are usually distro package managers, they have real root privileges even though they are running as a normal (although containerized) user. Our new mode uses the kernel’s seccomp(2) system call filtering to first construct a BPF program that specifies allowed system calls. It then intercepts certain privileged system calls, does absolutely nothing and returns success to the program.
The advantages of this new mode is that it is simpler, faster, completely neutral to libc and mostly neutral to distributions. The disadvantage is that it is that even the most hasty consistency checks will fail as most programs seem to not do any checks at all. For the few programs that do check and do apt/apt-get, it offers a hook to prevent certain programs from asking for it.
This lightning talk will discuss how this new root emulation mode uses the kernel’s seccomp filter to create a new fully unprivileged container build approach, along with its advantages and disadvantages.
The advantages of this new mode is that it is simpler, faster, completely neutral to libc and mostly neutral to distributions. The disadvantage is that it is that even the most hasty consistency checks will fail as most programs seem to not do any checks at all. For the few programs that do check and do apt/apt-get, it offers a hook to prevent certain programs from asking for it.
This lightning talk will discuss how this new root emulation mode uses the kernel’s seccomp filter to create a new fully unprivileged container build approach, along with its advantages and disadvantages.
Exhibitor Forum
Architecture and Networks
Data Movement and Memory
Hardware Technologies
TP
XO/EX
DescriptionFujitsu, with over 60 years of processor development history, developed A64FX, which was employed in the Supercomputer Fugaku. Fugaku has significantly contributed to accelerating HPC simulations with its high performance and energy efficiency. However, with the practicality of Artificial Intelligence, there is an increasing need for high computing power to process various workloads in both cloud and edge environments. Additionally, the need for customer data protection is also increasing due to the use of AI. To address these issues, Fujitsu is leveraging its experience in A64FX development and has started developing FUJITSU-MONAKA, a many-core CPU based on the Arm instruction set architecture with Scalable Vector Extension version 2 (SVE2). FUJITSU-MONAKA aims to deliver high AI inference performance with superior energy efficiency to be achieved by Fujitsu's own technology. The goal is to achieve 10 times the power performance of A64FX. Furthermore, FUJITSU-MONAKA will support confidential computing to protect customer data in memory by processor hardware. In this presentation, we will discuss Fujitsu's cutting-edge technologies that will be applied to FUJITSU-MONAKA, which will solve challenges facing the AI era.
Workshop
Education
State of the Practice
Sustainability
W
DescriptionComputer and computational science are pivotal within the evolving STEM landscape. The projected growth of STEM careers, especially in computing, underscores their significance. However, the underrepresentation of minorities and women in computing fields remains a challenge. Oak Ridge National Laboratory (ORNL) hosted one of five U.S. Department of Energy Workforce Development for Teachers and Scientists Pathways Summer Schools in Summer 2023. The Next Generation Pathways to Computing (NGP) program brings high school students to ORNL to learn about careers in computing and work towards inspiring diverse participation. The five-week program curriculum imparted foundational coding skills, practical insights into HPC, and allowed experiential learning under ORNL staff guidance. NGP reached out to underserved schools in the East Tennessee region by offering resources for equitable access. NGP exemplified a comprehensive approach to bridging the diversity gap in computing, nurturing a future generation of STEM leaders equipped with essential skills and diverse perspectives.
Paper
Applications
Modeling and Simulation
DescriptionNeural network quantum state (NNQS) has emerged as a promising candidate for quantum many-body problems, but its practical applications are often hindered by the high cost of sampling and local energy calculation. We develop a high-performance NNQS method for ab initio electronic structure calculations. The major innovations include:
(1) A transformer based architecture as the quantum wave function ansatz;
(2) A data-centric parallelization scheme for the variational Monte Carlo (VMC) algorithm which preserves data locality and well adapts for different computing architectures;
(3) A parallel batch sampling strategy which reduces the sampling cost and achieves good load balance;
(4) A parallel local energy evaluation scheme which is both memory and computationally efficient;
(5) Study of real chemical systems demonstrates both the superior accuracy of our method compared to state-of-the-art and the strong and weak scalability for large molecular systems with up to 120 spin orbitals.
(1) A transformer based architecture as the quantum wave function ansatz;
(2) A data-centric parallelization scheme for the variational Monte Carlo (VMC) algorithm which preserves data locality and well adapts for different computing architectures;
(3) A parallel batch sampling strategy which reduces the sampling cost and achieves good load balance;
(4) A parallel local energy evaluation scheme which is both memory and computationally efficient;
(5) Study of real chemical systems demonstrates both the superior accuracy of our method compared to state-of-the-art and the strong and weak scalability for large molecular systems with up to 120 spin orbitals.
Tutorial
Accelerators
Performance Optimization
TUT
DescriptionThe gap between peak performance and application performance is continuing to open. Paradoxically, bad node-level performance leads to highly scalable code, but at the price of increased overall time to solution. Consequently, valuable resources are wasted, often on a massive scale. If the user cares about time to solution on any scale, optimal performance on the node level is often the key factor. We convey the architectural features of current processor chips, multiprocessor nodes, and accelerators, as far as they are relevant for the practitioner. Peculiarities like SIMD vectorization, shared vs. separate caches, data transfer bottlenecks, and ccNUMA characteristics are introduced, and the influence of system topology and affinity on the performance of typical parallel programming constructs is demonstrated. Performance engineering and performance patterns are suggested as powerful tools that help the user understand the bottlenecks at hand and to assess the impact of possible code optimizations. A cornerstone of these concepts is the roofline model, which is described in detail, including useful case studies, limits of its applicability, and possible refinements. We also show how simple performance tools can support node-level performance analysis by providing the developer with useful information about the bottlenecks of their code.
Workshop
Programming Frameworks and System Software
W
DescriptionPower has recently become a significant limiting factor in supercomputing. It is imperative that future high-performance computing (HPC) systems are energy-efficient. Efficient system designs necessitate understanding the power signature of current supercomputing workloads. Additionally, that understanding will enable power efficiency improvements in existing systems. With advancements in power measurement and data collection techniques, many computing centers, including NERSC, collect a vast amount of data every second, including power usage and other performance metrics. However, access is not straightforward, being a time-consuming process.
We propose NPAT, a power analysis tool that aims to provide easy access to power usage data on the web. It provides a quick and accessible way to view the power usage data of NERSC systems, jobs, and applications. Being implemented in PHP, Javascript, and Python with open-source libraries and modules, it promises effortless portability to other sites.
We propose NPAT, a power analysis tool that aims to provide easy access to power usage data on the web. It provides a quick and accessible way to view the power usage data of NERSC systems, jobs, and applications. Being implemented in PHP, Javascript, and Python with open-source libraries and modules, it promises effortless portability to other sites.
Exhibitor Forum
Architecture and Networks
Data Movement and Memory
Hardware Technologies
TP
XO/EX
DescriptionCompute Express Link (CXL), introduced in 2019, manages the Host-Accelerator coherency and the Host-Memory interface. CXL fabric further enables the disaggregated memory architecture. Most of the CXL developments are on the memory interface and not on the storage interface. In this paper, Wolley evaluates the impact of CXL to the storage interface.
NVMe protocol moves the data in a block form from a Device to a Host memory utilizing the PCIe as the transport. There are several attempts to minimize such Host-Device data movement which is an important factor of performance bottleneck and power consumption. One such effort is Computational Storage that moves the compute from the Host to the Device, and the Device only sends the computed result back to the Host at a much lower data rate.
Wolley proposes using NVMe over CXL (NVMeoC) to optimize the Host-Device data movement. In most applications, the Host only accesses a small portion of the entire block data retrieved from the Device. With NVMeoC, the Device keeps a CXL staging area that is managed as a part of the Host memory. Once the block data is moved to the CXL staging memory through NVMe operation, the actual Host-Device data movement using CXL.mem is just a fraction of the total block data. Wolley will compare in details of the latency and the power consumption between the NVMe over PCIe and NVMe over CXL in several different of applications.
NVMe protocol moves the data in a block form from a Device to a Host memory utilizing the PCIe as the transport. There are several attempts to minimize such Host-Device data movement which is an important factor of performance bottleneck and power consumption. One such effort is Computational Storage that moves the compute from the Host to the Device, and the Device only sends the computed result back to the Host at a much lower data rate.
Wolley proposes using NVMe over CXL (NVMeoC) to optimize the Host-Device data movement. In most applications, the Host only accesses a small portion of the entire block data retrieved from the Device. With NVMeoC, the Device keeps a CXL staging area that is managed as a part of the Host memory. Once the block data is moved to the CXL staging memory through NVMe operation, the actual Host-Device data movement using CXL.mem is just a fraction of the total block data. Wolley will compare in details of the latency and the power consumption between the NVMe over PCIe and NVMe over CXL in several different of applications.
Workshop
Accelerators
Edge Computing
Heterogeneous Computing
W
DescriptionGraph Neural Networks (GNNs) are powerful machine learning models that learn on graph data by extracting embeddings that represent vertex and edge features, as well as graph topology. With graph data scale increasing, and high memory pressure generated from GNN feature data, we turn to out-of-core training methods on many real world graphs. Current state-of-the-art methods for large-graph GNN training leverage mini-batches, distributed or parallel environments, and memory-aware partitioning and sampling. These methods however require custom training architectures and pipelines. Here, we propose Kirin, a framework for large-graph out-of-core training on a single machine with a single GPU on pre-sampled graphs. Kirin leverages Dragon-direct, allowing for NVMe-backed tensors for out-of-core training through driver managed allocations. Building on UVM, Dragon-direct utilizes a page-based unified memory system, resulting in memory-management that is largely invisible to the user. We showcase Kirin and analyze its performance and effectiveness for GNN workloads.
Birds of a Feather
Cloud Computing
TP
XO/EX
DescriptionCloud-native methods are increasingly used for HPC infrastructure. The advantages claimed include agility in system management and flexible support of new and evolving workflows.
In the last ten years, open cloud infrastructure has become widespread in scientific computing and OpenStack is the dominant open source cloud solution. The OpenStack Scientific SIG represents this community.
This session brings together leading practitioners of OpenStack and related technologies for open solutions in production operations. The session will present current use cases of cloud-native open infrastructure. The advantages and challenges of this approach will be presented. Attendees will be invited to share experiences.
In the last ten years, open cloud infrastructure has become widespread in scientific computing and OpenStack is the dominant open source cloud solution. The OpenStack Scientific SIG represents this community.
This session brings together leading practitioners of OpenStack and related technologies for open solutions in production operations. The session will present current use cases of cloud-native open infrastructure. The advantages and challenges of this approach will be presented. Attendees will be invited to share experiences.
Birds of a Feather
Programming Frameworks and System Software
TP
XO/EX
DescriptionOpen MPI continues to drive the start of the art in HPC. This year, we've added new features, fixed bugs, improved performance, and collaborated with many across the HPC community. We'll discuss what Open MPI has accomplished over the past year and present a roadmap for the next year.
One of Open MPI's strengths lies in its diversity: we represent many different viewpoints across the HPC ecosystem. To that end, many developers from the community will be present to discuss and answer your questions both during and after the BoF.
One of Open MPI's strengths lies in its diversity: we represent many different viewpoints across the HPC ecosystem. To that end, many developers from the community will be present to discuss and answer your questions both during and after the BoF.
Birds of a Feather
Middleware and System Software
TP
XO/EX
DescriptionThis BoF is meant to be an open discussion to guide the future roadmap for Open OnDemand (openondemand.org), by getting feedback from the community on the prioritization of the various tasks planned for the next few years. OOD is extremely relevant to ongoing discussions within the HPC community about user interfaces and science gateways. The session leaders, all part of the OOD development team, will jointly develop the content for the presentation in advance to ensure a wide range of viewpoints and topics are presented. We will also consult with our user advisory group in advance for their suggestions.
Workshop
Quantum Computing
Software Engineering
W
DescriptionThe last session of the workshop will be an open session for discussion about all presentations during the event.
Birds of a Feather
Programming Frameworks and System Software
TP
XO/EX
DescriptionOpenACC organization helps researchers and developers advance science by expanding their parallel computing skills and supporting a directive-based, high-level parallel programming model on CPUs, GPUs, and more. OpenACC supports over 25 global hackathons annually and facilitated the acceleration of over 200 applications on multiple platforms (e.g., Frontier, Perlmutter, JUWELS, Summit, and Piz Daint). This BoF serves as a forum for OpenACC users, implementers, and the organization officers to openly discuss the status of OpenACC and its community. Presentations will be given by OpenACC officers, compiler implementers, and invited users, followed by an open mic discussion with the audience.
Workshop
State of the Practice
W
DescriptionIn the era of Natural Language Processing with Large Language Models, the OpenGPT-X project brings forth a platform for researching, producing, and using language models. The project is a German initiative with ten collaborative partners, focusing their efforts to contribute a multilingual Open Source language model. Models trained within the project will also be used for pilot cases by industry partners and commercialized through Gaia-X federation. The memory and compute for training large language models efficiently demands high performance computing systems like JUWELS Booster. This paper outlines the advancements and challenges in the project from the perspective of Jülich Supercomputing Centre and showcases the results of exploration of novel hardware architecture conducted within the scope of the project.
Workshop
Artificial Intelligence/Machine Learning
Software Engineering
W
DescriptionOpening remarks for the AI4DEV Workshop
Workshop
Artificial Intelligence/Machine Learning
Energy Efficiency
Green Computing
Performance Measurement, Modeling, and Tools
Sustainability
W
DescriptionOpening remarks by the organizers
Birds of a Feather
Programming Frameworks and System Software
TP
XO/EX
DescriptionThis BoF is highly interactive and provides attendees with first-hand information from OpenMP implementers and language designers on the future of the OpenMP API. Lightning talks and discussion rounds will give BoF participants amble opportunity to learn and interact with OpenMP experts, ask questions, and provide community feedback. Sub-committee leaders of the OpenMP ARB will provide insight into the future of OpenMP, focusing on the upcoming release of the OpenMP API version 6.0 in November 2024 and the progress that has been made. Vendor representatives will discuss support and timelines for OpenMP features, and expert users will describe their journey.
Workshop
Compilers
Heterogeneous Computing
Performance Optimization
W
DescriptionIn this work, we introduce extensions to LLVM OpenMP, transforming it into a versatile and performance portable kernel language for GPU programming. These extensions allow for the seamless porting of programs written in kernel languages to high-performance OpenMP GPU programs with minimal modifications. To evaluate our extension, we implemented a proof-of-concept prototype that contains a subset of extensions we proposed. We ported six established CUDA proxy and benchmark applications and evaluated their performance on both AMD and NVIDIA platforms. By comparing with native versions (HIP and CUDA), our results demonstrate that OpenMP, augmented with our extensions, can not only match but also in some cases exceed the performance of kernel languages, thereby offering performance portability with minimal effort from application developers.
Workshop
Exascale
Message Passing
Programming Frameworks and System Software
W
Birds of a Feather
Energy Efficiency
State of the Practice
Sustainability
TP
XO/EX
DescriptionOperational Data Analytics (ODA) provides unique opportunities to analyze, understand, and optimize operations of HPC systems. Readily available open-source frameworks make the collection of monitoring data from different domains of the HPC system increasingly easy. However, making the data work for HPC operations is not straight-forward and effort being duplicated at many HPC sites to develop methods and tools to analyze the data and leverage it for operations. There is a clear demand to collaborate on this within the community but as standards in terms of semantics and naming of monitoring data are currently missing, such collaboration is severely hampered.
Workshop
State of the Practice
W
DescriptionForecasting space weather conditions in the Earth’s ionosphere is critical to protecting key infrastructure, such as satellite-based positioning and navigation systems, high-frequency radio communications, and the electric power grid. Variations in space weather are caused by coronal mass ejections from the Sun’s surface, energizing electrons in the ionosphere to produce disturbances in communications and electrical systems, as well as spectacular aurorae.
We present a system for operationalizing HPC tasks for data assimilation in space weather forecasting using Celery and Django. Celery is used to execute and distribute tasks asynchronously, while Django is a popular web framework. Our system integrates these tools to automate running space weather simulations on an HPC cluster for data assimilation and presenting outputs on a website in near real-time. Our system applies to a wide range of HPC tasks in research software, and we believe this is useful for researchers to operationalize similar workflows.
We present a system for operationalizing HPC tasks for data assimilation in space weather forecasting using Celery and Django. Celery is used to execute and distribute tasks asynchronously, while Django is a popular web framework. Our system integrates these tools to automate running space weather simulations on an HPC cluster for data assimilation and presenting outputs on a website in near real-time. Our system applies to a wide range of HPC tasks in research software, and we believe this is useful for researchers to operationalize similar workflows.
Workshop
Algorithms
Heterogeneous Computing
Large Scale Systems
W
DescriptionWe describe our experience porting FUN3D's CUDA-optimized kernels to Intel oneAPI SYCL. We faced several challenges, including the suboptimal performance of the oneAPI code on Intel's new data center GPU. The suboptimal performance of the oneAPI code was due to high register spills, memory latency, and poor vectorization. We addressed these issues by implementing the kernels using Intel oneAPI's Explicit SIMD SYCL extension (ESIMD) API. The ESIMD API enables the writing of explicitly vectorized kernel code, gives more precise control over register usage and prefetching, and better handles thread divergence compared to SYCL. The ESIMD code outperforms the optimized SYCL code by up to a factor of 3.6, depending on the kernel. We also compared the performance with the CUDA-optimized version on NVIDIA V100 and A100 GPUs. We found the performance of a single tile of the Intel GPU using ESIMD greater than NVIDIA V100 and similar to NVIDIA A100.
Workshop
Applications
Cloud Computing
Distributed Computing
Edge Computing
Large Scale Systems
W
DescriptionScientific workflows bridge scientific challenges with computational resources. While dispel4py, a stream-based workflow system, offers mappings to parallel enactment engines like MPI or Multiprocessing, its optimization primarily focuses on dynamic process-to-task allocation for improved performance. An efficiency gap persists, particularly with the growing emphasis on conserving computing resources. Moreover, the existing dynamic optimization lacks support for stateful applications and grouping operations.
To address these issues, our work introduces a novel hybrid approach for handling stateful operations and groupings within workflows, leveraging a new Redis mapping. We also propose an auto-scaling mechanism integrated into dispel4py's dynamic optimization. Our experiments showcase the effectiveness of auto-scaling optimization, achieving efficiency while upholding performance. In the best case, auto-scaling reduces dispel4py's runtime to 87% compared to the baseline, using only 76% of process resources. Importantly, our optimized stateful dispel4py demonstrates a remarkable speedup, utilizing just 32% of the runtime compared to the contender.
To address these issues, our work introduces a novel hybrid approach for handling stateful operations and groupings within workflows, leveraging a new Redis mapping. We also propose an auto-scaling mechanism integrated into dispel4py's dynamic optimization. Our experiments showcase the effectiveness of auto-scaling optimization, achieving efficiency while upholding performance. In the best case, auto-scaling reduces dispel4py's runtime to 87% compared to the baseline, using only 76% of process resources. Importantly, our optimized stateful dispel4py demonstrates a remarkable speedup, utilizing just 32% of the runtime compared to the contender.
Workshop
Performance Optimization
W
DescriptionDeep Learning models frequently produce high-confidence softmax outputs for out-of-distribution (OOD) inputs, which would ideally be classified as "I don't know". To enhance our model's trustworthiness, we incorporate selective classification, which entails abstaining from predictions in situations of doubt. This approach requires initial uncertainty estimation. Subsequently, instead of offering a singular prediction, we provide a distribution over predictions, enabling users to discern if the model is trustworthy or if consultation with a human expert is necessary. In this paper, we assess uncertainty in two baseline models: a Convolutional Neural Network (CNN) and a Vision Transformer (ViT). Leveraging these uncertainty values, we minimize errors by refraining from predictions during high uncertainty. Additionally, we evaluate these models across various distributed architectures.
Paper
Artificial Intelligence/Machine Learning
Codesign
Performance Optimization
Programming Frameworks and System Software
TP
DescriptionConvolution kernels are widely seen in deep learning workloads and are often responsible for performance bottlenecks. Recent research has demonstrated that a direct convolution approach can outperform the traditional convolution implementation based on tensor-to-matrix conversions. However, existing approaches for direct convolution still have room for performance improvement. We present NDIRECT, a new direct convolution approach that targets ARM-based multi-core CPUs commonly found in smartphones and HPC systems. NDIRECT is designed to be compatible with the data layout formats used by mainstream deep learning frameworks but offers new optimizations for the computational kernel, data packing, and parallelization. We evaluate NDIRECT by applying it to representative convolution kernels and demonstrating its performance on four distinct ARM multi-core CPU platforms. We compare NDIRECT against state-of-the-art convolution optimization techniques. Experimental results show that NDIRECT gives the best overall performance across evaluation scenarios and platforms.
Paper
Accelerators
Algorithms
Linear Algebra
TP
DescriptionWe detail the performance optimizations made in rocHPL, AMD's open-source implementation of the High-Performance Linpack (HPL) benchmark targeting accelerated node architectures designed for exascale systems such as the Frontier supercomputer. The implementation leverages the high-throughput GPU accelerators on the node via highly optimized linear algebra libraries, as well as the entire CPU socket to perform latency-sensitive factorization phases. We detail novel performance improvements such as a multi-threaded approach to computing the panel factorization phase on the CPU, time-sharing of CPU cores between processes on the node, as well as several optimizations which hide MPI communication. We present some performance results of this implementation of the HPL benchmark on a single node of the Frontier early access cluster at Oak Ridge National Laboratory, as well as scaling to multiple nodes.
Workshop
Exascale
Message Passing
Programming Frameworks and System Software
W
DescriptionIrregular communication limits both performance and scalability of parallel applications. Typically, it is implemented as point-to-point, and optimizations are integrated into the application, lacking portability. Optimization of point-to-point messages within MPI is difficult, as the interface only provides information on a piece of overall communication. However, persistent neighbor collectives expose a suitable interface for such optimizations.
This paper presents methods for implementing existing optimizations for irregular communication within neighborhood collectives, analyzes the impact of neighborhood collectives in Hypre BoomerAMG, and shows up to a 1.38x speedup on sparse matrix-vector multiplication using optimized neighbor collectives. The authors analyze three implementations of neighborhood collectives for Alltoallv: an unoptimized wrapper of standard point-to-point communication, and two locality-aware aggregating methods. The second exposes a non-standard interface to perform additional optimization for an additional 0.07x speedup.
Optimizations are available open-source in MPI Advance which wraps MPI, allowing use with any MPI installation.
This paper presents methods for implementing existing optimizations for irregular communication within neighborhood collectives, analyzes the impact of neighborhood collectives in Hypre BoomerAMG, and shows up to a 1.38x speedup on sparse matrix-vector multiplication using optimized neighbor collectives. The authors analyze three implementations of neighborhood collectives for Alltoallv: an unoptimized wrapper of standard point-to-point communication, and two locality-aware aggregating methods. The second exposes a non-standard interface to perform additional optimization for an additional 0.07x speedup.
Optimizations are available open-source in MPI Advance which wraps MPI, allowing use with any MPI installation.
Paper
Distributed Computing
Message Passing
Programming Frameworks and System Software
TP
Best Student Paper Finalist
DescriptionCollective communication operations, such as broadcasting and reductions, often contribute to performance bottlenecks in Message Passing Interface (MPI) programs. As the number of processor cores integrated into CPUs increases, running multiple MPI processes on shared-memory machines to leverage hardware parallelism is becoming increasingly common. In this context, optimizing MPI collective communications for shared-memory execution is crucial. This paper identifies two primary limitations of existing MPI collective implementations on shared-memory systems. The first is the extensive redundant data movements when performing reduction collectives, and the second is the ineffective use of non-temporal instructions to optimize streamed data processing. To address these challenges, we propose two optimization techniques designed to minimize data movements and enhance the use of non-temporal instructions. We integrate our optimizations into the OpenMPI and evaluate their performance through micro-benchmarks and real-world application tests on two multi-core clusters. Experiments show that our approach significantly outperforms existing techniques by 1.2-6.4x.
Paper
Algorithms
Cloud Computing
Distributed Computing
Heterogeneous Computing
Large Scale Systems
State of the Practice
TP
DescriptionReconfigurable optical topologies are a promising new technology to improve datacenter network performance and cope with the explosive growth of traffic. In particular, these networks allow to adaptively connect racks between which there is currently much traffic, hence making an optimal use of the bandwidth by avoiding multi-hop forwarding.
This paper studies the dynamic optimization of such reconfigurable topologies, adapting to the traffic in an online manner. The underlying algorithmic problem can be described as an online maximum weight b-matching problem, a generalization of maximum weight matching where each node has at most b>=1 incident matching edges.
We make the case for a randomized approach for matching optimization. Our main contribution is a O(log b)-competitive algorithm and we show that it is asymptotically optimal. This algorithm is exponentially better than the best possible deterministic online algorithm.
We complement our theoretical results with trace-driven simulations, based on real-world datacenter workloads.
This paper studies the dynamic optimization of such reconfigurable topologies, adapting to the traffic in an online manner. The underlying algorithmic problem can be described as an online maximum weight b-matching problem, a generalization of maximum weight matching where each node has at most b>=1 incident matching edges.
We make the case for a randomized approach for matching optimization. Our main contribution is a O(log b)-competitive algorithm and we show that it is asymptotically optimal. This algorithm is exponentially better than the best possible deterministic online algorithm.
We complement our theoretical results with trace-driven simulations, based on real-world datacenter workloads.
Posters
Research Posters
TP
XO/EX
DescriptionDeep Learning (DL) methods have shown substantial efficacy in computer vision (CV) and natural language processing (NLP). Despite their proficiency, the inconsistency in input data distributions can compromise prediction reliability. This study mitigates this issue by introducing uncertainty evaluations in DL models, thereby enhancing dependability through a distribution of predictions. Our focus lies on the Vision Transformer (ViT), a DL model that harmonizes both local and global behavior. We conduct extensive experiments on the ImageNet-1K dataset, a vast resource with over a million images across 1,000 categories. ViTs, while competitive, are vulnerable to adversarial attacks, making uncertainty estimation crucial for robust predictions.
Our research advances the field by integrating uncertainty evaluations into ViTs, comparing two significant uncertainty estimation methodologies, and expediting uncertainty computations on high-performance computing (HPC) architectures, such as the Cerebras CS-2, SambaNova DataScale, and the Polaris supercomputer, utilizing the MPI4PY package for efficient distributed training.
Our research advances the field by integrating uncertainty evaluations into ViTs, comparing two significant uncertainty estimation methodologies, and expediting uncertainty computations on high-performance computing (HPC) architectures, such as the Cerebras CS-2, SambaNova DataScale, and the Polaris supercomputer, utilizing the MPI4PY package for efficient distributed training.
Posters
Research Posters
TP
XO/EX
DescriptionDistributed scientific workflows are becoming data-intensive, and the data movement through storage systems often causes bottleneck. Therefore, it is critical to understand data flow. Many scientific datasets incorporate domain semantics with formats like HDF and NetCDF, enhancing the interpretability and context of the data for analysis. We shed new insights on workflow bottlenecks by understanding how semantic data sets flow through storage. We unveil a fresh perspective with careful runtime measurement, recovering the mapping of domain semantics to low-level I/O operations, and effective visualization and analysis of semantic flows.
Workshop
Fault Handling and Tolerance
Large Scale Systems
W
DescriptionThe widening gap between compute and I/O performances on modern HPC systems means that writing checkpoints to a parallel file system for fault tolerance is fast becoming a bottleneck to high-performance. It is therefore vital that software is engineered such that it can achieve the highest proportion of available performance on the underlying hardware; and this is a burden often carried by I/O middleware libraries. In this paper, we outline such an I/O library based on a Log-structured Merge Tree (LSM-Tree), not just for metadata, but also scientific data. We benchmark its performance using the IOR benchmark, demonstrating 2.4 to 76.7x better performance than alternative file formats, such as ADIOS2, HDF5, and IOR baseline when running on a Lustre Parallel File System. We further demonstrate that when our LSM-Tree I/O library is used as a storage-layer for ADIOS2, the resulting I/O library still outperforms the default ADIOS2 implementation by 1.5x.
Workshop
State of the Practice
W
DescriptionRemote work has been widely adopted by tech companies since the lockdown more than three years ago. However, some companies are asking their employees to come back to their seats with the promise to change taught meetings by chatting next to the coffee machine.
In this talk, I'll show what we have done since the company was founded 11 years ago to build and keep together a team separated several kilometers away.
In this talk, I'll show what we have done since the company was founded 11 years ago to build and keep together a team separated several kilometers away.
Workshop
Security
State of the Practice
W
DescriptionReliable authentication is a key component of all HPC systems. This paper discusses an approach that bypasses systemic authentication problems experienced by the authors to provide a simple and reliable manner of managing service accounts and user groups for HPC centers using plain text caches and alternatives to passwords.
Exhibitor Forum
Accelerators
Artificial Intelligence/Machine Learning
Architecture and Networks
Hardware Technologies
TP
XO/EX
DescriptionThe largest performance bottleneck and energy usage in neural network acceleration is the fetching of weight and activation values prior to general matrix-vector (GEMV) or general matrix-matrix (GEMM) computation. Traditional von Neumann architectures, even with large on-chip caches, consume as much as 90% of their energy in data movement and only 10% for actual calculations, which limits their energy efficiency to, in most cases, low single digit TOPs/W. Analog in-memory compute, where the memory cell is used as part of the MAC calculation, suffers from accuracy issues and the required additional support circuitry, such as analog-to-digital and digital-to-analog converters, and compensation which obviates the inherent low-power advantages, limiting the state of the art to 3 TOPs/W.
The novel Untether AI at-memory compute architecture stores all weights directly on-chip in specially designed low-power SRAM using high-density bit cells that are tuned to directly feed the processing elements (PEs) using minimal energy. Because the PEs are directly adjacent to the SRAM cells, it only uses 2 femtojoules per bit-access. This innovation represents an order of magnitude improvement over compiled memory cells, and three orders of magnitude compared to fetching weights from external DRAM.
The novel Untether AI at-memory compute architecture stores all weights directly on-chip in specially designed low-power SRAM using high-density bit cells that are tuned to directly feed the processing elements (PEs) using minimal energy. Because the PEs are directly adjacent to the SRAM cells, it only uses 2 femtojoules per bit-access. This innovation represents an order of magnitude improvement over compiled memory cells, and three orders of magnitude compared to fetching weights from external DRAM.
Doctoral Showcase
Posters
Accelerators
TP
DescriptionThe imbalance between compute and memory bandwidth has been a long-standing issue. Despite efforts to address it, the gap between them is still widening. This has led to the categorization of many applications as memory-bound kernels.
This dissertation centers on memory-bound kernels, with a particular emphasis on Graphics Processing Units (GPUs), given their rising prevalence in High-Performance Computing (HPC) systems.
In this dissertation, we initially focus on the evolution trend of GPU development in the last decades. Examples include cooperative groups (i.e., device-wide barriers), asynchronous copy of shared memory (i.e., hardware prefetching), low(er) latency of operations, and larger volume of on-chip resources (register files and L1 cache).
This dissertation seeks to utilize the latest GPU features to optimize memory-bound kernels. Specifically, we propose extending the kernel's lifetime across the time steps and taking advantage of the large volume of on-chip resources (i.e., register files and scratchpad memory) in reducing or eliminating traffic to the device memory. Furthermore, we champion a minimum level of parallelism to maximize the available on-chip resources.
Based on the strategies, we propose a general execution model for running memory-bound iterative GPU kernels: PERsistent KernelS (PERKS) and a novel temporal blocking method, EBISU. Evaluations have shown outstanding performance in the latest GPU architectures compared with counterpart state-of-the-art implementations.
This dissertation centers on memory-bound kernels, with a particular emphasis on Graphics Processing Units (GPUs), given their rising prevalence in High-Performance Computing (HPC) systems.
In this dissertation, we initially focus on the evolution trend of GPU development in the last decades. Examples include cooperative groups (i.e., device-wide barriers), asynchronous copy of shared memory (i.e., hardware prefetching), low(er) latency of operations, and larger volume of on-chip resources (register files and L1 cache).
This dissertation seeks to utilize the latest GPU features to optimize memory-bound kernels. Specifically, we propose extending the kernel's lifetime across the time steps and taking advantage of the large volume of on-chip resources (i.e., register files and scratchpad memory) in reducing or eliminating traffic to the device memory. Furthermore, we champion a minimum level of parallelism to maximize the available on-chip resources.
Based on the strategies, we propose a general execution model for running memory-bound iterative GPU kernels: PERsistent KernelS (PERKS) and a novel temporal blocking method, EBISU. Evaluations have shown outstanding performance in the latest GPU architectures compared with counterpart state-of-the-art implementations.
Workshop
Performance Measurement, Modeling, and Tools
Performance Optimization
W
Workshop
Performance Measurement, Modeling, and Tools
Performance Optimization
W
Workshop
Compilers
Heterogeneous Computing
Performance Optimization
W
DescriptionPanel discussion centering on the state of LLVM for HPC.
Paper
Accelerators
Algorithms
Linear Algebra
TP
Best Paper Finalist
DescriptionSparse direct solvers play a vital role in large-scale high performance computing in science and engineering. Existing distributed sparse direct methods employ multifrontal/supernodal patterns to aggregate columns of nearly identical forms and to exploit dense basic linear algebra subprograms (BLAS) for computation. We propose a new sparse direct solver called PanguLU. Our work relies on simpler regular 2D blocking and stores blocks in their sparse forms to avoid any extra fill-ins. Based on sparse patterns of blocks, a variety of block-wise sparse BLAS methods are developed and selected for higher efficiency on local GPUs. To make PanguLU more scalable, we also adjust mapping of blocks to processes for overall more balanced workload, and propose a synchronization-free communication strategy to reduce overall latency overhead. Experiments on two distributed heterogeneous platforms consisting of 128 A100 GPUs and 128 MI50 GPUs demonstrate that PanguLU achieves up to 11.70x and 17.97x speedups over SuperLU_DIST.
Posters
Research Posters
TP
XO/EX
DescriptionPanSim, a specialized agent-based model, was developed to analyze interventions against COVID-19. Implemented in C++ and Thrust, it is a highly performant and portable code. Here we focus on different algorithmic formulations for calculating cumulative values like infectiousness at different locations. A detailed comparison of time and efficiency on different CPUs and GPUs was conducted, revealing suboptimal parallel efficiency. The time to execute 704 simulations on each platform was evaluated, emphasizing overall throughput instead of latency for more taxing workloads. We benchmarked modern CPU and GPU architectures, revealing the superior performance of NVIDIA A100 and AMD Genoa-X platforms. Additionally, the monetary cost associated with executing the simulations was analyzed, presenting a contrasting landscape in on-demand and spot pricing. Ampere Altra platform emerged as the most cost-effective. The findings contribute to understanding the efficiency, time, and cost dynamics in modeling and provide insights for the practice of pandemic response planning.
Tutorial
Algorithms
Heterogeneous Computing
Message Passing
TUT
DescriptionThis tutorial provides a comprehensive overview of parallel computing, emphasizing those aspects most relevant to the user. It is suitable for new users, students, managers, and anyone seeking an overview of parallel computing. It discusses software and hardware/software interaction, with an emphasis on standards, portability, and systems that are widely available.
The tutorial surveys basic parallel computing concepts, using examples selected from multiple engineering, scientific, and machine learning problems. These examples illustrate using MPI on distributed memory systems; OpenMP on shared memory systems; MPI+OpenMP on hybrid systems; and CUDA and compiler directives on GPUs and accelerators. It discusses numerous parallelization and load balancing approaches, and software engineering and performance improvement aspects, including the use of state-of-the-art tools.
The tutorial helps attendees make intelligent decisions by covering the primary options that are available, explaining how the different components work together and what they are most suitable for. Extensive pointers to web-based resources are provided to facilitate follow-up studies.
The tutorial surveys basic parallel computing concepts, using examples selected from multiple engineering, scientific, and machine learning problems. These examples illustrate using MPI on distributed memory systems; OpenMP on shared memory systems; MPI+OpenMP on hybrid systems; and CUDA and compiler directives on GPUs and accelerators. It discusses numerous parallelization and load balancing approaches, and software engineering and performance improvement aspects, including the use of state-of-the-art tools.
The tutorial helps attendees make intelligent decisions by covering the primary options that are available, explaining how the different components work together and what they are most suitable for. Extensive pointers to web-based resources are provided to facilitate follow-up studies.
Tutorial
Architecture and Networks
I/O and File Systems
TUT
DescriptionI/O on HPC systems is a black art. This tutorial sheds light on the state-of-the-art in parallel I/O and provides the knowledge necessary for attendees to best leverage I/O resources available to them. We cover the entire I/O software stack including storage and parallel file systems at the lowest layer, the role of NVRAM devices, intermediate layers (such as MPI-IO), and high-level I/O libraries (such as HDF-5). We emphasize ways to use these interfaces that result in high performance and tools for generating insight into these stacks.
Our first third of the tutorial covers parallel I/O fundamentals. We discuss storage technologies, both present and near-future and the major parallel and distributed file systems. We focus on application in our second third, connecting storage to our examination of the upper library layers of the I/O stack, covering MPI-IO, Parallel netCDF, and HDF5. Finally, we discuss tools for understanding I/O behavior.
Our first third of the tutorial covers parallel I/O fundamentals. We discuss storage technologies, both present and near-future and the major parallel and distributed file systems. We focus on application in our second third, connecting storage to our examination of the upper library layers of the I/O stack, covering MPI-IO, Parallel netCDF, and HDF5. Finally, we discuss tools for understanding I/O behavior.
Posters
Research Posters
TP
XO/EX
DescriptionDirect numerical simulation (DNS) is a technique that directly solves the fluid Navier-Stokes equations with high spatial and temporal resolutions. However, its utility in studying high Reynolds number (Re) wall turbulence of particular interest is limited by the rapidly growing grid size (i.e., the memory and computation requirement) with Re^3.
We present PowerLLEL, a high-performance finite difference solver tailored for the challenging DNS of incompressible wall turbulence at extreme scales. An adaptive multi-level parallelization strategy is proposed to fully exploit the multi-level parallelism of various architectures and enhance computational performance. The communication performance of global transpose and halo exchange is significantly improved by a tridiagonal solver based on the parallel diagonal dominant (PDD) algorithm and three RDMA-implemented communication optimizations. Strong scaling tests on the Tianhe-2A supercomputer show that PowerLLEL achieves nearly 92% parallel efficiency with up to 31,104 cores on a grid size of 143.3 billion.
We present PowerLLEL, a high-performance finite difference solver tailored for the challenging DNS of incompressible wall turbulence at extreme scales. An adaptive multi-level parallelization strategy is proposed to fully exploit the multi-level parallelism of various architectures and enhance computational performance. The communication performance of global transpose and halo exchange is significantly improved by a tridiagonal solver based on the parallel diagonal dominant (PDD) algorithm and three RDMA-implemented communication optimizations. Strong scaling tests on the Tianhe-2A supercomputer show that PowerLLEL achieves nearly 92% parallel efficiency with up to 31,104 cores on a grid size of 143.3 billion.
Workshop
Algorithms
Heterogeneous Computing
Large Scale Systems
W
DescriptionWe present a hybrid sequential/parallel symbolic Cholesky factorization algorithm that computes the sparsity pattern of the symbolic factors in parallel. We evaluate the performance on a large subset of the SuiteSparse matrix collection and multicore CPUs as well as flagship GPUs by AMD and NVIDIA, achieving speedups of an order of magnitude compared to a state-of-the-art sequential symbolic Cholesky factorization.
Paper
Accelerators
Algorithms
Graph Algorithms and Frameworks
TP
DescriptionThe top-K problem is an essential part of many important applications in scientific computing, information retrieval, etc. As data volume grows rapidly, high-performance parallel top-K algorithms become critical. We propose two parallel top-K algorithms, AIR top-K (Adaptive and Iteration-fused Radix top-K) and GridSelect, for GPU. AIR top-K employs an iteration-fused design to minimize CPU-GPU communication and device data access. Its adaptive strategy eliminates unnecessary device memory traffic automatically under various data distributions. GridSelect can process data on-the-fly. It adopts a shared queue and parallel two-step insertion to decrease the frequency of costly operations. We comprehensively compare 8 open-source GPU implementations and our methods for a wide range of problem sizes and data distributions. For batch sizes 1 and 100, respectively, AIR top-K shows 1.98-21.48X and 8.01-574.78X speedup over previous radix top-K algorithm, and 1.44-7.34X and 1.38-31.91X speedup over state-of-the-art methods. GridSelect shows up to 882.29X speedup over its baseline.
Workshop
Education
State of the Practice
W
DescriptionThe Nagel-Schreckenberg model is a stochastic one-dimensional traffic model. In this assignment, we guide students through the process of implementing a shared-memory parallel and reproducible version of an existing serial code that implements this model, and to analyze its scaling behavior.
One of the key elements in this traffic model is the presence of randomness, without which it would lack realistic phenomena such as traffic jams. Its implementation thus requires techniques associated with Monte Carlo simulations and pseudo-random number generation (PRNG). PRNGs are notoriously tricky to deal with in parallel when combined with the requirement of reproducibility.
This assignment was created for the graduate course PHY1610 Scientific Computing for Physicists at the University of Toronto, which had its origin in the training program of the SciNet HPC Consortium, and is also very suitable for other scientific disciplines. Several variations of the assignment have been used over the years.
One of the key elements in this traffic model is the presence of randomness, without which it would lack realistic phenomena such as traffic jams. Its implementation thus requires techniques associated with Monte Carlo simulations and pseudo-random number generation (PRNG). PRNGs are notoriously tricky to deal with in parallel when combined with the requirement of reproducibility.
This assignment was created for the graduate course PHY1610 Scientific Computing for Physicists at the University of Toronto, which had its origin in the training program of the SciNet HPC Consortium, and is also very suitable for other scientific disciplines. Several variations of the assignment have been used over the years.
Workshop
Accelerators
Codesign
Heterogeneous Computing
Task Parallelism
W
DescriptionEmbedded devices, constrained by limited memory and processors, require deep learning models to be tailored to their specifications. This research explores customized model architectures for classifying drainage crossing images. Building on the foundational ResNet-18, this paper aims to maximize prediction accuracy, reduce memory size, and minimize inference latency. Various configurations were systematically probed by leveraging hardware-aware neural architecture search, accumulating 1,717 experimental results over six benchmarking variants. The experimental data analysis, enhanced by nn-Meter, provided a comprehensive understanding of inference latency across four different predictors. Significantly, a Pareto front analysis with three objectives of accuracy, latency, and memory resulted in five non-dominated solutions. These standout models showcased efficiency while retaining accuracy, offering a compelling alternative to the conventional ResNet-18 when deployed in resource-constrained environments. The presentation concludes by highlighting insights drawn from the results and suggesting avenues for future exploration.
Posters
Research Posters
TP
XO/EX
DescriptionLeiden algorithm has demonstrated superior efficacy compared to traditional Louvain algorithms in the field of community detection. However, parallelizing the Leiden algorithm while imposing community size limitations brings significant challenges in big data processing scenarios. We present ParLeiden, a pioneering parallel Leiden strategy designed for distributed environments. By thread locks and efficient buffers, we effectively resolve community joining conflicts and reduce communication overheads. We can run Leiden algorithm on large-scale graphs and achieve performance speedup on up to 9.8 times than baselines.
Birds of a Feather
Education
TP
XO/EX
DescriptionDespite the quantity of existing training materials, acquisition and development of HPC skills is not straightforward enough to address the needs of the growing and diversifying HPC community. To address this, the HPC teaching and training ecosystem must mirror the growth and diversification of the HPC community and technologies. This BoF creates an opportunity to gather the user/learner community perspectives and explore new requirements in order to identify new entry points and build well-defined learning pathways that more accurately represent the aims of the user/learner community and changing technology landscape. We encourage those interested in HPC training to attend.
Workshop
Applications
Cloud Computing
Distributed Computing
Edge Computing
Large Scale Systems
W
DescriptionPatterns and Anti-Patterns in Migrating from Legacy Workflows to Workflow Management Systems
Workshop
Applications
Distributed Computing
Compilers
Heterogeneous Computing
Message Passing
Programming Frameworks and System Software
Task Parallelism
W
Workshop
Applications
Distributed Computing
Compilers
Heterogeneous Computing
Message Passing
Programming Frameworks and System Software
Task Parallelism
W
Workshop
Applications
Distributed Computing
Compilers
Heterogeneous Computing
Message Passing
Programming Frameworks and System Software
Task Parallelism
W
Workshop
Applications
Distributed Computing
Compilers
Heterogeneous Computing
Message Passing
Programming Frameworks and System Software
Task Parallelism
W
DescriptionDeveloping large scientific applications is challenging for many reasons, and alternative programming can help with better support for the implementation. These applications need to incorporate the latest domain specific scientific information, be applicable to real world problems, and be robust across a wide variety of inputs. For many models, parallelization is only seen as a necessary hardship, and once developed, the parallelization logic is left untouched, sometimes for decades. Alternative parallel application programming implementations such as Coarray Fortran, Chapel, or UPC++ promise to make implementing and maintaining such parallel logic easier; however, the alternative nature of these implementations often means they do not have the support in the operational HPC community to make that dream a reality. Here we discuss some of the problems that arise implementing the parallelization logic for two different models in Coarray Fortran. We highlight an example from the Intermediate Complexity Atmospheric Research model (ICAR) in which the Partitioned Global Address Space (PGAS) programming model of Coarray Fortran made implementing the generation of a massive lookup-table in parallel almost trivial. We then discuss issues that have arisen since the initial implementation due to inconsistencies in compiler implementations. Improving the support for such parallel frameworks is a bit of a chicken-and-egg problem. Compiler writers do not wish to devote resources to features that are not widely used, and developers do not want to use features without robust support.
Workshop
Applications
Distributed Computing
Compilers
Heterogeneous Computing
Message Passing
Programming Frameworks and System Software
Runtime Systems
Task Parallelism
W
DescriptionDifferent aspects of the workshop and other questions from the moderator and audience will be discussed in the panel.
Workshop
Programming Frameworks and System Software
W
DescriptionIn the context of the expanding landscape of contemporary High-Performance Computing (HPC) applications from petascale to exascale, the pursuit of performance optimization emerges as a significant impediment within software development endeavors. In the meantime, the escalating intricacies inherent in parallel architectures and systems serve to compound the challenges associated with performance enhancement.
Here, we introduce PEAK (Performance Evaluation and Analysis Kit), a light-weight profiling tool developed with a specific focus on large-scale HPC applications. Using Dynamic Binary Instrumentation, PEAK is able to profile large-scale multi-threaded, multi-process applications with low overhead and high accuracy. We analyzed the overhead and accuracy of PEAK using synthetic benchmarks and real applications and compared it against the other widely used HPC profiling tools available. Our demonstration underscores that PEAK exhibits comparable overhead and accuracy to alternative profiling tools, while preserving its inherent simplicity.
Here, we introduce PEAK (Performance Evaluation and Analysis Kit), a light-weight profiling tool developed with a specific focus on large-scale HPC applications. Using Dynamic Binary Instrumentation, PEAK is able to profile large-scale multi-threaded, multi-process applications with low overhead and high accuracy. We analyzed the overhead and accuracy of PEAK using synthetic benchmarks and real applications and compared it against the other widely used HPC profiling tools available. Our demonstration underscores that PEAK exhibits comparable overhead and accuracy to alternative profiling tools, while preserving its inherent simplicity.
Paper
Accelerators
Algorithms
Graph Algorithms and Frameworks
TP
DescriptionThe 𝐾 shortest path (KSP) algorithm, which finds the top 𝐾 shortest simple paths from a given source to a target vertex, has a wide range of real-world applications. While the top 𝐾 shortest simple paths offer invaluable insights, computing them is time-consuming. In this work, we observe existing works search 𝐾 shortest paths from the original graph, while the top 𝐾 shortest paths only cover a meager portion of the original graph. This paper devises PeeK. It first applies 𝐾 upper bound pruning to prune the vertices and edges that will not appear in any of the 𝐾 shortest paths. Second, PeeK adaptively compacts the graph that, not only removes the deleted vertices or edges but also efficiently computes the downstream task. We compare PeeK with five algorithms. For parallel computation with 32 threads, PeeK achieves 5.1x and 28.8x speedup over the state-of-the-art for 𝐾 = 8, 128, respectively.
Workshop
Education
State of the Practice
W
DescriptionHPC relies on experts to design, implement, and tune (computational science) applications that can efficiently use current (super)computing systems. As such, we strongly believe we must educate our students to ensure their ability to drive this activities, together with the domain experts. To this end, in 2018, we have designed a performance engineering course that, inspired by several conference-like tutorials, covers the principles and practice of performance engineering: benchmarking, performance modeling, and performance improvement. We describe the goals, learning objectives, and structure of the course, share students feedback and evaluation data, and discuss the lessons learned. After teaching the course five times, our results show that the course is tough (as expected) but very well received, with high-scores and several students continuing on the path of performance engineering during and after their master studies.
Workshop
Performance Measurement, Modeling, and Tools
Performance Optimization
W
DescriptionPreparing for the deployment of large scientific and engineering codes on upcoming exascale systems with GPU-dense nodes is made challenging by the unprecedented diversity of device architectures and heterogeneous programming models. In this work, we evaluate the process of porting a massively parallel, fluid dynamics code written in CUDA to SYCL, HIP, and Kokkos with a range of backends, using a combination of automated tools and manual tuning. We use a proxy application along with a custom performance model to inform the results and identify additional optimization strategies. At scale performance of the programming model implementations are evaluated on pre-production GPU node architectures for Frontier and Aurora, as well as on current NVIDIA device-based systems Summit and Polaris. Real-world workloads representing 3D blood flow calculations in complex vasculature are assessed. Our analysis highlights critical trade-offs between code performance, portability, and development time.
Workshop
Performance Measurement, Modeling, and Tools
Performance Optimization
W
DescriptionIn this new era where multiple GPU vendors are leading the supercomputing landscape, and multiple programming models are available to users, the drive to achieve performance portability across platforms faces new challenges. Consider stencil algorithms, where architecture-specific solutions are required to optimize for the parallelism hierarchy and memory hierarchy of emerging systems. In this work, we analyze performance portability of the BrickLib domain-specific library and vector code generator for stencils. BrickLib employs fine-grain data blocking to reduce the large amount of data movement associated with stencils. We compare different GPUs (NVIDIA, AMD and Intel) and their associated programming models (CUDA, HIP and SYCL). By testing a wide range of stencil configurations, we show that overall, BrickLib achieves good performance independent of machine or programming model. Moreover, we introduce correlation models as a new tool for comparing architectures and programming models from Roofline model data.
Workshop
Large Scale Systems
Middleware and System Software
Programming Frameworks and System Software
W
DescriptionMoore’s Law is a techno-economic model that has enabled the IT industry to double the performance and functionality of digital electronics roughly every 2 years within a fixed cost, power and area. This expectation has led to a relatively stable ecosystem (e.g. electronic design automation tools, compilers, simulators and emulators) built around general-purpose processor technologies, such as the x86, ARM and Power instruction set architectures. However, the historical improvements in performance offered by successive generations of lithography are waning while costs for new chip generations are growing rapidly. In the near term, the most practical path to continued performance growth will be architectural specialization in the form of many different kinds of accelerators. New software implementations, and in many cases new mathematical models and algorithmic approaches, are necessary to advance the science that can be done with these specialized architecture. This trend will not only continue but also intensify as the transition from multi-core systems to hybrid systems has already caused many teams to re-factor and redesign their implementations. But the next step to systems that exploit not just one type of accelerator but a full range of heterogeneous architectures will require more fundamental and disruptive changes in algorithm and software approaches. This applies to the broad range of algorithms used in simulation, data analysis and learning. New programming models or low-level software constructs that hide the details of the architecture from the implementation can make future programming less time-consuming, but they will not eliminate nor in many cases even mitigate the need to redesign algorithms. Future software development will not be tractable if a completely different code base is required for each different variant of a specialized system.
The aspirational desire for “minimizing the number of lines of code that must be changed to migrate to different systems with different arrangements of specialization” is encapsulated in the loaded phrase “Performance Portability.” However, performance portability is likely not an achievable goal if we attempt to do it using imperative languages like Fortran and C/C++. There is simply not enough flexibility built in to the specification of the algorithm for a compiler to do anything other than what the algorithm designer explicitly stated in their code. To make this future of diverse accelerators usable and accessible in the former case will require the co-design of new compiler technology and domain- specific languages (DSLs) designed around the requirements of the target computational motifs. The higher levels of abstraction and declarative semantics offered by DSLs enable more degrees of freedom to optimally map the algorithms onto diverse hardware than traditional imperative languages that over-prescribe the solution. Because this will drastically increase the complexity of the mapping problem, new mathematics for optimization will be developed, along with better performance introspection (both hardware and software mechanisms for online performance introspection) through extensions to the roofline model. Use of ML/AI technologies will be essential to enable analysis and automation of dynamic optimizations.
The aspirational desire for “minimizing the number of lines of code that must be changed to migrate to different systems with different arrangements of specialization” is encapsulated in the loaded phrase “Performance Portability.” However, performance portability is likely not an achievable goal if we attempt to do it using imperative languages like Fortran and C/C++. There is simply not enough flexibility built in to the specification of the algorithm for a compiler to do anything other than what the algorithm designer explicitly stated in their code. To make this future of diverse accelerators usable and accessible in the former case will require the co-design of new compiler technology and domain- specific languages (DSLs) designed around the requirements of the target computational motifs. The higher levels of abstraction and declarative semantics offered by DSLs enable more degrees of freedom to optimally map the algorithms onto diverse hardware than traditional imperative languages that over-prescribe the solution. Because this will drastically increase the complexity of the mapping problem, new mathematics for optimization will be developed, along with better performance introspection (both hardware and software mechanisms for online performance introspection) through extensions to the roofline model. Use of ML/AI technologies will be essential to enable analysis and automation of dynamic optimizations.
Workshop
Performance Measurement, Modeling, and Tools
Performance Optimization
W
DescriptionTo better advise HPC application developers, we have implemented Faces, a nearest-neighbor microbenchmark that quantifies performance trade-offs. The Faces experiments presented here explore the following design choices: 1) fewer dependent messages versus more independent messages, 2) fewer fused GPU kernels versus more simple kernels, 3) number of GPU streams, 4) size of GPU thread blocks, and 5) linear versus blocked ordering of MPI ranks. We present weak-scaling performance of a latency-sensitive "small'' per-rank domain and of a bandwidth-sensitive "large'' per-rank domain, and we compare results for two high-performance computers with contrasting CPU, GPU, and interconnect architectures: Summit and Frontier. We find that using more independent messages tends to give better performance than using few dependent messages. We identify performance-portability recommendations for GPU streams and synchronization, but other aspects of performance show complicated dependence on problem size and computer.
Tutorial
Accelerators
Heterogeneous Computing
Performance Measurement, Modeling, and Tools
Performance Optimization
Software Engineering
TUT
DescriptionThe Roofline performance model offers an insightful and intuitive method for extracting the key execution characteristics of HPC applications and comparing them against the performance bounds of modern CPUs and GPUs. Its ability to abstract the complexity of memory hierarchies and identify the most profitable optimization techniques have made Roofline-based analysis increasingly popular in the HPC community. Although different flavors of the Roofline model have been developed to deal with various definitions of memory data movement, there remains a need for a systematic methodology when applying them to analyze applications running on multicore and accelerated systems. The tutorial aims to bridge this gap on both CPUs and GPUs by both exposing the fundamental aspects behind different Roofline modeling principles as well as providing several practical use case scenarios that highlight their efficacy for application optimization. This tutorial presents a unique combination of instruction to Roofline by its creator, hands-on instruction in using Roofline within Intel’s, NVIDIA’s, and AMD’s production performance tools, and discussions of real-world Roofline use cases at ALCF, NERSC, and OLCF computing centers. The tutorial presenters have a long history of collaborating on the Roofline model and have presented several Roofline-based tutorials.
Workshop
Accelerators
Applications
Compilers
Heterogeneous Computing
Programming Frameworks and System Software
Runtime Systems
W
DescriptionWe present the steps followed to GPU-offload parts of the core solver of EFIT-AI, an equilibrium reconstruction code suitable for tokamak experiments and burning plasmas. For this work, we will focus on the fitting procedure that consists of a Grad–Shafranov (GS) equation inverse solver that calculates equilibrium reconstructions on a grid. We will show profiling results of the original code(CPU-baseline), as well as the directives used to GPU-offload the most time-consuming function, initially to compare OpenACC and OpenMP on NVIDIA and AMD GPUs and later on to assess OpenMP performance portability on NVIDIA, AMD and Intel GPUs. We will make a performance comparison for different grid sizes and show the speedup achieved on NVIDIA A100 (Perlmutter-NERSC), AMD MI250X (Frontier-OLCF) and Intel PVC GPUs (Sunspot-ALCF). Finally, we will draw some conclusions and recommendations to achieve high-performance portability for an equilibrium reconstruction code on the new HPC architectures.
Posters
Research Posters
TP
XO/EX
DescriptionNumerical methods such as the Finite Element Method (FEM) have successfully leveraged the computational power of GPU accelerators. However, much of the effort around FEM on GPU’s has been focused on high order discretizations due to their higher arithmetic intensity and order of accuracy. For applications such as the simulation of geologic reservoirs, high levels of heterogeneity results in high-resolution grids characterized by highly discontinuous (cell-wise) material property fields. Additionally, the significant uncertainties typical of geologic reservoirs reduces the benefits of high order accuracy, and low order methods are typically employed. In this study, we present a strategy for implementing highly performant low-order matrix-free FEM operator kernels in the context of the conjugate gradient method. Performance results of the operator kernel are presented and are shown to compare favorably to matrix-based SpMV operators on V100, A100, and MI250X GPUs.
Workshop
Data Movement and Memory
Fault Handling and Tolerance
Heterogeneous Computing
Security
W
DescriptionPersistent Memory (PM) has been proposed and commercially used as a novel generation of storage devices capable of competing with both primary and secondary memory, attaining features such as data persistency and byte addressability.
These devices paved the way for researchers to develop Transactional Memories (TMs) that, besides providing atomic transactions in main memory, because this memory can also be persistent, also deliver durable transactions. Unfortunately, combining PM and TM is challenging, as the most efficient implementations of TM, i.e., Hardware Transaction Memories (HTMs), operate at the level of volatile CPU caches.
We present our early-stage work on PSI, the first durable Persistent Hardware Transaction Memory for IBM's POWER systems. Our work builds on SI-HTM, which is a volatile HTM solution, and expands it with durability. We show that PSI imposes a relatively low overhead of 23% when compared with a volatile solution.
These devices paved the way for researchers to develop Transactional Memories (TMs) that, besides providing atomic transactions in main memory, because this memory can also be persistent, also deliver durable transactions. Unfortunately, combining PM and TM is challenging, as the most efficient implementations of TM, i.e., Hardware Transaction Memories (HTMs), operate at the level of volatile CPU caches.
We present our early-stage work on PSI, the first durable Persistent Hardware Transaction Memory for IBM's POWER systems. Our work builds on SI-HTM, which is a volatile HTM solution, and expands it with durability. We show that PSI imposes a relatively low overhead of 23% when compared with a volatile solution.
Workshop
Programming Frameworks and System Software
State of the Practice
W
Workshop
W
DescriptionContainers are becoming essential to support the diversity of scientific computing workloads at academic computing centers. Here, we offer perspectives and experiences from the Texas Advanced Computing Center on: the installation, configuration, and support of select containerization platforms; incorporation of containers into the module system to improve their discoverability and usability; facilitation of advanced use cases including MPI containers, GPU containers, and support for multiple instruction set architectures; and finally instruction on best practices to end users through workshops and university courses. We will briefly discuss case studies that highlight the importance of supporting containers for research computing.
Paper
Architecture and Networks
Data Movement and Memory
Graph Algorithms and Frameworks
Performance Measurement, Modeling, and Tools
Programming Frameworks and System Software
TP
DescriptionMemory performance is a bottleneck in graph analytics acceleration. Existing Machine Learning (ML) prefetchers struggle with phase transitions and irregular memory accesses in graph processing. We propose MPGraph, an ML-based Prefetcher for Graph analytics using domain specific models. MPGraph introduces three novel optimizations: soft detection for phase transitions, phase-specific multi-modality models for access delta and page predictions, and chain spatio-temporal prefetching (CSTP) for prefetch control.
Our transition detector achieves 34.17–82.15% higher precision compared with Kolmogorov–Smirnov Windowing and decision tree. Our predictors achieve 6.80–16.02% higher F1-score for delta and 11.68–15.41% higher accuracy-at-10 for page prediction compared with LSTM and vanilla attention models. Using CSTP, MPGraph achieves 12.52–21.23% IPC improvement, outperforming state-of-the-art non-ML prefetcher BO by 7.58–12.03% and ML-based prefetchers Voyager and TransFetch by 3.27–4.58%. For practical implementation, we demonstrate MPGraph using compressed models with reduced latency shows significantly superior accuracy and coverage compared with BO, leading to 3.58% higher IPC improvement.
Our transition detector achieves 34.17–82.15% higher precision compared with Kolmogorov–Smirnov Windowing and decision tree. Our predictors achieve 6.80–16.02% higher F1-score for delta and 11.68–15.41% higher accuracy-at-10 for page prediction compared with LSTM and vanilla attention models. Using CSTP, MPGraph achieves 12.52–21.23% IPC improvement, outperforming state-of-the-art non-ML prefetcher BO by 7.58–12.03% and ML-based prefetchers Voyager and TransFetch by 3.27–4.58%. For practical implementation, we demonstrate MPGraph using compressed models with reduced latency shows significantly superior accuracy and coverage compared with BO, leading to 3.58% higher IPC improvement.
Workshop
Modeling and Simulation
Performance Measurement, Modeling, and Tools
W
DescriptionA parallel program together with the parallel hardware it runs on is not only a vehicle to solve numerical problems, it is also a complex system with interesting dynamical behavior: resynchronization and desynchronization of parallel processes, propagating phases of idleness, and the peculiar effects of noise and system topology are just a few examples. We propose a physical oscillator model (POM) to describe the dynamics of interacting parallel processes. A process with its regular compute-communicate cycles is modeled as an oscillator which is coupled to other oscillators (processes) via an interaction potential. Instead of a simple all-to-all connectivity as in the standard well-known Kuramoto model, we employ a sparse topology matrix mapping the communication structure of the parallel program onto the oscillator setup. We show that the POM with appropriate potentials can mimic the propagation of delays and the synchronization and desynchronization behavior of scalable and bottlenecked parallel programs, respectively.
Invited Talk
Compilers
Hardware Technologies
Quantum Computing
TP
DescriptionQuantum software can be a force multiplier that can significantly shorten the timeline for utility-scale results from quantum hardware. In particular, several key research directions will help realize practical quantum advantage. Physics-aware, cross-layer optimizations will continue to yield important efficiencies to allow applications to make the most of quantum resources. Software-directed error mitigation, in particular, will be key to increasing gate depths and maintaining acceptable output fidelity. Pulse-level optimizations and specialized native gates will also be key enablers. Additionally, applications will be hybrid computations involving high-performance classical resources as well as quantum hardware serving as special-purpose accelerators. Effectively partitioning computations between these classical and quantum resources will be necessary to support realistic applications. Additionally, deep compiler optimization and classical simulation of Clifford and near-Clifford circuits can also be important classical investments toward more efficient quantum computations. Finally, defining abstractions that control compiler complexity yet selectively expose key physical machine properties will also be a key area of research.
Posters
Research Posters
TP
XO/EX
DescriptionPer-process per-thread traces enable in-depth analysis of parallel program execution to identify various kinds of performance issues. Often times, trace collection tools provide a graphical tool to analyze the trace output. However, these GUI-based tools only support specific file formats, are difficult to scale when the data is large, limit data exploration to the implemented graphical views, and do not support automated comparisons of two or more datasets. In this poster, we present a pandas-based Python library, Pipit, which can read traces in different file formats (OTF2, HPCToolkit, Projections, Nsight, etc.) and provide a uniform data structure in the form of a pandas DataFrame. Pipit provides operations to aggregate, filter, and transform the events in a trace to present the data in different ways. We also provide several functions to quickly identify performance issues in parallel executions.
Workshop
Artificial Intelligence/Machine Learning
Energy Efficiency
Green Computing
Performance Measurement, Modeling, and Tools
Sustainability
W
DescriptionThe power requirements of modern High-Performance Computing (HPC) systems pose environmental and financial challenges, given their carbon emissions and strain power grids. Optimizing power consumption together with system performance has thus become crucial. As jobs running on a system contribute to the whole system's power usage, predicting their power requirements before execution would allow forecasting the overall power consumption and perform techniques like power capping. Such predictive studies need quality data, which is limited due to the inherent complexity of collecting structured data in a production system. This paper aims to fill the lack of resources for job power prediction and provide (i) a methodology to create a job power consumption dataset from workload manager data and node power metrics logs, and (ii) a novel dataset comprising around 230K jobs and their corresponding power consumption values. The dataset is derived from M100, a holistic dataset extracted from a production supercomputer.
Workshop
Modeling and Simulation
Performance Measurement, Modeling, and Tools
W
Workshop
Modeling and Simulation
Performance Measurement, Modeling, and Tools
W
Workshop
Modeling and Simulation
Performance Measurement, Modeling, and Tools
W
DescriptionThe PMBS23 workshop is concerned with the comparison of high-performance computing systems through performance modeling, benchmarking or through the use of tools such as simulators. We are particularly interested in research which reports the ability to measure and make tradeoffs in software/hardware co-design to improve sustained application performance. We are also keen to capture the assessment of future systems.
The aim of this workshop is to bring together researchers, from industry and academia, concerned with the qualitative and quantitative evaluation and modeling of high-performance computing systems. Authors are invited to submit novel research in all areas of performance modeling, benchmarking and simulation, and we welcome research that brings together current theory and practice. We recognize that the term 'performance' has broadened to include power consumption and reliability, and that performance modeling is practiced through analytical methods and approaches based on software tools and simulators.
The aim of this workshop is to bring together researchers, from industry and academia, concerned with the qualitative and quantitative evaluation and modeling of high-performance computing systems. Authors are invited to submit novel research in all areas of performance modeling, benchmarking and simulation, and we welcome research that brings together current theory and practice. We recognize that the term 'performance' has broadened to include power consumption and reliability, and that performance modeling is practiced through analytical methods and approaches based on software tools and simulators.
Paper
Applications
Modeling and Simulation
DescriptionQuantum perturbation theory is pivotal in determining the critical physical properties of materials. The first-principles computations of these properties have yielded profound and quantitative insights in diverse domains of chemistry and physics.
In this work, we propose a portable and scalable OpenCL implementation for quantum perturbation theory, which can be generalized across various high-performance computing (HPC) systems. Optimal portability is realized through the utilization of a cross-platform unified interface and a collection of performance-portable heterogeneous optimizations. Exceptional scalability is attained by addressing major constraints on memory and communication, employing a locality-enhanced task mapping strategy and a packed hierarchical collective communication scheme. Experiments on two advanced supercomputers demonstrate that the quantum perturbation calculation exhibits remarkably performance on various material systems, scaling the system to 200,000 atoms with all-electron precision. This research enables all-electron quantum perturbation simulations on substantially larger molecular scales, with a potentially significant impact on progress in material sciences.
In this work, we propose a portable and scalable OpenCL implementation for quantum perturbation theory, which can be generalized across various high-performance computing (HPC) systems. Optimal portability is realized through the utilization of a cross-platform unified interface and a collection of performance-portable heterogeneous optimizations. Exceptional scalability is attained by addressing major constraints on memory and communication, employing a locality-enhanced task mapping strategy and a packed hierarchical collective communication scheme. Experiments on two advanced supercomputers demonstrate that the quantum perturbation calculation exhibits remarkably performance on various material systems, scaling the system to 200,000 atoms with all-electron precision. This research enables all-electron quantum perturbation simulations on substantially larger molecular scales, with a potentially significant impact on progress in material sciences.
Tutorial
Accelerators
Applications
Software Engineering
TUT
DescriptionThis hands-on tutorial teaches how to parallelize and optimize HPC applications for multi-core CPUs and GPUs using the portable parallelism and concurrency features of the ISO C++23 standard without any language or vendor extensions. We further show how to integrate this approach with MPI to target large multi-node homogeneous and heterogeneous HPC systems. The attendees learn problem-solving strategies for parallelizing classic HPC patterns (multi-dimensional loops, map-reduce, scans) and concurrency problems, e.g., to hide the latency of MPI communication behind computation. The tutorial provides attendees zero-setup web access to Jupyter Lab running on modern multi-GPU accelerated systems, enabling attendees to solve the hands-on exercises directly in their web browser. These hands-on exercises transfer the above mentioned technique to produce a portable multi-node, heterogeneous, and asynchronous 2D unsteady heat-equation mini-application. Finally, we synthesize practical techniques acquired from our professional experience applying the portable ISO C++23 parallel and asynchronous programming models to port large real-world HPC applications to heterogeneous supercomputers and refer further learning resources.
Workshop
Accelerators
Applications
Compilers
Heterogeneous Computing
Modeling and Simulation
Programming Frameworks and System Software
Runtime Systems
W
DescriptionThis paper presents the results of our efforts to port Meso-NH, an atmospheric non-hydrostatic research model, to AMD MI250X GPUs using OpenACC on the ADASTRA Machine, a technology similar to the Frontier system [1]. Meso-NH is a versatile model that covers a wide range of resolutions from synoptic to turbulent scales, and is designed for studies of physics and chemistry. Numerical simulation of the atmosphere is crucial for understanding and predicting weather and climate extremes. Current numerical weather prediction codes are limited to specific resolutions on global and regional scales. The Meso-NH code, however, tackles scales and complexities beyond what is typically used in operational forecasting.
We collaborated with GENCI, CINES, HPE, and AMD on the "progress contract," for ADASTRA machine aiming to achieve simulations at hectometric resolution for recent storms in the Atlantic and Mediterranean regions, characterized by extreme wind gusts.
We collaborated with GENCI, CINES, HPE, and AMD on the "progress contract," for ADASTRA machine aiming to achieve simulations at hectometric resolution for recent storms in the Atlantic and Mediterranean regions, characterized by extreme wind gusts.
Workshop
Performance Measurement, Modeling, and Tools
Performance Optimization
W
DescriptionBatched linear solvers play a vital role in computational sciences, especially in the fields of plasma physics and combustion simulations. With the imminent deployment of the Aurora Supercomputer and other upcoming systems equipped with Intel GPUs, there is a compelling demand to expand the capabilities of these solvers for Intel GPU architectures.
We present our efforts in porting and optimizing the batched iterative solvers on Intel GPUs using the SYCL programming model. These new solvers achieve impressive performance on the Intel GPU Max 1550s (Ponte Vecchio GPUs) which surpass our previous CUDA implementation on NVIDIA H100 GPUs by an average of 2.4x for the PeleLM application inputs. The batched solvers are ready for production use in real-world scientific applications through the Ginkgo library, complementing the performance portability of the batched functionality of Ginkgo.
We present our efforts in porting and optimizing the batched iterative solvers on Intel GPUs using the SYCL programming model. These new solvers achieve impressive performance on the Intel GPU Max 1550s (Ponte Vecchio GPUs) which surpass our previous CUDA implementation on NVIDIA H100 GPUs by an average of 2.4x for the PeleLM application inputs. The batched solvers are ready for production use in real-world scientific applications through the Ginkgo library, complementing the performance portability of the batched functionality of Ginkgo.
Workshop
State of the Practice
W
DescriptionCryogenic electronics have great potential to advance computing capabilities and quantum information processing. We explore two categories: Superconducting Electronics (SCE) and Cryogenic semiconductor electronics (Cryo-Semi). Taking advantage of the inherent phenomenon of superconductivity with zero resistance and Josephson junctions, SCE presents notable advantages in energy efficiency, minimal power dissipation, and gigahertz processing speed. Similarly, cryo-semi electronics offers a compelling advantage over their room temperature counterpart; these include lower noise, higher operating speed, increased efficiency, and wide-range operating temperature. Both SCE and cryo-semi exhibit compatibility with quantum technologies and deep space applications. Both have the capacity to operate at deep cryogenic temperatures, making them an appealing candidate for quantum computing. Their integration enables the seamless integration of classical and quantum resources in a shared cryogenic environment, improving quantum error correction and operating temperature compatibility.
Workshop
Modeling and Simulation
Performance Measurement, Modeling, and Tools
W
DescriptionPower has become a key limiting factor in supercomputing. Understanding the power signatures of current production workloads is essential to address this limit and continue to advance scientific computing at scale. This paper analyzes the power characteristics of NERSC production workloads at the system and application levels. Our system-level analysis revealed a large gap between the average and peak power usage distribution, indicating a significant power swing from running various applications on the system. On the application level, we select four workflow benchmarks representing NERSC's production workloads to analyze the power characteristics of applications and attempt to correlate the observed power timeline patterns with GPU performance metrics and application profiling data. We found different applications have distinct power usage patterns and widespread average and peak power usage. We discuss how these findings may help improve the current system's operational power efficiency and the implications for future system procurement.
Birds of a Feather
Architecture and Networks
TP
XO/EX
DescriptionSince the “Good Old Times” of petascale, HPC computer centers have had to evolve a lot. As a result, HPC centers require huge amounts of power to run, and TCO has gone through the roof, in particular in times of rising energy prices.
This BoF proposes a “short production system” compute model, where the compute, storage and network systems collaborate to execute applications and workflows in a small, compact and contiguous part of the system, exploit locality of compute and data resources, and thereby reduce energy usage and cost and avoid spreading applications and data across the whole system.
This BoF proposes a “short production system” compute model, where the compute, storage and network systems collaborate to execute applications and workflows in a small, compact and contiguous part of the system, exploit locality of compute and data resources, and thereby reduce energy usage and cost and avoid spreading applications and data across the whole system.
Workshop
Compilers
Heterogeneous Computing
Performance Optimization
W
DescriptionWith the advent of GPU computing, executing large program sections on accelerators has become increasingly important. Efforts are being made to support the C standard library, LIBC, on GPUs via LLVM machinery. Therefore, the C standard math library, LIBM, must be supported on GPUs. So far, LLVM frontends, such as Clang, have relied on GPU vendor implementations of LIBM functionality wrapped into (mostly) LIBM-compatible forwarding functions.
We propose a novel LIBM for GPUs reusing a collection of LLVM target-agnostic implementations and built-ins alongside vendor implementations of most single and double-precision floating point math functions. Our approach allows selecting between individual implementations based on the GPU target as opposed to the current approach, which serves only the single third-party library implementation. Our extensive numerical analysis highlights the various implementations' differences in performance and precision. Our solution allows users to choose the implementation that maximizes speed while meeting their specific precision requirements.
We propose a novel LIBM for GPUs reusing a collection of LLVM target-agnostic implementations and built-ins alongside vendor implementations of most single and double-precision floating point math functions. Our approach allows selecting between individual implementations based on the GPU target as opposed to the current approach, which serves only the single third-party library implementation. Our extensive numerical analysis highlights the various implementations' differences in performance and precision. Our solution allows users to choose the implementation that maximizes speed while meeting their specific precision requirements.
Doctoral Showcase
Posters
Artificial Intelligence/Machine Learning
Security
TP
DescriptionThe problem of preempting attacks before damages remains the top security priority. The gap between alerts and early detection remains wide open because noisy attack attempts and unreliable alerts mask real attacks from humans. This dissertation brings together: 1) attack patterns mining driven by real security incidents, 2) probabilistic graphical models linking patterns with runtime alerts, and 3) an in vivo testbed which embeds a honeypot in a live Science DMZ network for realistic assessment. Traditional techniques that seek specific attack signatures or anomalies are ineffective because defenders only see a partial view of ongoing attacks while having to wrestle with unreliable alerts and heavy background noise of attack attempts. In contrast, our principle objective is to reinforce scant, incomplete evidence of potential attacks with the ground truth of past security incidents. We evaluated our system, Cyborg's, accuracy, and performance in three experiments at the National Center for Supercomputing Applications at the University of Illinois. Our deployment stops 8 out of 10 replayed attacks before system integrity violation and all ten before data exfiltration. In addition, we discovered and stopped a family of ransomware attacks before the data breach. During the period of deployment, this thesis resulted in a honeypot that collected 15 billion attack attempts (the world's largest publicly analyzed dataset) for analytics. In the future, we are looking at integrating AI techniques such as large language models to build intelligent honeypot systems that are indistinguishable from real systems to collect attack intelligence and educate the security operator.
Workshop
W
DescriptionContainers provide a new paradigm for building, packaging, deploying and managing applications consistently across varying infrastructures. However, the utilization of containers in HPC has been more difficult due to the culmination of security and performance requirements. High resource utilization across GPU-intensive workloads is a crucial requirement for HPC clusters. Container orchestration platforms such as Kubernetes enable efficient management of HPC infrastructure for use by researchers who need access to scalable high performance facilities. However, the resource utilization of such orchestration frameworks with GPU-intensive HPC workloads remains relatively unexplored. In this paper we present kube-criu-scheduler, a Kubernetes scheduler that builds on a recently introduced container checkpointing feature to enable preemptive scheduling of GPU-accelerated HPC applications. Our results show that resulting efficiency and reliability gains are achieved with negligible impact on application performance.
Posters
Research Posters
Artificial Intelligence/Machine Learning
Post-Moore Computing
Quantum Computing
TP
DescriptionIn classical machine learning, the convolution operation is leveraged in the eponymous class of convolutional neural networks (CNNs) capturing the spatial and/or temporal locality of multidimensional input features. Preserving data locality allows CNN models to reduce the number of training parameters, and hence their training time, while achieving high classification accuracy. However, contemporary methods of quantum machine learning do not possess effective methods for exploiting data locality, due to the lack of a generalized and parameterizable implementation of quantum convolution. In this work, we propose variational quantum classification techniques that leverage a novel multidimensional quantum convolution operation with arbitrary filtering and unity stride. We provide the quantum circuits for our techniques alongside corresponding theoretical analysis. We also experimentally demonstrate the advantage of our method in comparison with existing quantum and classical techniques for image classification in staple multidimensional datasets using state-of-the-art quantum simulations.
Tutorial
Artificial Intelligence/Machine Learning
TUT
DescriptionRecent advances in Machine and Deep Learning (ML/DL) have led to many exciting challenges and opportunities. Modern ML/DL frameworks including TensorFlow, PyTorch, and cuML enable high-performance training, inference, and deployment for various types of ML models and Deep Neural Networks (DNNs). This tutorial provides an overview of recent trends in ML/DL and the role of cutting-edge hardware architectures and interconnects in moving the field forward. We will also present an overview of different DNN architectures, ML/DL frameworks, DL Training and Inference, and Hyperparameter Optimization with special focus on parallelization strategies for model training. We highlight new challenges and opportunities for communication runtimes to exploit high-performance CPU/GPU architectures to efficiently support large-scale distributed training. We also highlight some of our co-design efforts to utilize MPI for large-scale DNN training on cutting-edge CPU/GPU/DPU architectures available on modern HPC clusters. Throughout the tutorial, we include several hands-on exercises to enable attendees to gain first-hand experience of running distributed ML/DL training and hyperparameter optimizations on a modern GPU cluster.
Workshop
Applications
Exascale
Heterogeneous Computing
Programming Frameworks and System Software
State of the Practice
W
DescriptionThe diversity in processor technology used by High Performance Computing (HPC) facilities is growing, and so applications must be written in such a way that they can attain high levels of performance across a range of different CPUs, GPUs, and other accelerators. Measuring application performance across this wide range of platforms becomes crucial, but there are significant challenges to do this rigorously, in a time efficient way, while assuring results are scientifically meaningful, reproducible, and actionable. We present a methodology for measuring and analyzing the performance portability of a parallel application and shares a software framework which combines and extends adopted technologies to provide a usable benchmarking tool. We demonstrate the flexibility and effectiveness of the methodology and benchmarking framework by showcasing a variety of benchmarking case studies which utilize a stable of supercomputing resources at a national scale.
Paper
Architecture and Networks
Performance Measurement, Modeling, and Tools
Resource Management
TP
DescriptionPerformance variations caused by anomalies in modern High Performance Computing (HPC) systems lead to decreased efficiency, impaired application performance, and increased operational costs. While machine learning (ML)-based frameworks for automated anomaly detection (often based on time series telemetry data) are gaining popularity in the literature, practical deployment challenges are often overlooked. Some ML-based frameworks require extensive customization, while others need a rich set of labeled samples, none of which are feasible for a production HPC system.
This paper introduces a variational autoencoder-based anomaly detection framework, Prodigy, that outperforms the state-of-the-art alternatives by achieving a 0.95 F1-score when detecting performance anomalies. The paper also provides a real system implementation of Prodigy that enables easy integration with monitoring frameworks and rapid deployment. We deploy Prodigy on a production HPC system and demonstrate 88% accuracy in detecting anomalies. Prodigy involves an interface to provide job- and node-level analysis and explanations for anomaly predictions.
This paper introduces a variational autoencoder-based anomaly detection framework, Prodigy, that outperforms the state-of-the-art alternatives by achieving a 0.95 F1-score when detecting performance anomalies. The paper also provides a real system implementation of Prodigy that enables easy integration with monitoring frameworks and rapid deployment. We deploy Prodigy on a production HPC system and demonstrate 88% accuracy in detecting anomalies. Prodigy involves an interface to provide job- and node-level analysis and explanations for anomaly predictions.
Workshop
Education
State of the Practice
W
DescriptionDesigned for the master's degree program in "Computational and Data Science," the Faculty of Mathematics and Computer Science at Friedrich Schiller University Jena, Germany, offers a course that introduces students to distributed processing on massive datasets. Within that course, there is a three-week programming project where students learn to design, construct, and improve data analysis and machine learning pipelines using Hadoop, MapReduce, and Spark on the university’s central compute cluster. This short note sketches the main idea of the programming project, gives an example of a project instance, and reports on classroom experiences.
Workshop
Large Scale Systems
Middleware and System Software
Programming Frameworks and System Software
W
DescriptionI will discuss the multi-stream based execution environment of Habana/Gaudi systems that is exposed to deep learning frameworks and I will show how one can combine compute, networking and DMA at high performance and with low run-time overheads. I will highlight the performance of Habana Collective Communication Library at scale in terms of bandwidth, message rate and demonstrate its impact on deep learning training and inference performance of a few neural network models including vision and Large Language Models. In the second part of the talk, I will highlight the challenges in communication scaling, especially the associated congestion that we observe between leaf and spine switches in certain conditions. I will highlight solutions that we are currently deploying including congestion control algorithms and packet/message spraying techniques at the endpoint and share our results.
Tutorial
Accelerators
Artificial Intelligence/Machine Learning
TUT
DescriptionScientific applications are increasingly adopting Artificial Intelligence (AI) techniques to advance science. There are specialized hardware accelerators designed and built to run AI applications efficiently. With a wide diversity in the hardware architectures and software stacks of these systems, it is challenging to understand the differences between these accelerators, their capabilities, programming approaches, and how they perform, particularly for scientific applications. In this tutorial, we will cover an overview of the AI accelerators landscape with a focus on SambaNova, Cerebras, Graphcore, Groq, and Habana systems along with architectural features and details of their software stacks. We will have hands-on exercises that will help attendees understand how to program these systems by learning how to refactor codes written in standard AI framework implementations and compile and run the models on these systems. The tutorial will enable the attendees with an understanding of the key capabilities of emerging AI accelerators and their performance implications for scientific applications.
Tutorial
Accelerators
Algorithms
Task Parallelism
TUT
DescriptionIf you are an HPC programmer, you know OpenMP. Alongside MPI, OpenMP is the open, cross-vendor foundation of HPC. As hardware complexity has grown, OpenMP has grown as well adding GPU support in OpenMP 4.0 (2013). With a decade of evolution since then, OpenMP GPU technology is now a mature option for programming any GPU you are likely to find on the market.
While there are many ways to program a GPU, the best way is through OpenMP. Why? Because the GPU does not exist in isolation. There are always one or more CPUs on a node. Programmers need portable code that fully exploits all available processors. In other words, programmers need a programming model, such as OpenMP, that fully embraces heterogeneity.
In this tutorial, we explore GPU programing with OpenMP. We assume attendees already know the fundamentals of multi-threading with OpenMP, so we use our time on the directives that define how to map loops onto GPUs and optimize data movement between the CPU and GPU. Students will use their own laptops (with Windows, Linux, or macOS) to connect to remote servers we will provide with GPUs and all the software needed for the tutorial.
While there are many ways to program a GPU, the best way is through OpenMP. Why? Because the GPU does not exist in isolation. There are always one or more CPUs on a node. Programmers need portable code that fully exploits all available processors. In other words, programmers need a programming model, such as OpenMP, that fully embraces heterogeneity.
In this tutorial, we explore GPU programing with OpenMP. We assume attendees already know the fundamentals of multi-threading with OpenMP, so we use our time on the directives that define how to map loops onto GPUs and optimize data movement between the CPU and GPU. Students will use their own laptops (with Windows, Linux, or macOS) to connect to remote servers we will provide with GPUs and all the software needed for the tutorial.
Awards
Test of Time
Applications
Architecture and Networks
Codesign
TP
W
DescriptionIn 2009, we presented a paper at SC09 reporting on the design, construction, and use of Anton 1, a special-purpose supercomputer designed for molecular dynamics (MD) simulations of biomolecular systems. The machine’s specialized hardware dramatically increased the speed of MD calculations, making possible for the first time the simulation of biological molecules at an atomic level of detail for periods on the order of a millisecond -- about two orders of magnitude beyond the previous state of the art. This enabled biomolecular simulations on a timescale at which many critically important, but poorly understood phenomena were known to occur, allowing the observation of biological phenomena that were previously inaccessible to both computational and experimental study.
The following year, we published a paper in the journal Science that reported on our use of Anton 1 to answer longstanding fundamental questions regarding the nature of protein folding and other large-scale structural changes in proteins. Some of our results were derived from an Anton 1 simulation roughly 100 times longer than the longest simulation that had previously been reported.
Over the past 10 years, we have developed and deployed two further generations of the Anton supercomputer. Anton 3 is dramatically faster than Anton 1, and capable of efficiently handling far larger molecular systems. Although general-purpose supercomputers have also become much faster over that period, the performance gap between Anton and general-purpose supercomputers has grown over time -- to a factor of more than 400 for biomolecular systems in a size range of considerable interest within the research and drug discovery communities.
In 2010, we made an Anton 1 machine (later upgraded to an Anton 2) available without cost for noncommercial research use by universities and other nonprofit institutions. Anton time is allocated annually by the National Academies, and a total of 239 outside research groups have thus far conducted independent research projects on these machines.
At D. E. Shaw Research, we have continued using Anton machines both for fundamental research and for internal and collaborative drug discovery, yielding six drugs that are currently in human clinical trials.
This talk will review the contents of our original paper and related progress since its publication.
The following year, we published a paper in the journal Science that reported on our use of Anton 1 to answer longstanding fundamental questions regarding the nature of protein folding and other large-scale structural changes in proteins. Some of our results were derived from an Anton 1 simulation roughly 100 times longer than the longest simulation that had previously been reported.
Over the past 10 years, we have developed and deployed two further generations of the Anton supercomputer. Anton 3 is dramatically faster than Anton 1, and capable of efficiently handling far larger molecular systems. Although general-purpose supercomputers have also become much faster over that period, the performance gap between Anton and general-purpose supercomputers has grown over time -- to a factor of more than 400 for biomolecular systems in a size range of considerable interest within the research and drug discovery communities.
In 2010, we made an Anton 1 machine (later upgraded to an Anton 2) available without cost for noncommercial research use by universities and other nonprofit institutions. Anton time is allocated annually by the National Academies, and a total of 239 outside research groups have thus far conducted independent research projects on these machines.
At D. E. Shaw Research, we have continued using Anton machines both for fundamental research and for internal and collaborative drug discovery, yielding six drugs that are currently in human clinical trials.
This talk will review the contents of our original paper and related progress since its publication.
Workshop
Performance Measurement, Modeling, and Tools
Programming Frameworks and System Software
W
Workshop
Quantum Computing
Software Engineering
W
DescriptionState-of-the-art quantum circuit simulators have mostly focused on scaling the number of qubits. However, we argue that studying current noisy quantum computers and variational quantum algorithms benefits from high-throughput simulation of intermediate-scale quantum circuits. We present the first implementation and evaluation of a batched quantum simulator on the NEC Vector Engine (VE), a vector processor with massive memory bandwidth ideal for memory-intensive state vector simulation. To take advantage of the long-vector architecture of VE, we design a parallelization strategy and memory layout suited for batched state vector simulation. Our preliminary evaluation shows that the performance of our simulator on VE Type 20B outperforms a dual-socket CPU system by 12x. Furthermore, the performance of VE is identical to that of cuStateVec on A100 40 GB, matching the peak bandwidth of the two processors. This suggests the latest VE with higher memory bandwidth is expected to outperform A100.
Birds of a Feather
Distributed Computing
TP
XO/EX
DescriptionThis BoF session will address user experience challenges that arise from geographically dispersed computing resources, such as when an organization operates multiple HPC clusters or wishes to combine on-premises and cloud-based compute services. A series of speakers will provide an overview of current perspectives on and solutions for making dispersed computing resources available to user communities. We invite participants to engage in a facilitated follow-up discussion to identify key unresolved hurdles and document emerging community best practices for providing the best possible user experience in geographically dispersed HPC settings.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
TP
XO/EX
DescriptionA novel streaming approach is introduced for Python, leveraging the ProxyStore system to facilitate the exchange of stream references across distributed systems. This approach utilizes generators to efficiently publish and consume messages from streams. The extensible backend connector interface of ProxyStore enables support for diverse communication mechanisms, such as transitioning from ZMQ to RDMA. Performance results highlight the capability to perform data PUT and GET operations on streams with minimal overhead and high efficiency.
Students@SC
DescriptionThe research is clear – psychological safety is more critical than any other factor for making a teamwork. A shared belief held by individuals that their team is safe for interpersonal risk-taking isn’t just “nice to have” but a necessity for company growth, employee retention, and long-term success.
In this 1.5-hour session, participants will discuss the challenges and successes they’ve encountered while working to foster a greater sense of psychological safety in their teams and organizations. The host will share specific actions people leaders can take to create environments that welcome questions, express unique perspectives, and allow people to safely bring their authentic selves to work based on Dr. Clark’s 4-Stage framework.
In this 1.5-hour session, participants will discuss the challenges and successes they’ve encountered while working to foster a greater sense of psychological safety in their teams and organizations. The host will share specific actions people leaders can take to create environments that welcome questions, express unique perspectives, and allow people to safely bring their authentic selves to work based on Dr. Clark’s 4-Stage framework.
Workshop
Programming Frameworks and System Software
W
DescriptionModern supercomputing applications are complex programs built on optimized frameworks and accelerated on GPUs. As such, dedicated tools for profiling GPU kernel utilization and performance are needed to support development of these applications, which in turn accelerates progress for the scientific computing and machine learning communities.
This paper presents the Oneprof and Onetrace tools from the Intel PTI-GPU framework. These tools are capable of profiling applications and different levels of the runtime stack executing on Intel GPUs. To demonstrate the features and utility of these tools, we examine one HPC and one AI application.
This paper presents the Oneprof and Onetrace tools from the Intel PTI-GPU framework. These tools are capable of profiling applications and different levels of the runtime stack executing on Intel GPUs. To demonstrate the features and utility of these tools, we examine one HPC and one AI application.
Workshop
Applications
Distributed Computing
Compilers
Heterogeneous Computing
Message Passing
Programming Frameworks and System Software
Runtime Systems
Task Parallelism
W
DescriptionPure is a new programming model and runtime system explicitly designed to take advantage of shared memory within nodes in the context of a mostly message passing interface enhanced with the ability to use tasks to make use of idle cores. We use microbenchmarks to evaluate Pure’s key messaging and collective features and also show application speedups up to 2.1x on the CoMD molecular dynamics application. Overall, Pure offers improved performance by aggressively leveraging modern shared memory nodes with a programming model that will be familiar to MPI programmers.
Workshop
Quantum Computing
Software Engineering
W
DescriptionThe current era of quantum computing has yielded several algorithms that promise high computational efficiency. While the algorithms are sound in theory, there is little guidance on how to design proper quantum circuits to realize the appropriate unitary transformation to be applied to the input quantum state. We present QArchSearch, an AI based quantum architecture search package with the QTensor library as a backend that provides a principled and automated approach to finding the best model given a task and input quantum state. We show that the search package is able to efficiently scale the search to large quantum circuits and enables exploration of more complex models for different quantum applications. QArchSearch runs at scale and high efficiency on high-performance computing systems using two-level parallelization scheme on both CPUs and GPUs, which has been demonstrated on the Polaris supercomputer .
Posters
Research Posters
TP
XO/EX
DescriptionHigh-performance reconfigurable computers (HPRCs) make use of Field-Programmable Gate Arrays (FPGAs) for efficient emulation of quantum algorithms. Generally, algorithm-specific architectures are implemented on the FPGAs, and there is very little flexibility. Moreover, mapping a quantum algorithm onto its equivalent FPGA emulation architecture is challenging. In this work, we present an automation framework for converting quantum algorithms/circuits to their equivalent FPGA emulation architectures. The framework processes quantum circuits represented in Quantum Assembly Language (QASM) and derives high-level descriptions of the hardware emulation architectures for High-Level Synthesis (HLS) on HPRCs. Experimental results show that the framework-generated architectures deployed on an HPRC perform faster than a state-of-the-art software simulator.
Workshop
Quantum Computing
Software Engineering
W
DescriptionIn the field of Quantum Computing, transpilation plays a crucial role in converting high-level quantum circuits into versions that are specific to the underlying quantum devices. This process necessitates a consideration of a range of factors, such as the basis gate set, device topology, error profile, and other constraints. Yet, the efficiency of transpilation remains a significant bottleneck, particularly when dealing with very large assembly or QASM-level input files. In this paper, we present QASMTrans, a C++ based high-performance quantum transpiler. NWQTrans has demonstrated significant efficiency improvements compared to widely adopted approaches such as Qiskit. Built on comprehensive transpilation principles and efficient computation techniques, QASMTrans demonstrates 8-368X speedups on average compared to the internal transpiler of Qiskit. Such tremendous speedups thus make the exploration of much larger design space, as well as more comprehensive compiler optimizations, become feasible, especially for large circuits. QASMTrans will be released at http://github.com/pnnl/qasmtrans.
Panel
Codesign
Quantum Computing
TP
DescriptionQuantum Computing is quickly maturing and has started to enter the area of High-Performance Computing. As a consequence, we are seeing more and more work on quantum computing in the SC program and also more and more exhibitors focusing on this new technology and its relationship to HPC. This, however, comes with many challenges, especially for new companies in this field, as they have to bridge the gap between physics and computer science, both from a technology and a community point of view. In this panel, we will discuss this topic with five quantum computing companies covering hardware, software and workflow aspects, their take on the impact of HPC on them as well their impact on HPC, special challenges, and the future prospects of quantum computing as a new accelerator technology for HPC.
Posters
Research Posters
TP
XO/EX
DescriptionWith the demise of Moore’s empirical law, we cannot expect a dramatic improvement in computer performance in the future, but the need for supercomputer in JAXA for numerical simulation and data processing etc. continues to rise. Until now, general-purpose CPUs have been exclusively used, but there is an urgent need to seriously consider the use of dedicated computers and new architectures. One candidate is a quantum computer.
In order to study the feasibility of a quantum computer as a candidate for a new architecture, the Gate-Model Quantum Computer Study Group was established with JSS3: JAXA Supercomputer System generation 3 users as the main members, which examined the possibility of applying Gate-Model Quantum Computing technology to JAXA's technical problem areas, and will assist management in making mid to long-term decisions regarding computing resources.
The Group organized use cases created in workshops and gained insight into the effects of utilizing quantum technology.
In order to study the feasibility of a quantum computer as a candidate for a new architecture, the Gate-Model Quantum Computer Study Group was established with JSS3: JAXA Supercomputer System generation 3 users as the main members, which examined the possibility of applying Gate-Model Quantum Computing technology to JAXA's technical problem areas, and will assist management in making mid to long-term decisions regarding computing resources.
The Group organized use cases created in workshops and gained insight into the effects of utilizing quantum technology.
Invited Talk
Compilers
Hardware Technologies
Quantum Computing
TP
DescriptionThe expansion of several Quantum Computing platforms has already demonstrated some of the superior performance previously predicted. However, the ultimate goal of one day reaching end-users will require solving a multitude of difficult engineering problems. This talk focuses on some of the current microelectronics research aimed to make the necessary advances to improve Quantum Computing hardware. The next several generations of Quantum Computers must solve issues such as: scalable, fault tolerant hardware, microdevices with lower energy consumption, standardization of hardware across qubit types and development of devices that are less sensitive to environmental noise. Simultaneously, software platforms must make hardware accessible, while work force development must train users.
Posters
Research Posters
TP
XO/EX
DescriptionMost of the widely used quantum programming languages and libraries are not designed for the tightly coupled nature of hybrid quantum-classical algorithms, which run on quantum resources that are integrated on-premise with classical HPC infrastructure. We propose a programming model using the API provided by OpenMP to target quantum devices, which provides an easy-to-use and efficient interface for HPC applications to utilize quantum compute resources. We have implemented a variational quantum eigensolver using the programming model, which has been tested using a classical simulator. We are in the process of testing on the quantum resources hosted at LRZ.
Exhibits
Flash Session
TP
XO/EX
DescriptionNetwork engineers are closely watching the downside of quantum computer development: emergence of a cryptographically relevant quantum computer (CRQC). This talk will offer a better understanding of quantum-safe networks and show how to protect against attack immediately and well into the future.
Workshop
State of the Practice
W
DescriptionHigh Performance Computing systems play critical role in advancing scientific research. They use schedulers for allocating resources to queued jobs. Waiting time can vary, even among jobs of similar characteristics, making it difficult to accurately estimate the exact time a job will wait in the queue. Knowing how long a job will wait is beneficial for adequate planning and to avoid frustrations that may arise when a user's expectation of waiting time is not met. Efficient job wait time estimation is also crucial for optimizing resource allocation.
This work investigates the impact of job characteristics and user behaviors on job wait time on leadership-class HPC systems. The paper evaluates the performance of different supervised learning algorithms for job wait time estimation. While this study focuses on the workload and hardware characteristics of Theta CrayXC40, the processes and tools developed in this study can be applied to any other leadership-class machine.
This work investigates the impact of job characteristics and user behaviors on job wait time on leadership-class HPC systems. The paper evaluates the performance of different supervised learning algorithms for job wait time estimation. While this study focuses on the workload and hardware characteristics of Theta CrayXC40, the processes and tools developed in this study can be applied to any other leadership-class machine.
Posters
Research Posters
TP
XO/EX
DescriptionThe soaring demand for AI has led to a surge in specialized computation hardware, which poses challenges in sharing resources through conventional virtualization methods among end users. Moreover, the extensive data required by AI often cannot be conveniently co-located with the compute resources, resulting in costly and unsuitable migration attempts. To address these issues, Radium offers a userspace framework employing process virtualization, thread execution migration, and distributed shared memory. By leveraging Radium, an unmodified application binary operates in an encapsulated virtualized environment and its execution can be transparently distributed among nodes where resources are located. Radium enables resource aggregation with little performance penalty over high latency network connectivity. By choosing syscalls as the virtualization boundary, Radium supports novel hardware by nature without modifying existing infrastructure or applications.
Workshop
Applications
Exascale
Heterogeneous Computing
Programming Frameworks and System Software
State of the Practice
W
DescriptionReproducibility and replicability are extremely important components of scientific computing, and any computational research. The ability to replicate a set of experiments aids many other computational use cases, such as systems acceptance where a compute center requires the ability to produce execute the same experiment as a hardware vendor. Several test harnesses and frameworks exist, and attempt to increase the replicability of these experiments.
We introduce Ramble. a new Python-based experimentation framework. Ramble provides a domain specific language for abstracting how experiments can be creating from applications, and a flexible templating engine for creating experiments. Ramble can be used for automating system tests, scientific parameter studies, performance focused benchmarking, and many other software experiments. We will introduce Ramble, describe its architecture, and give some concrete use cases where it can be applied to HPC application experimentation.
We introduce Ramble. a new Python-based experimentation framework. Ramble provides a domain specific language for abstracting how experiments can be creating from applications, and a flexible templating engine for creating experiments. Ramble can be used for automating system tests, scientific parameter studies, performance focused benchmarking, and many other software experiments. We will introduce Ramble, describe its architecture, and give some concrete use cases where it can be applied to HPC application experimentation.
Paper
Artificial Intelligence/Machine Learning
Applications
Modeling and Simulation
State of the Practice
TP
DescriptionAtmospheric data assimilation is essential for numerical weather prediction. Ensemble-based data assimilation connects multiple instances of atmospheric model through Kalman-filter-based algorithm, which is regarded as a challenging computing task today. In this work, we present our efforts to build a fast, low-cost, and scalable atmospheric data assimilation prototype for the new-generation Sunway supercomputer, including (1) A UNet-neural-network-based surrogate model for atmospheric dynamic simulation to generate all the background ensemble with both satisfactory accuracy and reasonable robustness; (2) Batched LETKF with an efficient eigenvalue decomposition implementation and a data staging strategy to cover the observation IO time ; (3) A framework able to flexibly deploy the components, thus available to reach the maximum resource efficiency. Experimental evaluations show that our AI-integrated ensemble data assimilation prototype can finish hour-cycle assimilation in minutes, keep linear scalability and save an order of magnitude of computing resources compared with the traditional scientific method.
Workshop
Middleware and System Software
Programming Frameworks and System Software
Runtime Systems
W
DescriptionToday's supercomputers are more heterogeneous than ever before. As the share of AI workloads in data centers continues to grow, the share of GPUs and AI-specific hardware grows with it. AI accelerators are different from traditional hardware, affecting all aspects of system design, from data-center scale to single-chip scale. AI accelerators are much more efficient than CPUs or GPUs for some HPC workloads, especially in AI for Science. They also add complexity to system architecture, management, and programming. Although runtime frameworks are critical to reducing system complexity, there is little literature describing AI accelerator runtimes. In this paper, we introduce RDARuntime - an AI-specific OS tailored for the development and operation of SambaNova's reconfigurable dataflow architecture. We introduce the architecture, our design decisions, and some of the results we have achieved, along with some lessons we have learned while helping to deploy the Reconfigurable Dataflow Unit (RDU) to production environments.
Posters
Research Posters
TP
XO/EX
DescriptionThe uniform sampling of molecular dynamics (MD) simulations may not accurately capture crucial scientific events. Deep learning approaches are being developed to detect these events within streaming data but can take significant resources on large datasets (PB+). To address these limitations, we propose a solution based on streaming manifold learning, specifically the Kernel CUSUM (KCUSUM) algorithm. By leveraging KCUSUM, we can overcome the limitations of uniform sampling in MD simulations, as it compares incoming data with samples from a reference distribution. It utilizes a statistic derived from the Maximum Mean Discrepancy (MMD) non-parametric testing framework. This algorithm has been tested in various use cases, demonstrating its ability to significantly reduce data rates without missing important scientific events.
Workshop
Energy Efficiency
Green Computing
Sustainability
W
DescriptionImproving CPU power/energy efficiency without degrading performance requires an accurate application characterization. Rather than characterizing the application as a whole, we find that dividing the application into individual regions is much more effective. This fine-grained approach gives us the opportunity to save power/energy during memory-bound regions and MPI slack regions (time spent waiting on other processes) by lowering core frequency and during compute-bound regions by lowering uncore frequency. We propose an intuitive, lightweight, and portable algorithm for identifying these regions at runtime which relies only on the IPS (instructions per second) metric, rather than on performance counters that can differ across platforms. At the same time, we meet a user-specified level of acceptable performance degradation by adapting core and uncore frequencies to the application, achieving additional CPU power/energy savings. We evaluate our approach on the SPEC 2017 benchmarks and various MPI applications.
Workshop
Fault Handling and Tolerance
Large Scale Systems
W
DescriptionHigh-performance computing applications are central to advancement in many fields of science and engineering. Central to this advancement is the supposed reliability of the HPC system. However, as system size grows and hardware components run with near-threshold voltages, transient upset events become more likely. Many works have explored the problem of detecting silent data corruption; however, recovery is often left to checkpoint-restart or application-specific techniques. Recovering from a checkpoint incurs overhead due to reading a checkpoint and recomputing lost work. Allowing the application to recover just the corrupted data enables faster and more efficient recovery. This paper explores using spatial similarities to recover from silent data corruption. We explore several reconstruction methods and evaluate their effectiveness at recovering corrupted entries in data arrays. Results show that the Lorenzo 1-Layer prediction method yields the best results, with over half of its reconstructions having less than 1% relative error across all applications.
Workshop
Energy Efficiency
Green Computing
Sustainability
W
DescriptionAs the energy cost continues to rise, HPC centers need to reduce their energy footprint. We examine a French national machine hosted at CINES in Montpellier, Adastra, based on AMD-MI250X GPUs and #3 in Green500. As a base for the study, we define a set of applications representative of our current HPC/AI production workload. In this parametric study, we characterize our diverse workload by applying a range of frequency or power capping policies at the node level in order to build an efficiency profile of each application. Based on the collected results, we produce guidelines trading between pure energy savings to pure performance for each application and for the production workload as a whole. We hope the results of this study will be of help to accelerators enabled HPC centers seeking to reduce their energy footprint by applying capping policies on their accelerators at the node level.
Workshop
Modeling and Simulation
Performance Measurement, Modeling, and Tools
W
DescriptionHigh Performance Computing (HPC) benefits from different improvements during last decades, specially in terms of hardware platforms to provide more processing power while maintaining the power consumption at a reasonable level. The Intelligence Processing Unit (IPU) is a new type of massively parallel processor, designed to speedup parallel computations with huge number of processing cores and on-chip memory components connected with high-speed fabrics. IPUs mainly target machine learning applications, however, due to the architectural differences between GPUs and IPUs, especially significantly less memory capacity on an IPU, methods for reducing model size by sparsification have to be considered. Butterfly factorizations are well-known replacements for fully-connected and convolutional layers. We examine how butterfly structures can be implemented on an IPU and study their behavior and performance compared to a GPU.
Paper
Algorithms
Linear Algebra
Post-Moore Computing
TP
DescriptionResistive random access memory (ReRAM) is a promising technology that can perform low-cost and in-situ matrix-vector multiplication (MVM) in analog domain. Scientific computing requires high-precision floating-point (FP) processing. However, performing floating-point computation in ReRAM is challenging because of high hardware cost and execution time due to the large FP value range. In this work we present ReFloat, a data format and an accelerator architecture, for low-cost and high-performance floating-point processing in ReRAM for iterative linear solvers. ReFloat matches the ReRAM crossbar hardware and represents a block of FP values with reduced bits and an optimized exponent base for a high range of dynamic representation. Thus, ReFloat achieves less ReRAM crossbar consumption and fewer processing cycles and overcomes the noncovergence issue in a prior work. The evaluation on the SuiteSparse matrices shows that ReFloat achieves 5.02x to 84.28x improvement in terms of solver time compared to a state-of-the-art ReRAM based accelerator.
Workshop
Programming Frameworks and System Software
W
DescriptionThe paper presents improvements in Remora (REsource Monitoring for Remote Applications), a user-oriented, lightweight system monitoring tool designed for modern HPC systems. Assessing application performance can be complicated. Gathering metrics from various components may overwhelm end users. Hence, some HPC users might be able to make their applications more performant if there are easy-to-use monitoring tools for non-HPC experts. Remora addresses this by providing simple tools for quick diagnostic assessments of an application’s resource usage, and offering flexible and adaptable workflow support. The new release of REMORA v2 provides performance updates and new features. Other improvement include RemoraPy, a Python wrapper, and RP-Stats, a JupyterLab-based GUI, enhancing data collection, visualization, and analysis capabilities.
Workshop
Data Movement and Memory
Fault Handling and Tolerance
I/O and File Systems
State of the Practice
W
DescriptionThe current generation of Research Data Store (RDS) at The University of Sydney comprises a pair of peta-scale data storage systems. We implemented a disaster recovery (DR) solution for data protection against catastrophic failure at either storage system. To handle large amount of data transactions into RDS, we took an open-source approach to design an adaptable DR solution that enables parallelized data replication capability between the pair of storage systems. In the last three years of operations, the DR solution has gone through a few iterations which saw improvement in efficiency. In this paper, we present the findings and outcomes from our DR implementation.
Workshop
Software Engineering
W
DescriptionResearch software engineers (RSEs) are critical to the impact of HPC, data science, and the larger scientific community. They have existed for decades, though often not under that name. The past several years, however, have seen the development of the RSE concept, common job titles, and career paths; the creation of professional networks to connect RSEs; and the emergence of RSE groups at universities, national laboratories, and industry.
This workshop will bring together RSEs and allies involved in HPC, from all over the world, to grow the RSE community by establishing and strengthening professional networks of current RSEs and RSE leaders. We will hear about successes and challenges that RSEs and RSE groups have experienced and discuss ways to increase awareness of RSE opportunities and improve support for RSEs.
The workshop will be highly interactive, featuring breakout discussions and panels, as well as invited addresses and submitted talks.
This workshop will bring together RSEs and allies involved in HPC, from all over the world, to grow the RSE community by establishing and strengthening professional networks of current RSEs and RSE leaders. We will hear about successes and challenges that RSEs and RSE groups have experienced and discuss ways to increase awareness of RSE opportunities and improve support for RSEs.
The workshop will be highly interactive, featuring breakout discussions and panels, as well as invited addresses and submitted talks.
Paper
Cloud Computing
Distributed Computing
Energy Efficiency
Performance Measurement, Modeling, and Tools
TP
DescriptionServerless computing commonly adopts strong isolation mechanisms for deploying functions, which may bring significant performance overhead because each function needs to run in a completely new environment, i.e., the “one-to-one” model. To accelerate the function computation, prior work has proposed using sandbox sharing to reduce the overhead, i.e., the “many-to-one” model. Nonetheless, either process-based true-parallelism or thread-based pseudo-parallelism prevents its adaptation for latency-sensitive web services.
To achieve optimal performance and resource efficiency for serverless workflow, we argue an “m-to-n” deployment model that manipulates multiple granularities of computing abstractions (e.g., processes, threads), and sandboxes to amortize overhead. We propose wrap, a new deployment abstraction that balances the tradeoffs between interaction overhead, startup overhead and function execution. We further design Chiron, a wrap-based deployment manager that can automatically perform the orchestration of multiple computing abstractions based on performance prioritization. Our comprehensive evaluation indicates Chiron outperforms state-of-the-art systems by 1.3x-21.8x on system throughput.
To achieve optimal performance and resource efficiency for serverless workflow, we argue an “m-to-n” deployment model that manipulates multiple granularities of computing abstractions (e.g., processes, threads), and sandboxes to amortize overhead. We propose wrap, a new deployment abstraction that balances the tradeoffs between interaction overhead, startup overhead and function execution. We further design Chiron, a wrap-based deployment manager that can automatically perform the orchestration of multiple computing abstractions based on performance prioritization. Our comprehensive evaluation indicates Chiron outperforms state-of-the-art systems by 1.3x-21.8x on system throughput.
Workshop
Architecture and Networks
Hardware Technologies
W
DescriptionThis talk will include information about Open Standards, who is using RISC-V for HPC, how we get to application and system software portability, what we've done this year and what is coming up. It is a basic introduction and should help people considering RISC-V prepare to take next steps in creating a product.
Workshop
Modeling and Simulation
Performance Measurement, Modeling, and Tools
W
DescriptionThis work focuses on the design of scheduling algorithms for independent jobs that are submitted to a platform whose resource capacity varies over time. Jobs are submitted online and assigned on a target machine by the scheduler, which is agnostic to the rate and amount of resource variation. The optimization objective is the goodput. We introduce several novel algorithms that: (i) decide which fraction of the resources can be used safely; (ii) maintain a risk index associated to each machine; and (iii) achieves a global load balance while mapping longer jobs to safer machines. We assess the performance of these algorithms using one set of actual workflow traces together with three sets of synthetic traces. The goodput achieved by our algorithms increases up to 10% compared to standard first-fit approaches, while we never experience any loss in complementary metrics such as the maximum or average stretch.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
TP
XO/EX
DescriptionSelf-driving cars can potentially improve transportation efficiency and reduce human fatalities – provided they have access to significant processing power and large amounts of data. One popular approach for actualizing autonomous vehicles is using end-to-end learning, in which a machine learning model is trained on a large data set of real human driving. This poster shows how self-driving consistency can be improved using a Convolutional Neural Network (CNN) to predict current velocity. Our approach first reproduces an end-to-end learning result and then extends it with real-time speed data as additional model input.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
TP
XO/EX
DescriptionToday’s state-of-the-art scientific high-performance computing (HPC) applications generate extensive data in diverse domains, placing a significant strain on data transfer and storage systems. Most compression algorithms are more computationally complex, requiring more processing power and time to compress and decompress data. However, these algorithms tend to achieve higher compression ratios resulting in smaller compressed data sizes. Real-time streaming applications demand high data throughput. Therefore, striking a right balance between compression efficiency and computational complexity is essential. This poster explores two key aspects: interpolation method of 'sz3' algorithm for data reconstruction and the application of 'szx' algorithm on a 'Region of Interest(ROI)' - where a lesser data distortion is needed. We perform a through evaluation using NYX scientific dataset. Experiments show that compression ratio is improved by ~2x. Compression and decompression rates are improved by ~5-7x when contiguous ROI is preserved and only certain recursive levels of sx3 are processed.
Workshop
Middleware and System Software
Programming Frameworks and System Software
Runtime Systems
W
Workshop
Middleware and System Software
Programming Frameworks and System Software
Runtime Systems
W
DescriptionThe proposition for the panel is that specialized (lightweight) OS kernel architectures have become the dominant OS architecture for large-scale HPC systems, with the caveat that these specialized OSes are located in the opaque firmware blobs running on hardware accelerators (primarily GPUs). It is likely that this trend will continue with the majority (if not all) of the performance on the systems being managed by black-box firmware that is only accessible via a work-queue interface implemented by an often proprietary driver stack. Projecting into the future, performance will likely no longer be the primary concern for the open/modifiable components of supercomputing OS architectures, and so the community's research focus will instead need to shift to new capabilities and features that we can bring to the HPC environments. These features could include, for example, multi-tenancy capabilities, security partitioning and confidential computing, support for on-demand workloads with real-time constraints, and integration with edge resources and scientific instruments. An alternative viewpoint is that the research community should instead shift to custom/open hardware solutions that are either designed specifically for research or developed as part of a co-design effort with hardware architects. The purpose of this panel is to foster a conversation amongst the community about how we as a community should address the current landscape of HPC architectures; specifically, whether we should shift the focus of OS/R research away from performance-oriented approaches, and what new potential research opportunities are emerging in an accelerator-dominated ecosystem.
Workshop
Middleware and System Software
Programming Frameworks and System Software
Runtime Systems
W
Workshop
Software Engineering
W
Panel
Software Engineering
TP
DescriptionResearch Software Engineering (RSEng) as a professional designation has grown over the last 10+ years in industry, academia, and government sectors. Within HPC centers, Research Software Engineers (RSE) fill the role of combining software engineering expertise with the in-depth process of participating in and applying research. In this panel, we invite practicing RSEs, funders, university, and HPC center leaders who are experienced and dedicated to Research Software Engineering to present their varying perspectives on funding, managing, and doing RSEng within worldwide HPC centers. The moderator is Daniel S. Katz (Chief Scientist, NCSA; co-founder, US-RSE), and panelists are Gabrielle Allen (Director, School of Computing, University of Wyoming), Neil Chue Hong (EPCC, University of Edinburgh; Director, Software Sustainability Institute), Alison Kennedy (Strategic Advisor, UK Research and Innovation), Fabio Kon (Special Advisor, São Paulo Research Foundation), and Miranda Mundt (RSE, Sandia National Laboratories; Steering Committee Member, US-RSE).
Paper
Compilers
Performance Measurement, Modeling, and Tools
Performance Optimization
Programming Frameworks and System Software
TP
DescriptionDependence between iterations in sparse computations causes inefficient use of memory and computation resources. This paper proposes sparse fusion, a technique that generates efficient parallel code for the combination of two sparse matrix kernels, where at least one of the kernels has loop-carried dependencies. Existing implementations optimize individual sparse kernels separately. However, this approach leads to synchronization overheads and load imbalance due to the irregular dependence patterns of sparse kernels, as well as inefficient cache usage due to their irregular memory access patterns. Sparse fusion uses a novel inspection strategy and code transformation to generate parallel fused code optimized for data locality and load balance. Sparse fusion outperforms the best of unfused implementations using ParSy and MKL by an average of 4.2× and is faster than the best of fused implementations using existing scheduling algorithms, such as LBC, DAGP, and wavefront by an average of 4× for various kernel combinations.
Panel
Codesign
Heterogeneous Computing
Runtime Systems
TP
DescriptionExtreme heterogeneity is defined as one of the most important priority research directions today. Additionally, the applications are expected to grow in complexity to enable progress in multiple areas of science, technology and engineering. This urges consideration of hardware/software co-design to facilitate the adoption of emerging technologies. With such a scenario in mind, the opportunities lie in designing new features in runtimes and workflows. This panel aims to debate how future systems will look. Advances in this matter are key to executing science workflows and understanding their results, enabling efficient execution on diverse platforms, ensuring scalability of high-level descriptions of analytics workflows, and increasing user productivity and system utilization. In other words, how easily and rapidly a science team can develop or port a workflow to a new platform, and how well the resulting implementation makes use of the platform and its resources.
Students@SC
DescriptionThis event, which will take place from 9:00-3:00 on November 13 in person at SC in rooms 705-707, is open to anyone interested in learning more about High-Performance Computing (HPC). Participants will receive an overview of HPC programming environments, parallel programming models, job schedulers, and job launchers. Afterward, they will be directed to self-guided HPC challenges covering basic parallel programming, AI, and GPU programming topics. These challenges will be performed on OLCF’s Frontier Exascale system and Purdue’s Anvil Supercomputer. Frontier is the currently the most powerful supercomputing in the world. Students will have access to Frontier during the workshop and to Anvil afterward until December 13 to complete the exercises required for an HPC Crash Course certificate.
Pre-Workshop Session:
We will host pre-workshop help sessions on November 7 to review requirements, what to expect, and why you should know about HPC at 1:00 p.m. ET and 7:00 p.m. ET.
To attend one of the help sessions, please put your email in this sheet and indicate which time works for you. We will send you the joining link before November 7, 2023.
Signup sheet for Pre-workshop Help session: https://docs.google.com/spreadsheets/d/1ApA3vfa_jvzzmgNXu3LMPQHhHtx7foZPzD49MWs8FFk/edit?usp=sharing.
Frontier Access Requirements: Eligible participants will be provided access tokens and usernames for Frontier during the workshop. To gain access to Frontier:
1. Bring a government-issued photo ID to the workshop for quick access vetting.
2. Bring an internet-ready Laptop to the event.
3. Note that foreign nationals from countries listed in section 15 CFR 740.7 License Exceptions for Computers (including Cuba, Iran, North Korea, Sudan, and Syria) may require a lengthy approval process for access to DOE supercomputers. If approval cannot be obtained in time for the Crash Course, affected participants can apply to work on Anvil.
Anvil Access Requirements: Participants who need more time to complete exercises after the workshop or who cannot gain access to Frontier can apply for access to Anvil. It is strongly recommended that participants apply for Anvil access ahead of the workshop.
To apply for Anvil access, follow these steps:
1. For participants who do not have ACCESS ID already, Please go to https://identity.access-ci.org/new-user and follow the instructions listed here https://drive.google.com/file/d/1LA9MjC__fow7Yr-NEmGTCTqLs0ZV26_Z/view?usp=share_link.
2. Once you have the access ID please enter it here: https://docs.google.com/spreadsheets/d/1ApA3vfa_jvzzmgNXu3LMPQHhHtx7foZPzD49MWs8FFk/edit?usp=sharing. Once we have your Access ID, Anvil admins will add your access.
Pre-Workshop Session:
We will host pre-workshop help sessions on November 7 to review requirements, what to expect, and why you should know about HPC at 1:00 p.m. ET and 7:00 p.m. ET.
To attend one of the help sessions, please put your email in this sheet and indicate which time works for you. We will send you the joining link before November 7, 2023.
Signup sheet for Pre-workshop Help session: https://docs.google.com/spreadsheets/d/1ApA3vfa_jvzzmgNXu3LMPQHhHtx7foZPzD49MWs8FFk/edit?usp=sharing.
Frontier Access Requirements: Eligible participants will be provided access tokens and usernames for Frontier during the workshop. To gain access to Frontier:
1. Bring a government-issued photo ID to the workshop for quick access vetting.
2. Bring an internet-ready Laptop to the event.
3. Note that foreign nationals from countries listed in section 15 CFR 740.7 License Exceptions for Computers (including Cuba, Iran, North Korea, Sudan, and Syria) may require a lengthy approval process for access to DOE supercomputers. If approval cannot be obtained in time for the Crash Course, affected participants can apply to work on Anvil.
Anvil Access Requirements: Participants who need more time to complete exercises after the workshop or who cannot gain access to Frontier can apply for access to Anvil. It is strongly recommended that participants apply for Anvil access ahead of the workshop.
To apply for Anvil access, follow these steps:
1. For participants who do not have ACCESS ID already, Please go to https://identity.access-ci.org/new-user and follow the instructions listed here https://drive.google.com/file/d/1LA9MjC__fow7Yr-NEmGTCTqLs0ZV26_Z/view?usp=share_link.
2. Once you have the access ID please enter it here: https://docs.google.com/spreadsheets/d/1ApA3vfa_jvzzmgNXu3LMPQHhHtx7foZPzD49MWs8FFk/edit?usp=sharing. Once we have your Access ID, Anvil admins will add your access.
Posters
Research Posters
TP
XO/EX
DescriptionThe SCALABLE project aims to enhance an industrial Lattice Boltzmann Method (LBM)-based computational fluid dynamics (CFD) solver for current and future extreme-scale architectures, while ensuring accessibility for end-users and developers. This is accomplished by transferring technology and knowledge between academic code waLBerla and industrial code LaBS.
This poster introduces both software packages and the technology transfer process, resulting in improved CPU and GPU performance and increased interest in energy efficiency.
LBM are trustworthy alternatives to conventional CFD, showing roughly an order of magnitude performance advantage over Navier-Stokes approaches in comparable scenarios.
SCALABLE unites waLBerla and LaBS developers to improve both solvers in terms of portability (targeting GPUs for example), energy efficiency scenarios and transferring techniques between the two to achieve high performance, scalability, and energy efficiency.
This poster introduces both software packages and the technology transfer process, resulting in improved CPU and GPU performance and increased interest in energy efficiency.
LBM are trustworthy alternatives to conventional CFD, showing roughly an order of magnitude performance advantage over Navier-Stokes approaches in comparable scenarios.
SCALABLE unites waLBerla and LaBS developers to improve both solvers in terms of portability (targeting GPUs for example), energy efficiency scenarios and transferring techniques between the two to achieve high performance, scalability, and energy efficiency.
Posters
Research Posters
TP
XO/EX
DescriptionAs the dynamic network’s topology undergoes temporal alterations, associated graph properties must be updated to ensure their ac- curacy. Addressing this requirement efficiently in large dynamic networks led to the proposal of a generic framework, CANDY (Cyberinfrastructure for Accelerating Innovation in Network Dynamics). This paper expounds on the development of algorithms and subsequent performance improvements facilitated by CANDY.
Panel
Artificial Intelligence/Machine Learning
TP
DescriptionAI/Machine Learning usage is exploding in both application and model size. Predictive analytics, physics, modeling, and new use cases for generative AI/ML are increasing model sizes by 10x every 18 months. The custom processors and accelerators used for AI/ML require continually higher I/O bandwidth to address this model growth. However, how does one deploy a high-performance architecture that is scalable and adaptable through time to address this phenomenon? The panel will discuss the architectures, I/O and large-scale system topologies that will be needed to grow well beyond 200 billion parameters. You will gain insights into system concepts, scaled across workload size, that are both cost-effective from a new configurability perspective as well as a focus on energy-efficiency. Is there a new Billion Parameters per Watt metric? These are the topics the panel will discuss and debate.
Tutorial
Architecture and Networks
Data Movement and Memory
Message Passing
TUT
DescriptionThere are several popular Big Data processing frameworks including Apache Spark and Dask. These frameworks are not capable of exploiting high-speed and low-latency networks like InfiniBand, Omni-Path, Slingshot, and others. In the High Performance Computing (HPC) community, the Message Passing Interface (MPI) libraries are widely adopted to tackle this issue by executing scientific and engineering applications on parallel hardware connected via fast interconnect.
This tutorial introduces MPI4Spark and MPI4Dask that are enhanced Spark and Dask frameworks, respectively, and capable of utilizing MPI for communication in a parallel and distributed setting on HPC systems. MPI4Spark can launch the Spark ecosystem using MPI launchers to utilize MPI communication. It also maintains isolation for application execution by forking new processes using Dynamic Process Management (DPM). MPI4Spark also provides portability and performance benefits as it can utilize popular HPC interconnects. MPI4Dask is an MPI-based custom Dask framework that is targeted for modern HPC clusters built with CPU and NVIDIA GPUs.
This tutorial provides a detailed overview of the design, implementation, and evaluation of MPI4Spark and MPI4Dask on state-of-the-art HPC systems. Later, we also cover writing, running, and demonstrating user Big Data applications on HPC systems.
This tutorial introduces MPI4Spark and MPI4Dask that are enhanced Spark and Dask frameworks, respectively, and capable of utilizing MPI for communication in a parallel and distributed setting on HPC systems. MPI4Spark can launch the Spark ecosystem using MPI launchers to utilize MPI communication. It also maintains isolation for application execution by forking new processes using Dynamic Process Management (DPM). MPI4Spark also provides portability and performance benefits as it can utilize popular HPC interconnects. MPI4Dask is an MPI-based custom Dask framework that is targeted for modern HPC clusters built with CPU and NVIDIA GPUs.
This tutorial provides a detailed overview of the design, implementation, and evaluation of MPI4Spark and MPI4Dask on state-of-the-art HPC systems. Later, we also cover writing, running, and demonstrating user Big Data applications on HPC systems.
Posters
Research Posters
TP
XO/EX
DescriptionThe demand for interactivity on HPC systems is increasing, primarily driven by new HPC users from the AI/ML research area. Traditional HPC users are accustomed to waiting for job execution on a batch scheduling system while new users prefer an interactive terminal such as Jupyter Notebook. To address these evolving requirements, enhancing interactivity is essential. Fine-grained gang scheduling is one potential solution for this problem. This poster presents a scalable inter-node synchronization mechanism that facilitates well-time-aligned synchronization message delivery through broadcast communication for fine-grained gang scheduling in HPC systems. The mechanism improved the application performance by 2.7 times in comparison to the existing implementation, when simultaneously executing two parallel applications on 128 computing nodes with a 500 ms time slice.
Workshop
State of the Practice
W
DescriptionParallel computing plays a pivotal role in the efficient processing of vast-scale graph. Complex network analysis stands as a captivating frontier of research, holding promise across diverse scientific domains such as sociology, biology, online media, and recommendation systems. In this era, Machine Learning (ML) and Deep Learning (DL) emerge as indispensable tools, underpinning remarkable technological achievements. Within this dynamic landscape, my research revolves around the advancement of parallel algorithms tailored for large-scale graph operations. To achieve this, I harness the power of cutting-edge technologies including OpenMP, MPI, HIP, and CUDA, on the High-Performance Computing (HPC) platforms to unlock optimal performance. I also apply ML/DL techniques to HPC operational data, to streamline the monitoring and maintenance of supercomputers, alleviating the complexities associated with their upkeep and enhancing user support. My research echoes the synergy between parallel computing, large-scale graph analysis, and ML/DL, enhancing computational efficiency and user experience.
Posters
Research Posters
TP
XO/EX
DescriptionA neural network-based reduced order modeling method for three-dimensional turbulent flow simulation is proposed. This method was implemented as the scalable distributed learning on Fugaku. Our method constitutes a dimensional reduction using a convolutional-autoencoder-like neural network and the time evolution prediction using long short-term memory networks. The time evolution of the turbulent three-dimensional flow (e.g., Re=2.8×10^6) could be simulated at a significantly lower cost (approximately four orders of magnitude) without a major loss in accuracy. Using the single core memory group, our implementation shows 370 GFLOPS (24.28% of the peak performance) for the entire training loop and 753 GFLOPS (24.28% of the peak performance) for the convolution kernel. Our implementation scales up to 25,250 computational nodes (1,212,000 cores). Thus it shows 72.9 % of weak scaling performance (7.8 PFLOPS) for the entire training loop. On the other hand, the convolution routine shows 100.8% of weak scaling performance (113 PFLOPS).
Paper
Accelerators
Distributed Computing
Middleware and System Software
Performance Measurement, Modeling, and Tools
Post-Moore Computing
TP
Best Paper Finalist
DescriptionHPC is a heterogeneous world in which host and device code are interleaved throughout the application. Given the significant performance advantage of accelerators, device code execution time is becoming the new bottleneck. Tuning the accelerated parts is consequently highly desirable but often impractical due to the large overall application runtime which includes unrelated host parts.
We propose a Record-Replay (RR) mechanism to facilitate auto-tuning of large (OpenMP) offload applications. RR dissects the application, effectively isolating GPU kernels into independent executables. These comparatively small code-lets are amenable to various forms of post-processing, including elaborate auto-tuning. By eliminating the resource requirements and application dependencies, massively parallel and distributed auto-tuning becomes feasible.
Using RR, we run scalable Bayesian Optimization to determine optimal kernel launch parameters. LULESH showcases an end-to-end speedup of up to 1.53x, while RR enables 102x faster tuning compared to existing approaches using the entire application.
We propose a Record-Replay (RR) mechanism to facilitate auto-tuning of large (OpenMP) offload applications. RR dissects the application, effectively isolating GPU kernels into independent executables. These comparatively small code-lets are amenable to various forms of post-processing, including elaborate auto-tuning. By eliminating the resource requirements and application dependencies, massively parallel and distributed auto-tuning becomes feasible.
Using RR, we run scalable Bayesian Optimization to determine optimal kernel launch parameters. LULESH showcases an end-to-end speedup of up to 1.53x, while RR enables 102x faster tuning compared to existing approaches using the entire application.
Workshop
Applications
Cloud Computing
Distributed Computing
Edge Computing
Large Scale Systems
W
DescriptionDue to various restrictions in serverless computing, developers face significant challenges to pipeline multiple Backend-as-a-Service (BaaS) services, which is restricted by the maximum size of the serverless function’s deployment package, or by throughput and concurrency restrictions for functions and BaaS services. To bridge this gap, we introduce a methodology how to code scalable and composite BaaS services in form of serverless workflows. Using the Abstract Function Choreography Language (AFCL), we develop and characterize two scalable and composite BaaS services (i) pdf2SpeechDE, which translates a pdf file written in English and converts the extracted text into audio file in German; and (ii) Speech2SpeechDE, which translates audio files from English into a single audio file in German. We composed pdf2SpeechDE and speech2SpeechDE as a sequence of three BaaS services each, including split-merge functions for scalability. The characterization showed that there is no dominating provider for individual BaaS services.
Workshop
Data Analysis, Visualization, and Storage
Large Scale Systems
Performance Measurement, Modeling, and Tools
W
DescriptionComputational Fluid Dynamics (CFD) demands immense memory and computational power, leading to reliance on high-end computing platforms. Traditional methods, like checkpointing, are inefficient, often slowing simulations when data is saved. As technology advances towards exascale and GPU-powered High-Performance Computing (HPC), the dilemma emerges: either compromise data accuracy or decrease resolution. Addressing this, our research promotes in situ analysis and visualization techniques. This approach allows for more regular data snapshots directly from memory, bypassing the pitfalls of checkpointing. We delve into our application of NekRS, a GPU-centric thermal-fluid simulation code, showcasing diverse in situ strategies. To demonstrate real-world implications, we conducted experiments on the Polaris and JUWELS Booster supercomputers. These tests offer crucial insights, suggesting that with careful methodology, one can achieve efficient data management without compromising accuracy, even in the most demanding computational scenarios.
Doctoral Showcase
Posters
Data Compression
I/O and File Systems
TP
DescriptionFor scientists and engineers, large-scale computer systems are one of the most powerful tools to solve complex high-performance computing (HPC) and Deep Learning (DL) problems. With the ever-increasing computing power such as the new generation of exascale (one exaflop or a billion billion calculations per second) supercomputers, the gap between computing power and limited storage capacity and I/O bandwidth has become a major challenge for scientists and engineers. Large-scale scientific simulations on parallel computers can generate extremely large amounts of data that are highly compute and storage intensive. This study will introduce data reduction techniques as a promising solution to significantly reduce the data sizes while maintaining high data fidelity for post-analyses in HPC applications. This study can be categorized into mainly four scenarios: (1) A ratio-quality model that makes lossy compression predictable; (2) advanced parallel write solution with async-I/O; (3) in-situ data reduction for scientific applications; and (4) in-situ data reduction for large-scale machine learning.
Workshop
Education
State of the Practice
W
DescriptionThroughout the cyberinfrastructure community there is a large range of resources available to train faculty and young scholars about successful utilization of computational resources for research. The challenge that the community faces is that training materials abound, but they can be difficult to find, and often have little information about the quality or relevance of offerings. Building on existing software technology, we propose to build a way for the community to better share and find training and education materials, through a federated training repository. In this scenario, organizations and authors retain physical and legal ownership of their materials by sharing only catalog information, organizations can refine local portals to use the best and most appropriate materials from both local and remote sources, and learners can take advantage of materials that are reviewed and described more clearly.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
TP
XO/EX
DescriptionThis poster discusses efficient system designs for Large Language Model (LLM) scaling to up to 128 trillion parameters. We use a comprehensive analytical performance model to analyze how such models could be trained on current systems while maintaining 75% Model FLOPS Utilization (MFU). We first show how tensor offloading alone can be used to dramatically increase the size of trainable LLMs. We analyze performance bottlenecks when scaling on systems up to 16,384 GPUs and with models up to 128T parameters. Our findings suggest that current H100 GPUs with 80 GiB of HBM enabled with 512 GiB of tensor offloading capacity allows scaling to 11T-parameter LLMs; and getting to 128T parameters requires 120 GiB of HBM and 2 TiB of offloading memory, yielding 75%+ MFU, which is uncommon even when training much smaller LLMs today.
Posters
Research Posters
TP
XO/EX
DescriptionK-Path centrality is based on the flow of information in a graph along simple paths of length at most K. This work addresses the computational cost of estimating K-path centrality in large-scale graphs by introducing the random neighbor traversal graph (RaNT-Graph). The distributed graph data structure employs a combination of vertex delegation partitioning and rejection sampling, enabling it to sample massive amounts of random paths on large scale-free graphs. We evaluate our approach by running experiments which demonstrate strong scaling on large real-world graphs. The RaNT-Graph approach achieved a 56,544x speedup over the baseline 1D partition implementation when estimating K-path centrality on a graph with 89 million vertices and 1.9 billion edges.
Workshop
Data Analysis, Visualization, and Storage
Large Scale Systems
Programming Frameworks and System Software
Reproducibility
Resource Management
Runtime Systems
W
DescriptionWhen running at scale, modern scientific workflows require middleware to handle allocated resources, distribute computing payloads and guarantee a resilient execution. While individual steps might not require sophisticated control methods, bringing them together as a whole workflow requires advanced management mechanisms. In this work, we used RADICAL-EnTK (Ensemble Toolkit) - one of the SDK components of the ECP ExaWorks project - to implement and execute the novel Exascale Additive Manufacturing (ExaAM) workflows on up to 8000 compute nodes of the Frontier supercomputer at the Oak Ridge Leadership Computing Facility. EnTK allowed us to address challenges such as varying resource requirements (e.g., heterogeneity, size, and runtime), different execution environment per workflow, and fault tolerance. And a native portability feature of the developed EnTK applications allowed us to adjust these applications for Frontier runs promptly, while ensuring an expected level of resource utilization (up to 90%).
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
TP
XO/EX
DescriptionAI accelerator processing and memory constraints largely dictate the scale in which machine learning workloads (training and inference) can be executed within a desirable time frame. Training a transformer-based model requires the utilization of HPC harnessed through inherent parallelism embedded in processor design, to deliberate modification of neural networks to increase concurrency during training and inference. Our model is the culmination of different performance tests seeking the ideal combination of frameworks and configurations for training a 13-billion-parameter translation model for foreign languages. We performed ETL over the corpus, which involved training a balanced interleaved dataset. We investigated the impact of batch size, learning rate, and different forms of precision on model training time, accuracy, and memory consumption. We use DeepSpeed stage 3 and Huggingface accelerate to parallelize our model. Our model, based on the mT5 architecture, is trained on the mC4 and language-specific datasets, enabling question-answering in the fine-tuning process.
ACM Gordon Bell Finalist
Awards
TP
DescriptionWe exploit the high memory bandwidth of AI-customized Cerebras CS-2 systems for seismic processing. Through low-rank matrix approximation, memory hungry seismic applications fit onto memory-austere SRAM waferscale hardware, addressing a challenge arising in many wave-equation-based algorithms that rely on multi-dimensional convolution operators. Exploiting sparsity inherent in seismic data in the frequency domain, we implement embarrassingly parallel tile low-rank matrix-vector multiplications (TLR-MVM), which account for most of the elapsed time in MDC operations, to solve the Multi-Dimensional Deconvolution (MDD) inverse problem. By reducing memory footprint along with arithmetic complexity, we fit a standard seismic benchmark dataset into the local memories of Cerebras processing elements. TLR-MVM on 48 CS-2 systems in support of MDD gives a sustained memory bandwidth of 92.58PB/s on 35,784,000 processing elements, a significant milestone that highlights the capabilities of AI-customized architectures to enable a new generation of seismic algorithms that will empower multiple technologies of our low-carbon future.
ACM Gordon Bell Finalist
Awards
TP
DescriptionThis work brings the leading accuracy, sample efficiency, and robustness of deep equivariant neural networks to the extreme computational scale. This is achieved through a combination of innovative model architecture, massive parallelization, and models and implementations optimized for efficient GPU utilization. The resulting Allegro architecture bridges the accuracy/speed tradeoff of atomistic simulations and enables description of dynamics in structures of unprecedented complexity at quantum fidelity. To illustrate the scalability of Allegro, we perform nanoseconds-long stable simulations of protein dynamics and scale up to a 44-million atom structure of a complete, all-atom, explicitly solvated HIV capsid on the Perlmutter supercomputer. We demonstrate excellent strong scaling up to 100 million atoms and 70% weak scaling to 5120 A100 GPUs
Exhibitor Forum
Accelerators
Artificial Intelligence/Machine Learning
TP
XO/EX
DescriptionThis technical deep dive will demonstrate scaling an application up to 32 accelerators to a single node — which until now was only possible on a supercomputer. This is achieved without needing to modify the application software for HPC or AI workloads, saving users considerable time and effort in porting software.
This new capability was made possible by a deep integration between the engineering teams of AMD and GigaIO. It utilizes off-the-shelf servers and GPUs connected over GigaIO’s native PCIe memory fabric, which provides the same performance and latency as if those accelerators were housed within the server sheet metal.
This talk will cover the steps to create this first-of-its-kind server, the GigaIO SuperNODE, including how to identify and resolve issues that prevent the enumeration of large numbers of GPUs, such as hardcoded limits within ROCm, physical address bit inconsistencies between CPUs (Milan, Genoa) and GPUs, and memory address issues in the VBIOS.
GigaIO will demonstrate how frameworks such as Pytorch and TensorFlow “just work” when run on this all-AMD system, without changing a single line of code. The plug-n-play nature of this solution opens new possibilities for generative AI and machine learning workloads, especially given the current availability constraints on GPUs.
Limitations encountered include the need for server vendors to be willing to modify their server BIOS to accommodate the unexpected number of PCIe end-points and to support dynamic allocation of resources. As such, this solution is only available on selected platforms from those server vendors who have undertaken that effort. Other limiting factors include the total number of BUS IDs and MMIO space.
This new capability was made possible by a deep integration between the engineering teams of AMD and GigaIO. It utilizes off-the-shelf servers and GPUs connected over GigaIO’s native PCIe memory fabric, which provides the same performance and latency as if those accelerators were housed within the server sheet metal.
This talk will cover the steps to create this first-of-its-kind server, the GigaIO SuperNODE, including how to identify and resolve issues that prevent the enumeration of large numbers of GPUs, such as hardcoded limits within ROCm, physical address bit inconsistencies between CPUs (Milan, Genoa) and GPUs, and memory address issues in the VBIOS.
GigaIO will demonstrate how frameworks such as Pytorch and TensorFlow “just work” when run on this all-AMD system, without changing a single line of code. The plug-n-play nature of this solution opens new possibilities for generative AI and machine learning workloads, especially given the current availability constraints on GPUs.
Limitations encountered include the need for server vendors to be willing to modify their server BIOS to accommodate the unexpected number of PCIe end-points and to support dynamic allocation of resources. As such, this solution is only available on selected platforms from those server vendors who have undertaken that effort. Other limiting factors include the total number of BUS IDs and MMIO space.
Tutorial
Applications
Cloud Computing
Performance Measurement, Modeling, and Tools
TUT
DescriptionKubernetes has emerged as the leading container orchestration solution that works on resources ranging from on-prem clusters to commercial clouds. Developed at Google, now maintained by Cloud Native Foundation, it sports a diverse and active development community. At SDSC Kubernetes capabilities are available on Expanse, Voyager, and Prototype National Research Platform (PNRP) Nautilus (multi-site distributed resource) clusters. The ability to run services in Kubernetes enables execution of non-traditional workloads. This can enable some complex scientific workflows to be run that are difficult to handle through traditional batch scheduling on HPC clusters.
Kubernetes does not have a traditional batch interface, but the concepts are similar enough to allow for porting of existing batch-focused workloads to it. Users can customize their software environment in containers. Kubernetes provides significantly richer semantics, including explicit storage and network provisioning, that allow execution of scientific computing workflows typically not feasible on batch systems.
In this tutorial, the attendees will get an overview of the Kubernetes architecture, typical job and workflow submission procedures, learn how to use various storage options, and will learn how to run their software using Kubernetes. Theoretical information will be paired with hands-on sessions operating on the PNRP production Kubernetes cluster Nautilus.
Kubernetes does not have a traditional batch interface, but the concepts are similar enough to allow for porting of existing batch-focused workloads to it. Users can customize their software environment in containers. Kubernetes provides significantly richer semantics, including explicit storage and network provisioning, that allow execution of scientific computing workflows typically not feasible on batch systems.
In this tutorial, the attendees will get an overview of the Kubernetes architecture, typical job and workflow submission procedures, learn how to use various storage options, and will learn how to run their software using Kubernetes. Theoretical information will be paired with hands-on sessions operating on the PNRP production Kubernetes cluster Nautilus.
Birds of a Feather
Programming Frameworks and System Software
Software Engineering
TP
XO/EX
DescriptionSoftware has become central to all aspects of modern science and technology. Especially in high-performance computing (HPC) and computational science and engineering (CSE), it is becoming ever-larger and more complex while computer platforms evolve and become more diverse. Simultaneously, the teams behind the software are becoming larger, more technically diverse, and more geographically distributed.
This BoF provides an opportunity for people concerned about these topics to share existing experiences and activities, discuss how we can improve on them, and share the results. Presentations and discussion notes will be made available at the BoF series website, http://bit.ly/swe-cse-bof.
This BoF provides an opportunity for people concerned about these topics to share existing experiences and activities, discuss how we can improve on them, and share the results. Presentations and discussion notes will be made available at the BoF series website, http://bit.ly/swe-cse-bof.
Workshop
Architecture and Networks
Hardware Technologies
W
DescriptionRISC-V is an open standard Instruction Set Architecture (ISA) which enables the open development of CPUs and a shared common software ecosystem. There are already approximately 10 billion RISC-V cores, which is expected to accelerate rapidly. Nonetheless, for all the successes that RISC-V has faced, it is yet to become popular in HPC. Recent advances however, such as the vectorisation standard and data center RISC-V-based CPUs, mean that this technology is becoming a more realistic proposition for our workloads.
This workshop aims to connect those currently involved in RISC-V with the wider HPC community. We look to bring together RISC-V experts with scientific software developers, vendors, and supercomputing center operators to explore the advantages, challenges, and opportunities that RISC-V can bring to HPC. Furthermore, we aim to further expand the RISC-V HPC SIG, enabling interested attendees to participate in one of the most exciting open-source technological activities of our time.
This workshop aims to connect those currently involved in RISC-V with the wider HPC community. We look to bring together RISC-V experts with scientific software developers, vendors, and supercomputing center operators to explore the advantages, challenges, and opportunities that RISC-V can bring to HPC. Furthermore, we aim to further expand the RISC-V HPC SIG, enabling interested attendees to participate in one of the most exciting open-source technological activities of our time.
Tutorial
Performance Measurement, Modeling, and Tools
Software Engineering
TUT
DescriptionHPC increasingly involves the development and deployment of network and cloud services. Unique to the HPC field is the large amount of software that we develop to drive these services. These services must assure data integrity and availability, while providing access to a global scientific and engineering community.
Securing your network is not enough! Every service that you deploy is a window into your data center from the outside world, and a window that could be exploited by an attacker.
This tutorial is relevant to anyone wanting to learn about minimizing security flaws in the software they develop or manage. We share our experiences gained from performing vulnerability assessments of critical middleware. You will learn skills critical for software developers and analysts.
Dependency analysis tools –tools that find weaknesses in the software supply chain– are the first line of defense in assessing the security of a software project. These tools can catch flaws in the packages and libraries a program depends upon, and that affects the safety of the application. This tutorial is also relevant to anyone wanting to learn how to use these automated dependency analysis tools to minimize security flaws in the software they develop or manage.
Securing your network is not enough! Every service that you deploy is a window into your data center from the outside world, and a window that could be exploited by an attacker.
This tutorial is relevant to anyone wanting to learn about minimizing security flaws in the software they develop or manage. We share our experiences gained from performing vulnerability assessments of critical middleware. You will learn skills critical for software developers and analysts.
Dependency analysis tools –tools that find weaknesses in the software supply chain– are the first line of defense in assessing the security of a software project. These tools can catch flaws in the packages and libraries a program depends upon, and that affects the safety of the application. This tutorial is also relevant to anyone wanting to learn how to use these automated dependency analysis tools to minimize security flaws in the software they develop or manage.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
TP
XO/EX
DescriptionHigh performance computing (HPC) filesystems are extremely large, complex, and difficult to manage with existing tools. It is challenging for HPC administrators to describe the current structure of their filesystems, predict how they will change over time, and the requirements for future filesystems as they continue to evolve. Previous studies of filesystem characteristics largely predate the modern HPC filesystems of the last decade. The Grand Unified File Index (GUFI) was used to collect the data used to compute the characteristics of six HPC filesystems indexes from Los Alamos National Laboratory (LANL) representing 2.8 PB of data, containing 36 million directories and 600 million files. We present a methodology using GUFI to characterize the shape of HPC filesystems to help system administrators to understand their key characteristics.
This document has been approved for public release under LA-UR-23-28958.
This document has been approved for public release under LA-UR-23-28958.
Exhibits
Flash Session
TP
XO/EX
DescriptionSegment Routing is a powerful and proven technology for simplifying IP network operations and enabling network programmability. Learn how others are using Segment Routing to improve efficiency and achieve unparalleled network control and visibility. https://www.nokia.com/industries/research-and-education-networks/
Workshop
Cloud Computing
Resource Management
State of the Practice
W
DescriptionUsing correctly the compute capacity of an HPC or Openstack cluster is often a stumbling block for users, especially those from non-traditional domains where a cluster is only a tool and not the subject of their research.
This paper describes a web portal called TrailblazingTurtle built for HPC and Openstack Cluster to let users view the resources used and wasted by their jobs, without having to modify their workflow. The metrics are collected from various data sources on the cluster to enable monitoring at the job and VM level and are presented to the users and staff members as a simple web application. This platform makes it easy for newer users to request the correct quantity of computing resources for their work, see their impact on the shared file system, and the evolution of the priority of their group in Slurm.
This paper describes a web portal called TrailblazingTurtle built for HPC and Openstack Cluster to let users view the resources used and wasted by their jobs, without having to modify their workflow. The metrics are collected from various data sources on the cluster to enable monitoring at the job and VM level and are presented to the users and staff members as a simple web application. This platform makes it easy for newer users to request the correct quantity of computing resources for their work, see their impact on the shared file system, and the evolution of the priority of their group in Slurm.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
TP
XO/EX
DescriptionCompression ratio estimation is an important optimization of I/O workflows processing terabytes of data. Applications such as compression auto-tuning or lossy compressor selection require a high-throughput, accurate estimation. Prior works that utilize sampling are fast but inaccurate, while approaches which do not use sampling are highly accurate but slow. Through sensitivity analysis we show that sampling a small number of moderately sized data blocks maintains fast data transfer and yields superior estimation accuracy compared to existing sampling approaches, and we use this to construct a new fast and accurate sampling method. In relation to non-sampling techniques, our method results in less than 10% degradation in estimation accuracy, while still maintaining the high throughput of the less accurate sampling methods.
Workshop
Applications
Distributed Computing
Compilers
Heterogeneous Computing
Message Passing
Programming Frameworks and System Software
Task Parallelism
W
DescriptionWe describe shmem4py, a Python wrapper for the OpenSHMEM application programming interface (API) which follows a design similar to that of the well-known mpi4py package. OpenSHMEM is a descendant of the one-sided communication library for the Cray T3D and it is known for its uncompromising performance for low-latency and high-throughput use cases involving one-sided and collective communication. OpenSHMEM is arguably one of the most efficient and portable abstractions for modern network architectures. Thanks to tight interoperability with NumPy, shmem4py provides a convenient parallel programming framework leveraging both the high-productivity NumPy feature set and the high-performance networking capabilities of OpenSHMEM. This paper discusses the design and performance characteristics of shmem4py in a variety of communication patterns relative to lower-level languages (C) as well as MPI and mpi4py.
Workshop
Architecture and Networks
Hardware Technologies
W
DescriptionFor years, SIMD/vector units have enhanced the capabilities of modern CPUs in High-Performance Computing (HPC) and mobile technology. Typical commercially-available SIMD units process up to 8 double-precision elements with one instruction. The optimal vector width and its impact on CPU throughput due to memory latency and bandwidth remain challenging research areas. This study examines the behavior of four computational kernels on a RISC-V core connected to a customizable vector unit, capable of operating up to 256 double precision elements per instruction. The four codes have been purposefully selected to represent non-dense workloads: SpMV, BFS, PageRank, FFT. The experimental setup allows us to measure their performance while varying the vector length, the memory latency, and bandwidth. Our results not only show that larger vector lengths allow for better tolerance of limitations in the memory subsystem but also offer hope to code developers beyond dense linear algebra.
Birds of a Feather
Data Analysis, Visualization, and Storage
TP
XO/EX
DescriptionThe annual business meeting of SIGHPC is your opportunity to hear about and discuss the status of SIGHPC and its chapters. All of the elected officers and many of the other volunteers will be present to answer your questions about SIGHPC. Representatives from our chapters will also be available. We will also be discussing upcoming plans for the year.
Posters
Research Posters
TP
XO/EX
DescriptionDistributed-memory graph applications are dominated by communication and synchronization overheads. For such applications, the communication pattern comprises of variable-sized data exchanges between process neighbors in a process graph topology, which unlike process grid for rectangular problems is difficult to optimize for enhancing the locality in a sustainable fashion.
Process assignment or remapping can improve the communication performance, however, existing solutions mostly caters to Cartesian process topologies and not the graph topology. In this work, we propose application and topology agnostic process remapping strategies for graph applications. For two communication intensive distributed-memory graph applications (graph clustering and triangle counting), we demonstrate about 30% improvements in the overall execution times through various remapping methodologies via SST-based packet-level simulations on Dragonfly and Fat Tree based network topologies.
Process assignment or remapping can improve the communication performance, however, existing solutions mostly caters to Cartesian process topologies and not the graph topology. In this work, we propose application and topology agnostic process remapping strategies for graph applications. For two communication intensive distributed-memory graph applications (graph clustering and triangle counting), we demonstrate about 30% improvements in the overall execution times through various remapping methodologies via SST-based packet-level simulations on Dragonfly and Fat Tree based network topologies.
Posters
Research Posters
TP
XO/EX
DescriptionQuantum computation is an emerging technology that promises to be able to solve certain tasks that are out of reach of classical machines alone. However, the limited number and quality of qubits poses a challenge for practical usage of near-term quantum computation. Circuit cutting is a technique to decrease the size of circuits at the cost of an additional sampling overhead. This can enable executing problems larger in size and with higher-quality outcomes than what available quantum hardware would otherwise support.
Here, we use the Circuit Knitting Toolbox (CKT) to demonstrate two applications of circuit cutting. To scale these workloads up to hundreds of qubits, we use Quantum Serverless – a new framework for distributing computationally expensive workloads in the cloud.
Here, we use the Circuit Knitting Toolbox (CKT) to demonstrate two applications of circuit cutting. To scale these workloads up to hundreds of qubits, we use Quantum Serverless – a new framework for distributing computationally expensive workloads in the cloud.
Workshop
State of the Practice
W
DescriptionSimulating chemistry with quantum mechanical accuracy is a challenging task, while at the same time crucial to describing certain molecular interactions. On a classical machine, the scaling of quantum chemistry methods ranges from cubic for popular but less accurate methods to exponential for the highest quality methods. Quantum computers may be able to reduce the exponential scaling for high-accuracy methods, but while current quantum devices remain noisy, it is important to fully leverage modern high performance computing hardware. Doing so will enable pushing past present limits to study or even design new chemistry. In this work, we present efforts to scale quantum chemistry simulations to heterogeneous HPC systems, simulate quantum circuits efficiently on classical hardware, and provide an outlook on hybrid quantum/classical approaches.
Posters
Research Posters
TP
XO/EX
DescriptionNWQ-Sim is a cutting-edge quantum system simulation environment designed to run on classical multi-node, multi-CPU/GPU heterogeneous HPC systems. In this work, we provide a brief overview of NWQ-Sim and its implementation in simulating quantum circuit applications, such as the transverse field Ising model. We also demonstrate how NWQ-Sim can be used to examine the effects of errors that occur on real quantum devices, using a combined device noise model. Moreover, NWQ-Sim is particularly well-suited for implementing variational quantum algorithms where circuits are dynamically generated. Therefore, we also illustrate this with the variational quantum eigensolver (VQE) for the Ising model. In both cases, NWQ-Sim's performance is comparable to or better than alternative simulators. We conclude that NWQ-Sim is a useful and flexible tool for simulating quantum circuits and algorithms, with performance advantages and noise-aware simulation capabilities.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
TP
XO/EX
DescriptionThis work comprehensively analyzes the overhead when implementing fault-checking algorithms for sparse preconditioned conjugate gradient (PCG) solvers on many-core and GPU-accelerated systems. Our objective is to selectively utilize GPUs for duplicate calculations based on the numerical properties of the sparse matrices to enhance the reliability and performance of linear system solutions. Enabling the ability to rely on the relatively underutilized CPU for fault detection improves scientific applications' ability to efficiently manage their resources on large-scale systems. By leveraging existing fault-checking techniques, we validate calculations and address potential numerical instabilities or precision-related issues during iterative solving. Through extensive experimentation on real hardware, we demonstrate the effectiveness of the conjugate gradient algorithm in providing accurate and reliable solutions for large linear systems.
Workshop
Quantum Computing
Software Engineering
W
DescriptionQuantum Hamiltonian simulation is one of the most promising applications of quantum computing. Recent experimental results suggest that Hamiltonian-oriented analog quantum simulation would be advantageous over circuit-oriented digital quantum simulation in the NISQ era. We design and implement SimuQ, the first domain-specific language for quantum Hamiltonian simulation that supports pulse-level compilation to heterogeneous analog quantum simulators. Specifically, in SimuQ, front-end users specify the target quantum system with Hamiltonian Modeling Language, and the Hamiltonian-level programmability of analog quantum simulators is specified through a new abstraction called the abstract analog instruction set (AAIS) and programmed in AAIS Specification Language by hardware providers. Through a solver-based compilation, SimuQ generates executable pulse schedules for real devices to simulate the evolution of desired quantum systems, which is demonstrated on superconducting (IBM), neutral-atom (QuEra), and trapped-ion (IonQ) quantum devices.
Birds of a Feather
Middleware and System Software
TP
XO/EX
DescriptionSlurm is an open source workload manager used many on Top500 systems and provides a rich set of features including topology aware optimized resource allocation, cloud bursting, hierarchical bank accounts with fair-share job prioritization and many resource limits. The meeting will consist of three parts: The Slurm development team will present details about newly released 23.11 version and changes in the upcoming version 24.08, describe the Slurm roadmap, and solicit user feedback. Everyone interested in Slurm use and/or development is encouraged to attend.
Birds of a Feather
Architecture and Networks
TP
XO/EX
DescriptionSmartNIC availability has rapidly increased in recent years due to wider adoption in the cloud. Leveraging these emerging devices in HPC can provide the infrastructure needed to develop new offloading capabilities that go beyond the traditional packet processing to support HPC optimizations. This BoF aims at building a community to discuss SmartNIC use-cases to accelerate applications, improve storage, enable software-defined infrastructures, address operational aspects of HPC centers and more. It also aims to serve as the state-of-the-union for SmartNICs within the HPC audience, acting as a central hub for sharing information on this emerging technology.
Posters
Research Posters
TP
XO/EX
DescriptionWe present a practical approach for the acceleration of an industrial and scientific application using graphics processing units (GPUs). Our original application is a computational stratigraphy codebase that couples fluid flow and sediment deposition submodels. The application uses domain decomposition and a halo exchange to split the workload among multiple workers in a distributed system. Our methodology abstracts and conserves the host data structures while re-writing computational elements in the GPU programming language CUDA. Utilizing high performance GPU machines in the Azure cloud, we show a minimum 90x speedup compared to a high-end CPU based cluster. In this poster, we give a brief description of the original algorithm, followed by a discussion of required software changes and additions. Although this case study focuses on a specific example, we hope this approach inspires similar efforts in other applications.
Birds of a Feather
Programming Frameworks and System Software
TP
XO/EX
DescriptionEffective software testing plays a critical role in guaranteeing the performance, correctness, and reproducibility of applications and software. When it comes to testing high-performance computing (HPC) software and applications, unique requirements arise due to factors such as massive parallelism, concurrency and heterogeneity, the scale of target platforms, lack of oracles, and application-specific verification and validation techniques. In this BoF session, we aim to foster insightful discussions among a panel of expert speakers and the audience, focusing on methodologies and challenges in HPC software testing, and deepen our understanding in this crucial part of HPC software development.
Tutorial
Algorithms
Quantum Computing
TUT
DescriptionOptimization problems are among the most promising quantum applications that combine the use of quantum processor and classical processors. This tutorial is aimed at teaching the participants to solve optimization problems using two distinct quantum computing paradigms: (i) hybrid quantum-classical algorithms using gate-based systems, and (ii) neutral atom analog Hamiltonian simulations device. In gate-based systems, a parameterized quantum circuit is designed which is then used to compute value of an objective function and iteratively optimize via classical optimization algorithms. Such hybrid algorithms rely on rapid iterative computations of quantum and classical processors, requiring regular sharing of data between them. The analog Hamiltonian simulation quantum device comprises of an array of two-level neutral Rydberg atoms with ground state and excited Rydberg state. The atoms can be arranged in any 1D or 2D geometry and initially prepared in the ground state. The parameters of the driving Hamiltonian are then adiabatically varied and the state of each neutral atom is measured which represents the final solution. The tutorial will provide introduction to quantum computing and demonstrate the aforementioned solutions using hand-on sessions provided via free cloud access to quantum hardware.
Awards
TP
DescriptionParallel programming started around 1970, so as a discipline, it is now more than 50 years old. What lessons have we learned during that time about parallel programming? What problems remain to be solved? What can young researchers learn from the successes and failures of our discipline? This talk presents a personal point of view about these and other questions regarding the state of parallel programming.
Posters
Research Posters
TP
XO/EX
DescriptionHPC Software must offer tool support for productive programming of scientific applications run on supercomputers using this HPC Software, especially for the sophisticated activities of performance analysis and auto-tuning. Given the emergence of performance portable programming libraries having abstractions for parallelism, new tool support offered by the HPC Software for such sophisticated activities is needed to handle these library abstractions over multiple backends. Addressing this will allow for software sustainability of performance portable libraries. Considering Kokkos, a representative C++-based Performance Portable Library, we focus on (1) a community-driven tool connector subset of the Kokkos Tools offer capability for such sophisticated activities along with (2) an associated tool infrastructure which includes common interfaces and utilities to enable such sophisticated tools. Showcasing this part of Kokkos Tools shows it is capable, lightweight, easy to use, and a viable alternative of such tools supporting specific low-level programming models.
Paper
Accelerators
Applications
Graph Algorithms and Frameworks
Performance Measurement, Modeling, and Tools
Programming Frameworks and System Software
TP
DescriptionDedicated accelerator hardware has become essential for processing AI-based workloads, leading to the rise of novel accelerator architectures. Furthermore, fundamental differences in memory architecture and parallelism have made these accelerators targets for scientific computing. The sequence alignment problem is fundamental in bioinformatics; we have implemented the X-Drop algorithm, a heuristic method for pairwise alignment that reduces search space, on the Graphcore Intelligence Processor Unit (IPU) accelerator. The X-Drop algorithm has an irregular computational pattern, which makes it difficult to accelerate due to load balancing.
Here, we introduce a graph-based partitioning and queue-based batch system to improve load balancing. Our implementation achieves 10x speedup over a state-of-the-art GPU implementation and up to 4.65x compared to CPU. In addition, we introduce a memory-restricted X-Drop algorithm that reduces memory footprint by 55x and efficiently uses the IPU's limited low-latency SRAM. This optimization further improves the strong scaling performance by 3.6x.
Here, we introduce a graph-based partitioning and queue-based batch system to improve load balancing. Our implementation achieves 10x speedup over a state-of-the-art GPU implementation and up to 4.65x compared to CPU. In addition, we introduce a memory-restricted X-Drop algorithm that reduces memory footprint by 55x and efficiently uses the IPU's limited low-latency SRAM. This optimization further improves the strong scaling performance by 3.6x.
Birds of a Feather
Programming Frameworks and System Software
TP
XO/EX
DescriptionSpack is a package manager for scientific computing, with a rapidly growing open-source community. Spack has over 1000 contributors from academia, industry, and laboratories across the world, and is used to manage software releases for the U.S. Exascale Computing Project. Spack developers will give updates on the community, new features, and the roadmap for future development. We will poll the audience to gather valuable information on how Spack is being used, and we will open the floor for questions. All are invited to provide feedback, request features, and discuss future directions. Help us make installing HPC software simple!
Workshop
State of the Practice
W
DescriptionInternal waves below the ocean's surface significantly contribute to heat transfer in the global climate system, and are often studied with laboratory experiments like the Stratified Inclined Duct (SID). These experiments generate large amounts of data, creating expensive storage costs. This work is an effort to reduce the volume of data by developing a machine learning model that can accurately classify and predict mixing events in real time, enabling researchers to record and save particular
moments of an experiment.
The model, a convolutional neural network, is trained on 107 experimental shadowgraph videos and achieves nearly 97% accuracy in classifying turbulence on roughly 7,000 shadowgraph frames. Preliminary work indicates promising results for predictive spatiotemporal modeling, as well as the implementation of the curvelet transform in pre-processing to reduce the model size and improve training times.
moments of an experiment.
The model, a convolutional neural network, is trained on 107 experimental shadowgraph videos and achieves nearly 97% accuracy in classifying turbulence on roughly 7,000 shadowgraph frames. Preliminary work indicates promising results for predictive spatiotemporal modeling, as well as the implementation of the curvelet transform in pre-processing to reduce the model size and improve training times.
Workshop
Modeling and Simulation
Performance Measurement, Modeling, and Tools
W
DescriptionWe assess fundamental performance, power, and energy characteristics of the SPEChpc 2021 benchmark suite on two clusters based on Intel Ice Lake and Sapphire Rapids CPUs using MPI only. We use memory bandwidth, data volume, and scalability metrics in order to categorize the benchmarks and pinpoint relevant performance and scalability bottlenecks on the node and cluster levels. Common patterns such as memory bandwidth limitation, dominating communication and synchronization overhead, MPI serialization, superlinear scaling, and alignment issues could be identified, in isolation or in combination, showing that SPEChpc 2021 is representative of many HPC workloads. Power dissipation and energy measurements indicate that the modern Intel server CPUs have such a high idle power level that race-to-idle is the paramount strategy for energy to solution and energy-delay product minimization. On the chip level, only memory-bound code shows a clear advantage of Sapphire Rapids compared to Ice Lake in terms of energy.
Workshop
Accelerators
Compilers
Heterogeneous Computing
Performance Optimization
Programming Frameworks and System Software
Runtime Systems
W
DescriptionProgramming models for general purpose GPU (GPGPU) computing include grid and non-grid languages. Grid languages like CUDA and HIP map directly to the GPU hardware and can extract high performance from applications. However, this low-level programming approach makes them more difficult to program than non-grid languages such as C, C++, and Fortran with OpenMP target offload. Furthermore, grid languages often have more portability issues than non-grid languages. However, code generated from non-grid languages using automatic compiler and runtime techniques often incur higher overhead while generating GPU kernels.
This presentation discusses compiler and runtime techniques to generate specialized, high-performance kernels for OpenMP target regions in certain common situations. We outline conditions under which specialized kernels are generated for OpenMP target regions, both with and without reduction clauses. Experimental results on AMD GPUs indicate that a large percentage of OpenMP target regions are amenable to specialization and consequent performance improvement.
This presentation discusses compiler and runtime techniques to generate specialized, high-performance kernels for OpenMP target regions in certain common situations. We outline conditions under which specialized kernels are generated for OpenMP target regions, both with and without reduction clauses. Experimental results on AMD GPUs indicate that a large percentage of OpenMP target regions are amenable to specialization and consequent performance improvement.
Workshop
Large Scale Systems
Performance Measurement, Modeling, and Tools
Software Engineering
W
Workshop
Software Engineering
W
Workshop
Data Analysis, Visualization, and Storage
Large Scale Systems
Performance Measurement, Modeling, and Tools
W
DescriptionVisualization of dynamic processes in scientific high-performance computing is an immensely data intensive endeavor. Application codes have recently demonstrated scaling to full-size Exascale machines, and generating high-quality data for visualization is consequently on the machine-scale, easily spanning 100s of TBytes of input to generate a single video frame. In situ visualization, the technique to consume the many-node decomposed data in-memory, as exposed by applications, is the dominant workflow. Although in situ visualization has achieved tremendous progress in the last decade, scaling to system-size together with the application codes that produce its data, there is one important question that we cannot skip: is what we produce insightful and inspiring?
Workshop
Large Scale Systems
Performance Measurement, Modeling, and Tools
Software Engineering
W
DescriptionIn this paper we demonstrate direct data streaming from instruments and detectors at a large-scale experimental facility to a supercomputer for real-time data processing and feedback. Streaming data to supercomputers introduces the potential for novel scientific applications and workflow models, including the ability to provide real-time feedback from very large datasets during an experiment and the integration of real-time ML training and inference at scale. We discuss a successful demonstration for real-time processing of data from the Advanced Photon Source (APS) on the Polaris supercomputer using an EPICS-based streaming framework. We describe the capabilities of the streaming framework itself, and outline the architecture that allows us to process experimentally derived data on a supercomputer without file-based data transfers. We present throughput measurements that are indicative of system performance capable of sustaining the expected data production rates of the facility, as well as discuss some outstanding challenges and our future directions.
Exhibitor Forum
Strong Scaling of State-of-the-Art LLM Inference with Groq Software-Scheduled Deterministic Networks
Accelerators
Artificial Intelligence/Machine Learning
Architecture and Networks
Hardware Technologies
TP
XO/EX
DescriptionIn this talk, we will demonstrate Groq’s approach to synchronous, software-scheduled AI accelerator networks and showcase how we use it to unlock state-of-the-art performance and latency on Large Language Models (LLMs), including Llama-2 70B, scaled to over 500 GroqChip™ Language Processors™.
Traditional HPC systems and data centers use dynamic time- and space-sharing, where platforms dynamically coordinate the use of compute, memory, and network resources among threads or workloads. This is a natural solution for arbitrary compute workloads, whose unpredictability makes such mediation a prerequisite. Unfortunately, this results in compounding inefficiency and complexity at all layers of the stack: processor architecture, memory, networking, and more. Modern AI workloads, however, have a predictable structure allowing for efficient static scheduling of compute and network resources.
Groq is turning this theory to practice by making components deterministic from the ground-up to stand-up large-scale synchronous compute platforms and empower software to make more orchestration decisions statically. Unlike traditional networks where packets can collide and congestion can develop, all traffic in the Groq network is completely pre-planned by Groq™ Compiler with zero network collisions. This maximizes not only the utilization of the links, but the number of minimal paths that can be taken between chips. Deterministic compute and static orchestration does introduce new software and hardware challenges and co-optimization opportunities, which we will discuss in this talk.
Overcoming these challenges unlocks the opportunity for greater compute and power efficiency on AI workloads. Groq’s software-scheduled networks offer key advantages including: (1) a global network load balancing via compiler-driven network traffic scheduling; (2) high network bandwidth efficiency via low control overhead; and (3) low latency chip-to-chip communication via a router-less, handshake-less direct topology. We showcase these advantages by demonstrating state-of-the-art performance on LLM models, including Llama-2 70B, scaled to over 500 Language Processors.
Traditional HPC systems and data centers use dynamic time- and space-sharing, where platforms dynamically coordinate the use of compute, memory, and network resources among threads or workloads. This is a natural solution for arbitrary compute workloads, whose unpredictability makes such mediation a prerequisite. Unfortunately, this results in compounding inefficiency and complexity at all layers of the stack: processor architecture, memory, networking, and more. Modern AI workloads, however, have a predictable structure allowing for efficient static scheduling of compute and network resources.
Groq is turning this theory to practice by making components deterministic from the ground-up to stand-up large-scale synchronous compute platforms and empower software to make more orchestration decisions statically. Unlike traditional networks where packets can collide and congestion can develop, all traffic in the Groq network is completely pre-planned by Groq™ Compiler with zero network collisions. This maximizes not only the utilization of the links, but the number of minimal paths that can be taken between chips. Deterministic compute and static orchestration does introduce new software and hardware challenges and co-optimization opportunities, which we will discuss in this talk.
Overcoming these challenges unlocks the opportunity for greater compute and power efficiency on AI workloads. Groq’s software-scheduled networks offer key advantages including: (1) a global network load balancing via compiler-driven network traffic scheduling; (2) high network bandwidth efficiency via low control overhead; and (3) low latency chip-to-chip communication via a router-less, handshake-less direct topology. We showcase these advantages by demonstrating state-of-the-art performance on LLM models, including Llama-2 70B, scaled to over 500 Language Processors.
Paper
Accelerators
Artificial Intelligence/Machine Learning
Codesign
Fault Handling and Tolerance
Performance Measurement, Modeling, and Tools
Post-Moore Computing
TP
DescriptionThe advent of High Performance Computing has led to the adoption of Convolutional Neural Networks (CNNs) in safety-critical applications such as autonomous vehicles. However, CNNs are vulnerable to DRAM errors corrupting their parameters, thereby degrading their accuracy. Existing techniques for protecting CNNs from DRAM errors are either expensive or fail to protect from large-granularity, multi-bit errors, which occur commonly in DRAMs.
We propose a software-implemented coding scheme, Structural Coding (SC) for protecting CNNs from large-granularity memory errors. SC achieves three orders of magnitude reduction in Silent Data Corruption (SDC) rates of CNNs compared to no protection. Its average error correction coverage is also higher than other software-techniques to protect CNNs from faults in the memory. Further, its average performance, memory, and energy overheads are respectively 3%, 15.71%, and 4.38%. These overheads are much lower than other software protection techniques.
We propose a software-implemented coding scheme, Structural Coding (SC) for protecting CNNs from large-granularity memory errors. SC achieves three orders of magnitude reduction in Silent Data Corruption (SDC) rates of CNNs compared to no protection. Its average error correction coverage is also higher than other software-techniques to protect CNNs from faults in the memory. Further, its average performance, memory, and energy overheads are respectively 3%, 15.71%, and 4.38%. These overheads are much lower than other software protection techniques.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
TP
XO/EX
DescriptionServerless computing platforms use containers to create custom and isolated execution environments. Thus, the time to serve a function in the Function-as-a-Service (FaaS) paradigm, is dependent on the time to load the necessary container. FaaS platforms try to avoid "cold-starts'', instead pre-loading containers to serve workload. We focus on the problem of rapidly loading Python environments in the Globus Compute (previously funcX) platform. Globus Compute is unique in that it is deployed on HPC systems and thus suffers from costs of shared file systems. We evaluate containers and microvmms (Docker and Firecracker) and propose a new approach using lightweight Python Unikernels. We show considerable speed up in cold-start times using Unikernels.
Exhibitor Forum
Accelerators
Artificial Intelligence/Machine Learning
TP
XO/EX
DescriptionWe have seen a substantial increase in the use of the Oracle Cloud Infrastructure (OCI) for training of large-scale language models (LLM), as more and more startups and established companies seek to gain an edge with increasingly large and more accurate models. These models share a need for efficient GPU cluster computing – that is, the ability to scale training to hundreds or thousands of GPUs for an extended period of time while maintaining performance and efficiency. Performance is crucial both at the level of individual GPU and of scaling efficiently across the network. Scaling these large training models can be very complex and certainly difficult to tune, requiring a cost-effective infrastructure that can provide availability, resiliency, and performance at scale.
In this talk, we will discuss our approach to support the needs of these large-scale language models, building on years of experience running HPC on a bare-metal instances with a very low latency network. We will present Oracle’s “SuperCluster”, which scales to thousands or tens of thousands of Nvidia A100 and H100 GPUs with low latency and high inter-node bandwidth of up to 3,200Gbps. This time-tested bare-metal instance platform is combined with intelligent job placement, locality awareness, and additional tuning to make ML work at the largest scales. Oracle’s SuperClusters have been rigorously tested on well-known public benchmarks such as Megatron, where it reaches very high throughputs, as well as on proprietary cutting-edge models that are commonly used in machine learning. We will show examples of use from various companies and will discuss the challenges that were addressed to run these models at such scale. We will finish the presentation with a discussion of some of the open research problems that still need to be addressed in this area.
In this talk, we will discuss our approach to support the needs of these large-scale language models, building on years of experience running HPC on a bare-metal instances with a very low latency network. We will present Oracle’s “SuperCluster”, which scales to thousands or tens of thousands of Nvidia A100 and H100 GPUs with low latency and high inter-node bandwidth of up to 3,200Gbps. This time-tested bare-metal instance platform is combined with intelligent job placement, locality awareness, and additional tuning to make ML work at the largest scales. Oracle’s SuperClusters have been rigorously tested on well-known public benchmarks such as Megatron, where it reaches very high throughputs, as well as on proprietary cutting-edge models that are commonly used in machine learning. We will show examples of use from various companies and will discuss the challenges that were addressed to run these models at such scale. We will finish the presentation with a discussion of some of the open research problems that still need to be addressed in this area.
Birds of a Feather
Cloud Computing
TP
XO/EX
DescriptionThe SuperCompCloud series of panels, workshops, and BoFs has a goal of bringing together experts and practitioners from academia, national labs, and industry to discuss technologies, use cases and best practices in order to share vision and direction for leveraging high performance, extreme-scale computing and on-demand cloud ecosystems in light of increasing software complexity, narrowing on-premise infrastructure options, and cloud-only architectures. The session will continue the discussion of the latest challenges and plans in addition to interactive polling to engage the community in discussion with a level of interactivity distinct from the workshop series.
Panel
Codesign
Hardware Technologies
TP
DescriptionSuperconducting digital computing (SDC) has significant potential to preserve performance scaling for a wide range of HPC applications due to its tens to hundreds of GHz operating frequencies coupled with low dynamic energy. The current limitations of the technology such as device density, EDA tools, data movement, and cooling are active areas of research with promising directions. This, combined with studies that designed SDC accelerators for compute-intensive applications, hint that SDC may play an important role in HPC, though significant work remains to show the best integration strategy with HPC systems and on-sensor processing. In this panel, we invite experts from the superconducting community to discuss SDC’s ecosystem, how SDC may be used in practice in future systems, and the positive impact SDC can have to the performance and efficiency of key HPC applications.
Workshop
W
DescriptionContainers offer an array of advantages that benefit research reproducibility and portability. As container tools mature, container security improves, and high-performance computing (HPC) and cloud system tools converge, supercomputing centers are increasingly integrating containers into their workflows. Despite this, most research into containers remains focused on cloud environments.
We consider an adaptive containerization architecture approach, in which each component chosen represents the tool best adapted to the given system and site requirements, with a focus on accelerating the deployment of applications and workflows on HPC systems using containers. To this end, we discuss the HPC specific requirements regarding container tools, and analyze the entire containerization stack, including container engines and registries, in-depth. Finally, we consider various orchestrator and HPC workload manager integration scenarios, including Workload Manager (WLM) in Kubernetes, Kubernetes in WLM, and bridged scenarios. We present a proof-of-concept approach to a Kubernetes Agent in a WLM allocation.
We consider an adaptive containerization architecture approach, in which each component chosen represents the tool best adapted to the given system and site requirements, with a focus on accelerating the deployment of applications and workflows on HPC systems using containers. To this end, we discuss the HPC specific requirements regarding container tools, and analyze the entire containerization stack, including container engines and registries, in-depth. Finally, we consider various orchestrator and HPC workload manager integration scenarios, including Workload Manager (WLM) in Kubernetes, Kubernetes in WLM, and bridged scenarios. We present a proof-of-concept approach to a Kubernetes Agent in a WLM allocation.
Workshop
Accelerators
Applications
Distributed Computing
Compilers
Heterogeneous Computing
Message Passing
Programming Frameworks and System Software
Task Parallelism
W
DescriptionThis talk will highlight distributed and gpu computing using the Julia ecosystem. The Julia language proposes an integrated development end-to-end co-design model as a LLVM front-end for science to close the gap between high-productivity languages and the desired performance of traditional compiled languages on extreme heterogeneous systems. This talk will demonstrate how to develop a large-scale HPC application: from low-level hardware accelerator optimizations (using LLVM), to a task-based distributed parallel execution framework (based on Distributed.jl).
Workshop
Accelerators
Applications
Distributed Computing
Compilers
Data Analysis, Visualization, and Storage
Heterogeneous Computing
Message Passing
Programming Frameworks and System Software
Task Parallelism
W
DescriptionThis talk is about Legate, a framework for scalable and composable distributed software. Legate enables cuNumeric, a GPU accelerated distributed NumPy library, to grow rapidly by providing high-productivity abstractions on top of a scalable runtime system. In this talk, we will explain how Legate enabled the productive development of cuNumeric features and talk about our vision for an ecosystem of composable Legate libraries.
Workshop
Applications
Distributed Computing
Compilers
Heterogeneous Computing
Message Passing
Programming Frameworks and System Software
Task Parallelism
W
DescriptionThe speakers will respond to Q&A for the technologies they presented in the session.
Workshop
Applications
Distributed Computing
Compilers
Heterogeneous Computing
Message Passing
Programming Frameworks and System Software
Task Parallelism
W
DescriptionOpenSHMEM was introduced more than a decade ago to standardize SHMEM, a library-based communications interface that was originally developed as a proprietary application interface by Cray for their T3D systems. An alternative to MPI that implements a Partitioned Global Address Space (PGAS) programming model, OpenSHMEM combines the look and feel of MPI with the benefits of PGAS programming.
Workshop
Applications
Distributed Computing
Compilers
Data Analysis, Visualization, and Storage
Heterogeneous Computing
Message Passing
Programming Frameworks and System Software
Task Parallelism
W
DescriptionI will describe how Swift/T, an automatically parallel scripting language, serves as an alternative to MPI by creating a higher-level programming model for workflow-like applications. Swift/T essentially translates a functional description of a workflow into an MPI program runnable at the largest scales through the use of dataflow analysis and an MPI-based task distributor. I will also highlight interesting use cases such as optimizing deep learning models for cancer problems and fitting parameters to the observed behavior of contagious diseases like COVID-19. These examples will show how MPI can be used as an implementation layer for a completely different programming model, and as a complementary model via communicator subdivision and handoff.
Workshop
Energy Efficiency
Green Computing
Sustainability
W
DescriptionTackling climate change by reducing and eventually eliminating carbon emissions is a significant milestone on the path toward establishing an environmentally sustainable society. As we transition into the exascale era, marked by an increasing demand and scale of HPC resources, the HPC community must embrace the challenge of reducing carbon emissions from designing and operating modern HPC systems. We describe challenges and highlight different opportunities that can aid HPC sites in reducing the carbon footprint of modern HPC systems.
Workshop
Energy Efficiency
Green Computing
Sustainability
W
DescriptionProviding a sustainable path for supercomputing is a pressing topic for our community, industry, and governments. Supercomputing has an insatiable appetite for computational cycles, while we face increasing challenges of delivering performance per watt advances with silicon technology trends. All within the context of climate change, the drive toward net-zero, and economic pressures driven by geo-political challenges.
Improving the sustainability of supercomputing provides many opportunities when the end-to-end cycle is considered. From the design of computational circuits and systems; to the power and cooling that is used to operate them, along with the suite of software tools used to administrate, maintain, and raise operational efficiency of HPC systems. All elements of the system must be considered, from compute nodes and interconnects, to IO and storage components of the system.
This workshop will gather users, researchers, hardware and software developers to address opportunities and challenges of sustainability in the supercomputing context.
Improving the sustainability of supercomputing provides many opportunities when the end-to-end cycle is considered. From the design of computational circuits and systems; to the power and cooling that is used to operate them, along with the suite of software tools used to administrate, maintain, and raise operational efficiency of HPC systems. All elements of the system must be considered, from compute nodes and interconnects, to IO and storage components of the system.
This workshop will gather users, researchers, hardware and software developers to address opportunities and challenges of sustainability in the supercomputing context.
Workshop
Energy Efficiency
Green Computing
Sustainability
W
Workshop
Applications
Distributed Computing
Compilers
Heterogeneous Computing
Linear Algebra
Message Passing
Programming Frameworks and System Software
Task Parallelism
Tensors
W
DescriptionSparse symmetric positive definite systems of equations are ubiquitous in scientific workloads and applications. Parallel sparse Cholesky factorization is the method of choice for solving such linear systems. Therefore, the development of parallel sparse Cholesky codes that can efficiently run on today’s large-scale heterogeneous distributed-memory platforms is of vital importance. Modern supercomputers offer nodes that contain a mix of CPUs and GPUs. To fully utilize the computing power of these nodes, scientific codes must be adapted to offload expensive computations to GPUs.
We present symPACK, a GPU-capable parallel sparse Cholesky solver that uses one-sided communication primitives and remote procedure calls provided by the UPC++ library. We also utilize the UPC++ "memory kinds" feature to enable efficient communication of GPU-resident data. We show that on a number of large problems, symPACK outperforms comparable state-of-the-art GPU-capable Cholesky factorization codes by up to 14x on the NERSC Perlmutter supercomputer.
We present symPACK, a GPU-capable parallel sparse Cholesky solver that uses one-sided communication primitives and remote procedure calls provided by the UPC++ library. We also utilize the UPC++ "memory kinds" feature to enable efficient communication of GPU-resident data. We show that on a number of large problems, symPACK outperforms comparable state-of-the-art GPU-capable Cholesky factorization codes by up to 14x on the NERSC Perlmutter supercomputer.
Paper
Cloud Computing
Distributed Computing
Energy Efficiency
Performance Measurement, Modeling, and Tools
TP
DescriptionEnergy-efficient computing uses power management techniques such as frequency scaling to save energy. Implementing energy-efficient techniques on large-scale computing systems is challenging. While most modern architectures, including GPUs, are capable of frequency scaling, these features are often not available on large systems.
We propose SYnergy, a novel energy-efficient approach that spans languages, compilers, runtimes, and job schedulers to achieve unprecedented fine-grained energy savings on large-scale heterogeneous clusters. SYnergy defines an extension to the SYCL programming model that allows programmers to define a specific energy goal for each kernel. Through compiler integration and a machine learning model, each kernel is statically optimized for the specific target. The methodology is inherently portable and has been evaluated on both NVIDIA and AMD GPUs. Experimental results show unprecedented improvements in energy and energy-related metrics on real-world applications, as well as scalable energy savings on a 64-GPU cluster.
We propose SYnergy, a novel energy-efficient approach that spans languages, compilers, runtimes, and job schedulers to achieve unprecedented fine-grained energy savings on large-scale heterogeneous clusters. SYnergy defines an extension to the SYCL programming model that allows programmers to define a specific energy goal for each kernel. Through compiler integration and a machine learning model, each kernel is statically optimized for the specific target. The methodology is inherently portable and has been evaluated on both NVIDIA and AMD GPUs. Experimental results show unprecedented improvements in energy and energy-related metrics on real-world applications, as well as scalable energy savings on a 64-GPU cluster.
Posters
Research Posters
TP
XO/EX
DescriptionHPC systems are getting ever more powerful, but this comes at the price of increasing system complexity. In order to use HPC systems efficiently, one has to be aware of their architectural details, in particular details of their hardware topology, which is increasingly affected by dynamic runtime settings.
sys-sage is a novel approach providing an infrastructure for storage, correlation, and provision of HW-related system information. It uses information from various well-known sources as well as use-case-specific solutions, and correlates the particular pieces together to provide a full view of a system. The novelty of our approach lies in the ability to capture dynamic environments as well as systems’ complexities, and in enabling greater flexibility in its usage.
sys-sage is publicly available and can be used by many applications. It integrates widely used approaches, such as hwloc or dynamic counter information, and offers user-integration of all other user-specific data sources.
sys-sage is a novel approach providing an infrastructure for storage, correlation, and provision of HW-related system information. It uses information from various well-known sources as well as use-case-specific solutions, and correlates the particular pieces together to provide a full view of a system. The novelty of our approach lies in the ability to capture dynamic environments as well as systems’ complexities, and in enabling greater flexibility in its usage.
sys-sage is publicly available and can be used by many applications. It integrates widely used approaches, such as hwloc or dynamic counter information, and offers user-integration of all other user-specific data sources.
Birds of a Feather
Post-Moore Computing
TP
XO/EX
DescriptionAs Quantum Computing, QC, systems mature and make their way out of laboratories into HPC centers as accelerators, we also must rethink the role of system software. We require stable software environments targeted at broad, non-physics end-user communities that are directly integrated into HPC system software as well as HPC schedulers. In this BoF, we will highlight recent developments relating to first QC system installations in HPC centers and discuss open questions and challenges. We aim to establish an international discussion on this emerging, critical issue and to help clear the road for the next steps towards efficient quantum acceleration.
Paper
Artificial Intelligence/Machine Learning
TP
DescriptionGraph Neural Networks (GNNs) are rapidly gaining popularity since they hold state-of-the-art performance for various critical graph-related tasks. While quantization is a primary approach to accelerating GNN computation, quantized training faces remarkable challenges. We observe that current quantized GNN training systems often experience longer training time than their full-precision counterparts for two reasons: (i) addressing the accuracy challenge results in too much overhead. (ii) The optimization opportunity exposed by quantization is not well leveraged. This paper introduces Tango, which re-thinks quantization challenges and opportunities for graph neural network training on GPUs with the following contributions: First, we introduce light-weighted rules to meet the accuracy requirement for quantized GNN training. Second, we design and implement quantization-aware primitives and inter-primitive optimizations to accelerate GNN training. Third, we integrate Tango with the mainstream Deep Graph Library (DGL) system and demonstrate that Tango outperforms the state-of-the-art across all the evaluated GNN models and datasets.
Workshop
Algorithms
Heterogeneous Computing
Large Scale Systems
W
DescriptionWe investigate a new task-based implementation of the polar decomposition on massively parallel systems augmented with multiple GPUs using SLATE. We implement the iterative QR Dynamically-Weighted Halley (QDWH) algorithm, whose building blocks mainly consist of compute-bound matrix operations, allowing for high levels of parallelism to be exploited on various hardware architectures, such as NVIDIA, AMD, and Intel GPU-based systems. To achieve both performance and portability, we implement our QDWH-based polar decomposition in the SLATE library, which uses efficient techniques in dense linear algebra, such as 2D block cyclic data distribution and communication-avoiding algorithms, as well as modern parallel programming approaches, such as dynamic scheduling and communication overlapping, and uses OpenMP tasks to track data dependencies.
We report numerical accuracy and performance results. The benchmarking campaign reveals up to an 18-fold performance speedup of the GPU accelerated implementation compared to the existing state-of-the-art implementation for the polar decomposition.
We report numerical accuracy and performance results. The benchmarking campaign reveals up to an 18-fold performance speedup of the GPU accelerated implementation compared to the existing state-of-the-art implementation for the polar decomposition.
Workshop
Data Analysis, Visualization, and Storage
Large Scale Systems
Programming Frameworks and System Software
Reproducibility
Resource Management
Runtime Systems
W
DescriptionMany scientific applications are expressed as high-throughput workflows that consist of large graphs of data assets and tasks to be executed on large parallel and distributed systems. A challenge in executing these workflows is managing data: both datasets and software must be efficiently distributed to cluster nodes; intermediate data must be conveyed between tasks; output data must be delivered to its destination. Scaling problems result when these actions are performed in an uncoordinated manner on a shared filesystem. To address this problem, we introduce TaskVine: a system for exploiting the aggregate local storage and network capacity of a large cluster. TaskVine tracks the lifetime of data in a workflow --from archival sources to final outputs-- making use of local storage to distribute and re-use data. We describe the architecture and novel capabilities of TaskVine, and demonstrate its use with applications in genomics, high energy physics, molecular dynamics, and machine learning.
Birds of a Feather
Education
TP
XO/EX
DescriptionThis BoF will be in the form of a panel consisting of representatives from the industry, national labs, and academia with a background in HPC. The panel will share advice on different career options in HPC, and their experiences in their respective career trajectories. The primary audience for this event is current, preferably ABD, graduate students. The format will include a brief introduction by each speaker, followed by a moderated discussion based on a set of previously submitted questions and ending with further questions from the audience.
Workshop
Education
Heterogeneous Computing
Reproducibility
State of the Practice
W
DescriptionIn this paper, we describe our experience of teaching Heterogeneous and Parallel Computing with Google Colab and Raspberry Pi Clusters in a senior elective course in Spring 2023. After introductory lectures, while the whole class learned CUDA on Google Colab for five and half weeks, in parallel, a team of two students spearheaded a pilot project as their undergraduate research project to build, configure, and test a cluster of four Raspberry Pi’s. Then the rest of the class followed suit to build their own clusters in teams using the tutorials developed through the pilot project. Thanks to these clusters, in the next seven weeks, the class went on to learn OpenMP and MPI on various scales. Students’ performance on the labs and assignments, their end-of-semester evaluations, and three anonymous surveys were collected as data to produce an evaluation of the course, which is presented in the end of the paper.
Workshop
Education
State of the Practice
Sustainability
W
DescriptionIncreasing performance in data workflows can cause non-deterministic communication. Non-determinism can seriously affect software correctness and compromise reproducibility in scientific discovery. We design and implement tutorial modules to demonstrate the impact of non-determinism in data science workflows. We use ANACIN-X, a framework of test cases and tools for analytics and visualization. By completing our modules, students, researchers, and data science professionals will understand non-determinism, how it affects their applications, how to quantify it, and how to identify its root sources.
Posters
Research Posters
TP
XO/EX
DescriptionUmpire, a data and memory management API created at LLNL, provides memory pools which enable less expensive ways to allocate large amounts of memory in HPC environments. Memory pools commonly contain both allocations that persist for only a portion of the program (temporary) and those that persist for the entire program (permanent). However, too much of a mix of both allocation types can lead to pool fragmentation and cause the pool to perform poorly. Umpire created a tool that uses a machine learning model to perform temporal classifications and categorize allocations as either temporary or permanent. We conducted experiments using trace files from two LLNL applications to study how much memory can be saved when those allocations are separated into distinct pools. We found that our ML tool accurately classifies memory allocations and that separating these allocation types into distinct pools reduces overall memory usage significantly (up to 29.5%).
Workshop
Education
State of the Practice
W
DescriptionThe inherent wide distribution, heterogeneity, and dynamism of the current and emerging high-performance computing and software environments increasingly challenge cyberinfrastructure facilitators, trainers, and educators. The challenge is how to support and train the current diverse users and prepare the future educators, researchers, developers, and policymakers to keep pace with the rapidly evolving HPC environments to advance discovery and economic competitiveness for many generations.
The tenth annual full-day workshop on HPC training and education is an ACM SIGHPC Education Chapter coordinated effort, aimed at fostering more collaborations among the practitioners from traditional and emerging fields to explore educational needs in HPC, to develop and deploy HPC training, and to identify new challenges and opportunities for the latest HPC platforms. The workshop will also be a platform for disseminating results and lessons learned in these areas and will be captured in a Special Edition of the Journal of Computational Science Education.
The tenth annual full-day workshop on HPC training and education is an ACM SIGHPC Education Chapter coordinated effort, aimed at fostering more collaborations among the practitioners from traditional and emerging fields to explore educational needs in HPC, to develop and deploy HPC training, and to identify new challenges and opportunities for the latest HPC platforms. The workshop will also be a platform for disseminating results and lessons learned in these areas and will be captured in a Special Edition of the Journal of Computational Science Education.
Workshop
Accelerators
Compilers
Heterogeneous Computing
Programming Frameworks and System Software
Runtime Systems
W
Workshop
Accelerators
Compilers
Heterogeneous Computing
Programming Frameworks and System Software
Runtime Systems
W
DescriptionHeterogeneous node architectures are becoming omnipresent in today’s HPC systems. Exploiting the maximum compute capability out of such systems, while also maintaining code portability and
maintainability, necessitates accelerator programming approaches such as OpenMP offloading, OpenACC, standard C++/Fortran parallelism, SYCL, DPC++, Kokkos, RAJA. However, the use of these programming approaches remains a research activity and there are many possible trade-offs between performance, portability, maintainability, and ease of use that must be considered for optimal use of accelerator-based HPC systems.
Toward this end, the workshop will highlight the improvements over state-of-the-art through the accepted papers and talks. In addition, the event will foster discussion with a keynote/panel to draw the community’s attention to key areas that will facilitate the transition to accelerator-based HPC. The workshop aims to showcase all aspects of innovative high-level language features, lessons learned while using directives/abstractions to migrate scientific legacy code, experiences using novel accelerator architectures, among others.
maintainability, necessitates accelerator programming approaches such as OpenMP offloading, OpenACC, standard C++/Fortran parallelism, SYCL, DPC++, Kokkos, RAJA. However, the use of these programming approaches remains a research activity and there are many possible trade-offs between performance, portability, maintainability, and ease of use that must be considered for optimal use of accelerator-based HPC systems.
Toward this end, the workshop will highlight the improvements over state-of-the-art through the accepted papers and talks. In addition, the event will foster discussion with a keynote/panel to draw the community’s attention to key areas that will facilitate the transition to accelerator-based HPC. The workshop aims to showcase all aspects of innovative high-level language features, lessons learned while using directives/abstractions to migrate scientific legacy code, experiences using novel accelerator architectures, among others.
Posters
Research Posters
TP
XO/EX
DescriptionHigh-performance computing applications running on modern-day supercomputers frequently encounter performance and portability challenges especially if using multiple programming models, languages and compilers. In this work, we explore the proposed C++26 language standard model for asynchronous parallelism, called std::execution or stdexec, powered with stdpar, std::mdspan, among other C++23 features, to port and analyze multiple scientific HPC applications on CPUs and GPUs. These applications include sequence alignment codes from ADEPT and heat transfer from AMReX. Our experiments depict near-native performance for our ported implementations on NVIDIA A100 GPUs running on the Perlmutter supercomputer. We also study and analyze the data transfer traffic patterns and overheads between the host and device for stdpar and provide helpful insights in application performance. Finally, we discuss some challenges and limitations encountered while porting these apps to C++26 with stdexec, as well as their workarounds, until the stdexec is fully integrated and function in the NVHPC compilers.
Workshop
Applications
Cloud Computing
Distributed Computing
Edge Computing
Large Scale Systems
W
DescriptionScientific workflows have underpinned some of the most significant discoveries of the past several decades. Workflow management systems provide abstraction and automation which enable a broad range of researchers to easily define sophisticated computational processes and to then execute them efficiently on parallel and distributed computing systems. Workflows are becoming more complex and require more sophisticated workflow management capabilities.
This workshop focuses on the many facets of scientific workflow management systems, ranging from actual execution to service management and the coordination and optimization of data, service, and job dependencies. The workshop covers a broad range of issues in the scientific workflow lifecycle that include: scientific workflows representation; workflow scheduling techniques to optimize the execution on heterogeneous infrastructures; provisioning workflows on infrastructures; workflow engines that deal with failures in the application and infrastructure; and computer science problems related to scientific workflows such as semantic technologies, compiler methods, fault tolerance, etc.
This workshop focuses on the many facets of scientific workflow management systems, ranging from actual execution to service management and the coordination and optimization of data, service, and job dependencies. The workshop covers a broad range of issues in the scientific workflow lifecycle that include: scientific workflows representation; workflow scheduling techniques to optimize the execution on heterogeneous infrastructures; provisioning workflows on infrastructures; workflow engines that deal with failures in the application and infrastructure; and computer science problems related to scientific workflows such as semantic technologies, compiler methods, fault tolerance, etc.
Workshop
Applications
Distributed Computing
Compilers
Heterogeneous Computing
Message Passing
Programming Frameworks and System Software
Task Parallelism
W
DescriptionSupercomputers get faster and more complex every year. MPI, long the dominant model for distributed computation, has adapted by combining with models for intra-node parallelism (e.g. OpenMP, CUDA). These MPI+X hybrids offer performance but demand significant programmer effort to write, debug and tune applications.
Alternatives to MPI+X are worth exploring as programmer productivity becomes a major component of the time to science. Alternatives include parallel programing languages (e.g. Chapel, Regent, Fortran 2018), general purpose libraries (e.g. Charm++, COMPSs, HPX, Legion, UPC++), and domain specific libraries (e.g. Arkouda, Dask, Spark). With many options to choose from, it is hard for programmers to know which alternative models are appropriate for their application and for programming model developers to understand the opportunities for improvement.
Through discussion of specific applications, PAW-ATM brings together application experts and programming model developers to improve applications and models.
Alternatives to MPI+X are worth exploring as programmer productivity becomes a major component of the time to science. Alternatives include parallel programing languages (e.g. Chapel, Regent, Fortran 2018), general purpose libraries (e.g. Charm++, COMPSs, HPX, Legion, UPC++), and domain specific libraries (e.g. Arkouda, Dask, Spark). With many options to choose from, it is hard for programmers to know which alternative models are appropriate for their application and for programming model developers to understand the opportunities for improvement.
Through discussion of specific applications, PAW-ATM brings together application experts and programming model developers to improve applications and models.
Early Career Program
Inclusivity
Inclusivity
TP
DescriptionIn this workshop, participants will be provided an overview of the different types of elevator pitches. There will be tips on posture, presence, and perspective. Participants will be provided with a worksheet to sketch out their ideas and of course, practice their pitch!
Workshop
Education
State of the Practice
W
DescriptionGiving students a good understanding of how micro-architectural effects impact achievable performance for given HPC workloads is essential. It enables them to find effective optimization strategies and to reason about sensible approaches towards better efficiency. This paper describes a lab course held in collaboration between LRZ, LMU and TUM. The course was born with a dual motivation in mind: filling a gap in educating students to become HPC experts, as well as understanding the stability and usability of emerging HPC programming models for recent CPU and GPU architectures with the help of students. We describe the course structure used to achieve the goals, resources made available to attract students, and experiences and statistics from running the course now for six semesters. We conclude with an assessment of how successfully the lab course could meet the vision.
Workshop
Education
State of the Practice
W
Description"Learning by Doing" also known as Active Learning is a hands-on, experiential approach to learning that involves actively engaging in tasks or activities to acquire knowledge and skills. TACC has been incorporating "Learning by Doing" into introductory advanced computing topics termed Code-a-thons. This approach has greatly increased students' knowledge base in advanced computing through problem solving, debugging, and implementing. Three major student outcomes from our code-a-thons have been:
* Increased Retention and Understanding
* Skill Development
* Critical Thinking and Problem-Solving
Code-a-thons have promoted a more engaging and practical learning experience that encourages learners to become active participants in their own education, leading to a deeper and more comprehensive understanding of scientific computing. We will discuss the TACC Code-a-thon model: the benefits and detail our implementation, and share the real world projects used during our community coding events and feedback from our students.
* Increased Retention and Understanding
* Skill Development
* Critical Thinking and Problem-Solving
Code-a-thons have promoted a more engaging and practical learning experience that encourages learners to become active participants in their own education, leading to a deeper and more comprehensive understanding of scientific computing. We will discuss the TACC Code-a-thon model: the benefits and detail our implementation, and share the real world projects used during our community coding events and feedback from our students.
Workshop
Data Analysis, Visualization, and Storage
Large Scale Systems
Programming Frameworks and System Software
Reproducibility
Resource Management
Runtime Systems
W
DescriptionNowadays, many scientific workflows from different domains, such as Remote Sensing, Astronomy, and Bioinformatics, are executed on large computing infrastructures managed by resource managers. Scientific workflow management systems (SWMS) support the workflow execution and communicate with the infrastructures' resource managers. However, the communication between SWMS and resource managers is complicated by a) inconsistent interfaces between SMWS and resource managers and b) the lack of support for workflow dependencies and workflow-specific properties.
To tackle these issues, we developed the Common Workflow Scheduler Interface (CWSI), a simple yet powerful interface to exchange workflow-related information between a SWMS and a resource manager, making the resource manager workflow-aware. The first prototype implementations show that the CWSI can reduce the makespan already with simple but workflow-aware strategies up to 25%. We show how existing workflow resource management research can be integrated into the CWSI.
To tackle these issues, we developed the Common Workflow Scheduler Interface (CWSI), a simple yet powerful interface to exchange workflow-related information between a SWMS and a resource manager, making the resource manager workflow-aware. The first prototype implementations show that the CWSI can reduce the makespan already with simple but workflow-aware strategies up to 25%. We show how existing workflow resource management research can be integrated into the CWSI.
Exhibits
Flash Session
TP
XO/EX
DescriptionBoston discuss how HPC, quantum computing, and AI are paradigms that require massive amounts of computing power, highly parallelizable, and require sophisticated software development tools and techniques. They converge in a number of ways, with AI being used to optimize HPC and quantum workflows, computers used to accelerate AI workloads, and HPC being used to train and deploy large AI models.
Exhibitor Forum
Architecture and Networks
Cloud Computing
TP
XO/EX
DescriptionDesign of modern very large scale integrated circuits (VLSI) using electronic design automation (EDA) is an increasingly compute intensive and complex endeavor. Because of typical product cycles in chip design, EDA is an excellent candidate for offloading bursts of computations to cloud-based resources when close to design deadlines, to reduce infrastructure cost and improve flexibility by offering virtually unlimited computational power on-demand. However, running EDA workloads poses significant security risks, due to the designers’ intellectual property (IP) and high-value foundry process design kits (PDKs). The cost of a leaked proprietary design is measured in millions of dollars, loss of competitiveness and brand damage. To guarantee security of these highly valuable assets, all data and computations in the EDA workloads must be secured. Traditionally, encryption has been an effective solution to protect data at rest and in motion; however, data in use has so far seen less secure solutions based mostly on virtualization. Emerging confidential computing techniques can improve this aspect by providing truly isolated and encrypted environments for the computations. However, as of today, there is no comprehensive study on the challenges of running HPC workloads in confidential enclaves, and on how to deploy confidential computing in the public cloud. This talk focuses on EDA workloads as a proxy to generic HPC workloads that need thousands of cores, high-bandwidth network communication and shared storage. We present our experience at running cloud-native EDA workloads in confidential VMs through the use of Confidential Containers, that allows a zero-effort conversion of cloud-native workloads. We will briefly discuss existing and novel mechanisms to integrate the data-in-use protection of Confidential Containers with secure private/shared storage and network. Then, we will focus on measuring and characterizing the performance overhead of protecting data in every stage of the computation.
Birds of a Feather
Performance Measurement, Modeling, and Tools
TP
XO/EX
DescriptionAs supercomputing welcomes new workflows of simulations, data science and artificial intelligence in the Exascale era, the goal of this session is to pose, engage, debate, and address the question - "How should the SC community evolve performance benchmarks?". The session will be organized as presentations and panel discussions with audience participation that will invite active members of the Top500, HPCG, MLPerf, TeraSort, etc. and key personnel from industry, academia, and government to discuss the value, need and desire for evolving the benchmark suite that is inclusive and accommodative of emerging applications to guide future supercomputing system design and architecture.
Birds of a Feather
HPC in Society
TP
XO/EX
DescriptionThe National Science Foundation's vision and investment plans for cyberinfrastructure (CI) are designed to address the evolving needs of the science and engineering research community. Senior leadership and program staff from NSF’s Office of Advanced Cyberinfrastructure (OAC) will discuss OAC's vision, strategic and national priorities, as well as the latest funding opportunities across all aspects of the research cyberinfrastructure ecosystem. Substantial time will be devoted to Q&A between attendees and NSF staff.
Panel
Artificial Intelligence/Machine Learning
Compilers
Performance Optimization
TP
DescriptionThis panel discussion aims at identifying cross-cutting issues, opportunities, similarities, and discrepancies between HPC and AI workloads and systems, as well as defining the role of compilers in the development of HPC applications and AI models. While there is a clear overlap in problems being solved in HPC and AI communities, often solutions are siloed to one, with software fragmentation and increased maintenance cost. It has become critical to identify current gaps and potential solutions in current compiler frameworks and to develop an interoperable environment to help researchers move to the next stage of scientific discoveries, such as moving from classification models to machine reasoning. This panel brings together the experience of distinguished researchers from industry, academia, the U.S. National Laboratory, and the U.S. Department of Energy, to share their vision, identify current gaps and research opportunities, and define a future research agenda.
Paper
Cloud Computing
Data Analysis, Visualization, and Storage
Graph Algorithms and Frameworks
TP
Best Paper Finalist
DescriptionGraph databases (GDBs) are crucial in academic and industry applications. The key challenges in developing GDBs are achieving high performance, scalability, programmability, and portability. To tackle these challenges, we harness established practices from the HPC landscape to build a system that outperforms all past GDBs presented in the literature by orders of magnitude, for both OLTP and OLAP workloads. For this, we first identify and crystallize performance-critical building blocks in the GDB design, and abstract them into a portable and programmable API specification, called the Graph Database Interface (GDI), inspired by the best practices of MPI. We then use GDI to design a GDB for distributed-memory RDMA architectures. Our implementation harnesses one-sided RDMA communication and collective operations, and it offers architecture-independent theoretical performance guarantees. The resulting design achieves extreme scales of more than a hundred thousand cores. Our work will facilitate the development of next-generation extreme-scale graph databases.
Birds of a Feather
Energy Efficiency
TP
XO/EX
DescriptionWith power being a first-order design constraint on par with performance, it is important to measure and analyze energy-efficiency trends in supercomputing. To raise the awareness of greenness as a first-order design constraint, the Green500 seeks to characterize the energy-efficiency of supercomputers for different metrics, workloads, and methodologies. This BoF discusses trends across the Green500 and highlights from the current Green500 list. In addition, the Green500, Top500, and Energy Efficient HPC Working Group have been working together on improving power-measurement methodology, and this BoF presents recommendations for changes to sampling rates that will improve ease of submission without compromising accuracy.
Panel
Artificial Intelligence/Machine Learning
Applications
Exascale
TP
DescriptionExascale computing promises broad advances in simulation, data analytics, and machine learning. The US Department of Energy (DOE) is funding the Exascale Computing Project (ECP) to develop the applications, software, and integration needed to harness the immense computing power of exascale machines. As part of this effort, the ECP established the Industry and Agency Council (IAC), made up of executives from US industry, US government agencies and US independent software vendors (ISVs). As the ECP winds down, this panel is a chance for IAC members to reflect on how the ECP and the move to exascale computing is impacting industry’s current and planned use of HPC in saving energy, boosting competitiveness, and building global technology leadership. Moderated by Fran Hill (Chief Scientist for DoD’s HPC Modernization Program), the panel will be a lively and informative discussion of how exascale and the ECP are impacting businesses both large and small.
Posters
Research Posters
TP
XO/EX
DescriptionRemote Memory Access (RMA) provides an alternate mechanism for data movement by separating communication with synchronization, exposing remote memory access features via one-sided communication semantics to a global address space. Performance of the most popular asynchronous RMA interfaces like MPI RMA and SHMEM has steadily improved over the past years due to better software/hardware support from the vendors and community-driven programming model standardization efforts.
Current RMA benchmarking efforts are mostly focused on investigating elementary data movement overheads between a process-pair within and across nodes, not considering a specific process topology. Distributed-memory applications on the other hand must deal with overlapped data distributions, which governs the underlying topology of the processes. We discuss the performance of SHMEM and MPI RMA (in comparison with MPI point-to-point) for grid and graph process topologies on NERSC Perlmutter supercomputer, demonstrating average and 99th percentile latencies.
Current RMA benchmarking efforts are mostly focused on investigating elementary data movement overheads between a process-pair within and across nodes, not considering a specific process topology. Distributed-memory applications on the other hand must deal with overlapped data distributions, which governs the underlying topology of the processes. We discuss the performance of SHMEM and MPI RMA (in comparison with MPI point-to-point) for grid and graph process topologies on NERSC Perlmutter supercomputer, demonstrating average and 99th percentile latencies.
Birds of a Feather
Codesign
Exascale
TP
XO/EX
DescriptionEfficient use of exascale systems for large-scale applications implies the development, in a combined manner, of applications, the full software stack, and the machine. As the BoF organizers did in the context of IESP and BDEC workshops (exascale.org), we plan to launch a new series of workshops that will gather stakeholders in Europe (EuroHPC, French NumPEX project, BSC, JSC), USA (DOE, NSF partners), Japan (FugakuNEXT, Riken-CC) and large-scale applications communities to target the co-design of software and hardware components of future exascale systems and preparing the major scientific and industrial application domains to fully exploit the capabilities of these systems.
Posters
Research Posters
TP
XO/EX
DescriptionGraphs are used to model real-world systems that often evolve over time. We have developed a streaming graph framework which, while ingesting an unbounded stream of events mirroring a graph's evolution, dynamically updates the solution to a user query, and is able to offer, on-demand and with low latency, the solution to the query. Integral to our framework is that graph topology changes and algorithmic messages are processed concurrently, asynchronously, and autonomously (i.e., without shared state). This poster uses graph coloring as a challenge problem to highlight two advantages of our framework beyond those showcased by past work (i.e., low result latency, high sustained ingestion throughput, and scalability). These additional advantages are: (i) the ability to efficiently leverage the "free" computational resources available when the rate of incoming topology events is below the maximum sustainable throughput, and (ii) the ability to produce "stable" solutions to queries as the graph evolves.
Workshop
Large Scale Systems
Middleware and System Software
Programming Frameworks and System Software
W
DescriptionModern extreme scale computing systems rely on heterogeneous CPU and GPU architectures. While this design has enabled several remarkable achievements in high-performance computing, applications running at exascale have already identified multiple opportunities where this paradigm can be improved; notably, the communication costs, and the complexity of the resultant programming model, incurred by the presence of two isolated memory spaces for CPU and GPU. To address these challenges, AMD has developed the Instinct MI300 APU (Accelerated Processing Unit) architecture, which integrates CPU and GPU processing elements on the same system on a chip (SoC). This talk will discuss programmability advantages, and future possibilities, afforded by the MI300 for Exascale computing, including: the improved simplicity of porting from CPU codes and performance benefits resulting from close integration of CPU and GPU compute elements. These simplifications and improvements are realized in a variety of tools, including the RAJA and Kokkos accelerator abstraction frameworks, a recently developed Standard Parallelism interface to AMD APUs, and automatic offload of libraries.
Tutorial
Algorithms
Programming Frameworks and System Software
Task Parallelism
TUT
DescriptionOpenMP is the de facto standard for writing parallel applications for shared memory computers. Born ~25 years ago in 1997, it runs on just about every shared memory platform on the market. It’s also very complicated. We created OpenMP to be the “simple API” for application programmers. With a specification running to over 600 pages OpenMP has grown into an intimidating API viewed by many as for “experts only”.
Most OpenMP programmers, however, use around 21 items from the specification. We call these 21 items the “OpenMP Common Core”. By focusing on the common core, we make OpenMP what it was always meant to be: a simple API for parallel application programmers.
In this hands-on tutorial, we explore the OpenMP Common Core. We utilize active learning through a carefully selected set of exercises, so students will master the Common Core and learn to apply it to their own problems. Students will use their own laptops (with Windows, Linux, or OS/X) to access remote systems that support OpenMP (a remote SMP server). Alternatively, students can load an OpenMP compiler onto their laptops before the tutorial. Information about OpenMP compilers is available at www.openmp.org.
Most OpenMP programmers, however, use around 21 items from the specification. We call these 21 items the “OpenMP Common Core”. By focusing on the common core, we make OpenMP what it was always meant to be: a simple API for parallel application programmers.
In this hands-on tutorial, we explore the OpenMP Common Core. We utilize active learning through a carefully selected set of exercises, so students will master the Common Core and learn to apply it to their own problems. Students will use their own laptops (with Windows, Linux, or OS/X) to access remote systems that support OpenMP (a remote SMP server). Alternatively, students can load an OpenMP compiler onto their laptops before the tutorial. Information about OpenMP compilers is available at www.openmp.org.
ACM Gordon Bell Finalist
Awards
TP
DescriptionWe present an efficient and performance portable implementation of the Simple Cloud Resolving E3SM Atmosphere Model (SCREAM). SCREAM is a full featured atmospheric global circulation model with a nonhydrostatic dynamical core and state-of-the-art parameterizations for microphysics, moist turbulence and radiation. It has been written from scratch in C++ with the Kokkos library used to abstract the on-node execution model for both CPUs and GPUs. SCREAM is one of only a few global atmosphere models to be ported to GPUs. As far as we know, SCREAM is the first such model to run on both AMD GPUs and NVIDIA GPUs, as well as the first to run on nearly an entire exascale system (Frontier). On Frontier, we obtained a record setting performance of 1.26 simulated years per day for a realistic cloud resolving simulation.
Workshop
W
DescriptionOriginally launched in 2018, Spin is a user-facing, container-based platform designed for NERSC users to deploy their own science gateways, workflow managers, API endpoints, databases, and other network services to support their scientific projects. Spin users enjoy the ease of use and rapid deployment typical of cloud technologies combined with close proximity to large-scale compute and storage resources; NERSC administrators benefit from managing a common, consolidated service platform with a reduced-privilege container runtime environment.
In just five years, Spin has evolved into an important platform for web services and complex scientific workflows, supporting hundreds of users in over 80 NERSC projects. Along this journey, Spin has undergone a major redeployment to a Kubernetes-based back end, a complete overhaul of its security policy subsystems, a full hardware and storage refresh, and numerous software upgrades.
In this quick talk intended to "spin off" deeper conversations with both users and facilities, we’ll highlight key milestones and lessons learned in the development of Spin, and we'll share plans for the future, including upgraded storage, improved security automation, and an expansion of Spin into the HPC platform itself.
In just five years, Spin has evolved into an important platform for web services and complex scientific workflows, supporting hundreds of users in over 80 NERSC projects. Along this journey, Spin has undergone a major redeployment to a Kubernetes-based back end, a complete overhaul of its security policy subsystems, a full hardware and storage refresh, and numerous software upgrades.
In this quick talk intended to "spin off" deeper conversations with both users and facilities, we’ll highlight key milestones and lessons learned in the development of Spin, and we'll share plans for the future, including upgraded storage, improved security automation, and an expansion of Spin into the HPC platform itself.
Workshop
Education
State of the Practice
W
DescriptionAs of 2023 we, at PSC, have taught more than 24,000 students over the course of 106 events using the Wide Area Classroom, a novel distributed teaching platform. This has been a successful effort, as gauged by several important metrics. We describe both the technical and logistical structure of these events as well as the specific HPC curricula which have proven to be most popular.
Workshop
Education
State of the Practice
Sustainability
W
DescriptionComputer networking is mostly experienced as an invisible, mysterious, and ubiquitous presence. Core problems in networking are grounded in the physicality of the network, which most students will not have experience or intuition with when taking a networks course. To better ground students in the physicality of computer networking, this work describes an experimental course delivered in Fall 2022 using custom built hardware.
A feature poor optical NIC was developed and students were directed to "reinvent the Internet''. Challenges were encountered with the NICs during delivery that required some substantial changes to the course during delivery. Overall, the course was successful at the intended goals.
A feature poor optical NIC was developed and students were directed to "reinvent the Internet''. Challenges were encountered with the NICs during delivery that required some substantial changes to the course during delivery. Overall, the course was successful at the intended goals.
Exhibits
Flash Session
TP
XO/EX
DescriptionMoshe lays out the future of AI in the world’s data centers. Think differently about what’s needed and what’s now possible to make inference AI technology systems economically viable and environmentally sustainable now and in the future.
Workshop
Quantum Computing
Software Engineering
W
DescriptionWe introduce the Trapped-Ion Surface Code Compiler (TISCC), a software tool that generates circuits for a universal set of surface code patch operations in terms of a native trapped-ion gate set. To accomplish this, TISCC manages an internal representation of a trapped-ion system where a repeating pattern of trapping zones and junctions is arranged in an arbitrarily large rectangular grid. Surface code operations are compiled by instantiating surface code patches on the grid and using methods to generate transversal operations over data qubits, rounds of error correction over stabilizer plaquettes, and/or lattice surgery operations between neighboring patches. Beyond the implementation of a basic surface code instruction set, TISCC contains corner movement functionality and a patch translation that is implemented using ion movement alone. Except in the latter case, all TISCC functionality is extensible to alternative grid-like hardware architectures. TISCC output has been verified using the Oak Ridge Quasi-Clifford Simulator (ORQCS).
Workshop
Large Scale Systems
Middleware and System Software
Programming Frameworks and System Software
W
DescriptionIn this panel, we focus on the challenges in programming models and runtime system for large language model training/inference. We invite researchers across academia, national labs, and industry to share their experience and vision on programming tools, runtime performance, architecture, optimization, scalability, I/O, data, and communication to facilitate LLMs on supercomputers. The discussion will cover LLM pretraining, fine-tuning, deployment, and usage in science. We will identify the Top 5 challenges across these areas.
Birds of a Feather
Exascale
TP
XO/EX
DescriptionThe TOP500 list of supercomputers serves as a “Who’s Who” in the field of High Performance Computing (HPC). It started as a list of the most powerful supercomputers in the world and has evolved to a major source of information about trends in HPC. The 62nd TOP500 list will be published in November 2023 just in time for SC23.
This BoF will present detailed analyses of the TOP500 and discuss the changes in the HPC marketplace during the past years. The BoF is meant as an open forum for discussion and feedback between the TOP500 authors and the user community.
This BoF will present detailed analyses of the TOP500 and discuss the changes in the HPC marketplace during the past years. The BoF is meant as an open forum for discussion and feedback between the TOP500 authors and the user community.
Birds of a Feather
Artificial Intelligence/Machine Learning
TP
XO/EX
DescriptionAI is driving scientific discovery and economic growth. While AI R&D is advancing rapidly, access to the computational and data resources that drive the frontiers of AI remains limited. This BoF will explore how democratizing access to national-level cyberinfrastructure (CI) for AI R&D can help strengthen the AI research and innovation ecosystem. Specifically, this BoF will catalyze a discussion about the nature and composition of such CI, how it can be realized nationally and connected internationally, how to measure both successes and failures, and what are necessary guardrails to ensure responsible AI.
Workshop
Data Analysis, Visualization, and Storage
Large Scale Systems
Performance Measurement, Modeling, and Tools
W
DescriptionThe Fast Fourier Transform (FFT) is a numerical operation that transforms a function into a form comprised of its constituent frequencies and is an integral part of scientific computation and data analysis. The objective of our work is to enable use of the FFT as part of a scientific in situ processing chain to facilitate the analysis of data in the spectral regime. We describe the implementation of an FFT endpoint for the transformation of multi-dimensional data within the SENSEI infrastructure. Our results show its use on a sample problem in the context of a multi-stage in situ processing workflow.
Workshop
Applications
Exascale
Heterogeneous Computing
Programming Frameworks and System Software
State of the Practice
W
DescriptionBenchmarking is integral to procurement of HPC systems, communicating HPC center workloads to HPC vendors, and verifying performance of the delivered HPC systems. Currently, HPC bench- marking is manual and challenging at every step, posing a high barrier to entry, and hampering reproducibility of the benchmarks across different HPC systems. We propose collaborative continuous benchmarking to enable functional reproducibility, automation, and community collaboration in HPC benchmarking. We define the minimal requirements for collaborative continuous benchmarking and develop a common language to streamline the interactions between HPC centers, vendors, and researchers. We demonstrate the initial implementation of collaborative continuous benchmarking, and introduce an open source continuous bench-marking repository, Benchpark, for community collaboration. We believe collaborative continuous benchmarking will help overcome the human bottleneck in HPC benchmarking, enabling better evaluation of our systems and enabling a more productive collaboration within the HPC community.
ACM Gordon Bell Finalist
Awards
TP
DescriptionA state-of-the-art large eddy simulation code has been developed to solve compressible flows in turbomachinery. The code has been engineered with a high degree of scalability, enabling it to effectively leverage the many-core architecture of the new Sunway system. A consistent performance of 115.8 DP-PFLOPs has been achieved on a high-pressure turbine cascade consisting of over 1.69 billion mesh elements and 865 billion Degree of Freedoms (DOFs). By leveraging a high-order unstructured solver and its portability to large heterogeneous parallel systems, we have progressed toward solving the grand challenge problem outlined by NASA, which involves a time dependent simulation of a complete engine, incorporating all the aerodynamic and heat transfer components.
Posters
Research Posters
TP
XO/EX
DescriptionThere have been significant advances in machine learning-driven performance modeling in recent years. One key limitation of such approaches is that their success depends, to a large degree, on the formulation of the outcome or objective, which is typically done by human experts. In this paper, we propose a novel approach of automatically generating new optimization heuristics using inductive program synthesis. To explore the feasibility of this approach, we investigated the graph-coloring register allocation heuristic used in the state-of-the-art compilers today. In particular, we focused on the task of live range splitting. The results show that when using a Genetic Algorithm, we can obtain splitting heuristics that are within 10% of the optimal split after 202 generations.
Paper
Cloud Computing
Distributed Computing
Energy Efficiency
Green Computing
Programming Frameworks and System Software
State of the Practice
Sustainability
TP
DescriptionThe rapid growth in demand for HPC systems has led to a rise in carbon footprint, which requires urgent intervention. In this work, we present a comprehensive analysis of the carbon footprint of high-performance computing (HPC) systems, considering the carbon footprint during both the hardware manufacturing and system operational stages. Our work employs HPC hardware component carbon footprint modeling, regional carbon intensity analysis, and experimental characterization of the system life cycle to highlight the importance of quantifying the carbon footprint of HPC systems.
Workshop
Quantum Computing
Software Engineering
W
DescriptionIn quantum programming, there is a natural conflict between high-level expression and low-level control. Existing quantum programming solutions optimize for either the expressiveness of quantum programs or ease of composing quantum programs, but not both. In this work, we describe a quantum programming interface called AutoQASM that is Python-native, clean, and expressive for general control flow as well as for low-level and device-dependent quantum instructions. It generates OpenQASM 3.0 programs and integrates with the Amazon Braket software development kit to enable program composition, execution, and analysis in the same environment.
Posters
Research Posters
TP
XO/EX
DescriptionParticle-resolved direct numerical simulations (PR-DNS), which resolve not only the smallest turbulent eddies but also track the development and motion of individual particles, are arguably an essential tool for exploring aerosol-cloud-turbulence interactions at the fundamental level. For instance, PR-DNS may complement experimental facilities designed to study key physical processes in a controlled environment and therefore serve as digital twins for such cloud chambers. In this poster we present our ongoing work aimed at enabling the use of a PR-DNS model for this purpose. We consider two approaches: traditional HPC techniques and emerging machine learning methods. Future research directions are outlined as well.
Workshop
Education
State of the Practice
Sustainability
W
DescriptionSkills in HPC are important for various professional fields. Several initiatives present an organization of the required skills by goals and roles. These visions, built from discussions between specialists and users, propose three main actors: HPC systems engineers, software engineers, and users. Besides the formal courses in different universities, the community has offered summer schools and similar events. In this interest, the specialists and the industry were involved in developing training on computer architectures and programming paradigms. The deployment of different specific schools around the world allows for democratizing knowledge and the creation of previously non-existent collaborations in multidisciplinary and inclusive ways. So, we propose an original non-formal school for training and development by skills, named the Supercomputing and Distributed Camping School (SC-Camp), a non-profit event addressed to students who lack financial backup, with an important focus on practical sessions and itinerant, bringing knowledge to a different country yearly.
Workshop
Applications
Cloud Computing
Distributed Computing
Edge Computing
Large Scale Systems
W
DescriptionTranscriptomics studies the RNA present in a specific cell or tissue at a given time or condition. This dependence on time makes the problem computationally challenging, as the data generated by transcriptomics experiments is larger than the genomics studies on DNA sequences. The goal of the Transcriptomics Atlas project is to create a database of analyzed RNA sequences corresponding to given tissue and organ types based on the data from public repositories and make it available for researchers. We describe our transcriptomics atlas pipeline as an example of a new data- and compute-intensive scientific workflow. After analyzing the requirements of the tasks in the pipeline, we describe our proposed cloud architecture. We present the preliminary results of the experimental evaluation of the pipeline in the AWS cloud, and compare the performance results to the traditional execution on the HPC cluster.
Posters
Research Posters
TP
XO/EX
DescriptionThe I/O performance prediction is challenging due to multiple intertwined variables inside a cluster. This situation makes I/O performance prediction a strong candidate for using machine learning because of the complex variables involved. However, making a high-quality prediction requires a large amount of equivalent-quality data, and collecting it is a big challenge for most data centers.
In this project, we explore transfer learning to predict the I/O performance by utilizing the publicly available I/O performance data in Darshan logs from the NCSA's Blue Waters supercomputer. We devise a workflow to train a neural network model as a base to predict the POSIX I/O bandwidth of other clusters (CLAIX18 and Theta). With less than 1% of the data needed to build the base model, our experiment shows that our transfer learning workflow can predict the I/O bandwidth of another system with a mean absolute error better or equivalent to the state-of-the-art.
In this project, we explore transfer learning to predict the I/O performance by utilizing the publicly available I/O performance data in Darshan logs from the NCSA's Blue Waters supercomputer. We devise a workflow to train a neural network model as a base to predict the POSIX I/O bandwidth of other clusters (CLAIX18 and Theta). With less than 1% of the data needed to build the base model, our experiment shows that our transfer learning workflow can predict the I/O bandwidth of another system with a mean absolute error better or equivalent to the state-of-the-art.
Workshop
Data Analysis, Visualization, and Storage
Large Scale Systems
Performance Measurement, Modeling, and Tools
W
DescriptionFacing the need for carbon emission reduction, processes such as CO2 capture in nanoporous Metal-Organic Frameworks (MOFs) have emerged. However, such processes still need to be improved, by understanding the dynamic properties of CO2 molecules when confined in MOF nanopores. To do so, molecular dynamics (MD) simulations are run for several millions of iterations, enabling to accurately compute the CO2 residency time. Nevertheless, this dynamical parameter remains challenging to compute by standard post-processing approaches and may require terabytes of memory when data are saved after each iteration. To tackle this issue, we developed a trigger-based in situ approach that saves only the relevant data. We implement it by instrumenting the LAMMPS MD code with the SENSEI/Python in situ API. We show that this approach reduces the quantity of data saved by 4 orders of magnitude and can be up to 14% faster than traditional MD simulations without in situ processing.
Paper
Compilers
Performance Measurement, Modeling, and Tools
Performance Optimization
Programming Frameworks and System Software
TP
DescriptionTrivial operations cause software inefficiencies that waste functional units and memory bandwidth for executing useless instructions. Although previous works have identified a significant amount of trivial operations in widely used programs, the proposed solutions only provide useful observations, other than actionable guidance to eliminate trivial operations for better performance. In this paper, we propose TrivialSpy - a fine-grained and dataflow-based value profiler to effectively identify software triviality with optimization potential estimation. With the help of dataflow analysis, TrivialSpy can detect software trivialities of heavy operation, trivial chain, and redundant backward slice. In addition, TrivialSpy can identify trivial breakpoints that combine multiple trivial conditions for more optimization opportunities. The evaluation results demonstrate TrivialSpy is capable of identifying software triviality in highly optimized programs. Based on the optimization guidance provided by TrivialSpy, we can achieve 52.09% performance speedup at maximum after eliminating trivial operations.
Birds of a Feather
Artificial Intelligence/Machine Learning
TP
XO/EX
DescriptionThis Birds of a Feather session, “Two Worlds Collide: Forging Sustainable Coupled HPC Simulation/Deep Learning Applications from Hardware to Algorithm,” continues a series started in 2021 with a theme of discussing and brainstorming solutions for a new paradigm in HPC: the coupling of simulation with machine learning for state-of-the-art research. In this installment, we focus on sustainability and assurance for coupled simulation and deep learning. We discuss the current state and needs for enabling integration of HPC simulation with modern deep learning stacks to provide transformative scientific discoveries while delivering productivity, portability, and correctness for safety and mission critical applications.
Posters
Research Posters
Performance Measurement, Modeling, and Tools
TP
DescriptionNumerous sophisticated profiling and visualization tools have been developed to enable programmers to expose semantic information from their application components. However, effective and interactive exploration of the profiles of large-scale parallel programs remains a challenge due to the high I/O overheads of profiles and the difficulties in scaling downstream visualization tools. In this poster, we present a full-stack approach to a performance introspection framework that tackles key challenges in profiling and visualizing performance data at scale. Our novelty lies in a scalable and compact data model and a two-phase I/O system, which instill scalability into the profiler making it low overhead-- even at high process counts (< 5%). We then build a web-based, visual-analytic dashboard with linked views. Our profiling and visualization tools are both lightweight and easy-to-use, which strikes a balance between providing sophisticated features and operating quickly and efficiently at high process counts.
Workshop
Performance Optimization
W
DescriptionMetal additive manufacturing is a disruptive manufacturing technology that opens the design space for parts outside those possible from traditional manufacturing methods. In order to accelerate industry and R&D needs to certify AM parts, the ExaAM project has developed a suite of exascale-ready computational tools to model the process-to-structure-to-properties relationship for additively manufactured metal components. One tool is a UQ pipeline to quantify the effect uncertainty in processing conditions has on local mechanical responses. We present an overview of this pipeline and its codes. Using ORNL’s exascale computer, Frontier, we utilize this pipeline to cross multiple length and time scales to predict local mechanical response of a location within a complex AM bridge part, AMB2018-01 produced by NIST as part of their 2018 AM-Bench test series. Our results are then compared to experimental mechanical tests of parts from the NIST build to quantify the error in the ExaAM UQ workflow.
Workshop
Performance Optimization
W
DescriptionWith increased computational power through the use of low-precision arithmetic, a relevant question is how lower precision affects simulation results, especially for chaotic systems where analytical round-off estimates are non-trivial to obtain. In this work, we consider how the uncertainty of the time series of a direct numerical simulation of turbulent channel flow at 𝑅𝑒𝜏 = 180 is affected when restricted to a reduced-precision representation. We utilize a non-overlapping batch means estimator and find that the mean statistics can, in this case, be obtained with significantly fewer mantissa bits than conventional IEEE-754 double precision, but that the mean flow is more sensitive in the middle of the channel than the boundary layer. This indicates that using lower precision in the boundary layer, where the majority of the computational work is located, may benefit significantly from low-precision floating point units found in upcoming computer hardware.
Workshop
Education
State of the Practice
W
DescriptionThe “Understanding the Skills and Pathways Behind Research Software Training” BoF session run at ISC’23 provided an opportunity to gather attendees interested in enhancing skills within the RSE community. This included looking at options for understanding and developing pathways that practitioners can follow to develop their skills and competencies in a structured manner from beginner to advanced level.
During the session a live, anonymous survey was conducted. Participants were asked several questions including their role in the training community and how easy they feel it is to find/access training content targeting different skill levels. They were also asked about challenges faced in accessing relevant content, combining it into a coherent pathway, and linking training content from different sources.
The goal of this lightning talk is to present findings, within the context of the community wide effort to make the training materials more FAIR - findable, accessible, interoperable, and reusable.
During the session a live, anonymous survey was conducted. Participants were asked several questions including their role in the training community and how easy they feel it is to find/access training content targeting different skill levels. They were also asked about challenges faced in accessing relevant content, combining it into a coherent pathway, and linking training content from different sources.
The goal of this lightning talk is to present findings, within the context of the community wide effort to make the training materials more FAIR - findable, accessible, interoperable, and reusable.
Workshop
W
DescriptionHPC platforms seek to ensure peak computing performance with minimal energy cost searching sustainability. Considering different cases of use and implementation, both post-Moore hardware elements and software deployment (as virtualization or containerization) are incorporated. However, as the number of devices proliferates, managing applications has become intricate, prompting the adoption of containerization methods for simplification. Then, understanding the performance of the deployment strategy to propose an adequate implementation and integration of HPC-based post-Moore architectures is important to guarantee efficiency. This study evaluates the containerization strategies on a cost-effective post-Moore device suitable for an HPC platform. Factors like ease of use, reproducibility, and compatibility examine the deployment methods. The evaluation process employs metrics establishment and stress tests to appraise application-specific aspects. The obtained insights are categorized to address deployment mechanisms, performance implications, execution duration, and energy consumption impacts.
Paper
Accelerators
Architecture and Networks
Data Analysis, Visualization, and Storage
Fault Handling and Tolerance
TP
DescriptionModern Graphics Processing Units (GPUs) demand life expectancy extended to many years, exposing the hardware to aging (i.e., permanent faults arising after the end-of-manufacturing test). Hence, techniques to assess permanent fault impacts in GPUs are strongly required, especially in safety-critical domains.
This paper presents a method to evaluate permanent faults in the GPU's scheduler and control units, together with the first figures to quantify these effects. We inject 5.83x10^5 permanent faults in the gate-level units of a GPU model. Then, we map the observed error categories as software errors by instrumenting 13 applications and two convolutional neural networks, injecting more than 1.65x10^5 permanent errors (1,000 errors per application), reducing evaluation times from several years to hundreds of hours. Our results highlight that faults in GPU parallelism management units impact software execution parameters. Moreover, errors in resource management or instructions codes hang code, while 45% of errors induce silent data corruption.
This paper presents a method to evaluate permanent faults in the GPU's scheduler and control units, together with the first figures to quantify these effects. We inject 5.83x10^5 permanent faults in the gate-level units of a GPU model. Then, we map the observed error categories as software errors by instrumenting 13 applications and two convolutional neural networks, injecting more than 1.65x10^5 permanent errors (1,000 errors per application), reducing evaluation times from several years to hundreds of hours. Our results highlight that faults in GPU parallelism management units impact software execution parameters. Moreover, errors in resource management or instructions codes hang code, while 45% of errors induce silent data corruption.
Panel
Artificial Intelligence/Machine Learning
Applications
Reproducibility
TP
DescriptionRecent advances in deep learning (DL) for scientific computing have paved the way for a new type of integrated programming environment. This environment must support the seamless integration of simulation applications with deep learning frameworks using methods such as in-memory coupling and inference serving. Especially for HPC, this environment brings a slew of challenges, forcing developers to revisit decades of solved problems in scientific computing: kernel optimization, verification/validation strategies, building/porting practices. Interfacing HPC simulation codes with DL frameworks from industry—whose philosophies and strategies may differ from those within HPC—brings critical questions about how these two communities can work together to develop sustainable, integrated programming environments that are trustworthy, vetted, and portable, and where HPC communities can express requirements for scientific software and can track ownership. Discussions are needed about how to overcome these challenges: here, panelists from academia, national laboratories and industry will start a conversation, sharing perspectives and experiences.
Paper
Accelerators
Algorithms
Linear Algebra
TP
DescriptionThis paper presents a unified framework for reducing communication costs of sparse triangular solvers (SpTRSV) on CPU and GPU clusters. The proposed framework builds upon a 3D communication-avoiding process layout that distributes a sparse triangular matrix into a 3D layout consisting of 2D grids. This work significantly reduces inter-process communication by replicating computation and using sparse allreduce operations across the 2D grids. This also allows for integration of a number of communication-optimized 2D SpTRSV algorithms including binary communication tree-based CPU algorithms and one-sided GPU communication (e.g., NVSHMEM)-based algorithms. With all these communication reduction schemes, the resulting SpTRSV exhibits significantly better scalability than existing works on leadership CPU and CPU clusters such as Cori, Perlmutter and Crusher.
Birds of a Feather
Architecture and Networks
TP
XO/EX
DescriptionIn order to exploit the capabilities of new HPC systems and to meet their demands in scalability, communication software needs to scale on millions of cores and support applications with adequate functionality. UCX is a collaboration between industry, national labs and academia that consolidates that provides a unified open-source framework.
The UCX project is managed by the UCF consortium (http://www.ucfconsortium.org/) and includes members from LANL, ANL, Ohio State University, AMD, ARM, IBM, NVIDIA, and more. The session will serves as the UCX community meeting, and will introduce the latest development to HPC developers and the broader user community.
The UCX project is managed by the UCF consortium (http://www.ucfconsortium.org/) and includes members from LANL, ANL, Ohio State University, AMD, ARM, IBM, NVIDIA, and more. The session will serves as the UCX community meeting, and will introduce the latest development to HPC developers and the broader user community.
Birds of a Feather
Energy Efficiency
Middleware and System Software
Sustainability
TP
XO/EX
DescriptionThis BoF will bring together academia, government research laboratories, and industry to discuss and contribute to the two active community-driven, vendor-neutral forums focusing on energy efficiency in HPC software stacks. For more than 7 years, these two complementary forums- HPC-PowerStack and PowerAPI - have led the efforts in identifying and building software solutions across the software stack.
This interactive BoF will enable the community to discuss ongoing challenges in designing cost-effective, cohesive, portable, and interoperable implementations of HPC software for monitoring and control of system efficiency. Attendees will also contribute toward brainstorming solutions for addressing ongoing exascale power challenges.
This interactive BoF will enable the community to discuss ongoing challenges in designing cost-effective, cohesive, portable, and interoperable implementations of HPC software for monitoring and control of system efficiency. Attendees will also contribute toward brainstorming solutions for addressing ongoing exascale power challenges.
Paper
Accelerators
Architecture and Networks
Data Analysis, Visualization, and Storage
Fault Handling and Tolerance
TP
Best Student Paper Finalist
DescriptionDRAM vendors utilize On-Die Error Correction Codes (OD-ECC) to correct random bit errors internally. Meanwhile, system companies utilize Rank-Level ECC (RL-ECC) to protect data against chip errors. Separate protection increases the redundancy ratio to 32.8% in DDR5 and incurs significant performance penalties. This paper proposes a novel RL-ECC, Unity ECC, that can correct both single-chip and double-bit error patterns. Unity ECC corrects double-bit errors using unused syndromes of single-chip correction. Our evaluation shows that Unity ECC without OD-ECC can provide the same reliability level as Chipkill RL-ECC with OD-ECC. Moreover, it can significantly improve system performance and reduce DRAM energy and area by eliminating OD-ECC.
Posters
Research Posters
TP
XO/EX
DescriptionThis poster highlights our previous and future design-space exploration effort to optimize our CGRA architecture for HPC, i.e., intra-CGRA interconnect optimization, FMA and transcendental operation on CGRA, programmable buffer, systolic-array style execution on CGRA, predication support, and FPGA based emulation on actual HPC environment.
Panel
Applications
Reproducibility
TP
DescriptionThe scientific community needs a data fabric that integrates data delivery and access to shared storage, networking, computing, and educational resources. Such a data fabric can potentially democratize data-driven scientific discovery across the growing data science community.
In this panel, we will discuss the needs, challenges, and opportunities of the data science community leveraging the existing cyberinfrastructures and software tools while strategizing on what is missing to connect an open network of institutions, including resource-disadvantaged institutions.
In this panel, we will discuss the needs, challenges, and opportunities of the data science community leveraging the existing cyberinfrastructures and software tools while strategizing on what is missing to connect an open network of institutions, including resource-disadvantaged institutions.
Students@SC
DescriptionAre you looking to maximize your success in the job interview process? Negotiation is a crucial skill to help you secure the job offer you desire. In this workshop, participants will learn the best negotiation practices, such as what to negotiate, when to negotiate, how to evaluate total compensation, and what to avoid during negotiation. Participants will have the opportunity to put their negotiation skills to practice.
Invited Talk
Applications
Biology
Medicine
TP
DescriptionThe utilization of vascular digital twins has gained significant traction in the field of medicine, holding immense potential for transforming healthcare practices. These advanced models enable the creation of patient-specific replicas of vascular systems, facilitating precise measurements of blood flow conditions. Vascular digital twins provide a non-invasive solution for assessing stenosis severity, guiding treatment decisions, and optimizing surgical planning. Medical professionals can enhance their expertise and refine approaches with precision and confidence by performing virtual surgery and evaluating interventions beforehand. However, the development and deployment of vascular digital twins pose notable challenges, particularly in terms of data size, time-to-solution, and computational cost. Constructing a realistic model of human blood flow entails complex mathematical and computational tasks, incorporating fluid dynamics, intricate vessel geometry, pulse-driven flow and pressure changes, and the behavior of red blood cells. Furthermore, the seamless integration of personalized models with streaming wearable data for holistic patient views and virtual reality interfaces for intuitive interaction by clinicians and researchers presents additional hurdles. In this presentation, I will discuss the role of high performance computing in advancing the fidelity and use of personalized computational models.
Tutorial
Cloud Computing
Middleware and System Software
TUT
DescriptionCloud computing technologies have seen tremendous growth in recent years, with many organizations moving their HPC workloads to the cloud due to its flexibility in the organization and provisioning of HPC infrastructure. While such a diverse and flexible set of options brings additional degrees of freedom, they also bring a daunting set of hardware and software choices. Furthermore, the lines between traditional system administrator and application deployment can be blurred.
In this tutorial, we will provide a foundation to understand how to run HPC workloads in the cloud effectively and with minimal complexity. We start with a primer on cloud foundations and how they map to common HPC concepts, and then dive deeper into core HPC cloud components. We then introduce important HPC partners, discuss industry-specific solutions and present blueprints describing infrastructure, scheduler and applications.
Finally, we present the best practices to run HPC in the cloud and how to explore your options for the best configuration for price/performance.
This tutorial will use a combination of lectures and hands-on labs using Google Cloud, the open-source Google Cloud HPC Toolkit, Slurm, Spack, and other popular open-source HPC software to provide a balance of both theoretical and hands-on learning.
In this tutorial, we will provide a foundation to understand how to run HPC workloads in the cloud effectively and with minimal complexity. We start with a primer on cloud foundations and how they map to common HPC concepts, and then dive deeper into core HPC cloud components. We then introduce important HPC partners, discuss industry-specific solutions and present blueprints describing infrastructure, scheduler and applications.
Finally, we present the best practices to run HPC in the cloud and how to explore your options for the best configuration for price/performance.
This tutorial will use a combination of lectures and hands-on labs using Google Cloud, the open-source Google Cloud HPC Toolkit, Slurm, Spack, and other popular open-source HPC software to provide a balance of both theoretical and hands-on learning.
Workshop
Artificial Intelligence/Machine Learning
Software Engineering
W
DescriptionHigh-Performance Computing (HPC) has long been the driving force behind advancements in science, engineering, and beyond. Yet, realizing the full potential of HPC applications has often been hampered by the intricate nature of programming for the underlying parallel systems. In this keynote, we explore a transformative approach that bridges the gap between human ingenuity and computational power using the capabilities of large language models (LLMs).
Our research is an exploration of how cutting-edge LLMs can be tailored to the demanding domain of HPC, where computational speed and efficiency reign supreme. While LLMs have showcased remarkable proficiency in understanding and generating code, their training data primarily comes from general-purpose codebases. In stark contrast, HPC code involves intricate mathematical modeling, parallelism, and optimization, demanding customized adaptations.
That is why our journey for ‘HPC LLMs’ began with the collection of an extensive dataset, HPCorpus, that represents the culmination of HPC code in C, C++, and Fortran from diverse domains. Armed with this invaluable resource, we embarked on an ambitious mission to enhance the capabilities of language models in the realm of HPC. The creation of Tokompiler, a pioneering HPC-specific code tokenizer, marked a pivotal turning point. Tokompiler, designed to preprocess code for language models, introduced a revolutionary approach that harnessed abstract syntax trees (ASTs) into the source code itself and reshaped the way language models comprehend and generate code, resembling how compilers perceive our codes, not humans. Building upon this innovation, we undertook comprehensive pre-training efforts with CompCoder, adapting transformer-based language models to the intricacies of HPC. This journey has culminated in novel downstream tasks, including the generation of OpenMP and MPI code, where our models shine by transforming serial code into efficient parallel one. Together, these milestones represent a great leap forward in the convergence of AI and HPC from a different perspective, promising to redefine the landscape of computational science.
As we stand at the crossroads of AI and HPC, the possibilities are boundless. Our journey is merely the prologue, unveiling a multitude of untapped opportunities in HPC code comprehension, generation, and optimization. From refining domain-specific code to tackling complex simulations and accelerating scientific breakthroughs, the horizons are vast. The symbiotic partnership between LLMs and HPC promises to revolutionize how HPC practitioners write code. Looking ahead, we envision a future where LLMs for HPC become indispensable tools for researchers and developers in their quest for unprecedented speed, accuracy, and efficiency.
Our research is an exploration of how cutting-edge LLMs can be tailored to the demanding domain of HPC, where computational speed and efficiency reign supreme. While LLMs have showcased remarkable proficiency in understanding and generating code, their training data primarily comes from general-purpose codebases. In stark contrast, HPC code involves intricate mathematical modeling, parallelism, and optimization, demanding customized adaptations.
That is why our journey for ‘HPC LLMs’ began with the collection of an extensive dataset, HPCorpus, that represents the culmination of HPC code in C, C++, and Fortran from diverse domains. Armed with this invaluable resource, we embarked on an ambitious mission to enhance the capabilities of language models in the realm of HPC. The creation of Tokompiler, a pioneering HPC-specific code tokenizer, marked a pivotal turning point. Tokompiler, designed to preprocess code for language models, introduced a revolutionary approach that harnessed abstract syntax trees (ASTs) into the source code itself and reshaped the way language models comprehend and generate code, resembling how compilers perceive our codes, not humans. Building upon this innovation, we undertook comprehensive pre-training efforts with CompCoder, adapting transformer-based language models to the intricacies of HPC. This journey has culminated in novel downstream tasks, including the generation of OpenMP and MPI code, where our models shine by transforming serial code into efficient parallel one. Together, these milestones represent a great leap forward in the convergence of AI and HPC from a different perspective, promising to redefine the landscape of computational science.
As we stand at the crossroads of AI and HPC, the possibilities are boundless. Our journey is merely the prologue, unveiling a multitude of untapped opportunities in HPC code comprehension, generation, and optimization. From refining domain-specific code to tackling complex simulations and accelerating scientific breakthroughs, the horizons are vast. The symbiotic partnership between LLMs and HPC promises to revolutionize how HPC practitioners write code. Looking ahead, we envision a future where LLMs for HPC become indispensable tools for researchers and developers in their quest for unprecedented speed, accuracy, and efficiency.
Workshop
Data Analysis, Visualization, and Storage
Large Scale Systems
Performance Measurement, Modeling, and Tools
W
DescriptionThis work explores a new use case of an in situ processing capability to study a certain diffusion process in magnetic confinement fusion. This diffusion process involves plasma particles that are likely to escape confinement, because such particles carry a significant amount of energy from the burning plasma to the diverter and damaging the diverter plate. This study requires in situ processing because of the fast changing nature of the particle diffusion process. However, the in situ processing approach is challenging because the amount of data to be retained for the diffusion calculations increases over time, unlike in other in situ processing cases where the amount of data to be processed is constant over time. Here we report our preliminary efforts to control the memory usage while ensuring the necessary analysis tasks are completed in a timely manner.
Posters
Research Posters
TP
XO/EX
DescriptionCardiovascular electrophysiology simulations often involve computationally expensive tasks due to the inherent multiphysics complexity of the problems. Additionally, the use of complex patient-specific geometries and biophysically-detailed ionic models adds to the system's complexity. To numerically solve such problems within reasonable timeframes, high-performance computing plays a crucial role. In this poster, we present a high-performance electrophysiology library specifically designed to address these demanding simulations. The library's routines support the use of linear, and quadratic tetrahedral elements. Moreover, our library offers a two-way coupling capability that enables interactions among multi-dimensional meshes. This important feature facilitates the simulation of electrical interactions between insulated regions of the heart, such as the atria and the ventricles. By enabling such coupling, the library aims to contribute to a more comprehensive understanding of the heart's electrophysiology and its intricate electrical behavior.
Birds of a Feather
Education
TP
XO/EX
DescriptionCreating and providing HPC training for practitioners with diverse backgrounds is challenging, and requires a multitude of educational resources covering different skills. However, the sheer volume does not guarantee discoverability or quality of the content. The main goal of the International HPC Certification program is to ease the provision and uptake of training by clearly categorizing, defining and eventually assessing the skills required to efficiently use HPC resources. The session aims to present the current status, discuss the developed processes, tools, and skills, and to ensure community involvement. Anyone interested in HPC education is invited to participate in the discussion.
Workshop
Quantum Computing
Software Engineering
W
DescriptionThe automatic resource estimation tools provided by Azure Quantum and Microsoft Quantum Development Kit is described and examples are given of obtaining resource estimates for fault tolerant implementations of several quantum algorithms. The AQ Resource Estimator tool uses the planar, quantum Instruction-Set Architecture as the logical abstraction level where the algorithm specification and the physical parameters of a chosen quantum hardware profile meet. More specifically, it enables the user to provide a high-level specification of an algorithm, which then gets automatically translated to the quantum ISA level. At the lower end of the stack, the tool enables the user to specify parameters such as the physical error rates, the time durations, the quantum error correction scheme, and the algorithmic error budget. Put together, the tool is thus able to calculate the physical resources that are needed to execute the specific quantum algorithm.
Workshop
Fault Handling and Tolerance
Large Scale Systems
W
DescriptionFault tolerance remains a key challenge for current high performance computing systems. Effective and efficient scheduling of mitigation methods continues to be a critical issue in the face of dynamic and difficult-to-predict error rates found on many systems. Using failure data from the Astra supercomputer, we examine the efficacy of a simple method to determine if a sliding window of recent failures contains an unusual pattern of errors. Specifically, we investigate using Benford’s Law to predict the likelihood that the system is currently in a period of unusual failure occurrences. While still in its initial stages, this work provides critical analysis of failure status for extreme-scale systems and a simple form of prediction for determining when the scheduling of failure mitigation may be suboptimal and needs to be reevaluated due to the unusual pattern of errors that are occurring.
Tutorial
Cloud Computing
Resource Management
Software Engineering
TUT
DescriptionWithin just the past few years, the use of containers has revolutionized the way in which industries and enterprises have developed and deployed computational software and distributed systems. The containerization model has gained traction within the HPC community as well with the promise of improved reliability, reproducibility, portability, and levels of customization that were not previously possible on supercomputers. This adoption has been enabled by a number of HPC Container runtimes that have emerged including Singularity, Shifter, Enroot, Charliecloud, and others.
This hands-on tutorial looks to train users on the usability of containers on HPC resources. We will provide a detailed background on Linux containers, along with introductory hands-on experience building a container image, sharing the container and running it on a HPC cluster. Furthermore, the tutorial will provide more advanced information on how to run MPI-based and GPU-enabled HPC applications, how to optimize I/O intensive workflows, and how to setup GUI enabled interactive sessions. Cutting-edge examples will include machine learning and bioinformatics. Users will leave the tutorial with a solid foundational understanding of how to utilize containers on HPC resources using Podman, Shifter, and Singularity, and in-depth knowledge to deploy custom containers on their own resources.
This hands-on tutorial looks to train users on the usability of containers on HPC resources. We will provide a detailed background on Linux containers, along with introductory hands-on experience building a container image, sharing the container and running it on a HPC cluster. Furthermore, the tutorial will provide more advanced information on how to run MPI-based and GPU-enabled HPC applications, how to optimize I/O intensive workflows, and how to setup GUI enabled interactive sessions. Cutting-edge examples will include machine learning and bioinformatics. Users will leave the tutorial with a solid foundational understanding of how to utilize containers on HPC resources using Podman, Shifter, and Singularity, and in-depth knowledge to deploy custom containers on their own resources.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
TP
XO/EX
DescriptionThe Scientific Data and Computing Center (SDCC) at Brookhaven National Laboratory manages a data storage system with millions of files totaling petabytes of data. To optimize costs, they use a multi-tiered storage approach based on data temperature, storing infrequently accessed ("cold") data on cheaper technologies like Blu-ray disks or tape drives, and frequently accessed ("hot") data on faster but costlier mediums like Hard Disk Drives or Solid State Drives. Current data migration decisions rely on manual human judgment supported by simple algorithms not suitable for long-term predictions. To address this, our project aims to automate the process by training a deep neural network (DNN) on file metadata to predict data temperature upon upload. The model achieved promising initial results, with a 90.53% general accuracy in predicting data temperature. This automation could significantly improve the management and distribution of the vast research data generated at BNL.
Workshop
Exascale
Message Passing
Programming Frameworks and System Software
W
DescriptionCurrent HPC architectures are deeply hierarchical (racks, nodes, sockets, NUMA domains, caches, ...), and the mapping of MPI processes to cores can significantly influence application performance. To study hierarchy effects on MPI application performance, we propose a procedure for expressing mappings by enumerating cores in the hierarchy in different orders. We explore two use cases: MPI rank reordering for applications using subcommunicators, and core selection for applications not using all cores on a node.
Results of micro-benchmarks executing collective operations in subcommunicators show a performance difference up to a factor 4 between the best and the worst rank orderings. By changing the rank orders, we observe a performance impact for the Splatt application. The evaluation of the strong scalability of a conjugate gradient benchmark shows that considering all hierarchy levels in the core selection policy can give better performance than using only options available with common MPI application launchers.
Results of micro-benchmarks executing collective operations in subcommunicators show a performance difference up to a factor 4 between the best and the worst rank orderings. By changing the rank orders, we observe a performance impact for the Splatt application. The evaluation of the strong scalability of a conjugate gradient benchmark shows that considering all hierarchy levels in the core selection policy can give better performance than using only options available with common MPI application launchers.
Workshop
Education
State of the Practice
W
DescriptionDeep Learning (DL) methods have recently dominated the fields of Machine Learning. Most DL models assume that the input data distribution is identical between testing and validation, though they often are not. For example, if we train a traffic sign classifier, the model might confidently, but incorrectly, classify a graffitied stop sign as a speed limit sign. Often ML provides high-confidence (softmax) output for out-of-distribution input that should have been classified as "I don't know". By adding the capability of propagating uncertainty to our results, the model can provide not just a single prediction, but a distribution over predictions that will allow the user to determine the model's reliability and whether it needs to be deferred to a human expert. Uncertainty estimation is computationally expensive; in this assignment, we will learn to accelerate the calculations using common distributed systems divide and conquer techniques.
Files given to students (Slides&Code ) (link:\url{https://drive.google.com/drive/folders/1KrxWlMZpoJzph0Y7VbZj_yYyACK-Jusl?usp=sharing})
Files given to students (Slides&Code ) (link:\url{https://drive.google.com/drive/folders/1KrxWlMZpoJzph0Y7VbZj_yYyACK-Jusl?usp=sharing})
Workshop
Data Analysis, Visualization, and Storage
Large Scale Systems
Performance Measurement, Modeling, and Tools
W
DescriptionBecause memory is a highly constrained resource, Umpire, a data and memory management API, was created at Lawrence Livermore National Laboratory (LLNL). Umpire provides memory pools which enable less expensive ways to allocate very large amounts of memory in HPC environments. Additionally, memory pools can be used when many small allocations are needed to avoid expensive calls to the underlying device-specific API. In-situ visualization is inherently
resource constrained, making Umpire’s memory management API a valuable tool for improving performance. Umpire is used in many simulation codes at LLNL that also rely on cutting-edge in-situ visualization libraries. This lightning talk discusses Umpire's advantages and use cases, including some examples of in-situ visualization applications which rely on Umpire to improve memory performance.
resource constrained, making Umpire’s memory management API a valuable tool for improving performance. Umpire is used in many simulation codes at LLNL that also rely on cutting-edge in-situ visualization libraries. This lightning talk discusses Umpire's advantages and use cases, including some examples of in-situ visualization applications which rely on Umpire to improve memory performance.
Workshop
Education
State of the Practice
W
DescriptionWe have developed a series of course-based undergraduate research experiences for students integrated into curriculum centered around the use of 3D visualization. One project involves the creation and use of a volumetric renderer for hyperstack images, paired with a project in confocal microscopy. Students have developed and tested tools for confocal microscopy visualization across headset based and CAVE based VR platforms. Two applications of the tool are presented: a rendering of Drosophila primordial germ cells coupled with automated detection and counting, and a database in development of 3D renderings of pollen grains. Another project involves the development and testing of point cloud renderers. Student work has focused on performance testing and enhancement across a range of 2D and 3D hardware, including native Quest apps. Through the process, students are introduced to scientific visualization concepts, while gaining practical experience with programming, software engineering, graphics, shader programming, and cross-platform design.
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
TP
XO/EX
DescriptionObstructive sleep apnea (OSA) impacts millions, linking to severe complications yet understanding its influence on comorbidities lags. Complications can be avoided by using expensive continuous positive airway pressure (CPAP) machines, but physicians cannot identify those at risk. Large language models (LLMs) have recently made impressive advancements in sequence modeling, and clinical applications are quickly emerging. However, the medical relevance of pre-trained LLM latent spaces remains uncertain.
This study gauges 12 pre-trained clinical LLMs, clustering OSA-related phenotypes and comorbidities (atrial fibrillation, coronary artery disease, heart failure, hypertension, stroke, type 2 diabetes). Using 40 A100 GPUs on NERSC’s Perlmutter, document-level embeddings for 331,793 MIMIC-IV discharge reports were computed for each LLM. K-Means models were ranked by clustering entropy of phenotype classes, guiding model selection. The top models successfully subset patients with similar histories and outcomes. This work will support ongoing OSA research by identifying phenotypes and assist physicians by informing CPAP allocation.
This study gauges 12 pre-trained clinical LLMs, clustering OSA-related phenotypes and comorbidities (atrial fibrillation, coronary artery disease, heart failure, hypertension, stroke, type 2 diabetes). Using 40 A100 GPUs on NERSC’s Perlmutter, document-level embeddings for 331,793 MIMIC-IV discharge reports were computed for each LLM. K-Means models were ranked by clustering entropy of phenotype classes, guiding model selection. The top models successfully subset patients with similar histories and outcomes. This work will support ongoing OSA research by identifying phenotypes and assist physicians by informing CPAP allocation.
Workshop
Accelerators
Edge Computing
Heterogeneous Computing
W
DescriptionValue-based resource management heuristics, which are traditionally deployed in heterogeneous HPC systems, maximize system productivity by assigning resources to each job based on its priority and estimated value gain relative to each job's completion time. We investigate the utility of value-based resource management at heterogeneous SoC scale and demonstrate its ability to make effective scheduling decisions for time-constrained jobs in oversubscribed systems where system resources are shared by multiple users and applications arrive dynamically. The proposed value-based resource management approach drops tasks that are estimated to result with lower-value gain dynamically with the aim of completing more number of high-value jobs with a scheduling decision time at 120𝜇s scale. The value-based resource management treats scheduling as a global optimization problem, therefore this study sets a path forward for deploying a unified value-based resource management on a system composed of front-end SoC-based edge devices and a back-end HPC system.
Paper
Artificial Intelligence/Machine Learning
Codesign
Performance Optimization
Programming Frameworks and System Software
TP
DescriptionThe increasing success and scaling of Deep Learning models demands higher computational efficiency and power. Sparsification can lead to both smaller models as well as higher compute efficiency, and accelerated hardware is becoming available. However, exploiting it efficiently requires kernel implementations, pruning algorithms, and storage formats, to utilize hardware support of specialized sparse vector units. An example of those are the NVIDIA’s Sparse Tensor Cores(SPTCs), which promise a 2x speedup. However, SPTCs only support the 2:4 format, limiting achievable sparsity ratios to 50%. We present the V:N:M format, which enables the execution of arbitrary N:M ratios on SPTCs. To efficiently exploit the resulting format, we propose Spatha, a high-performance sparse-library for DL routines. We show that Spatha achieves up to 37x speedup over cuBLAS. We also demonstrate a second-order pruning technique that enables sparsification to high sparsity ratios with V:N:M and little to no loss in accuracy in modern transformers.
Workshop
Modeling and Simulation
Performance Measurement, Modeling, and Tools
W
DescriptionMPI collective communication operations are crucial for high-performance computing, making the efficient implementation of collective algorithms essential for optimal application performance. While most MPI libraries provide several algorithms for a specific collective operation, each may work better in a specific scenario. Therefore, selecting the most suitable algorithm for each use case is important. However, even the best algorithm in a given MPI library’s set may deliver suboptimal performance.
Self-consistent MPI performance guidelines are general expectations that collectives must meet to be deemed performance-consistent. Specifically, a specialized collective call should not be slower than its less specialized counterparts. We introduce a tool for assessing the performance consistency of MPI collectives in a statistically sound manner. Through a case study, we demonstrate the current state of MPI performance consistency for three TOP500 machines.
Self-consistent MPI performance guidelines are general expectations that collectives must meet to be deemed performance-consistent. Specifically, a specialized collective call should not be slower than its less specialized counterparts. We introduce a tool for assessing the performance consistency of MPI collectives in a statistically sound manner. Through a case study, we demonstrate the current state of MPI performance consistency for three TOP500 machines.
Workshop
Accelerators
Edge Computing
Heterogeneous Computing
W
DescriptionWe investigate the vertical scaling of a mixed-precision variational multiscale method. In this method, the finescales are represented in reduced-precision floating-point format while the coarse scales are represented in double-precision floating-point format. We accelerate the solve of the finescale problem by shifting the solve of the finescale problem from the central processing unit to the graphical processing unit. We observe that this vertical scaling technique successfully accelerates the solve of the finescale problem by over 900x in some instances. However, we also note the observed acceleration is parameter-dependent and varies wildly based on the coarse scale and finescale polynomial degrees chosen for the variational multiscale method. Despite the demonstrated success of the present work, this case study highlights existing challenges when merging vertical and horizontal scaling techniques and motivates opportunities for future research on the topic.
Posters
Scientific Visualization & Data Analytics Showcase
Visualizing Megafires: How AI Can Be Used to Drive Wildfire Simulations with Better Predictive Skill
Data Analysis, Visualization, and Storage
HPC in Society
Modeling and Simulation
Visualization
TP
XO/EX
DescriptionThe East Troublesome Wildfire was the fourth largest wildfire to date in Colorado history, igniting on October 14, 2020. Driven by low humidity and high winds, the wildfire spread to over 200,000 acres in nine days, with 87,000 of those acres being burnt in a single 24 hour period. Wildfire simulations and forecasts help decision-makers issue evacuation orders and inform response teams, but these simulations depend on accurate variable inputs to produce trustworthy results. These wildfire visualizations demonstrate new AI tools developed at the National Center for Atmospheric Research (NCAR), which are producing superior wildfire simulation outputs than have been available in the past.
Posters
Scientific Visualization & Data Analytics Showcase
Data Analysis, Visualization, and Storage
HPC in Society
Modeling and Simulation
Visualization
TP
XO/EX
DescriptionWe present an explanatory-track visualization which utilizes multiple open-source graphics tools, including the C++ library OpenVDB and the 3D animation software Blender, to create a cinematic representation of simulation data generated in support of the Asian Summer Monsoon Chemical and Climate Impact Project (ACCLIP) campaign. After a brief summary of the project and data simulation, the process and techniques used to create the visualization are explained in detail.
Workshop
Artificial Intelligence/Machine Learning
Software Engineering
W
DescriptionSince CUDA was introduced around 15 years ago, it has developed constantly to accommodate many different functionalities. For beginners in particular, it is already difficult remember the expected function parameters of commonly used CUDA features, let alone optimize their code by exploiting these new functionalities. To reduce the burden for CUDA programmers, we propose VSCuda - a Visual Studio Code extension for CUDA C/C++ which includes functionalities including but not limited to - CUDA syntax highlighting, Code Help for all CUDA Runtime API, Code completion for common CUDA functions and integrated code improvement suggestions from state-of-the-art large language models.
Workshop
Accelerators
Compilers
Heterogeneous Computing
Programming Frameworks and System Software
Runtime Systems
W
Workshop
Applications
Cloud Computing
Distributed Computing
Edge Computing
Large Scale Systems
W
Workshop
Data Analysis, Visualization, and Storage
Large Scale Systems
Programming Frameworks and System Software
Reproducibility
Resource Management
Runtime Systems
W
Workshop
Exascale
Message Passing
Programming Frameworks and System Software
W
Birds of a Feather
Programming Frameworks and System Software
TP
XO/EX
DescriptionWelcome to C++ 23, the “pandemic” edition. C++ was named Tiobe Programming Language of the Year for 2022 by the Tiobe Index of language popularity. C/C++ is used in 79.4% of parallel programming languages based on Hyperion 2021 research HPC Briefing at ISC 2021. We will review C++23 final content and implementation status by all compilers while we look ahead to see what is coming for C++ 26. This BoF will pull together important leader within ISO C++ Standard that are co-authors in key C++23 features such as ML, executors, mdspan, library and Concurrency.
Workshop
Accelerators
Codesign
Heterogeneous Computing
Task Parallelism
W
Workshop
Resource Management
State of the Practice
W
DescriptionThere are likely more ideas about the fair way to schedule workloads than there are systems that run those workloads. Possibly more than the number of users of those systems. Unfortunately, since fair is different in different contexts, the optimal solution is unlikely to be a one-size-fits-all, solution, or even an adjustable-size solution where everyone turns a couple of knobs to get the optimal fair solution to meet their need.
I will present GReaT allocations used at the Institute for Computational and Data Sciences, at Penn State University. We provide guaranteed start time expectations and priority scheduling, similar in some respects to condo-like scheduling systems. In addition to start time and resource availability, we provide access to temporary extended resources and protection against rogue or runaway jobs that could drain allocations. Our configuration required additional extensions in addition to the standard tools provided with slurm scheduling systems.
I will present GReaT allocations used at the Institute for Computational and Data Sciences, at Penn State University. We provide guaranteed start time expectations and priority scheduling, similar in some respects to condo-like scheduling systems. In addition to start time and resource availability, we provide access to temporary extended resources and protection against rogue or runaway jobs that could drain allocations. Our configuration required additional extensions in addition to the standard tools provided with slurm scheduling systems.
Workshop
Fault Handling and Tolerance
Large Scale Systems
W
DescriptionConsider an application executing for a fixed duration. The checkpoint duration is a stochastic random variable that obeys some well-known probability distribution law. The question is when to take a checkpoint towards the end of the execution, so that the expectation of the work done is maximized. In the first scenario, a checkpoint can be taken at any time.
We provide the optimal solution for a variety of probability distribution laws modeling checkpoint duration. In the second scenario, the application is a chain of tasks with IID stochastic execution times, and a checkpoint can be taken only at the end of a task. First, we introduce a static strategy where we compute the optimal number of tasks before the checkpoint at the beginning of the execution. Then, we design a dynamic strategy that decides whether to checkpoint or to continue execution at the end of each task.
We provide the optimal solution for a variety of probability distribution laws modeling checkpoint duration. In the second scenario, the application is a chain of tasks with IID stochastic execution times, and a checkpoint can be taken only at the end of a task. First, we introduce a static strategy where we compute the optimal number of tasks before the checkpoint at the beginning of the execution. Then, we design a dynamic strategy that decides whether to checkpoint or to continue execution at the end of each task.
Workshop
Large Scale Systems
Middleware and System Software
Programming Frameworks and System Software
W
DescriptionEnsuring high productivity in scientific software development necessitates developing and maintaining a single codebase that can run efficiently on a range of accelerator-based supercomputing platforms. This requires the use of performance portability layers such as OpenMP, RAJA, Kokkos and SYCL for developing the compute kernels. In this talk, I will present the results of a comprehensive study of a range of proxy applications implemented in the major programming models suitable for GPU-based platforms. We collect and analyze performance results across NVIDIA and AMD GPU hardware currently deployed in leadership-class computing facilities using a representative set of scientific codes and several programming models -- CUDA, HIP, Kokkos, RAJA, OpenMP, OpenACC, and SYCL. Based on the specific characteristics of applications tested, we discuss recommendations to developers on how to choose the right programming model for their code. These results provide a comprehensive evaluation of the extent to which each programming model for heterogeneous systems provides true performance portability in real-world usage.
Workshop
State of the Practice
W
DescriptionOne of the exciting aspects of a research career is that it can change rapidly, including the problems you work on and the collaborators you work with, in addition to the usual options of changing jobs and institutions. Using my own experience at universities, national labs, and a brief stint in industry, I will talk about different experiences with each and how to decide when it is time for a change.
Workshop
State of the Practice
W
DescriptionLet’s face it: Being in a leadership role where you are sandwiched between those who report to you and those to whom you report can be tough. In fact, research shows middle and upper middle management experiences the most workplace dissatisfaction in organizations. It’s often frustrating and occasionally demotivating. But! It’s also on-and-off satisfying and once-in-a-while outright inspiring. In this talk, we’ll discuss ways to harness your position, manage and advocate for your team, meet the expectations of your organization, and be an agent of achievement.
Workshop
State of the Practice
W
DescriptionThe WHPC workshop audience will have a structured breakout session for attendees to mingle and network. This is also the opportunity to meet with the early career lightning talk speakers.
Workshop
State of the Practice
W
DescriptionWith regard to diversity, many of us start our careers in computing with enthusiasm and potential, but as we progress through different career stages, our representation dwindles. In simpler terms, we're well-represented in the early stages of our careers but less so as we advance. In this discussion, I'd like to share insights I've gained from facilitating the ECP High-Performance Computing Workforce Development and Retention Action (HPC-WDR) Group’s Workforce Development Webinar series. The mission of the ECP-HPC-WDR Action Group was to enable DOE national laboratories and their related computing communities to share their collective insight for inclusive and equitable workforce development and retention for high-performance computing.
The goal of this session is to have an open conversation with all of you about how we can foster a sense of belonging in our workplaces or schools and develop a strong identity within the scientific or technical domain. We're focusing on these aspects because research shows they significantly influence how long students stay in their majors and how long professionals continue in their careers.
For this discussion, we're all panelists. I invite you to come prepared to share your experiences on what motivates you to persist in your career and your ideas on how we can create a supportive community that not only encourages our persistence but also transforms our institutions. Together, we'll explore solutions for better workplace retention among minority groups in computing.
The goal of this session is to have an open conversation with all of you about how we can foster a sense of belonging in our workplaces or schools and develop a strong identity within the scientific or technical domain. We're focusing on these aspects because research shows they significantly influence how long students stay in their majors and how long professionals continue in their careers.
For this discussion, we're all panelists. I invite you to come prepared to share your experiences on what motivates you to persist in your career and your ideas on how we can create a supportive community that not only encourages our persistence but also transforms our institutions. Together, we'll explore solutions for better workplace retention among minority groups in computing.
Workshop
State of the Practice
W
Workshop
State of the Practice
W
DescriptionThe 16th international Women in HPC workshop will be held at SC23 in Denver, CO, USA with the goal of fostering a diverse and inclusive HPC community. The WHPC workshop series has become the leading SC event focused on DEI topics. We aim to cultivate skills for valuing a diverse workforce and creating a welcoming environment for all. New this year, we will have an increased emphasis on diversity and inclusion of both women and men from under represented groups.
WHPC@SC23 we will focus on the following topics:
- Improving diversity and inclusion for all in the HPC workforce
- Building a deeper understanding of what diversity, equity, and inclusion means for different groups
- Strategies for recruitment, retention, and success
- Building community through real-time networking
- Learning from, and valuing, different experiences and career paths
We will also include short lightning talks by early career researchers from under represented groups.
WHPC@SC23 we will focus on the following topics:
- Improving diversity and inclusion for all in the HPC workforce
- Building a deeper understanding of what diversity, equity, and inclusion means for different groups
- Strategies for recruitment, retention, and success
- Building community through real-time networking
- Learning from, and valuing, different experiences and career paths
We will also include short lightning talks by early career researchers from under represented groups.
Exhibits
Flash Session
TP
XO/EX
DescriptionThis session will talk about the trends and growth of housing workloads in Norway, Bulk Data Centers approach to this demand and the secret that gives AI, ML, and Supercomputing companies an advantage by being in Norway.
Exhibits
Flash Session
TP
XO/EX
DescriptionThis session will talk about the trends and growth of housing workloads in Norway, Bulk Data Centers approach to this demand and the secret that gives AI, ML, and Supercomputing companies an advantage by being in Norway.
Posters
Research Posters
TP
XO/EX
DescriptionModern scientific applications produce vast amounts of data, typically stored in monolithic files on parallel file systems (PFS). Analyzing these large files often results in inefficiency due to I/O stalls. To mitigate these stalls, certain data can be pre-computed during the production phase and queried during analysis. However, this solution demands added storage capacity and an astute use of storage hierarchies. In this context, we introduce Hades, an I/O engine seamlessly integrated with the Adios2 framework. Hades offers hierarchical buffering, which enables smart data placement and prefetching across the spectrum of I/O devices. Additionally, it is adept at computing basic derived quantities required by I/O applications, such as the global and local min/max values. A notable feature of Hades is its memory-first metadata management strategy, which is designed for querying derived data, significantly enhancing system performance.
Workshop
Applications
Distributed Computing
Large Scale Systems
Programming Frameworks and System Software
Runtime Systems
W
DescriptionWe present a prototype solution for a polyglot computational fluid dynamics code using the Python Multiprocessing API and Dragon. The code uses an actor-based dataflow architecture with a directed graph to explicitly express program execution including parallelization and asynchronous communication. Computation-heavy parts are covered with individual Fortran executables, the shared state description is written in C with Fortran and Cython wrappers. Our code demonstrates data flow programming in Python for a classical tightly coupled HPC problem, combining cloud-native programming paradigms with HPC communication techniques like RDMA through the Dragon runtime. We demonstrate how a scalable software architecture for classical HPC, AI/ML and HPC workflow applications could look like in the future.
Birds of a Feather
HPC in Society
TP
XO/EX
DescriptionHPC changes the world and society around us on a daily basis. Ensuring that HPC resources are both used ethically and are ethically available is of utmost importance to ensure a more equitable world. With our first BoF in 2019 and annually since (save 2020's Covid limitation) and expansion to ISC 2023 as well, we are fostering discussion with the community about what our ethical standards should be and fostering lively discussion debating these standards. This BoF will continue this tradition while incorporating, for the first time, efforts for establishing specific ethical principles driving toward a formal community declaration.
Workshop
Applications
Cloud Computing
Distributed Computing
Edge Computing
Large Scale Systems
W
Workshop
Large Scale Systems
Performance Measurement, Modeling, and Tools
Software Engineering
W
Birds of a Feather
Cloud Computing
Distributed Computing
TP
XO/EX
DescriptionDiscoveries in science increasingly rely on workflows to coordinate complex experiments, ranging from cloud-based data preprocessing to multi-facility computational workflows. Continuum and cross-facility workflows have gained prominence, providing continuous computing access and spanning multiple sites. This BoF session, organized by the Workflows Community Initiative, will address challenges, opportunities, and future directions for continuum and cross-facility workflows. Participants will share domain-specific insights, covering topics such as facility coordination, metadata tracking, and standardization. The BoF will produce tangible outputs, including lightning talks and a community roadmap, fostering networking and international collaborations.
Workshop
Applications
Cloud Computing
Distributed Computing
Edge Computing
Large Scale Systems
W
Workshop
Data Analysis, Visualization, and Storage
Large Scale Systems
Programming Frameworks and System Software
Reproducibility
Resource Management
Runtime Systems
W
Workshop
Applications
Data Movement and Memory
Large Scale Systems
W
DescriptionThe growing disparity between compute and memory speed, known as the memory wall problem, has been one of the most critical and long-standing challenges in the computing industry. The prevalence of heterogeneous computing, the ongoing expansion of the memory hierarchy, and the advent of disaggregated architectures have considerably expanded the scope of this problem. Computer architecture, operating systems, storage systems, performance models, tools, and applications themselves are being enhanced or even redesigned to address the performance, programmability, and energy efficiency challenges of the increasingly complex and heterogeneous memory systems. Exploring the intersection of these research areas will enable cohesive and synergistic development and collaboration on the future of memory technologies, systems, and applications. MTSA’23: Workshop on Memory Technologies, Systems, and Applications aims to bring together researchers from industry, government labs, and academia concerned with the challenges of efficiently using existing and emerging memory systems.
Workshop
Applications
Cloud Computing
Distributed Computing
Edge Computing
Large Scale Systems
W
Workshop
Data Analysis, Visualization, and Storage
Large Scale Systems
Programming Frameworks and System Software
Reproducibility
Resource Management
Runtime Systems
W
Paper
Data Analysis, Visualization, and Storage
I/O and File Systems
State of the Practice
TP
DescriptionDirectory tree walks on parallel file systems are costly operations frequently required by many storage management tasks. Even listing the contents of a single directory can take minutes to hours for huge directories, as the tree walk performance of parallel file systems in Linux is severely throttled by sequentially accessing distributed metadata for each file through the syscall interface.
We present extreme file attribute stat (Xfast), which scales the performance of directory tree walks by combining techniques that have been developed over a time frame of 10 years for the Lustre file system. Scalable statahead predicts file access patterns and prefetches required attributes, while the Size on MDT (SOM) mechanism reduces the number of RPC calls to collect file attributes. Xfast improves the performance of common directory operations, e.g. reduces the time to list one million files from 11 minutes to less than one minute for a single process.
We present extreme file attribute stat (Xfast), which scales the performance of directory tree walks by combining techniques that have been developed over a time frame of 10 years for the Lustre file system. Scalable statahead predicts file access patterns and prefetches required attributes, while the Size on MDT (SOM) mechanism reduces the number of RPC calls to collect file attributes. Xfast improves the performance of common directory operations, e.g. reduces the time to list one million files from 11 minutes to less than one minute for a single process.
Workshop
Large Scale Systems
Performance Measurement, Modeling, and Tools
Software Engineering
W
Workshop
Large Scale Systems
Performance Measurement, Modeling, and Tools
Software Engineering
W
Workshop
Large Scale Systems
Performance Measurement, Modeling, and Tools
Software Engineering
W
Workshop
Large Scale Systems
Performance Measurement, Modeling, and Tools
Software Engineering
W
DescriptionAdvancement in computational power and high-speed networking is enabling a new model of scientific experiment, experiment-in-the-loop computing (EILC). In this model, simulation and/or learning modules are run as data is collected from observational and experimental sources. Presently, the amount and complexity of data generated by simulations and by observational and experimental sources, such as sensor networks and large-scale scientific facilities, continues to increase. Several research challenges exist, many of which are independent of the scientific application domain. New algorithms, including artificial intelligence and machine learning algorithms, to merge simulation ensembles and experimental data sets must be developed. Data transfer techniques and workflows must be constructed to control the ensembles and integrate simulated and observed data sets. The Workshop on Extreme-Scale Experiment-in-the-Loop Computing (XLOOP 2023) will be a unique opportunity to promote this interdisciplinary topic area. We invite papers, presentations, and participants from the physical and computer sciences.
Workshop
Programming Frameworks and System Software
W
DescriptionHeterogeneous High Performance Computing (HPC) systems are highly specialized, complex, powerful, and expensive systems. Efficient utilization of these systems requires monitoring tools to confirm that users have configured their jobs, workflows, and applications correctly to consume the limited allocations they have been awarded. Historically system monitoring tools are designed for and only available to system administrators and facilities personnel to ensure that the system is healthy, utilized, and operating within acceptable parameters. However, there is a demand for user space monitoring capabilities to address the configuration validation and optimization problem. We describe a prototype tool, ZeroSum, designed to provide user space monitoring of application processes, lightweight processes (threads), and hardware resources on heterogeneous, distributed HPC systems. ZeroSum is designed to be used either as a limited-use porting tool or as an always-on monitoring library.
Sessions
Workshop
Architecture and Networks
W
Workshop
Middleware and System Software
Programming Frameworks and System Software
Runtime Systems
W
Workshop
Fault Handling and Tolerance
Large Scale Systems
W
Workshop
Algorithms
Heterogeneous Computing
Large Scale Systems
W
Workshop
Performance Optimization
W
Workshop
Performance Measurement, Modeling, and Tools
Performance Optimization
W
Workshop
Distributed Computing
Security
W
Workshop
Performance Measurement, Modeling, and Tools
Programming Frameworks and System Software
W
Workshop
Distributed Computing
Middleware and System Software
Runtime Systems
W
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
TP
XO/EX
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
Artificial Intelligence/Machine Learning
Algorithms
Applications
Architecture and Networks
Cloud Computing
Distributed Computing
Data Analysis, Visualization, and Storage
Performance Measurement, Modeling, and Tools
Programming Frameworks and System Software
TP
XO/EX
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
TP
XO/EX
Workshop
Artificial Intelligence/Machine Learning
Software Engineering
W
Paper
Artificial Intelligence/Machine Learning
Applications
Modeling and Simulation
State of the Practice
TP
Paper
Artificial Intelligence/Machine Learning
Codesign
Performance Optimization
Programming Frameworks and System Software
TP
Awards
Awards Luncheon
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Posters
Best ACM SRC Poster Presentations
TP
Inclusivity
Childcare
Inclusivity
Inclusivity
Childcare
Inclusivity
Inclusivity
Childcare
Inclusivity
Inclusivity
Childcare
Inclusivity
Inclusivity
Childcare
Inclusivity
Paper
Compilers
Performance Measurement, Modeling, and Tools
Performance Optimization
Programming Frameworks and System Software
TP
Paper
Algorithms
Cloud Computing
Distributed Computing
Heterogeneous Computing
Large Scale Systems
State of the Practice
TP
Paper
Cloud Computing
Distributed Computing
Data Movement and Memory
Performance Measurement, Modeling, and Tools
TP
Inclusivity
Developing community and pipeline through SCC
Inclusivity
Workshop
Education
State of the Practice
W
Workshop
Large Scale Systems
Middleware and System Software
Programming Frameworks and System Software
W
Workshop
Exascale
Message Passing
Programming Frameworks and System Software
W
Exhibitor Forum
Accelerators
Artificial Intelligence/Machine Learning
TP
XO/EX
Exhibitor Forum
Architecture and Networks
Data Movement and Memory
Hardware Technologies
TP
XO/EX
Exhibitor Forum
Artificial Intelligence/Machine Learning
Architecture and Networks
Hardware Technologies
TP
XO/EX
Exhibitor Forum
Exascale
Programming Frameworks and System Software
Quantum Computing
TP
XO/EX
Exhibitor Forum
Artificial Intelligence/Machine Learning
Fault Handling and Tolerance
Large Scale Systems
Programming Frameworks and System Software
TP
XO/EX
Exhibitor Forum
Accelerators
Artificial Intelligence/Machine Learning
Architecture and Networks
Hardware Technologies
TP
XO/EX
Exhibitor Forum
Accelerators
Artificial Intelligence/Machine Learning
Architecture and Networks
Hardware Technologies
TP
XO/EX
Invited Talk
Artificial Intelligence/Machine Learning
HPC Infrastructure
TP
Paper
Accelerators
Artificial Intelligence/Machine Learning
Codesign
Fault Handling and Tolerance
Performance Measurement, Modeling, and Tools
Post-Moore Computing
TP
Workshop
Programming Frameworks and System Software
State of the Practice
W
Workshop
Quantum Computing
Software Engineering
W
Workshop
Data Movement and Memory
Heterogeneous Computing
W
Paper
Heterogeneous Computing
Programming Frameworks and System Software
Task Parallelism
TP
Paper
Accelerators
Distributed Computing
Middleware and System Software
Performance Measurement, Modeling, and Tools
Post-Moore Computing
TP
Paper
Architecture and Networks
Data Movement and Memory
Graph Algorithms and Frameworks
Performance Measurement, Modeling, and Tools
Programming Frameworks and System Software
TP
Paper
Cloud Computing
Data Analysis, Visualization, and Storage
Graph Algorithms and Frameworks
TP
Paper
Accelerators
Architecture and Networks
Data Analysis, Visualization, and Storage
Fault Handling and Tolerance
TP
Paper
Accelerators
Applications
Graph Algorithms and Frameworks
Performance Measurement, Modeling, and Tools
Programming Frameworks and System Software
TP
Paper
Data Analysis, Visualization, and Storage
I/O and File Systems
State of the Practice
TP
Workshop
Large Scale Systems
Programming Frameworks and System Software
W
Workshop
Programming Frameworks and System Software
W
Inclusivity
Inclusive Practices and Software Project Productivity
Inclusivity
Workshop
Data Analysis, Visualization, and Storage
Large Scale Systems
Performance Measurement, Modeling, and Tools
W
Workshop
Compilers
Heterogeneous Computing
Performance Optimization
W
Paper
Distributed Computing
Message Passing
Programming Frameworks and System Software
TP
Paper
Accelerators
Applications
Architecture and Networks
Modeling and Simulation
TP
Inclusivity
Parents' Room
Inclusivity
Inclusivity
Parents' Room
Inclusivity
Inclusivity
Parents' Room
Inclusivity
Inclusivity
Parents' Room
Inclusivity
Inclusivity
Parents' Room
Inclusivity
Inclusivity
Parents' Room
Inclusivity
Workshop
Modeling and Simulation
Performance Measurement, Modeling, and Tools
W
ACM Student Research Competition: Graduate Poster
ACM Student Research Competition: Undergraduate Poster
Doctoral Showcase
Posters
Research Posters
Scientific Visualization & Data Analytics Showcase
TP
Inclusivity
Prayer Room
Inclusivity
Inclusivity
Prayer Room
Inclusivity
Inclusivity
Prayer Room
Inclusivity
Inclusivity
Prayer Room
Inclusivity
Inclusivity
Prayer Room
Inclusivity
Inclusivity
Prayer Room
Inclusivity
Press Briefing
Press Briefing
Paper
Architecture and Networks
Performance Measurement, Modeling, and Tools
Resource Management
TP
Workshop
Accelerators
Edge Computing
Heterogeneous Computing
W
Awards
SC23 Awards Ceremony
SC24
SC24 Conference Preview
TP
XO/EX
Posters
Scientific Visualization & Data Analytics Showcase
TP
XO/EX
Posters
Scientific Visualization & Data Analytics Showcase
TP
XO/EX
Posters
Scientific Visualization & Data Analytics Showcase
TP
XO/EX
Posters
Scientific Visualization & Data Analytics Showcase
TP
XO/EX
Workshop
Architecture and Networks
Hardware Technologies
W
Student Cluster Competition
Student Cluster Competition
Student Cluster Competition
Student Cluster Competition
Student Cluster Competition
Student Cluster Competition
Student Cluster Competition
Student Cluster Competition Kick-Off
Student Cluster Competition
Student Cluster Competition Posters Display
Student Cluster Competition
Student Cluster Competition Posters Display
Student Cluster Competition
Student Cluster Competition Posters Display
Student Cluster Competition
Student Cluster Competition Posters Display
Student Cluster Competition
Student Cluster Competition Wrapup
Students@SC
Students@SC GIGs Kickoff
Students@SC
Students@SC Student Headquarters
TP
W
TUT
XO/EX
Students@SC
Students@SC Student Headquarters
TP
W
TUT
XO/EX
Students@SC
Students@SC Student Headquarters
TP
W
TUT
XO/EX
Students@SC
Students@SC Student Headquarters
TP
W
TUT
XO/EX
Students@SC
Students@SC Student Headquarters
TP
W
TUT
XO/EX
Students@SC
Students@SC Student Headquarters
TP
W
TUT
XO/EX
Paper
Cloud Computing
Distributed Computing
Energy Efficiency
Green Computing
Programming Frameworks and System Software
State of the Practice
Sustainability
TP
Paper
Artificial Intelligence/Machine Learning
Compilers
Performance Measurement, Modeling, and Tools
Performance Optimization
Programming Frameworks and System Software
Tensors
TP
Workshop
Education
State of the Practice
W
Workshop
Accelerators
Compilers
Heterogeneous Computing
Programming Frameworks and System Software
Runtime Systems
W
Awards
Test of Time
Applications
Architecture and Networks
Codesign
TP
W
Workshop
Applications
Cloud Computing
Distributed Computing
Edge Computing
Large Scale Systems
W
Workshop
Data Analysis, Visualization, and Storage
Large Scale Systems
Programming Frameworks and System Software
Reproducibility
Resource Management
Runtime Systems
W
Workshop
Artificial Intelligence/Machine Learning
Energy Efficiency
Green Computing
Performance Measurement, Modeling, and Tools
Sustainability
W
Workshop
Applications
Distributed Computing
Compilers
Heterogeneous Computing
Message Passing
Programming Frameworks and System Software
Task Parallelism
W
Paper
Cloud Computing
Distributed Computing
Energy Efficiency
Performance Measurement, Modeling, and Tools
TP
Tutorial
Tutorial Lunch
TUT
Tutorial
Tutorial Lunch
TUT
Workshop
Applications
Data Movement and Memory
Large Scale Systems
W
Workshop
Accelerators
Codesign
Heterogeneous Computing
Task Parallelism
W
Workshop
Large Scale Systems
Performance Measurement, Modeling, and Tools
Software Engineering
W
Try a different query.