DescriptionLarge scale neural network training is challenging due to the high ratio of communication to computation. Recent work has shown that these large networks contain sparse subnetworks consisting of 10-20% of the parameters, which when trained in isolation reach comparable accuracy to the larger network. In this work, we propose a novel approach that exploits the existence of these sparse subnetworks to dramatically improve the efficiency of large scale neural network training. By storing in sparse and computing in dense, we are able to reduce the number of parameters drastically while matching the compute efficiency of the original network. We exploit this reduced parameter set to optimize the communication time of AxoNN, a state-of-the-art framework for parallel deep learning. Our approach yields a significant speedup of 17% when training a 2.7 billion parameter transformer model on 384 GPUs.
DescriptionJobs for a High Performance Computing cluster are allocated system resources by a scheduling application such as SLURM. These scheduling applications are highly configurable by HPC administrators through the use of parameters which modify and customize their scheduling behavior. Although there are default values for these scheduling parameters provided by their creators and maintainers, it is unclear which values for scheduler parameter settings would be optimal for a particular HPC system running the types of jobs its users typically submit. Using over 37,000 jobs from historic job log data from Kansas State University’s High Performance Computing cluster, this research uses a SLURM simulator to execute over 90,000 scheduler simulations requiring over 840,000 compute hours along with gradient boosted tree regression to predict an optimal set of scheduler configuration parameters which results in a 79% decrease in the average job queue time when compared with the default scheduler parameters
DescriptionQuantum circuit simulation can be carried out as a contraction over many quantum tensors. QTensor, a library built for quantum circuit simulation using a bucket elimination algorithm, contracts tensors to return a final energy value. As bucket elimination advances, tensors can grow large, and memory becomes a bottleneck. To address memory limitations of circuit simulation while enabling more complex circuits to be simulated, we focus on implementing a lossy compressor that can compress the floating-point data stored in quantum circuit tensors while simultaneously preserving a final energy value within an error bound after decompression. We study the effects of various lossy compression/decompression strategies on data compressibility, throughput, and result error to ensure compression/decompression can be effective, fast, and does not heavily distort data. The work for this project is in progress and preliminary results for proposed preprocessing/postprocessing strategies and compressor optimizations that have been developed will be showcased.
DescriptionAchieving full automation of program optimization is still an open problem for compiler writers. This work explores machine learning as a potential solution to learn data locality optimizations for tensor applications. Training models with supervised-learning for loop-nest optimization often requires prohibitively expensive training data generation for learning the combined effects of a transformation sequence. As a solution, this work proposes a novel learning strategy called Composed Singular Prediction (CSP) that significantly reduces the training data generation cost in the context of learned loop transformation models. The learned models are then deployed to predict data locality optimization schedules for Conv2d kernels to achieve performance improvements up to 4x against Intel oneDNN while saving over 100x in training data collection time over exhaustive search.
DescriptionWe present a novel parallel framework for large scale network alignment. Network alignment has applications in many disciplines including bioinformatics and social sciences. Our algorithm is one of the first network alignment tools that can not only identify similar networks, but also identify the differences between nearly similar networks. It is particularly useful in finding regions of non-determinism in event graphs, arising in large HPC simulations.
Our algorithm compares similarity between vertices based the number of graphlets (or motifs) to which the vertex belongs. Thus, it can also be used to find motifs in a graph. However, compared to the state-of-the art algorithms, our algorithm can (i) compute multiple motifs in one execution and (ii) be tuned to graph structure and user specification. We will present the algorithm, showcase the scalability results, and compare its performance and accuracy with other state-of-the art software.
DescriptionParallel file systems like Lustre contain complicated I/O paths from clients to storage servers. An efficient I/O path requires proper settings of multiple parameters as the default settings often fail to deliver optimal performance, especially for diverse workloads in the HPC environment. Existing tuning strategies are limited in being adaptive, timely, and flexible. We propose IOPathTune, which adaptively tunes PFS I/O Path online from the client side without characterizing the workloads, doing expensive profiling, and communicating with other machines. We leveraged CloudLab to conduct the evaluations with 20 different Filebench workloads under three different test conditions: single-client standalone tests, dynamic workload change, and multi-client executions. We observed either on par or better performance than the default configuration across all workloads. Some of the most considerable improvement includes 231%, 113%, 96%, 43%.
DescriptionThe traceback phase of the Smith-Waterman (SW) algorithm requires significant memory and introduces an irregular memory access pattern which makes it challenging to implement for GPU architectures. In this work, we introduce a novel strategy for implementing the traceback kernel for the SW algorithm on GPUs by restructuring the global memory access patterns and introducing a memory-efficient data structure for storing large dynamic programming matrices in GPU’s limited memory. To demonstrate this kernel’s performance we integrated this into the existing ADEPT library and Metahipmer2, a de novo metagenomic short read assembler. Our implementation is 3.6x faster than traceback in GASAL2, and 51x faster than traceback in Striped Smith-Waterman, the current state of the art SW libraries on GPU and CPU respectively. It sped up the final alignment step in Metahipmer2 by an average of 44% and improved the overall execution time of Metahipmer2 by an average of 13%.
DescriptionScientific software is required to be fast, painless to change, and easy to deploy. Historically, compiled languages such as C/C++ and Fortran have been preferred when writing software with the highest performance requirements. However these languages are complex, and the resulting software is challenging to maintain and deploy across platforms. We present our recent software projects written in Rust, a fast-growing, ergonomic, systems-level programming language with a toolchain designed for high-performance and simple cross platform builds. We illustrate the current state of the scientific computing ecosystem in Rust, through our experience developing high-performance MPI-distributed software for computational physics problems.
DescriptionAggregated HPC resources have rigid allocation systems and programming models and struggle to adapt to diverse and changing workloads. Thus, HPC systems fail to efficiently use the large pools of unused memory and increase the utilization of idle computing resources. Prior work attempted to increase the throughput and efficiency of supercomputing systems through workload co-location and resource disaggregation. However, these methods fall short of providing a solution that can be applied to existing systems without major hardware modifications and performance losses.
In this project, we use the new cloud paradigm of serverless computing to improve the utilization of supercomputers. We show that the FaaS programming model satisfies the requirements of high-performance applications and how idle memory helps resolve cold startup issues. We demonstrate a software resource disaggregation approach where the co-location of functions allows idle cores and accelerators to be utilized while retaining near-native performance.
DescriptionDeep learning surrogate models have drawn much attention in large-scale scientific simulations because they can provide similar results to simulations at lower computational costs. To process large amounts of scientific data, distributed training on high-performance computing (HPC) clusters is often used. Training a surrogate model with data parallelism consists of three major steps: (1) Each device loads a subset of the dataset from the parallel filesystem; (2) Computing the model update on each device; (3) Communicating between devices to synchronize the model update. During these steps, we observe that data loading is the main performance bottleneck for training surrogate models. To this end, we propose SurrogateTrain, an efficient data-loading approach for training surrogate models, including offline scheduling and on-demand buffering. Our evaluation on a scientific surrogate model demonstrates that SurrogateTrain reduces the amount of data loaded by 6.7× and achieves up to 4.7× speedup in data loading.
DescriptionParaGraph is an open-source toolkit for use in co-designing hardware and software for supercomputer-scale systems. It bridges an infrastructure gap between an application target and existing high-fidelity computer-network simulators. The first component of ParaGraph is a high-level graph representation of a parallel program, which faithfully represents parallelism and communication, can be extracted automatically from a compiler, and is “tuned” for use with network simulators. The second is a runtime that can emulate the representation’s dynamic execution for a simulator. User-extensible mechanisms are available for modeling on-node performance and transforming high-level communication into operations that backend simulators understand. Case studies include deep learning workloads that are extracted automatically from programs written in JAX and TensorFlow and interfaced with several event-driven network simulators. These studies show how system designers can use ParaGraph to build flexible end-to-end software-hardware co-design workflows to tweak communication libraries, find future hardware bottlenecks, and validate simulations with traces.
DescriptionError-bounded lossy compression is a critical technique for significantly reducing scientific data volumes. With ever-emerging heterogeneous high-performance computing (HPC) architecture, GPU-accelerated error-bounded compressors such as cuSZ have been developed. In order to improve the data quality and the compression ratio while maintaining high throughput, an interpolation-based spline method is introduced, inspired by the existing CPU prototype. In this work, We present (1) an efficient GPU implementation of the 3D interpolative spline prediction method, (2) a finer-grained data chunking approach using anchor points to leverage the modern GPU architecture, and (3) an in-depth analysis of how such anchor point affects the error formation and the compression ratio, and (4) a preliminary result in performance on the state-of-the-art modern GPUs. Our solution can achieve 1) a higher compression ratio than the previous default prediction method in cuSZ, and 2) the overall comparable data quality and compression ratio with the CPU prototype.
DescriptionIn the fields of science and engineering, lossy compression plays a growing role in running scientific simulations, as output data is on the scale of terabytes. Using error bounded lossy compression reduces the amount of storage for each simulation; however, there is no known bound for the upper limit of lossy compressibility. Data correlation structures, compressors and error bounds are factors allowing larger compression ratios and improved quality metrics. This provides one direction towards quantifying lossy compressibility. Our previous work explored 2D statistical methods to characterize the data correlation structures and their relationships, through functional models, to compression ratios and quality metrics for 2D scientific data. In this poster, we explore the expansion of our statistical methods to 3D scientific data. The method was comparable to 2D. Our work is the next step towards evaluating the theoretical limits of lossy compressibility used to predict compression performance and optimally adapt compressors.
DescriptionThe computational advance in high-performance computing leads to increased data generation by applications, resulting in a bottleneck within the system due to I/O limitations. One solution is the Spatio-temporal sampling method, which takes advantage of both spatial and temporal data reduction methods to produce higher post-reconstruction quality. Various user input parameters such as the number of bins or histogram intersection limit the performance for Spatio-temporal sampling. This poster focuses on determining the effect of the histogram intersection threshold in the Spatio-temporal sampling method. Results indicate that as long as a data set is not identical across adjacent time-steps, reducing the histogram intersection percentage increases the sampling bandwidth until blocks reused become static. The ExaAM data set shows an increase of 100-130% in sampling bandwidth, with only about a 5% decrease in PSNR value at 60% histogram intersection or lower.
DescriptionThe dCache installation is a storage management system that acts as a disk cache for high-energy physics (HEP) data. Storagespace on dCache is limited relative to persistent storage devices, therefore, a heuristic is needed to determine what data should be kept in the cache. A good cache policy would keep frequently accessed data in the cache, but this requires knowledge of future dataset popularity. We present methods for forecasting the number of times a dataset stored on dCache will be accessed in the future. We present a deep neural network that can predict future dataset accesses accurately, reporting a final normalized loss of 4.6e-8. We present a set of algorithms that can forecast future dataset accesses given an access sequence. Included are two novel algorithms, Backup Predictor and Last N Successors, that outperform other file prediction algorithms. Findings suggest that it is possible to anticipate dataset popularity in advance.
DescriptionWhen transmitting image data from a deployed edge device, a high-bandwidth connection to a cloud system cannot be guaranteed. An early-warning system for an intersection crosswalk, for instance, would have to be able to compress and transmit data with enough quality to ensure prompt detection of danger through remote image processing. Adaptive lossy compression provides a potential solution for this, although it is yet to be evaluated on actual edge hardware. By separating the compression and detection pipelines between client and server processes, improving compression ratios by up to 4.95% via a unified lossless stage, demonstrating compression performance on an Arm-powered edge device, and benchmarking network performance under a range of realistic bandwidth conditions, we attempt to evaluate the viability of this method under realistic conditions. This poster discusses our revised architecture and its performance, along with the relevance of our results towards method refinement.
DescriptionLarge data sets tend to be very common in many areas of high-performance computing. Often times, the size of these data sets are so extreme that they far exceed the storage capabilities of their system. This highlights an opportunity to employ compression methods in order to reduce the data set down to a manageable size. Given that reduction methods operate on data in different ways, it is important to compare these methods with the goal of determining the optimal approach for any given data set. This poster compares the effectiveness of different data reduction methods on image data from Los Alamos National Labs based on three major parameters: PSNR, compression ratio, and compression rate. Our analysis indicated the SZ lossy compressor was the most effective for this data set, given that it offered the highest PSNR along with a very reasonable compression ratio.
DescriptionCells are the basic building blocks of human organisms. Single-cell RNA sequencing is a technology for studying the heterogeneity of cells of different organs, tissues, subjects, conditions, and treatments. Identification of cell types and states in sequenced data is an important and challenging task, requiring computational approaches that are accurate, robust, and scalable. Existing approaches use cluster analysis as the first step of cell-types prediction. Their performance remains limited because they optimize only one objective function. In this study, two evolutionary clustering approaches were designed, implemented, and systematically validated, namely a single-objective evolutionary algorithm and a multi-objective evolutionary algorithm. The algorithms were evaluated on synthetic and real datasets. The results demonstrated that the performance and the accuracy of both evolutionary algorithms were consistent, stable, and on par with or better than baseline algorithms. Running time analysis of multi-processing on an HPC showed that evolutionary algorithms can efficiently handle large datasets.
DescriptionToday’s scientific projects and simulations often require repeated transfer of large data volumes between the storage system and the client. This increases the load on the network, leading to congestion. In order to mitigate these effects, regional data storage cache systems are used to store data locally. This project examines the XCache storage system to closely analyze data trend patterns in the data volume and data throughput performance, while also creating a model for predicting how caches could potentially impact network traffic and data transfer performance overall. The results of the data access patterns demonstrated that traffic volume was reduced by an average factor of 2.35. The hourly and daily prediction models also showed low error values, reinforcing the learning methods used in this effort.
DescriptionThe architectures of supercomputers are increasing in diversity. It is important to maintain efficient code portability to take advantage of the computing capabilities of the evolving hardware in these systems. Intel has adopted an open standard programming interface for heterogeneous systems called oneAPI, designed to allow code portability across different processor architectures. This report evaluates oneAPI by migrating a general matrix-matrix multiplication CUDA algorithm from the dense linear algebra library Matrix Algebra on GPU and Multicore Architectures to Data Parallel C++, the direct programming language of oneAPI. Performance of the migrated code is compared to native CUDA implementations on multicore CPUs and GPUs. The initial migrated code demonstrates impressive performance on multicore CPUs. It retains the performance of CUDA on NVIDIA GPUs. It performs poorly on the Intel GPU but is improved with tuning. Intel's oneAPI allowed for a successful extension of MAGMA portability to multicore CPUs and Intel GPUs.
DescriptionOpenACC is a high-level directive-based parallel programming model that can manage the sophistication of heterogeneity in architectures and abstract it from the users. The portability of the model across CPUs and accelerators has gained the model a wide variety of users. This means it is also crucial to analyze the reliability of the compilers' implementations. To address this challenge, the OpenACC Validation and Verification team has proposed a validation testsuite to verify the OpenACC implementations across various compilers with an infrastructure for a more streamlined execution. This paper will cover the following aspects: (a) the new developments since the last publication on the tetsuite, (b) outline the use of the infrastructure, (c) discuss tests that highlight our workflow process, (d) analyze the results from executing the testsuite on various systems, and (e) outline future developments.
DescriptionTo enable efficient and productive programming of today's supercomputers and beyond, a variety of issues must be addressed, including: load balancing (i.e., utilizing all resources equally), fault tolerance (i.e., coping with hardware failures), and resource elasticity (i.e., allowing the addition/release of resources).
In this work, we address above issues in the context of Asynchronous Many-Tasking (AMT) for clusters. Here, programmers split a computation into many fine-grained execution units (called tasks), which are dynamically mapped to processing units (called workers) by a runtime system.
Regarding load balancing, we propose a work stealing technique that transparently schedules tasks to resources of the overall system, balancing the workload over all processing units. Experiments show good scalability, and a productivity evaluation shows intuitive use.
Regarding fault tolerance, we propose four techniques to protect programs transparently. All perform localized recovery and continue the program execution with fewer resources. Three techniques write uncoordinated checkpoints of task descriptors in a resilient store. One technique does not write checkpoints, but exploits natural task duplication of work stealing. Experiments show failure-free running time overhead below 1% and a recovery overhead below 0.5 seconds. Simulations of job set executions show that makespans can be reduced by up to 97%.
Regarding resource elasticity, we propose a technique to enable the addition and release of nodes at runtime by transparently relocating tasks accordingly. Experiments show costs for adding and releasing nodes below 0.5 seconds. Additionally, simulations of job set executions show that makespans can be reduced by up to 20%.
DescriptionModern HPC workloads produce massive amounts of distributed intermediate data that needs to be checkpointed concurrently in real-time at scale. One such popular scenario is the use of checkpoint-restore for revisiting previous states (intermediate data) to advance computations, such as adjoint methods. In this context, GPUs have shown tremendous performance improvements during computations but demonstrate I/O limitations while managing high-frequency large-volume data movement across heterogeneous memory tiers. Existing data movement runtimes are not well suited for such I/O because of factors such as imbalance in checkpoint distribution across fast memory tiers, slow memory allocation, and restore oblivious cache eviction and prefetching strategies. We address these challenges by designing a set of transparent, asynchronous checkpoint-restore techniques that minimize the blocking time of the application during I/O using three novel contributions. First, we design techniques to evenly distribute checkpoints across fast memory tiers (e.g. peer GPUs) using collaborative checkpointing that leverages fast interconnects such as NVLinks and NVSwitches for load balancing. Second, we mitigate the slow cache allocation for storing checkpoints on both GPU and host by leveraging techniques such as CUDA's virtual memory management functions, eager memory mapping, and lazy pinning. Third, we design a restore-order aware eviction and prefetching approach that is coordinated by a finite state machine based on a unified checkpoint-restore abstraction for optimal evictions. Our evaluations across real-world and synthetic benchmarks demonstrate significant speedup in both checkpoint and restore phases of the application compared to the current state-of-the-art data movement engines.
DescriptionEfficiently and accurately simulating partial differential equations (PDEs) in and around arbitrarily defined geometries, especially with high levels of adaptivity, has significant implications for different application domains. In this work, we develop a fast construction of a ‘good’ adaptively-refined incomplete octree based mesh capable of carving out arbitrarily shaped void regions from the parent domain: an essential requirement for fluid simulations around complex objects. Further, we integrate the mesh generation with Petsc to solve several multiphysics and multiphase phenomena. We showcase the applicability of the algorithms to solve the large scale problems. The algorithms developed have enabled us to run the most resolved jet atomization simulations and demonstrated scaling till O(100K) processors on TACC Frontera.
DescriptionWith the rise of Big Data, there has been a significant effort in increasing compute power through GPUs, TPUs, and heterogeneous architectures. As a result, the bottleneck of applications is shifting toward memory performance. Prefetching techniques are widely used to hide memory latency and improve instructions per cycle (IPC). A data prefetching process is a form of speculation that looks at memory access patterns to forecast the near future accesses and avoid cache misses. Traditional hardware data prefetchers use pre-defined rules, which are not powerful enough to adapt to the increasingly complex memory access patterns from new workloads.
We hypothesize that a machine learning-based prefetcher can be developed to achieve high-quality memory access prediction, leading to the improvement of IPC for a system. We develop several optimizations for ML-based prefetching. First, we propose RAOP, a framework for RNN augmented offset prefetcher, in which RNN provides temporal references for a spatial offset prefetcher, leading to the improvement of IPC. Second, we propose C-MemMAP, which provides clusters for downstream meta-models to balance the model size and prediction accuracy. We propose DM (delegated model) clustering method that learns latent patterns from long memory traces, which has significantly raised the prediction accuracy of the meta-models. Third, we propose TransFetch, an attention-based prefetcher that supports variable-degree prefetching by modeling prefetching as a multi-label classification problem. In addition, we propose ReSemble, a Reinforcement Learning (RL) based adaptive ensemble framework that enables multiple prefetchers to complement each other on hybrid applications and updates online.
DescriptionAs computational resources scale larger, applications often need to be refactored to deal with bottlenecks that arise to gain the advantages of strong scaling. When not properly addressed legacy workloads can lead to inefficient usage of available hardware which leads to poor throughput. One solution is to allow multiple tasks to share a system to provide multi-tenancy. Multi-tenant environments fall into two categories: time-sharing and space-sharing. Time-sharing has been an effective technique to deal with multiple applications sharing the CPU and GPU at the node-level. However, time-sharing can have a heavy performance cost such as saving and restoring architectural state (context switch overhead) which is very costly on GPUs. While space-sharing can avoid this overhead and improve throughput, current hardware and software systems lack full isolation to provide the necessary quality of service. In this work, we identify key challenges that arise when sharing resources in a HPC context. We evaluate real-world scenarios both at the node-level and cluster-level. Using these insights, we propose middleware to mitigate and improve quality of service. We introduce a runtime CUDA middleware that improves QoS for GPUs. We also introduce and study two new features of HDF5, GDS VFD and Async I/O. The former improves I/O latency while the latter improves and hides variability in I/O latency.
DescriptionPython's extensive software ecosystem leads to high productivity, rendering it the language of choice for scientific computing. However, executing Python code is often slow or impossible in emerging architectures and accelerators. To complement Python's productivity with the performance and portability required in high-performance computing (HPC), we introduce a workflow based on data-centric (DaCe) parallel programming. Python code with HPC-oriented extensions is parsed into a dataflow-based intermediate representation, facilitating analysis of the program's data movement. The representation is optimized via graph transformations driven by the users, performance models, and automatic heuristics. Subsequently, hardware-specific code is generated for supported architectures, including CPU, GPU, and FPGA. We evaluate the above workflow through three case studies. First, to compare our work to other Python-accelerating solutions, we introduce NPBench, a collection of over 50 Python microbenchmarks across a wide range of scientific domains. We show performance results and scaling across CPU, GPU, FPGA, and the Piz Daint supercomputer. DaCe runs 10x faster than the reference Python execution and achieves 2.47x and 3.75x speedups over previous-best solutions and up to 93.16% scaling efficiency. Second, we re-implement in Python and optimize the Quantum Transport Simulator OMEN. The application's DaCe version executes one to two orders of magnitude faster than the original code written in C++, achieving 42.55% of the Summit supercomputer's peak performance. Last, we utilize our workflow to build Deinsum, an automated framework for distributed multilinear algebra computations expressed in Einstein notation. Deinsum performs up to 19x faster over state-of-the-art solutions on the Piz Daint supercomputer.
DescriptionIn contrast to conventional integrated circuits, Field Programmable Gate Arrays (FPGAs) can be reconfigured dynamically. This flexibility unlocks potential for FPGA-based accelerators to offload tasks in HPC. Scheduling tasks on FPGAs is equivalent to the allocation of chip resources: each offloaded task occupies chip area during its execution. Hence, task scheduling on FPGAs is typically done with Partial Reconfiguration (PR). However, PR requires a high development overhead, requires expert knowledge and has limited portability, making it difficult to apply existing research and lowering the adoption of FPGAs in HPC. We want to aid software developers and vendors to integrate accelerators based on FPGAs without these issues and ask: how we can optimize task scheduling on FPGAs without relying on PR?
We answer this question with three key contributions: first, we introduce an abstraction-agnostic methodology to analyze and compare scheduling strategies for FPGAs. Center of our method is the derivation of scheduling constraints from a machine model representing a target FPGA. The schedules generated for HPC applications are compared for two models. We show that the overhead for avoiding PR is feasible. Second, we propose algorithms to generate recommendations for minimal changes to the program that affect the quality of possible schedules. We show that effective recommendations can be generated for HPC applications. Third, we contribute two polynomial-time scheduling algorithms. Our results can help vendors to provide significantly more streamlined workflows for programming FPGAs, making the platform more appealing and helping the adoption of high-level programming environments like OpenCL for FPGAs.
DescriptionIn this work we accelerate a target a deep learning model designed to enhance CT images of covid-19 chest scans namely DD-Net using sparse techniques. The model follows an auto encoder decoder architecture in deep learning paradigm and has high dimensionality and thus takes many compute hours of training. We propose a set of techniques which target these two aspects of model - dimensionality and training time. We will implement techniques to prune neurons making the model sparse and thus reduce the effective dimensionality with a loss of accuracy not more than 5% with minimal additional overhead of retraining. Then we propose set of techniques tailored with respect to underlying hardware in order to better utilize the existing components of hardware (such as tensor core) and thus reduce time and associated cost required to train this model.
DescriptionData transformation tasks - such as encoding, decoding, parsing, and conversion between common data formats - are at the core of many data analytics, data processing and scientific applications. This has led to the development of custom software libraries and hardware implementations targeting popular data transformations. By accelerating specific transformations, however, these solutions suffer from lack of generality. On the other hand, a generic and programmable data processing engine might support a wide range of data transformations, but do so at the cost of reduced performance compared to custom, algorithm-specific solutions.
In this work, we aim to bridge this gap between generality and performance. To this end, we provide a compilation framework that transparently converts data transformation tasks expressed using pushdown transducers into efficient GPU code.
DescriptionThe LLVM Flang compiler ("Flang") is currently Fortran 95 compliant, and the frontend can parse Fortran 2018. However, Flang does not have a comprehensive 2018 test suite and does not fully implement the static semantics of the 2018 standard. We are investigating whether agile software development techniques, such as pair programming and test-driven development (TDD), can help Flang to rapidly progress to Fortran 2018 compliance. Because of the paramount importance of parallelism in high-performance computing, we are focusing on Fortran’s parallel features, commonly denoted “CoArray Fortran". We are developing what we believe are the first exhaustive, open-source tests for the static semantics of Fortran 2018 parallel features, and contributing them to the LLVM project. A related effort involves writing runtime tests for parallel 2018 features and supporting those tests by developing a new parallel runtime library: the CoArray Fortran Framework of Efficient Interfaces to Network Environments (Caffeine).
DescriptionIn past years, the world has switched to many-core and multi-core shared memory architectures. As a result, there is a growing need to utilize these architectures by introducing shared memory parallelization schemes, such as OpenMP, to software applications. Nevertheless, introducing OpenMP into code, especially legacy code, is challenging due to pervasive pitfalls in management of parallel shared memory. To facilitate the performance of this task, many source-to-source (S2S) compilers have been created over the years, tasked with inserting OpenMP directives into code automatically. In addition to having limited robustness to their input format, these compilers still do not achieve satisfactory coverage and precision in locating parallelizable code and generating appropriate directives. In this work, we propose leveraging recent advances in machine learning techniques, specifically in natural language processing (NLP) - the transformers model, to suggest the need for an OpenMP directive or specific clauses (reduction and private).
DescriptionThe GeoCAT-comp program is a Python toolkit used by the geoscience community to analyze data. This project explores ways to port GeoCAT-comp to run on GPUs, as recent supercomputers are shifting to include GPU accelerators as the major resource. Although GeoCAT-comp's routines are all sequential or utilize Dask parallelization on the CPU, the data processing is embarrassingly parallel and computationally costly, enabling us to optimize using GPUs. GeoCAT uses NumPy, Xarray, and Dask arrays for CPU parallelization. In this project, we examined different GPU-accelerated Python packages (e.g., Numba and CuPy). Taking into account the deliverability of the final porting method to the GeoCAT team, CuPy is selected. CuPy is a Python CUDA-enabled array backend module that is quite similar to NumPy. We analyzed the performance of the GPU-accelerated code compared to the Dask CPU parallelized code over various array sizes and resources, and through strong and weak scaling.
DescriptionAppropriately adjusting the power draw of computational hardware plays a crucial role in its efficient use. While vendors have already implemented hardware-controlled power management, additional energy savings are available, depending on the state of the machine. We propose the online classification of such states based on computationally informed machine learning algorithms to adjust the power cap of the next time step. This research highlights that the overall energy consumption can be reduced significantly, often without a prohibitive penalty in the runtime of the applications.
DescriptionFast analysis of scientific data from X-ray free electron laser (XFEL) experimental facilities is key for supporting real-time decisions that efficiently use these facilities to speed up scientific discovery. Our research shows gains obtained using graphics processing units (GPUs) to accelerate 3D reconstruction of Single Particle Imaging (SPI) X-ray diffraction data. We achieve a 4X speedup over the previous GPU implementation, 50% better image reconstruction resolution, and 485X speedup when calculating resolution compared to the existing implementation. We showcase techniques to optimize per-node computational efficiency, increase scalability and improve the accuracy of SPI by using better algorithms, improving data movement and accesses, reusing data structures, and reducing memory fragmentation.
DescriptionQueries on large graphs use the stored graph properties to generate responses. As most of the real-world graphs are dynamic, i.e., the graph topology changes with time, and hence the related graph properties are also time-varying. In such cases, maintaining correctness in stored graph properties requires recomputation or update on previous properties. Here, we present an efficient framework, CANDY for updating the properties in large dynamic networks. We prove the efficacy of our general framework by applying it to update graph properties such as Single Source Shortest Path (SSSP), Vertex Coloring, and PageRank. Empirically we show that our shared-memory parallel and NVIDIA GPU-based data-parallel implementations perform better than the state-of-the-art implementations.
DescriptionWith modern technology and High-Performance Computing (HPC), Molecular Dynamics (MD) simulations can be task and data parallel. That means, they can be decomposed into multiple independent tasks (i.e., trajectories) with their own data, which can be processed in parallel. Analysis of MD simulations includes finding specific molecular events and the conformation changes that a protein undergoes. However, the traditional analysis relies on the global decomposition of all the trajectories for a specific molecular system, which can be performed only in a centralized way. We propose a lightweight self-supervised machine learning technique to analyze MD simulations in situ. That is, we aim to speed up the process of finding molecular events in the protein trajectory at run-time, without having to wait for the entire simulation to finish. This allows us to scale the analysis with the simulation.
DescriptionKokkos is a representative approach between template metaprogramming solutions that offers programmers high-level abstractions for generic programming while most of the device-specific code generation and optimizations are delegated to the compiler through template specializations. For this, Kokkos provides a set of device-specific code specializations in multiple backends, such as CUDA and HIP. However, maintaining and optimizing multiple device-specific back ends for each new device type can be complex and error-prone. To alleviate these concerns, this paper presents an alternative OpenACC back end for Kokkos: KokkACC. KokkACC provides a high-productivity programming environment and—potentially—a multi architecture back end. We have observed competitive performance; in some cases, KokkACC is faster than NVIDIA’s CUDA back end and much faster than OpenMP’s GPU offloading back end. This work also includes implementation details and a detailed performance study conducted with a set of mini-benchmarks (AXPY and DOT product) and two mini-apps (LULESH and miniFE).
DescriptionNOvA is a world-leading neutrino physics experiment that is making measurements of fundamental neutrino physics parameters and performing searches for physics beyond the Standard Model. These measurements must leverage high performance computing facilities to perform data intensive computations and execute complex statistical analyses. We outline the NOvA analysis workflows we have implemented on NERSC Cori and Perlmutter systems. We have developed an implicitly-parallel data-filtering framework for high energy physics data based on pandas and HDF5. We demonstrate scalability of the framework and advantages of an aggregated monolithic dataset by using a realistic neutrino cross-section measurement. We also demonstrate the performance and scalability of the computationally intensive profiled Feldman-Cousins procedure for statistical analysis. This process performs statistical confidence interval construction based on non-parametric Monte Carlo simulation and was applied to the NOvA sterile neutrino search. We show the NERSC Perlmutter system provides an order of magnitude computing performance gain over Cori.
DescriptionGPU matrix chain multiplication serves as a basis for a wide range of scientific domains like computer graphics, physics, and machine learning. While its time performance was studied for years, there has been significantly less effort in optimizing its energy efficiency. GPU power consumption is heavily impacted by the number of data transfers performed. In fact, a data transfer from global memory needs a thousand times more energy than a double precision arithmetic operation. Thus, minimizing data transfers is key for reducing the energy consumption. We present an energy efficient solution for Matrix Chain Multiplication on GPUs that minimizes computation as well as off-chip data transfers. For this, optimizations at three different levels are provided. For a single matrix multiplication, we use a large tile blocking strategy. Then, we extend our approach to three matrices. Finally, we propose a solution for a sequence of matrices.
DescriptionEnergy systems research strongly relies on large modeling frameworks. Many of them use linear optimization approaches to calculate blueprints for ideal future energy systems, which become increasingly complex, as do the models. The state of the art is to compute them with shared-memory computers combined with approaches to reduce the model size. We overcome this and implement a fully automated workflow on HPC using a newly developed solver for distributed memory architectures. Moreover, we address the challenge of uncertainty in scenario analysis by performing sophisticated parameter variations for large-scale power system models, which cannot be solved in the conventional way. Preliminary results show that we are able to identify clusters of future energy system designs, which perform well from different perspectives of energy system research and also consider disruptive events. Furthermore, we also observe that our approach provides the most insights when being applied to complex rather than simple models.
DescriptionThe NERSC Perlmutter HPC system is the most recent large-scale US system that is publicly available. NERSC chose to deploy a first phase of its GPU-based nodes in late 2021 using 2x Slingshot10 connections and has been upgrading them to 4x Slinghot11 connections starting in summer 2021. In this poster we provide benchmark numbers for using CGYRO, a popular fusion turbulence simulation tool, comparing the original and the upgraded network setup. CGYRO has been previously shown to be communication-bound in many recent HPC systems and we show that the upgraded networking provides a significant boost for fusion science.
DescriptionThe standard implementation of MPI_Alltoall uses a combination of techniques, including the spread-out and Bruck algorithms. The existing Bruck algorithm implementation is limited to a radix of two, so the total number of communication steps is fixed at log2(P) (P: total number of processes). The spread-out algorithm, on the other hand, requires P-1 communication steps. There remains a wide unexplored parameter area between these two extremities of the communication spectrum that can be tuned. In this paper, we formalize a generalized formula and implementation of the Bruck algorithm, whose radix can be varied from 2 to P-1. With this ability, both the total number of communication steps and the total amount of data transmitted can be tuned, which allows performance tuning. We performed an experimental investigation and demonstrated that the Bruck with the optimal radix is up to 57% faster than the vendor's optimized MPI_Alltoall on the Theta supercomputer.
DescriptionPerformance data are collected to establish how well exascale applications are doing with executing their code or workflow as efficiently as possible. Chimbuko, a tool specifically focused on the analysis of performance data in real time, looks through these data and collects performance anomalies that are detected. These anomalies are saved into the Chimbuko Provenance Database, together with as much contextual information as needed. The goal of our work is to perform statistical analysis on the Chimbuko Provenance Database by presenting simple visualizations and determining if the information collected for each anomaly is sufficient to conduct a causal analysis. Statistical methods such as Theil’s U correlation analysis, Logistic regression, and K-Prototype clustering were used to identify association between variables. Furthermore, feature selection was conducted with Decision Tree and Random Forest. We identified association between call_stack and several variables, which reveals that call_stack is a very important feature of the dataset.
DescriptionThis poster presents GPU optimizations for Sparse Deep Neural Networks using Apache TVM. Although various deep neural network models exist, SpDNNs have shown great improvements in the size and memory of neural networks. SpDNNs provide unique scalability difficulties in which optimizations and advancements can be made. Apache TVM is a machine learning compiler framework for CPUs and GPUs. It has been shown to have promising improvements for the performance, deployment, and optimizations of the networks. To evaluate its effectiveness for SpDNNs, this work builds SpDNNs with Apache TVM and compares with current SpDNNs. When testing with various datasets, TVM-based implementation can achieve faster and more efficient optimizations.
DescriptionThe poster presents a scalable approach that converts the results of large-scale Computational Fluid Dynamics (CFD) simulations into a volumetric representation used by volume rendering-based visualization. Even if this functionality is provided by common post-processing tools, its efficient parallelization requires an appropriate load-balancing. Unfortunately, load-balancing according to the number of cells does not scale for unstructured meshes with high growth rate that is common in CFD. In the poster, we show that with an appropriate redistribution of data among available resources it is possible to perform the operation in just several seconds with significantly improved scalability.
DescriptionFederated Learning (FL) is a distributed Machine Learning paradigm aiming to collaboratively learn a shared model while considering privacy preservation by letting the clients process their private data locally. In the Computing Continuum context (edge-fog-cloud ecosystem), FL raises several challenges such as supporting very heterogeneous devices and optimizing massively distributed applications.
We propose a workflow to better support and optimize FL systems across the Computing Continuum by relying on formal descriptions of the infrastructure, hyperparameter optimization and model retraining in case of performance degradation. We motivate our approach by providing preliminary results using a human activity recognition dataset. The next objective will be to implement and deploy our solution on the Grid’5000 testbed.
During the poster session, I will start by presenting the main problems for applying FL in the Computing Continuum and how our approach is tackling it. Next I will present preliminary results and discuss the remaining challenges.
DescriptionIn structured grid finite-difference, finite-volume, and finite-element discretizations of partial differential equation conservation laws, regular stencil computations constitute the main core kernel in many temporally explicit approaches for such problems. For various blocking dimensions, the Spatial Blocking (SB) approach enables data reuse within multiple cache levels.
Introduced in GIRIH, the Multi-core Wavefront Diamond blocking (MWD) method optimizes practically relevant stencil algorithms by combining the concepts of diamond tiling and multi-core aware wavefront temporal blocking, leading to significant increase in data reuse and locality.
We evaluate the performance of MWD on a variety of recent multi-core architectures. Among all of them, the new AMD multi-processor, codenamed Milan-X, provides an unprecedented capacity for the Last Level Cache. We show that the Milan-X hardware design is ideal for the MWD method, and significant performance gain can be achieved relative to its predecessors Milan and Rome.
DescriptionThe fast Fourier Transforms (FFT), a reduced-complexity formulation of the Discrete Fourier Transform (DFT), dominate the computational cost in many areas of science and engineering. Due to the large-scale data, multi-node heterogeneous systems aspire to meet the increasing demands from parallel computing FFT in the field of High-Performance Computing (HPC). In this work, we present a highly efficient GPU-based distributed FFT framework by adapting the Cooley-Tukey recursive FFT algorithm. Two major types of optimizations, including automatic low-dimensional FFT kernel generation and asynchronous strategy for multi-GPUs, are presented to enhance the performance of our approach for large-scale distributed FFT, and numerical experiments demonstrate that our work achieves more than 40x speedup over CPU FFT libraries and about 2x speedup over heFFTe, currently available state-of-art research, on GPUs.
DescriptionThe poster presents the usage of a deterministic traffic simulator for optimizing traffic flow within a city. The simulator is one part of a traffic modeling framework for intelligent transportation in smart cities. In contrast to standard navigation systems where the navigation is optimized for drivers, we aim to optimize a distribution of the global traffic flow. We utilize HPC resources for the simulator’s parameters exploration for which EVEREST SDK is used.
The EVEREST project aims at developing a holistic design environment that addresses simplifying the programmability of heterogeneous and distributed architectures for Big Data applications. The project uses “data-driven” design approach with domain-specific language extensions, hardware-accelerated AI and an efficient monitoring of the execution with a unified hardware/software paradigm. During the presentation the distribution of traffic flow in a selected city will be presented in a form of short video to demonstrate the dynamicity of the system.
DescriptionApplications of quantum machine learning algorithms are currently still being studied. Recent work suggests that classical gradient descent techniques can effectively train variational quantum circuits. We propose to train quantum variational circuits to find smaller text and image embeddings that preserve contrastive-learning distances based on CLIP large embeddings. This is a critical task since fine-tuning CLIP to produce low-dimensional embeddings is prohibitively expensive. We introduce CLIP-ACQUA, a model trained in a self-supervised configuration from CLIP embeddings to reduce the latent space. We use CLIP-ACQUA on a sizeable unlabelled corpus of text and images to demonstrate its effectiveness. Our experiments show that we can obtain smaller latent spaces that preserve the original embedding distances inferred during contrastive learning. Furthermore, using our model requires no fine-tuning of CLIP, preserving its original robustness and structure. The data used as a demonstration aids in modeling consumer-to-consumer online marketplaces to detect illicit activities.
DescriptionMany HPC and certainly AI or DL applications are comprised in their core of small linear algebra operations which are then used to compose large and more complicated tensor operations. Especially in the field of AI/DL portability among different hardware platforms is essential due to an extensive reliance on Python and the high-level nature of many frontends. However, scientists are often faced with the challenge to run their codes in vastly different environments. They therefore have to restrict themselves to high-level languages and hope for good compiler optimizations. Especially for complicated linear algebra operators, as they arise in high-order methods in the computational sciences, this is huge leap of faith. In this work we demonstrate how Tensor Processing Primitives, a low-dimensional SIMD abstraction for various CPU architectures, can be used to obtain very high fractions of floating point peak on seven different CPU micro-architectures offering four different ISAs.
DescriptionWe have successfully developed an efficient algorithm capable of computation of N=1 million elements and 0.1 million time-steps. Strong-scaling analyses show that the algorithm exhibits the good scalability for OpenMP / MPI of 8 threads and more than 10000 cores (~200 nodes). This capacity is necessary to simulate the nationwide fault activity for the Japanese Islands with the current HPC systems. The algorithm is applied to simulate the 15 thousand years of the earthquake recurrence history along one of the largest active faults in SW Japan, the Median Tectonic line. We demonstrate that the optimized algorithm is a powerful tool enabling us to build a physics-based method applied to long-term forecast of earthquake generation.
DescriptionRadio-frequency cavities are key components for high-energy particle accelerators, quantum computing, etc. Designing cavities comes along with many computational challenges such as multi-objective optimization, high performance computing (HPC) requirement for handling large-sized cavities etc. To be more precise, its multi-objective optimization requires an efficient 3D full-wave electromagnetic simulator. For which, we rely on the integral equation (IE) method and it requires fast solver with HPC and ML algorithms to search for resonance modes.
We propose an HPC-based fast direct matrix solver for IE, combined with hybrid optimization algorithms to attain an efficient simulator for accelerator cavity modeling. First, we solve the linear eigen problem for each trial frequency by a distributed-memory parallel, fast direct solver. Second, we propose the combination of the global optimizer Gaussian Process with the local optimizer Downhill-simplex methods to generate the trial frequency samples which successfully optimize the corresponding 1D objective function with multiple sharp minimums.
DescriptionAutotuning is a widely used method for guiding developers of large-scale applications to achieve high performance. However, autotuners typically employ black-box optimizations to recommend parameter settings at the cost of users missing the opportunity to identify performance bottlenecks. Performance analysis fills that gap and identifies problems and optimization opportunities that can result in better runtime and utilization of hardware resources. This work combines the best of the both worlds by integrating a systematic performance analysis and visualization approach into a publicly available autotuning framework, GPTune, to suggest users which configuration parameters are important to tune, to what value, and how tuning the parameters affect hardware-application interactions. Our experiments demonstrate that a subset of the task parameters impact the execution time of the Hypre application; the memory traffic and page faults cause performance problems in the Plasma-DGEMM routine on Cori-Haswell.
DescriptionWriting software that can exploit supercomputers is difficult, and this is going to get much harder as we move toward exascale where the scale and heterogenity of our machines will increase significantly. A potential solution is in the use of Domain Specific Languages (DSLs) which separate the programmer's logic from mechanisms of parallelism. However, while these have shown promise, a major challenge is that DSL toolchains are often siloed, sharing little or no infrastructure between DSLs.
In this poster, we present xDSL which is an ecosystem for DSL development. Built upon the hugely popular LLVM and MLIR, xDSL provides a Python-based toolbox to ease integration with MLIR, and a series of IR dialects and transformations that DSL developers can apply. The result is that that DSLs become a thin layer of abstraction atop a common, well supported, mature and maintained ecosystem that targets a variety of hardware architectures.
DescriptionHPC systems are at risk of being underutilized due to various resource requirements of applications and the imbalance of utilization among subsystems. This work provides a holistic analysis and view of memory utilization on a leadership computing facility, the Perlmutter system at NERSC, through which we gain insights about the resource usage patterns of the memory subsystem. The results of the analysis can help evaluate current system configurations, offer recommendations for future procurement, provide feedback to users on code efficiency, and motivate research in new architecture and system designs.
DescriptionIn this work, we study the performance-portability of offloaded lattice Boltzmann kernels and the trade-off between portability and efficiency. The study is based on a proxy application for the lattice Boltzmann method (LBM). The performance portability programming framework of Kokkos (with CUDA or SYCL backend) is used and compared with programming models of native CUDA and native SYCL. The Kokkos library supports the mainstream GPU products in the market. The performance of the code can vary with accelerating models, number of GPUs, scale of the problem, propagation patterns and architectures. Both Kokkos library and CUDA toolkit are studied on the supercomputer of ThetaGPU (Argonne Leadership Computing Facility). It is found that Kokkos (CUDA) has almost the same performance as native CUDA. The automatic data and kernel management in Kokkos may sacrifice the efficiency, but the parallelization parameters can also be tuned by Kokkos to optimize the performances.
DescriptionThe Fast Fourier Transform is an essential algorithm of modern computational science. The highly parallel structure of the FFT allows for its efficient implementation on graphics processing units (GPUs), which are now widely used for general-purpose computing. This poster presents the VkFFT - an efficient GPU-accelerated multidimensional Fast Fourier Transform library for Vulkan/CUDA/HIP/OpenCL/Level Zero projects. VkFFT aims to provide the community with a cross-platform open-source alternative to vendor-specific solutions while achieving comparable or better performance. This poster presents the optimizations implemented in VkFFT and compares its performance and precision against Nvidia cuFFT and AMD's rocFFT libraries on their latest HPC GPUs. This poster also presents the first performant implementation of Discrete Cosine Transforms on GPUs. VkFFT is released under MIT license.
DescriptionApplications can experience significant performance differences when run on different architectures. For example, GPUs are often utilized to accelerate an application over its CPU implementation. Understanding how performance changes across platforms is vital to the design of hardware, systems software, and performance critical applications. However, modeling the relationship between systems and performance is difficult as run time data needs to be collected on each platform. In this poster, we present a methodology for predicting the relative performance of an application across multiple systems using profiled performance counters and deep learning.
DescriptionScientific software in high performance computing is becoming increasingly complex both in terms of its size and the number of external dependencies. Correctness and performance issues can become more challenging in actively developed software with increasing complexity. This leads to software developers having to spend larger portions of their time on debugging, optimizing, and maintaining code. Making software optimization and maintenance easier for developers is paramount to accelerating the rate of scientific progress. Fortunately, there is a wealth of data on scientific coding practices available implicitly via version control histories. These contain the state of a code at each stage throughout its development via commit snapshots. Commit snapshots provide dynamic insight into the software development process that static analyses of release tarballs do not. We propose a new machine learning based approach for studying the performance of source code across code modifications.
DescriptionIn this work, we evaluate the performance of unroll and tiling, two loop transformations introduced in OpenMP 5.1 and early implemented in Clang 13 for GPUs. Experiments on a common seismic computational kernel demonstrate performance gains on three GPU architectures.
DescriptionBy detecting different animal species reliably at scale we can protect biodiversity. Yet, traditionally, biodiversity data has been collected by expert observers which is prohibitively expensive, not reliable neither scalable. Automated species detection via machine-learning is promising, but it is constrained by the necessity of large training data sets all labeled by human experts. Here, we propose to use Self-Supervised Learning for studying semantic features from passively collected acoustic data. We utilized a joint embedding configuration to acquire features from spectrograms. We processed recordings from ∼190 hours of audio. In order to process these volumes of data we utilized a HPC cluster provided by the Argonne Leadership Computing Facility. We analyzed the output space from a trained backbone which highlights important semantic attributes of the spectrograms. We envisage these preliminary results as compelling for future automatic assistance of biologist as a pre-processing stage for labeling very big data sets.
DescriptionIt is difficult to implement a CNN for edge processing in satellites, automobiles, and more, where machine resources and power are limited. FPGAs meet such constraints of machine resources and power associated with CNNs. FPGAs have low power consumption, but limited machine resources. Quantization Neural Networks have fewer parameters (bit depth) than CNNs and better estimation accuracy than BNNs.
Although CNNs for regression problems are rarely implemented with FPGAs, our study installed debris pose estimation on an FPGA using the latest edge technology such as quantization neural network. Pose estimations were run on a workstation using 32bit floating-point precision and on an FPGA using 8bit int precision. The average errors were 4.98% and 5.38%, respectively. This demonstrates that the regression problem can be transferred to an FPGA without a significant loss of accuracy. The FPGA power efficiency is more than 218k times that of a workstation implementation.
DescriptionGlobal pandemics can wreak havoc and lead to significant social, economic and personal losses. Preventing the spread of infectious diseases requires interventions at different levels needing the study of potential impact and efficacy of those preemptive measures. Modeling epidemic diffusion and possible interventions can help us in this goal. Agent-based models have been used effectively in the past to model contagion processes. We present Loimos, a highly parallel simulation of epidemic diffusion written on top of the Charm++ asynchronous task-based system. Loimos uses a hybrid time-stepped and discrete-event simulation to model disease spread. We demonstrate that our implementation of Loimos is able to efficiently utilize a large number of cores on different HPC platforms, namely, we scale to about 32k cores on Theta at ALCF and about 4k cores on Cori at NERSC.
DescriptionHistorical temperature measurements are the basis of important global climate datasets like HadCRUT4 and HadCRUT5 to analyze climate change. These datasets contain many missing values and have low resolution grids. Here we demonstrate that artificial intelligence can skillfully fill these observational gaps and upscale these when combined with numerical climate model data. We show that recently developed image inpainting techniques perform accurate reconstructions via transfer learning. In addition, high resolution in weather and climate was always a common and ongoing goal of the community. We gain a neural network which reconstructs and downscales the important observational data sets (IPCC AR6) at the same time, which is unique and state-of-the-art in climate research.
DescriptionThe k-nearest neighbor search is used in various applications such as machine learning, computer vision, database search, and information retrieval. While the computational cost of the exact nearest neighbor search is enormous, an approximate nearest neighbor search (ANNS) is being paid much attention. IVFPQ is one of the ANNS methods. Although we can leverage the high bandwidth and low latency of shared memory to compute the search phase of the IVFPQ on NVIDIA GPUs, the throughput can degrade due to shared memory bank conflict. To reduce the bank conflict and improve the search throughput, we propose a custom 8-bit floating point value format. This format doesn’t have a sign bit and can be converted from/to FP32 with a few instructions. We use this format for IVFPQ on GPUs and get better performance without significant recall loss compared to FP32 and FP16.
DescriptionMonitoring the status of large computing systems is essential to identify unexpected behavior and improve their performance and up-time. However, due to the large-scale and distributed design of such computing systems as well as a large number of monitoring parameters, automated monitoring methods should be applied. Such automatic monitoring methods should also have the ability to adapt themselves to the continuous changes in the computing system. In addition, they should be able to identify behavioral anomalies in useful time, in order to perform appropriate reactions. This work proposes a general light-weight and unsupervised method for near real-time anomaly detection using operational data measurement on large computing systems. The proposed model requires as low as 4 hours of data and 50 epochs for each training process to accurately resemble the behavioral pattern of computing systems.
DescriptionMissing climatological data is a general problem in climate research that leads to uncertainty of prediction models that rely on these data resources. So far, existing approaches for infilling missing precipitation data are mostly numerical or statistical techniques that require time consuming computations and are not suitable for large regions with missing data. Most recent machine learning techniques have proven to perform well on infilling missing temperature or satellite data. However, these techniques consider only spatial variability in the data whereas precipitation data is much more variable in both space and time. We propose a convolutional inpainting network that additionally considers temporal variability and atmospheric parameters in the data. The model was trained and evaluated on the RADOLAN data set over Germany. Since the training of this high-resolved data set requires a large amount of computational resources, we apply distributed training on an HPC system to maximize the performance.
DescriptionSupraventricular Tachycardia (SVT) is when the heart’s upper chambers beat either too quickly or out of rhythm with the heart’s lower chambers. This out-of-step heart beating is a leading cause of strokes, heart attacks, and heart failure. The most successful treatment for SVT is catheter ablation, a process where an electrophysiologist (EP) maps the heart to find areas with abnormal electrical activity. The EP then runs a catheter into the heart to burn the abnormal area, blocking the electrical signals. Much is not known about what triggers SVT and where to place scar tissue for optimal patient outcomes. We have produced a dynamic model of the right atrium accelerated on NVIDIA GPUs. An interface allows researchers to insert ectopic signals into the simulated atria and ablate sections of the atria allowing them to rapidly gain insight into what causes SVTs and how to terminate them.
DescriptionMemory management APIs like Umpire were created to solve the memory constraints for applications running on heterogeneous HPC systems. At Lawrence Livermore National Laboratory (LLNL), many application codes utilize the memory management capabilities of Umpire. This study focuses on one such code, a high explosive equation of state chemistry application from LLNL. This code uses Umpire’s memory pools in order to allocate all required memory at once instead of many times throughout the code. The performance of memory pools varies widely and depends upon how the blocks of memory within the pool are managed. We conducted several experiments that tested different strategies to manage allocations within a memory pool in order to study the impact on performance. Our experiments demonstrate how this performance varies, from causing an application to run out of memory prematurely to reducing peak memory usage by 64%, depending upon that management strategy.
DescriptionReal-world HPC workloads impose a lot of pressure on storage systems as they are highly data dependent. On the other hand, as a result of recent developments in storage hardware, it is expected that the storage diversity in upcoming HPC systems will grow. This growing complexity in the storage system presents challenges to users, and often results in I/O bottlenecks due to inefficient usage. There have been several studies on reducing I/O bottlenecks. The earliest attempts worked to solve this problem by combining I/O characteristics with expert insight. The recent attempts rely on the performance analysis from the I/O characterization tools. However, the problem is multifaceted with many metrics to consider, hence difficult to do manually, even for experts. In this work, we develop a methodology that produces a multifaceted view of the I/O behavior of a workload to identify potential I/O bottlenecks automatically.
DescriptionThe multi-precision methods commonly follow approximate-iterate scheme by first obtaining the approximate solution from a low-precision factorization and solve. Then, they iteratively refine the solution to the desired accuracy that is often as high as what is possible with traditional approaches. While targeting symmetric/Hermitian eigenvalue problems of the form Ax=(lambda)x, we revisited the SICE algorithm by applying the Sherman-Morrison formula on the diagonally-shifted tridiagonal systems, we propose an updated SICE-SM algorithm. We exploited asynchronous scheduling techniques to take advantage of the new computational graph enabled by the use of mixed-precision in the eigensolver. By incorporating the latest two-stage algorithms from the PLASMA and MAGMA software libraries for numerical linear algebra, we achieved up to 3.6x speedup using the mixed-precision eigensolver with the blocked SICE-SM algorithm for iterative refinement when compared with full double complex precision solvers for the cases with a portion of eigenvalues and eigenvectors requested.
DescriptionThe SOLLVE V&V suite tests new OpenMP features to visualize compiler and system compilers. Systems include Oak Ridge National Laboratory's (ORNL) Summit and Crusher systems, as well as National Energy Research Scientific Computing Center (NERSC)'s Perlmutter system.
DescriptionIn recent years, despite remarkable progress in computing and network performance, HPC platforms have struggled to maintain satisfactory I/O throughput. Various solutions have been proposed to mitigate the contention and variability experienced by more and more concurrent applications, particularly on heavily shared parallel file systems. In consequence, many large scale platforms now offer complex hierarchies of storage resources using diverse architectures based on different hardware technologies such as persistent memories or flash. In that context, we propose to study how to efficiently allocate these heterogeneous storage resources. In our poster, we introduce StorAlloc, a modular and extensible simulator of a storage-aware job-scheduler. We present the design of the tool before showing through the concrete example of the dimensioning of a partition of burst buffers the insights StorAlloc can provide in terms of storage system design and resource scheduling algorithms.
DescriptionWe present a modern C++20 interface for MPI 4.0. The interface utilizes recent language features to ease development of MPI applications. An aggregate reflection system enables generation of MPI data types from user-defined classes automatically. Immediate and persistent operations are mapped to futures, which can be chained to describe sequential asynchronous operations and task graphs in a concise way. This work introduces the prominent features of the interface with examples. We further measure its performance overhead with respect to the raw C interface.
DescriptionAccurate modeling of turbulent hypersonic flows has tremendous scientific and commercial value, and applies to atmospheric flight, supersonic combustion, materials discovery and climate prediction. In this poster, we describe our experiences in extending the capabilities of and modernizing CRoCCo, an MPI-based, CPU-only compressible computational fluid dynamics code. We extend CRoCCo to support block-structured adaptive mesh refinement using a highly-scalable AMR library, AMReX, and add support for a fully curvilinear solver. We also port the computational kernels in CRoCCo to NVIDIA GPUs to enable scaling on modern exascale systems. We present our techniques for overcoming performance challenges and evaluate the updated code, CRoCCo-AMR, on the Summit system, demonstrating a 5× to 24× speedup over the CPU-only version.
DescriptionA large body of approaches has been proposed to analyze the resilience of HPC applications. However, existing studies rarely address the challenges of the analysis result perception. Specifically, resilience analysis techniques produce a massive volume of unstructured data, making it difficult to conduct the resilience analysis. Furthermore, different analysis models produce diverse results with multiple levels of details, which creates hurdles to compare and explore the resilience of HPC program execution. To this end, we present VISILIENCE, an interactive VISual resILIENCE analysis framework to allow programmers to facilitate the resilience analysis of HPC applications. In particular, VISILIENCE leverages an effective visualization approach Control Flow Graph (CFG) to present a function execution. In addition, three widely-used models for resilience analysis (i.e., Y-Branch, IPAS, and TRIDENT) are seamlessly embedded into the framework for resilience analysis and result comparison. Case studies have been conducted to demonstrate the effectiveness of our proposed framework VISILIENCE.
DescriptionA new machine learning-based non-destructive testing (NDT) technique for the examination of conductive objects is presented. NDT of objects behind barriers utilize the defect-induced distortions on electromagnetic (EM) fields to detect flaws in the structure of inspected targets. Such distortions are highly non-linear, requiring significant amounts of data for training neural networks. To this end, a massively parallelized data generation framework is proposed in conjunction with a multi-frequency hybrid neural network (MF-HNN), to create a physics-informed inversion AI model. The performance of the resulting inversion algorithm is applied on casings, where tubular pipes are inspected. For data generation, physics-based solvers are employed to simulate the EM field distribution resulting from pipes with defects. The large-scale distribution of this step leads to 43 times faster execution than a single CPU. This allows the MF-HNN to achieve significantly improved generalization performance and to generate high-resolution cross-sectional images of the pipelines.
DescriptionMassive Multiple-Input-Multiple-Output is a crucial technology for Next-Generation networks (Next-G). It uses hundreds of antennas at transceivers to exchange data. However, its accurate signal detection relies on solving an NP-hard optimization problem in real-time latency.
In this poster, we propose a new GPU-based detection algorithm that demonstrates the positive impact of low-precision arithmetic with multiple GPUs to achieve next-G latency/scalability/accuracy requirements. Our approach iteratively extends a solution with several symbols representing the best combination out of the aggregated levels. The computation at each iteration is formulated as a matrix multiplication operation to leverage GPU architectures.
The obtained results using A100 GPU show a 1.7x improvement by exploiting half-precision arithmetic without loss in accuracy. Furthermore, our low-precision multi-GPU version with four A100 GPUs is 4x faster than the single-precision single GPU version and 40x faster than a similar parallel CPU implementation executed on a two-socket 28-core IceLake CPU with 56 threads.
DescriptionServices on the edge and fog systems desire mobility owing to the user or data mobility, and the necessity of relocating to the cloud upon the oversubscription. More specifically, live migration of containerized microservices is required for service mobility, elasticity, and load balancing purposes. Although container runtimes and orchestrators recently provided native live migration support, they do not allow migration across autonomous computing systems with heterogeneous orchestrators. Our hypothesis is that non-native and non-invasive support for the live container migration is the need of hour and can unlock several new use cases. We develop a non-native and non-invasive live container migration method leveraging the nested container runtime. We design the architecture and develop the solution to enable container migration across heterogeneous orchestrators. We evaluate the performance against other approaches. We observe that for microservices smaller than 512 MiB, the nested container runtime approach can be implemented within an acceptable overhead.
DescriptionChimbuko is a framework for detecting real-time performance anomalies incurred by large-scale applications. Understanding the source of anomalous behaviors is difficult due to the high volume of information stored by Chimbuko in a provenance database. This undergraduate research project aims to intuitively display this high volume of information without overwhelming users. We then integrate our analysis and visualization techniques into a publicly available framework called Dashing. This project facilitates interactive user investigation of anomaly provenance in large-scale applications.
DescriptionWe present a strategy for GPU acceleration of a multiphase compressible flow solver that brings us closer to exascale computing. Given the memory-bound nature of most CFD problems, one must be prudent in implementing algorithms and offloading work to accelerators for efficient use of resources. Through careful choice of OpenACC decorations, we achieve 46% of peak GPU FLOPS on the most expensive kernel, leading to a 500-times speedup on an NVIDIA A100 compared to 1 modern Intel CPU core. The implementation also demonstrates ideal weak scaling for up to 13824 GPUs on OLCF Summit. Strong scaling behavior is typical but improved by reduced communication times via CUDA-aware MPI.
DescriptionThe Influence Maximization (IM) problem on a social network is the problem of identifying a small cohort of vertices that, when initially activated, results in a cascading effect that will activate the maximum expected number other vertices in the network. While the problem is NP-hard under budget constraints, it has a submodular structure that leads to efficient approximation.
In this work, we present techniques and our performance analysis that we are using to drive the design of efficient FPGA acceleration for the seed selection step within the IMM algorithm. Currently, we are able to achieve from 0.75x to 4.78x speedup, with the main bottleneck being a static overhead determined by the size of the input graph. We discuss future work to improve on the current architecture, and hope to provide techniques for making "almost-regular" applications fast and efficient on FPGAs.
DescriptionThis study aimed to employ deep learning capability and computing scalability to create a model and predict the velocity of the straining turbulence flow. The turbulence flow was generated in a laboratory. The turbulence intensity of the flow is controlled via impeller rotation speed. The mean strain rate is made by two circular plates moving toward each other in the center of the measuring area by an actuator. The dynamics of the particles are measured using high-speed Lagrangian Particle Tracking at 10,000 frames per second. Measured data from the experiment were employed to design a gated recurrent unit model. Two powerful parallel computing machines, JUWELS and DEEP-EST, were employed to implement the model. The velocity forecasting with a gated recurrent network presents a considerable outcome. The computing machine's scalability using GPUs accelerates this model's computing time significantly, which strengthens the ability to predict turbulent flow.
DescriptionA jet of fluid -- when we open a garden hose, for instance -- exhibits a rich tapestry of flow physics, including the rupture of fluid films and a cascade of filament and droplet breakup and coalescence. In addition to its breathtaking beauty, this jet atomization is a critical component for a broad spectrum of energy and healthcare applications. Simulating and visualization jet atomization is an ideal way to understand and control this phenomenon. However, the multiscale nature of jet atomization makes this a very challenging problem. Here, we visualize one of this phenomenon's highest resolution simulation datasets. The dataset consists of over 120,000-time steps of an adaptively resolved spatial mesh with length scales. We describe the parallel workflow and associated challenges while visualizing the time evolution of the jet. We show how this visualization produces a deep qualitative understanding of fluid dynamics from the outputs of these massive simulations.
DescriptionHigh Performance Computing (HPC) critically underpins the design of aero-engines. With global emissions targets, engine designs require a fundamental change including designs utilizing sustainable aviation fuels and electric/hybrid flight. Virtual certification of designs with HPC is recognized as a key technology to meet these challenges, but require analysis on models with higher fidelity, using ultra-large scale executions. In this explanatory SC-SciVis showcase, we present results from time-accurate simulations of a 4.6B-element full 360-degree model of a production-representative gas turbine engine compressor, the Rig250 at DLR. This represents a grand challenge problem, at the fidelity for virtual certification standards. The results are achieved through Rolls-Royce's Hydra CFD suite on ARCHER2. The compressor is visualized under off-design conditions, demonstrating flow contours of velocity, Mach number and iso-surfaces of vorticity. The level of detail and the HPC simulations leading to the visualizations demonstrate a step-change towards achieving virtual certification objectives under production settings.
DescriptionIn the United States, fossil-fuel related industrial processes account for approximately half of all greenhouse gas emissions in the United States. Chemical Looping Reactors (CLRs) provide a promising path to reducing carbon emissions; however, scale-up and testing of these systems is expensive and time-consuming. In our video, we focus on understanding bubble dynamics in fluidized beds of Chemical Looping Reactor as simulated by the MFIX-Exa Code, including the importance of Los Alamos National Laboratory’s in situ feature detection algorithm and the use of the Cinema visualization tool in the post hoc workflow. MFIX-Exa provides new computing capabilities needed to combine CFD-DEM simulation with computing at the exascale via an adaptive mesh refinement (AMReX) framework.
DescriptionMarine macroalgae in the Gulf of Mexico is an important potential source for biofuel. However, identifying locations with the correct biogeochemical and hydrodynamic conditions for cultivation on a large enough scale to meet the needs of the U.S. private energy sector is impossible from purely observational studies. Large-scale, HPC modeling of earth systems processes enables researchers to study complex physical relationships with high fidelity. Here, we present novel visualization techniques showing the results of a global run of the E3SM's MPAS-Ocean model data with biogeochemistry extensions to improve ongoing research in macroalgae cultivation.
DescriptionThe Advanced Visualization Lab at the National Center for Supercomputing Applications created a cinematic scientific visualization of the ArcticDEM survey and Vavilov ice cap collapse for the documentary film "Atlas of a Changing Earth", in both digital fulldome and flatscreen television formats. While the ArcticDEM dataset is the main one featured here, this visualization fills in gaps using other datasets, including a climate simulation by Bates et al and Landsat imagery. The visualization required a number of steps including: both manual and algorithmic data cleaning, processing, and alignment; data fusion; virtual scene design; morphing interpolation; lighting design; camera choreography; compositing; and rendering on the Blue Waters supercomputer.
DescriptionThis explanatory visualization shows the results of a state-of-the-art 3D simulation of supernova explosion and neutron-star birth. It is a rare instance where the full stellar evolution of an object, including the physics of the convection and the radiation, has been simulated in three dimensions. Among the highlights is the deep core that is shrinking after explosion due to neutrino cooling and deleptonization on its way to becoming a cold, compact neutron star. There is also evidence of inner proto-neutron star convection, perhaps the site of magnetic dynamo action that can turn a pulsar into a magnetar. An exterior view shows the blast wave, which cocoons the newly-birthed neutron star, moving at ∼10,000 km/s. Additionally, a reusable pipeline was developed, which leverages state-of-the-art tools for scientific data analysis and visualization resulting in high-quality renderings.