Performance variation diagnosis in High-Performance Computing (HPC) systems is a challenging problem due to the size and complexity of the systems. Application performance variation leads to premature termination of jobs, decreased energy efficiency, or wasted computing resources. Manual root-cause analysis of performance variation based on system telemetry has become an increasingly time-intensive process as it relies on human experts and the size of telemetry data has grown. Recent methods use supervised machine learning models to automatically diagnose previously encountered performance anomalies in compute nodes. However, supervised machine learning models require large labeled data sets for training. This labeled data requirement is restrictive for many real-world application domains, including HPC systems, because collecting labeled data is challenging and time-consuming, especially considering anomalies that sparsely occur. This paper proposes a novel semi-supervised framework that diagnoses previously encountered performance anomalies in HPC systems using a limited number of labeled data points, which is more suitable for production system deployment. Our framework first learns performance anomalies’ characteristics by using historical telemetry data in an unsupervised fashion. In the following process, we leverage supervised classifiers to identify anomaly types. While most semi-supervised approaches do not typically use anomalous samples, our framework takes advantage of a few labeled anomalous samples to classify anomaly types. We evaluate our framework on a production HPC system and on a testbed HPC cluster. We show that our proposed framework achieves 60% F1-score on average, outperforming state-of-the-art supervised methods by 11%, and maintains an average 0.06% anomaly miss rate.
Solving dense systems of linear equations is essential in applications encountered in physics, mathematics, and engineering. This paper describes our current efforts toward the development of the ADELUS package for current and next generation distributed, accelerator-based, high-performance computing platforms. The package solves dense linear systems using partial pivoting LU factorization on distributed-memory systems with CPUs/GPUs. The matrix is block-mapped onto distributed memory on CPUs/GPUs and is solved as if it was torus-wrapped for an optimal balance of computation and communication. A permutation operation is performed to restore the results so the torus-wrap distribution is transparent to the user. This package targets performance portability by leveraging the abstractions provided in the Kokkos and Kokkos Kernels libraries. Comparison of the performance gains versus the state-of-the-art SLATE and DPLASMA GESV functionalities on the Summit supercomputer are provided. Preliminary performance results from large-scale electromagnetic simulations using ADELUS are also presented. The solver achieves 7.7 Petaflops on 7600 GPUs of the Sierra supercomputer translating to 16.9% efficiency.
Persistent memory (PMEM) devices can achieve comparable performance to DRAM while providing significantly more capacity. This has made the technology compelling as an expansion to main memory. Rethinking PMEM as storage devices can offer a high performance buffering layer for HPC applications to temporarily, but safely store data. However, modern parallel I/O libraries, such as HDF5 and pNetCDF, are complicated and introduce significant software and metadata overheads when persisting data to these storage devices, wasting much of their potential. In this work, we explore the potential of PMEM as storage through pMEMCPY: a simple, lightweight, and portable I/O library for storing data in persistent memory. We demonstrate that our approach is up to 2x faster than other popular parallel I/O libraries under real workloads.
Gate set tomography (GST) is a protocol for detailed, predictive characterization of logic operations (gates) on quantum computing processors. Early versions of GST emerged around 2012-13, and since then it has been refined, demonstrated, and used in a large number of experiments. This paper presents the foundations of GST in comprehensive detail. The most important feature of GST, compared to older state and process tomography protocols, is that it is calibration-free. GST does not rely on pre-calibrated state preparations and measurements. Instead, it characterizes all the operations in a gate set simultaneously and self-consistently, relative to each other. Long sequence GST can estimate gates with very high precision and efficiency, achieving Heisenberg scaling in regimes of practical interest. In this paper, we cover GST’s intellectual history, the techniques and experiments used to achieve its intended purpose, data analysis, gauge freedom and fixing, error bars, and the interpretation of gauge-fixed estimates of gate sets. Our focus is fundamental mathematical aspects of GST, rather than implementation details, but we touch on some of the foundational algorithmic tricks used in the pyGSTi implementation.
To meet the extreme compute demands for deep learning across commercial and scientific applications, dataflow accelerators are becoming increasingly popular. While these “domain-specific” accelerators are not fully programmable like CPUs and GPUs, they retain varying levels of flexibility with respect to data orchestration, i.e., dataflow and tiling optimizations to enhance efficiency. There are several challenges when designing new algorithms and mapping approaches to execute the algorithms for a target problem on new hardware. Previous works have addressed these challenges individually. To address this challenge as a whole, in this work, we present a HW-SW codesign ecosystem for spatial accelerators called Union within the popular MLIR compiler infrastructure. Our framework allows exploring different algorithms and their mappings on several accelerator cost models. Union also includes a plug-and-play library of accelerator cost models and mappers which can easily be extended. The algorithms and accelerator cost models are connected via a novel mapping abstraction that captures the map space of spatial accelerators which can be systematically pruned based on constraints from the hardware, workload, and mapper. We demonstrate the value of Union for the community with several case studies which examine offloading different tensor operations (CONV/GEMM/Tensor Contraction) on diverse accelerator architectures using different mapping schemes.
Deep neural networks (DNNs) have achieved state-of-the-art performance across a variety of traditional machine learning tasks, e.g., speech recognition, image classification, and segmentation. The ability of DNNs to efficiently approximate high-dimensional functions has also motivated their use in scientific applications, e.g., to solve partial differential equations and to generate surrogate models. In this paper, we consider the supervised training of DNNs, which arises in many of the above applications. We focus on the central problem of optimizing the weights of the given DNN such that it accurately approximates the relation between observed input and target data. Devising effective solvers for this optimization problem is notoriously challenging due to the large number of weights, nonconvexity, data sparsity, and nontrivial choice of hyperparameters. To solve the optimization problem more efficiently, we propose the use of variable projection (VarPro), a method originally designed for separable nonlinear least-squares problems. Our main contribution is the Gauss–Newton VarPro method (GNvpro) that extends the reach of the VarPro idea to nonquadratic objective functions, most notably cross-entropy loss functions arising in classification. These extensions make GNvpro applicable to all training problems that involve a DNN whose last layer is an affine mapping, which is common in many state-of-the-art architectures. In our four numerical experiments from surrogate modeling, segmentation, and classification, GNvpro solves the optimization problem more efficiently than commonly used stochastic gradient descent (SGD) schemes. Also, GNvpro finds solutions that generalize well, and in all but one example better than well-tuned SGD methods, to unseen data points.
In this work we evaluated the effects that equations of state and strength models have on SCJ development using the Sandia National Laboratories multiphysics shock code, ALEGRA. Results were quantified using a Lagrangian tracer particle following liner collapse, passing through the compression zone, and flowing into the jet tip. We found consistent results among several EOS: 3320, 3331, and 3337. The 3325 EOS generated a measurable low density and hollow region near the jet tip which appears to be reflected in a lower internal energy of the jet. At this time, we cannot tell, experimentally, if such a hollow region exists. The 3337 EOS is recent, well documented [6], and produces results similar to 3320 [3]. The various strength models produced more noticeable differences. In terms of internal energy and temperature, SGL had the largest values followed by PTW, ZA, and finally JC and MTS, which were quite similar to each other. We looked at melt conditions in the SGL and JC models using the 3337 EOS. The SGL model reported a liquid region along the jet axis all the way to the tip-seemingly consistent with experiment-while the JC model does not indicate any phase transition. None of the other yield models indicated melt along the jet axis. For all EOS and strength models, we found similar results for the velocity history of the jet tip as measured against experiment using photon Dopper velocimetry.
Microstructure reconstruction is a long-standing problem in experimental and computational materials science, for which numerous attempts have been made to solve. However, the majority of approaches often treats microstructure as discrete phases, which, in turn, reduces the quality of the resulting microstructures and limits its usage to the computational level of fidelity, but not the experimental level of fidelity. In this work, we applied our previously proposed approach [41] to generate synthetic microstructure images at the experimental level of fidelity for the UltraHigh Carbon Steel DataBase (UHCSDB) [13].
The FAIR principles of open science (Findable, Accessible, Interoperable, and Reusable) have had transformative effects on modern large-scale computational science. In particular, they have encouraged more open access to and use of data, an important consideration as collaboration among teams of researchers accelerates and the use of workflows by those teams to solve problems increases. How best to apply the FAIR principles to workflows themselves, and software more generally, is not yet well understood. We argue that the software engineering concept of technical debt management provides a useful guide for application of those principles to workflows, and in particular that it implies reusability should be considered as 'first among equals'. Moreover, our approach recognizes a continuum of reusability where we can make explicit and selectable the tradeoffs required in workflows for both their users and developers. To this end, we propose a new abstraction approach for reusable workflows, with demonstrations for both synthetic workloads and real-world computational biology workflows. Through application of novel systems and tools that are based on this abstraction, these experimental workflows are refactored to rightsize the granularity of workflow components to efficiently fill the gap between end-user simplicity and general customizability. Our work makes it easier to selectively reason about and automate the connections between trade-offs across user and developer concerns when exposing degrees of freedom for reuse. Additionally, by exposing fine-grained reusability abstractions we enable performance optimizations, as we demonstrate on both institutional-scale and leadership-class HPC resources.
In power grid operation, optimal power flow (OPF) problems are solved several times per day to find economically optimal generator setpoints that balance given load demands. Ideally, we seek an optimal solution that is also “N-1 secure”, meaning the system can absorb contingency events such as transmission line or generator failure without loss of service. Current practice is to solve the OPF problem and then check a subset of contingencies against heuristic values, resulting in, at best, suboptimal solutions. Unfortunately, online solution of the OPF problem including the full N-1 contingencies (i.e., two-stage stochastic programming formulation) is intractable for even modest sized electrical grids. To address this challenge, this work presents an efficient method to embed N-1 security constraints into the solution of the OPF by using Neural Network (NN) models to represent the security boundary. Our approach introduces a novel sampling technique, as well as a tuneable parameter to allow operators to balance the conservativeness of the security model within the OPF problem. Our results show that we are able to solve contingency formulations of larger size grids than reported in literature using non-linear programming (NLP) formulations with embedded NN models to local optimality. Solutions found with the NN constraint have marginally increased computational time but are more secure to contingency events.
Rendezvous algorithms encode a communication pattern that is useful when processors sending data do not know who the receiving processors should be, or vice versa. The idea is to define an intermediate decomposition where datums from different sending processors can ”rendezvous” to perform a computation, in a manner that both the senders and eventual receivers of the results can identify the appropriate rendezvous processor. Originally designed for interpolating between overlaid grids with independent parallel decompositions (Plimpton et al., 2004), we have recently found rendezvous algorithms useful for a variety of operations in particle- or grid-based simulation codes when running large problems on large numbers of processors. In particular, we show they can perform well when a load-balanced intermediate decomposition is randomized and not spatial, requiring all-to-all communication to move data between processors. In this case rendezvous algorithms leverage the large bisection communication bandwidths which parallel machines provide. We describe how rendezvous algorithms work in a scientific computing context and give specific examples for molecular dynamics and Direct Simulation Monte Carlo codes which result in dramatic performance improvements versus simpler algorithms which do not scale as well. We explain how a generic rendezvous algorithm can be implemented, and also point out similarities with the MapReduce paradigm popularized by Google and Hadoop.
Both the data science and scientific computing communities are embracing GPU acceleration for their most demanding workloads. For scientific computing applications, the massive volume of code and diversity of hardware platforms at supercomputing centers has motivated a strong effort toward performance portability. This property of a program, denoting its ability to perform well on multiple architectures and varied datasets, is heavily dependent on the choice of parallel programming model and which features of the programming model are used. In this paper, we evaluate performance portability in the context of a data science workload in contrast to a scientific computing workload, evaluating the same sparse matrix kernel on both. Among our implementations of the kernel in different performance-portable programming models, we find that many struggle to consistently achieve performance improvements using the GPU compared to simple one-line OpenMP parallelization on high-end multicore CPUs. We show one that does, and its performance approaches and sometimes even matches that of vendor-provided GPU math libraries.
Network modeling is a powerful tool to enable rapid analysis of complex systems that can be challenging to study directly using physical testing. Two approaches are considered: emulation and simulation. The former runs real software on virtualized hardware, while the latter mimics the behavior of network components and their interactions in software. Although emulation provides an accurate representation of physical networks, this approach alone cannot guarantee the characterization of the system under realistic operative conditions. Operative conditions for physical networks are often characterized by intrinsic variability (payload size, packet latency, etc.) or a lack of precise knowledge regarding the network configuration (bandwidth, delays, etc.); therefore uncertainty quantification (UQ) strategies should be also employed. UQ strategies require multiple evaluations of the system with a number of evaluation instances that roughly increases with the problem dimensionality, i.e., the number of uncertain parameters. It follows that a typical UQ workflow for network modeling based on emulation can easily become unattainable due to its prohibitive computational cost. In this paper, a multifidelity sampling approach is discussed and applied to network modeling problems. The main idea is to optimally fuse information coming from simulations, which are a low-fidelity version of the emulation problem of interest, in order to decrease the estimator variance. By reducing the estimator variance in a sampling approach it is usually possible to obtain more reliable statistics and therefore a more reliable system characterization. Several network problems of increasing difficulty are presented. For each of them, the performance of the multifidelity estimator is compared with respect to the single fidelity counterpart, namely, Monte Carlo sampling. For all the test problems studied in this work, the multifidelity estimator demonstrated an increased efficiency with respect to MC.
In this work, we show that reduced communication algorithms for distributed stochastic gradient descent improve the time per epoch and strong scaling for the Generalized Canonical Polyadic (GCP) tensor decomposition, but with a cost, achieving convergence becomes more difficult. The implementation, based on MPI, shows that while one-sided algorithms offer a path to asynchronous execution, the performance benefits of optimized allreduce are difficult to best.
2021 IEEE High Performance Extreme Computing Conference, HPEC 2021
Chester, Dean G.; Groves, Taylor; Hammond, Simon; Law, Tim; Wright, Steven A.; Smedley-Stevenson, Richard; Fahmy, Suhaib A.; Mudalidge, Gihan R.; Jarvis, Stephen A.
We present StressBench, a network benchmarking framework written for testing MPI operations and file I/O concurrently. It is designed specifically to execute MPI communication and file access patterns that are representative of real-world scientific applications. Existing tools consider either the worst case congestion with small abstract patterns or peak performance with simplistic patterns. StressBench allows for a richer study of congestion by allowing orchestration of network load scenarios that are representative of those typically seen at HPC centres, something that is difficult to achieve with existing tools. We demonstrate the versatility of the framework from micro benchmarks through to finely controlled congested runs across a cluster. Validation of the results using four proxy application communication schemes within StressBench against parent applications shows a maximum difference of 15%. Using the I/O modeling capabilities of StressBench, we are able to quantify the impact of file I/O on application traffic showing how it can be used in procurement and performance studies.
On high-performance computing (HPC) systems, job allocation strategies control the placement of a job among available nodes. As the placement changes a job's communication performance, allocation can significantly affects execution times of many HPC applications. Existing allocation strategies typically make decisions based on resource limit, network topology, communication patterns, etc. However, system network performance at runtime is seldom consulted in allocation, even though it significantly affects job execution times.In this work, we demonstrate using monitoring data to improve HPC systems' performance by proposing a NetworkData-Driven (NeDD) job allocation framework, which monitors the network performance of an HPC system at runtime and allocates resources based on both network performance and job characteristics. NeDD characterizes system network performance by collecting the network traffic statistics on each router link, and it characterizes a job's sensitivity to network congestion by collecting Message Passing Interface (MPI) statistics. During allocation, NeDD pairs network-sensitive (network-insensitive) jobs with nodes whose parent routers have low (high) network traffic. Through experiments on a large HPC system, we demonstrate that NeDD reduces the execution time of parallel applications by 11% on average and up to 34%.
We propose a vertical TFET using atomic precision advanced manufacturing (APAM) to create an abrupt buried n++-doped source. We developed a gate stack that preserves the APAM source to accumulate holes above it, with a goal of band-to-band tunneling (BTBT) perpendicular to the gate – critical for the proposed device. A metal-insulator-semiconductor (MIS) capacitor shows hole accumulation above the APAM source, corroborated by simulation, demonstrating the TFET’s feasibility.
In this paper we consider 2D nonlocal diffusion models with a finite nonlocal horizon parameter δ characterizing the range of nonlocal interactions, and consider the treatment of Neumann-like boundary conditions that have proven challenging for discretizations of nonlocal models. We propose a new generalization of classical local Neumann conditions by converting the local flux to a correction term in the nonlocal model, which provides an estimate for the nonlocal interactions of each point with points outside the domain. While existing 2D nonlocal flux boundary conditions have been shown to exhibit at most first order convergence to the local counter part as δ → 0, the proposed Neumann-type boundary formulation recovers the local case as O(δ2) in the L∞(ω) norm, which is optimal considering the O(δ2) convergence of the nonlocal equation to its local limit away from the boundary. We analyze the application of this new boundary treatment to the nonlocal diffusion problem, and present conditions under which the solution of the nonlocal boundary value problem converges to the solution of the corresponding local Neumann problem as the horizon is reduced. To demonstrate the applicability of this nonlocal flux boundary condition to more complicated scenarios, we extend the approach to less regular domains, numerically verifying that we preserve second-order convergence for non-convex domains with corners. Based on the new formulation for nonlocal boundary condition, we develop an asymptotically compatible meshfree discretization, obtaining a solution to the nonlocal diffusion equation with mixed boundary conditions that converges with O(δ2) convergence.
Polynomial preconditioning can improve the convergence of the Arnoldi method for computing eigenvalues. Such preconditioning significantly reduces the cost of orthogonalization; for difficult problems, it can also reduce the number of matrix-vector products. Parallel computations can particularly benefit from the reduction of communication-intensive operations. The GMRES algorithm provides a simple and effective way of generating the preconditioning polynomial. For some problems high degree polynomials are especially effective, but they can lead to stability problems that must be mitigated. A two-level "double polynomial preconditioning"strategy provides an effective way to generate high-degree preconditioners.
Interval Assignment (IA) is the problem of selecting the number of mesh edges (intervals) for each curve for conforming quad and hex meshing. The intervals x is fundamentally integer-valued, yet many approaches perform floating-point optimization and convert a floating-point solution into an integer solution. We avoid such steps: we start integer, stay integer. Incremental Interval Assignment (IIA) uses integer linear algebra (Hermite normal form) to find an initial solution to the matrix equation Ax = b satisfying the meshing constraints. Solving for reduced row echelon form provides integer vectors spanning the nullspace of A. We add vectors from the nullspace to improve the initial solution. Compared to floating-point optimization approaches, IIA is faster and always produces an integer solution. The potential drawback is that there is no theoretical guarantee that the solution is optimal, but in practice we achieve solutions close to the user goals. The software is freely available.
We consider the integral definition of the fractional Laplacian and analyze a linearquadratic optimal control problem for the so-called fractional heat equation; control constraints are also considered. We derive existence and uniqueness results, first order optimality conditions, and regularity estimates for the optimal variables. To discretize the state equation we propose a fully discrete scheme that relies on an implicit finite difference discretization in time combined with a piecewise linear finite element discretization in space. We derive stability results and a novel L2(0, T;L2(Ω)) a priori error estimate. On the basis of the aforementioned solution technique, we propose a fully discrete scheme for our optimal control problem that discretizes the control variable with piecewise constant functions, and we derive a priori error estimates for it. We illustrate the theory with one- and two-dimensional numerical experiments.
In this paper, we introduce and analyze a new class of optimal control problems constrained by elliptic equations with uncertain fractional exponents. We utilize risk measures to formulate the resulting optimization problem. We develop a functional analytic framework, study the existence of solution, and rigorously derive the first-order optimality conditions. Additionally, we employ a sample-based approximation for the uncertain exponent and the finite element method to discretize in space. We prove the rate of convergence for the optimal risk neutral controls when using quadrature approximation for the uncertain exponent and conclude with illustrative examples.
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Milewicz, Reed M.; Pirkelbauer, Peter; Soundararajan, Prema; Ahmed, Hadia; Skjellum, Tony
A source-to-source compiler is a type of translator that accepts the source code of a program written in a programming language as its input and produces an equivalent source code in the same or different programming language. S2S techniques are commonly used to enable fluent translation between high-level programming languages, to perform large-scale refactoring operations, and to facilitate instrumentation for dynamic analysis. Negative perceptions about S2S’s applicability in High Performance Computing (HPC) are studied and evaluated here. This is a first study that brings to light reasons why scientists do not use source-to-source techniques for HPC. The primary audience for this paper are those considering S2S technology in their HPC application work.
Interval Assignment (IA) is the problem of selecting the number of mesh edges (intervals) for each curve for conforming quad and hex meshing. The intervals x is fundamentally integer-valued, yet many approaches perform floating-point optimization and convert a floating-point solution into an integer solution. We avoid such steps: we start integer, stay integer. Incremental Interval Assignment (IIA) uses integer linear algebra (Hermite normal form) to find an initial solution to the matrix equation Ax = b satisfying the meshing constraints. Solving for reduced row echelon form provides integer vectors spanning the nullspace of A. We add vectors from the nullspace to improve the initial solution. Compared to floating-point optimization approaches, IIA is faster and always produces an integer solution. The potential drawback is that there is no theoretical guarantee that the solution is optimal, but in practice we achieve solutions close to the user goals. The software is freely available.
Aeroengines ingest foreign object debris such as sand, which eventually erode components through repeated impacts. Due to the wide feature space, modeling and simulations are needed to rapidly assess the erosion behavior of materials such as composites. Peridynamic simulations were performed to analyze erosion of SiC/SiC composite due to sand impacts, which gives direct insight into the impact erosion mechanism and amounts. The erosion data was strongly correlated to impact velocity and angle, providing predictive equations.