As clock speeds have stagnated, the number of cores in a node has been drastically increased to improve processor throughput. Most scalable system software was designed and developed for single-threaded environments. Multithreaded environments become increasingly prominent as application developers optimize their codes to leverage the full performance of the processor; however, these environments are incompatible with a number of assumptions that have driven scalable system software development. This paper will highlight a case study of this mismatch focusing on MPI message matching. MPI message matching has been designed and optimized for traditional serial execution. The reduced determinism in the order of MPI calls can significantly reduce the performance of MPI message matching, potentially overtaking time-per-iteration targets of many applications. Different proposed techniques attempt to address these issues and enable multithreaded MPI usage. These approaches highlight a number of tradeoffs that make adapting MPI message matching complex. This case study and its proposed solutions highlight a number of general concepts that need to be leveraged in the design of next generation scaleable system software.
Haddock, Walker; Bangalore, Purushotham V.; Curry, Matthew L.; Skjellum, Anthony
Exascale computing demands high bandwidth and low latency I/O on the computing edge. Object storage systems can provide higher bandwidth and lower latencies than tape archive. File transfer nodes present a single point of mediation through which data moving between these storage systems must pass. By increasing the performance of erasure coding, stripes can be subdivided into large numbers of shards. This paper’s contribution is a prototype nearline disk object storage system based on Ceph. We show that using general purpose graphical processing units (GPGPU) for erasure coding on file transfer nodes is effective when using a large number of shards. We describe an architecture for nearline disk archive storage for use with high performance computing (HPC) and demonstrate the performance with benchmarking results. We compare the benchmark performance of our design with the IntelR⃝ Storage Acceleration Library (ISA-L) CPU based erasure coding libraries using the native Ceph erasure coding feature.
Communication networks have evolved to a level of sophistication that requires computer models and numerical simulations to understand and predict their behavior. A network simulator is a software that enables the network designer to model several components of a computer network such as nodes, routers, switches and links and events such as data transmissions and packet errors in order to obtain device and network level metrics. Network simulations, as many other numerical approximations that model complex systems, are subject to the specification of parameters and operative conditions of the system. Very often the full characterization of the system and their input is not possible, therefore Uncertainty Quantification (UQ) strategies need to be deployed to evaluate the statistics of its response and behavior. UQ techniques, despite the advancements in the last two decades, still suffer in the presence of a large number of uncertain variables and when the regularity of the systems response cannot be guaranteed. In this context, multifidelity approaches have gained popularity in the UQ community recently due to their flexibility and robustness with respect to these challenges. The main idea behind these techniques is to extract information from a limited number of high-fidelity model realizations and complement them with a much larger number of a set of lower fidelity evaluations. The final result is an estimator with a much lower variance, i.e. a more accurate and reliable estimator can be obtained. In this contribution we investigate the possibility to deploy multifidelity UQ strategies to computer network analysis. Two numerical configurations are studied based on a simplified network with one client and one server. Preliminary results for these tests suggest that multifidelity sampling techniques might be used as effective tools for UQ tools in network applications.
In this paper, we develop software for decomposing sparse tensors that is portable to and performant on a variety of multicore, manycore, and GPU computing architectures. The result is a single code whose performance matches optimized architecture-specific implementations. The key to a portable approach is to determine multiple levels of parallelism that can be mapped in different ways to different architectures, and we explain how to do this for the matricized tensor times Khatri-Rao product (MTTKRP), which is the key kernel in canonical polyadic tensor decomposition. Our implementation leverages the Kokkos framework, which enables a single code to achieve high performance across multiple architectures that differ in how they approach fine-grained parallelism. We also introduce a new construct for portable thread-local arrays, which we call compile-time polymorphic arrays. Not only are the specifics of our approaches and implementation interesting for tuning tensor computations, but they also provide a roadmap for developing other portable high-performance codes. As a last step in optimizing performance, we modify the MTTKRP algorithm itself to do a permuted traversal of tensor nonzeros to reduce atomic-write contention. We test the performance of our implementation on 16- and 68-core Intel CPUs and the K80 and P100 NVIDIA GPUs, showing that we are competitive with state-of-the-art architecture-specific codes while having the advantage of being able to run on a variety of architectures.
For at least the last 20 years, many have tried to create a general resource management system to support interoperability across various concurrent libraries. The previous strategies all suffered from additional toolchain requirements, and/or a usage of a shared programing model that assumed it owned/controlled access to all resources available to the program. None of these techniques have achieved wide spread adoption. The ubiquity of OpenMP coupled with C++ developing a standard way to describe many different concurrent paradigms (C++23 executors) would allow OpenMP to assume the role of a general resource manager without requiring user code written directly in OpenMP. With a few added features such as the ability to use otherwise idle threads to execute tasks and to specify a task “width”, many interesting concurrent frameworks could be developed in native OpenMP and achieve high performance. Further, one could create concrete C++ OpenMP executors that enable support for general C++ executor based codes, which would allow Fortran, C, and C++ codes to use the same underlying concurrent framework when expressed as native OpenMP or using language specific features. Effectively, OpenMP would become the de facto solution for a problem that has long plagued the HPC community.
Probabilistic simulations of the post-closure performance of a generic deep geologic repository for commercial spent nuclear fuel in shale host rock provide a test case for comparing sensitivity analysis methods available in Geologic Disposal Safety Assessment (GDSA) Framework, the U.S. Department of Energy's state-of-the-art toolkit for repository performance assessment. Simulations assume a thick low-permeability shale with aquifers (potential paths to the biosphere) above and below the host rock. Multi-physics simulations on the 7-million-cell grid are run in a high-performance computing environment with PFLOTRAN. Epistemic uncertain inputs include properties of the engineered and natural systems. The output variables of interest, maximum I-129 concentrations (independent of time) at observation points in the aquifers, vary over several orders of magnitude. Variance-based global sensitivity analyses (i.e., calculations of sensitivity indices) conducted with Dakota use polynomial chaos expansion (PCE) and Gaussian process (GP) surrogate models. Results of analyses conducted with raw output concentrations and with log-transformed output concentrations are compared. Using log-transformed concentrations results in larger sensitivity indices for more influential input variables, smaller sensitivity indices for less influential input variables, and more consistent values for sensitivity indices between methods (PCE and GP) and between analyses repeated with samples of different sizes.
Atmospheric tracer transport is a computationally demanding component of the atmospheric dynamical core of weather and climate simulations. Simulations typically have tens to hundreds of tracers. A tracer field is required to preserve several properties, including mass, shape, and tracer consistency. To improve computational efficiency, it is common to apply different spatial and temporal discretizations to the tracer transport equations than to the dynamical equations. Using different discretizations increases the difficulty of preserving properties. This paper provides a unified framework to analyze the property preservation problem and classes of algorithms to solve it. We examine the primary problem and a safety problem; describe three classes of algorithms to solve these; introduce new algorithms in two of these classes; make connections among the algorithms; analyze each algorithm in terms of correctness, bound on its solution magnitude, and its communication efficiency; and study numerical results. A new algorithm, QLT, has the smallest communication volume, and in an important case it redistributes mass approximately locally. These algorithms are only very loosely coupled to the underlying discretizations of the dynamical and tracer transport equations and thus are broadly and efficiently applicable. In addition, they may be applied to remap problems in applications other than tracer transport.
Drinking water utilities use booster stations to maintain chlorine residuals throughout water distribution systems. Booster stations could also be used as part of an emergency response plan to minimize health risks in the event of an unintentional or malicious contamination incident. The benefit of booster stations for emergency response depends on several factors, including the reaction between chlorine and an unknown contaminant species, the fate and transport of the contaminant in the water distribution system, and the time delay between detection and initiation of boosted levels of chlorine. This paper takes these aspects into account and proposes a mixed-integer linear program formulation for optimizing the placement of booster stations for emergency response. A case study is used to explore the ability of optimally placed booster stations to reduce the impact of contamination in water distribution systems.
Bayesian optimization is an effective surrogate-based optimization method that has been widely used for simulation-based applications. However, the traditional Bayesian optimization (BO) method is only applicable to single-fidelity applications, whereas multiple levels of fidelity exist in reality. In this work, we propose a bi-fidelity known/unknown constrained Bayesian optimization method for design applications. The proposed framework, called sBF-BO-2CoGP, is built on a two-level CoKriging method to predict the objective function. An external binary classifier, which is also another CoKriging model, is used to distinguish between feasible and infeasible regions. The sBF-BO-2CoGP method is demonstrated using a numerical example and a flip-chip application for design optimization to minimize the warpage deformation under thermal loading conditions.
We present a meshfree quadrature rule for compactly supported nonlocal integro-differential equations (IDEs) with radial kernels. We apply this rule to develop a meshfree discretization of a peridynamic solid mechanics model that requires no background mesh. Existing discretizations of peridynamic models have been shown to exhibit a lack of asymptotic compatibility to the corresponding linearly elastic local solution. By posing the quadrature rule as an equality constrained least squares problem, we obtain asymptotically compatible convergence by introducing polynomial reproduction constraints. Our approach naturally handles traction-free conditions, surface effects, and damage modeling for both static and dynamic problems. We demonstrate high-order convergence to the local theory by comparing to manufactured solutions and to cases with crack singularities for which an analytic solution is available. Finally, we verify the applicability of the approach to realistic problems by reproducing high-velocity impact results from the Kalthoff–Winkler experiments.
Existing machines for lazy evaluation use a flat representation of environments, storing the terms associated with free variables in an array. Combined with a heap, this structure supports the shared intermediate results required by lazy evaluation. We propose and describe an alternative approach that uses a shared environment to minimize the overhead of delayed computations. We show how a shared environment can act as both an environment and a mechanism for sharing results. To formalize this approach, we introduce a calculus that makes the shared environment explicit, as well as a machine to implement the calculus, the Cactus Environment Machine. A simple compiler implements the machine and is used to run experiments for assessing performance. The results show reasonable performance and suggest that incorporating this approach into real-world compilers could yield performance benefits in some scenarios.
This work introduces a new method to efficiently solve optimization problems constrained by partial differential equations (PDEs) with uncertain coefficients. The method leverages two sources of inexactness that trade accuracy for speed: (1) stochastic collocation based on dimension-Adaptive sparse grids (SGs), which approximates the stochastic objective function with a limited number of quadrature nodes, and (2) projection-based reduced-order models (ROMs), which generate efficient approximations to PDE solutions. These two sources of inexactness lead to inexact objective function and gradient evaluations, which are managed by a trust-region method that guarantees global convergence by adaptively refining the SG and ROM until a proposed error indicator drops below a tolerance specified by trust-region convergence theory. A key feature of the proposed method is that the error indicator|which accounts for errors incurred by both the SG and ROM|must be only an asymptotic error bound, i.e., a bound that holds up to an arbitrary constant that need not be computed. This enables the method to be applicable to a wide range of problems, including those where sharp, computable error bounds are not available; this distinguishes the proposed method from previous works. Numerical experiments performed on a model problem from optimal ow control under uncertainty verify global convergence of the method and demonstrate the method's ability to outperform previously proposed alternatives.
We use Bayesian data analysis to predict dengue fever outbreaks and quantify the link between outbreaks and meteorological precursors tied to the breeding conditions of vector mosquitos. We use Hamiltonian Monte Carlo sampling to estimate a seasonal Gaussian process modeling infection rate, and aperiodic basis coefficients for the rate of an “outbreak level” of infection beyond seasonal trends across two separate regions. We use this outbreak level to estimate an autoregressive moving average (ARMA) model from which we extrapolate a forecast. We show that the resulting model has useful forecasting power in the 6–8 week range. The forecasts are not significantly more accurate with the inclusion of meteorological covariates than with infection trends alone.
Composition of computational science applications into both ad hoc pipelines for analysis of collected or generated data and into well-defined and repeatable workflows is becoming increasingly popular. Meanwhile, dedicated high performance computing storage environments are rapidly becoming more diverse, with both significant amounts of non-volatile memory storage and mature parallel file systems available. At the same time, computational science codes are being coupled to data analysis tools which are not filesystem-oriented. In this paper, we describe how the FAODEL data management service can expose different available data storage options and mediate among them in both application- and FAODEL-directed ways. These capabilities allow applications to exploit their knowledge of the different types of data they may exchange during a workflow execution, and also provide FAODEL with mechanisms to proactively tune data storage behavior when appropriate. We describe the implementation of these capabilities in FAODEL and how they are used by applications, and present preliminary performance results demonstrating the potential benefits of our approach.
Krichmar, Jeffrey L.; Severa, William M.; Khan, Muhammad S.; Olds, James L.
The Artificial Intelligence (AI) revolution foretold of during the 1960s is well underway in the second decade of the twenty first century. Its period of phenomenal growth likely lies ahead. AI-operated machines and technologies will extend the reach of Homo sapiens far beyond the biological constraints imposed by evolution: outwards further into deep space, as well as inwards into the nano-world of DNA sequences and relevant medical applications. And yet, we believe, there are crucial lessons that biology can offer that will enable a prosperous future for AI. For machines in general, and for AI's especially, operating over extended periods or in extreme environments will require energy usage orders of magnitudes more efficient than exists today. In many operational environments, energy sources will be constrained. The AI's design and function may be dependent upon the type of energy source, as well as its availability and accessibility. Any plans for AI devices operating in a challenging environment must begin with the question of how they are powered, where fuel is located, how energy is stored and made available to the machine, and how long the machine can operate on specific energy units. While one of the key advantages of AI use is to reduce the dimensionality of a complex problem, the fact remains that some energy is required for functionality. Hence, the materials and technologies that provide the needed energy represent a critical challenge toward future use scenarios of AI and should be integrated into their design. Here we look to the brain and other aspects of biology as inspiration for Biomimetic Research for Energy-efficient AI Designs (BREAD).
Herrington, Adam R.; Lauritzen, Peter H.; Taylor, Mark A.; Goldhaber, Steve; Eaton; Reed; Ullrich, Paul A.
Atmospheric modeling with element-based high-order Galerkin methods presents a unique challenge to the conventional physics–dynamics coupling paradigm, due to the highly irregular distribution of nodes within an element and the distinct numerical characteristics of the Galerkin method. The conventional coupling procedure is to evaluate the physical parameterizations (physics) on the dynamical core grid. Evaluating the physics at the nodal points exacerbates numerical noise from the Galerkin method, enabling and amplifying local extrema at element boundaries. Grid imprinting may be substantially reduced through the introduction of an entirely separate, approximately isotropic finite-volume grid for evaluating the physics forcing. Integration of the spectral basis over the control volumes provides an area-average state to the physics, which is more representative of the state in the vicinity of the nodal points rather than the nodal point itself and is more consistent with the notion of a “large-scale state” required by conventional physics packages. This study documents the implementation of a quasi-equal-area physics grid into NCAR’s Community Atmosphere Model Spectral Element and is shown to be effective at mitigating grid imprinting in the solution. The physics grid is also appropriate for coupling to other components within the Community Earth System Model, since the coupler requires component fluxes to be defined on a finite-volume grid, and one can be certain that the fluxes on the physics grid are, indeed, volume averaged.