Continued progress in computing has augmented the quest for higher performance with a new quest for higher energy efficiency. This has led to the re-emergence of Processing-In-Memory (PIM) ar- chitectures that offer higher density and performance with some boost in energy efficiency. Past PIM work either integrated a standard CPU with a conventional DRAM to improve the CPU- memory link, or used a bit-level processor with Single Instruction Multiple Data (SIMD) control, but neither matched the energy consumption of the memory to the computation. We originally proposed to develop a new architecture derived from PIM that more effectively addressed energy efficiency for high performance scientific, data analytics, and neuromorphic applications. We also originally planned to implement a von Neumann architecture with arithmetic/logic units (ALUs) that matched the power consumption of an advanced storage array to maximize energy efficiency. Implementing this architecture in storage was our original idea, since by augmenting storage (in- stead of memory), the system could address both in-memory computation and applications that accessed larger data sets directly from storage, hence Processing-in-Memory-and-Storage (PIMS). However, as our research matured, we discovered several things that changed our original direc- tion, the most important being that a PIM that implements a standard von Neumann-type archi- tecture results in significant energy efficiency improvement, but only about a O(10) performance improvement. In addition to this, the emergence of new memory technologies moved us to propos- ing a non-von Neumann architecture, called Superstrider, implemented not in storage, but in a new DRAM technology called High Bandwidth Memory (HBM). HBM is a stacked DRAM tech- nology that includes a logic layer where an architecture such as Superstrider could potentially be implemented.
Could combining quantum computing and machine learning with Moore's law produce a true 'rebooted computer'? This article posits that a three-technology hybrid-computing approach might yield sufficiently improved answers to a broad class of problems such that energy efficiency will no longer be the dominant concern.
The heterogeneity in mechanical fields introduced by microstructure plays a critical role in the localization of deformation. To resolve this incipient stage of failure, it is therefore necessary to incorporate microstructure with sufficient resolution. On the other hand, computational limitations make it infeasible to represent the microstructure in the entire domain at the component scale. In this study, the authors demonstrate the use of concurrent multiscale modeling to incorporate explicit, finely resolved microstructure in a critical region while resolving the smoother mechanical fields outside this region with a coarser discretization to limit computational cost. The microstructural physics is modeled with a high-fidelity model that incorporates anisotropic crystal elasticity and rate-dependent crystal plasticity to simulate the behavior of a stainless steel alloy. The component-scale material behavior is treated with a lower fidelity model incorporating isotropic linear elasticity and rate-independent J2 plasticity. The microstructural and component scale subdomains are modeled concurrently, with coupling via the Schwarz alternating method, which solves boundary-value problems in each subdomain separately and transfers solution information between subdomains via Dirichlet boundary conditions. In this study, the framework is applied to model incipient localization in tensile specimens during necking.
Quantum state tomography on a d-dimensional system demands resources that grow rapidly with d. They may be reduced by using model selection to tailor the number of parameters in the model (i.e., the size of the density matrix). Most model selection methods typically rely on a test statistic and a null theory that describes its behavior when two models are equally good. Here, we consider the loglikelihood ratio. Because of the positivity constraint ρ ≥ 0, quantum state space does not generally satisfy local asymptotic normality (LAN), meaning the classical null theory for the loglikelihood ratio (the Wilks theorem) should not be used. Thus, understanding and quantifying how positivity affects the null behavior of this test statistic is necessary for its use in model selection for state tomography. We define a new generalization of LAN, metric-projected LAN, show that quantum state space satisfies it, and derive a replacement for the Wilks theorem. In addition to enabling reliable model selection, our results shed more light on the qualitative effects of the positivity constraint on state tomography.
Shock wave interactions with defects, such as pores, are known to play a key role in the chemical initiation of energetic materials. The shock response of hexanitrostilbene is studied through a combination of large-scale reactive molecular dynamics and mesoscale hydrodynamic simulations. In order to extend our simulation capability at the mesoscale to include weak shock conditions (<6 GPa), atomistic simulations of pore collapse are used to define a strain-rate-dependent strength model. Comparing these simulation methods allows us to impose physically reasonable constraints on the mesoscale model parameters. In doing so, we have been able to study shock waves interacting with pores as a function of this viscoplastic material response. We find that the pore collapse behavior of weak shocks is characteristically different than that of strong shocks.
The scientific goal of ExaWind Exascale Computing Project (ECP) is to advance our fundamental understanding of the flow physics governing whole wind plant performance, including wake formation, complex terrain impacts, and turbine-turbine-interaction effects. Current methods for modeling wind plant performance fall short due to insufficient model fidelity and inadequate treatment of key phenomena, combined with a lack of computational power necessary to address the wide range of relevant length scales associated with wind plants. Thus, our ten-year exascale challenge is the predictive simulation of a wind plant composed of O(100) multi-MW wind turbines sited within a 100 km2 area with complex terrain, involving simulations with O(100) billion grid points. The project plan builds progressively from predictive petascale simulations of a single turbine, where the detailed blade geometry is resolved, meshes rotate and deform with blade motions, and atmospheric turbulence is realistically modeled, to a multi turbine array in complex terrain. The ALCC allocation will be used continually throughout the allocation period. In the first half of the allocation period, small (e.g., for testing Kokkos algorithms) and medium (e.g., 10K cores for highly resolved ABL simulations) sized jobs will be typical. In the second half of the allocation period, we will also have a number of large submittals for our resolved-turbine simulations. A challenge in the latter period is that small time step sizes will require long wall-clock times for statistically meaningful solutions. As such, we expect our allocation-hour burn rate to increase as we move through the allocation period.
The inverse problem of Kohn–Sham density functional theory (DFT) is often solved in an effort to benchmark and design approximate exchange-correlation potentials. The forward and inverse problems of DFT rely on the same equations but the numerical methods for solving each problem are substantially different. We examine both problems in this tutorial with a special emphasis on the algorithms and error analysis needed for solving the inverse problem. Two inversion methods based on partial differential equation constrained optimization and constrained variational ideas are introduced. We compare and contrast several different inversion methods applied to one-dimensional finite and periodic model systems.
Methods for the efficient representation of fracture response in geoelectric models impact an impressively broad range of problems in applied geophysics. We adopt the recently-developed hierarchical material property representation in finite element analysis (Weiss, 2017) to model the electrostatic response of a discrete set of vertical fractures in the near surface and compare these results to those from anisotropic continuum models. We also examine the power law behavior of these results and compare to continuum theory. We find that in measurement profiles from a single point source in directions both parallel and perpendicular to the fracture set, the fracture signature persists over all distances. Furthermore, the homogenization limit (distance at which the individual fracture anomalies are too small to be either measured or of interest) is not strictly a function of the geometric distribution of the fractures, but also their conductivity relative to the background. Hence, we show that the definition of “representative elementary volume”, that distance over which the statistics of the underlying heterogeneities is stationary, is incomplete as it pertains to the applicability of an equivalent continuum model. We also show that detailed interrogation of such intrinsically heterogeneous models may reveal power law behavior that appears anomalous, thus suggesting a possible mechanism to reconcile emerging theories in fractional calculus with classical electromagnetic theory.
A closed-form solution is described here for the equilibrium configurations of the magnetic field in a simple heterogeneous domain. This problem and its solution are used for rigorous assessment of the accuracy of the ALEGRA code in the quasistatic limit. By the equilibrium configuration we understand the static condition, or the stationary states without macroscopic current. The analysis includes quite a general class of 2D solutions for which a linear isotropic metallic matrix is placed inside a stationary magnetic field approaching a constant value ° at infinity. The process of evolution of the magnetic fields inside and outside the inclusion and the parameters for which the quasi-static approach provides for self-consistent results is also explored. It is demonstrated that under spatial mesh refinement, ALEGRA converges to the analytic solution for the interior of the inclusion at the expected rate, for both body-fitted and regular rectangular meshes.
Anomaly detection is an important problem in various fields of complex systems research including image processing, data analysis, physical security and cybersecurity. In image processing, it is used for removing noise while preserving image quality, and in data analysis, physical security and cybersecurity, it is used to find interesting data points, objects or events in a vast sea of information. Anomaly detection will continue to be an important problem in domains intersecting with “Big Data”. In this paper we provide a novel algorithm for anomaly detection that uses phase-coded spiking neurons as basic computational elements.
In many applications the resolution of small-scale heterogeneities remains a significant hurdle to robust and reliable predictive simulations. In particular, while material variability at the mesoscale plays a fundamental role in processes such as material failure, the resolution required to capture mechanisms at this scale is often computationally intractable. Multiscale methods aim to overcome this difficulty through judicious choice of a subscale problem and a robust manner of passing information between scales. One promising approach is the multiscale finite element method, which increases the fidelity of macroscale simulations by solving lower-scale problems that produce enriched multiscale basis functions. In this study, we present the first work toward application of the multiscale finite element method to the nonlocal peridynamic theory of solid mechanics. This is achieved within the context of a discontinuous Galerkin framework that facilitates the description of material discontinuities and does not assume the existence of spatial derivatives. Analysis of the resulting nonlocal multiscale finite element method is achieved using the ambulant Galerkin method, developed here with sufficient generality to allow for application to multiscale finite element methods for both local and nonlocal models that satisfy minimal assumptions. We conclude with preliminary results on a mixed-locality multiscale finite element method in which a nonlocal model is applied at the fine scale and a local model at the coarse scale.
Evaluating the effectiveness of data visualizations is a challenging undertaking and often relies on one-off studies that test a visualization in the context of one specific task. Researchers across the fields of data science, visualization, and human-computer interaction are calling for foundational tools and principles that could be applied to assessing the effectiveness of data visualizations in a more rapid and generalizable manner. One possibility for such a tool is a model of visual saliency for data visualizations. Visual saliency models are typically based on the properties of the human visual cortex and predict which areas of a scene have visual features (e.g. color, luminance, edges) that are likely to draw a viewer's attention. While these models can accurately predict where viewers will look in a natural scene, they typically do not perform well for abstract data visualizations. In this paper, we discuss the reasons for the poor performance of existing saliency models when applied to data visualizations. We introduce the Data Visualization Saliency (DVS) model, a saliency model tailored to address some of these weaknesses, and we test the performance of the DVS model and existing saliency models by comparing the saliency maps produced by the models to eye tracking data obtained from human viewers. Finally, we describe how modified saliency models could be used as general tools for assessing the effectiveness of visualizations, including the strengths and weaknesses of this approach.
When very few samples of a random quantity are available from a source distribution or probability density function (PDF) of unknown shape, it is usually not possible to accurately infer the PDF from which the data samples come. Then a significant component of epistemic uncertainty exists concerning the source distribution of random or aleatory variability. For many engineering purposes, including design and risk analysis, one would normally want to avoid inference related under-estimation of important quantities such as response variance, and failure probabilities. Recent research has established the practicality and effectiveness of a class of simple and inexpensive UQ Methods for reasonable conservative estimation of such quantities when only sparse samples of a random quantity are available. This class of UQ methods is explained, demonstrated, and analyzed in this paper within the context of the Sandia Cantilever Beam End-to-End UQ Problem, Part A.1. Several sets of sparse replicate data are involved and several representative uncertainty quantities are to be estimated: A) beam deflection variability, in particular the 2.5 to 97.5 percentile “central 95%” range of the sparsely sampled PDF of deflection; and B) a small exceedance probability associated with a tail of the PDF integrated beyond a specified deflection tolerance.
Park, Michael A.; Barral, Nicolas; Ibanez-Granados, Daniel A.; Kamenetskiy, Dmitry S.; Krakos, Joshua A.; Michal, Todd; Loseille, Adrien
Unstructured grid adaptation is a tool to control Computational Fluid Dynamics (CFD) discretization error. However, adaptive grid techniques have made limited impact on production analysis workflows where the control of discretization error is critical to obtaining reliable simulation results. Issues that prevent the use of adaptive grid methods are identified by applying unstructured grid adaptation methods to a series of benchmark cases. Once identified, these challenges to existing adaptive workflows can be addressed. Unstructured grid adaptation is evaluated for test cases described on the Turbulence Modeling Resource (TMR) web site, which documents uniform grid refinement of multiple schemes. The cases are turbulent flow over a Hemisphere Cylinder and an ONERA M6 Wing. Adaptive grid force and moment trajectories are shown for three integrated grid adaptation processes with Mach interpolation control and output error based metrics. The integrated grid adaptation process with a finite element (FE) discretization produced results consistent with uniform grid refinement of fixed grids. The integrated grid adaptation processes with finite volume schemes were slower to converge to the reference solution than the FE method. Metric conformity is documented on grid/metric snapshots for five grid adaptation mechanics implementations. These tools produce anisotropic boundary conforming grids requested by the adaptation process.
When very few samples of a random quantity are available from a source distribution or probability density function (PDF) of unknown shape, it is usually not possible to accurately infer the PDF from which the data samples come. Then a significant component of epistemic uncertainty exists concerning the source distribution of random or aleatory variability. For many engineering purposes, including design and risk analysis, one would normally want to avoid inference related under-estimation of important quantities such as response variance, and failure probabilities. Recent research has established the practicality and effectiveness of a class of simple and inexpensive UQ Methods for reasonable conservative estimation of such quantities when only sparse samples of a random quantity are available. This class of UQ methods is explained, demonstrated, and analyzed in this paper within the context of the Sandia Cantilever Beam End-to-End UQ Problem, Part A.1. Several sets of sparse replicate data are involved and several representative uncertainty quantities are to be estimated: A) beam deflection variability, in particular the 2.5 to 97.5 percentile “central 95%” range of the sparsely sampled PDF of deflection; and B) a small exceedance probability associated with a tail of the PDF integrated beyond a specified deflection tolerance.
In this work we present a computational capability featuring a hierarchy of models with different fidelities for the solution of electrokinetics problems at the micro-/nano-scale. A multifidelity approach allows the selection of the most appropriate model, in terms of accuracy and computational cost, for the particular application at hand. We demonstrate the proposed multifidelity approach by studying the mobility of a colloid in a micro-channel as a function of the colloid charge and of the size of the ions dissolved in the fluid.
Independent meshing of subdomains separated by an interface can lead to spatially non-coincident discrete interfaces. We present an optimization-based coupling method for such problems, which does not require a common mesh refinement of the interface, has optimal H1 convergence rates, and passes a patch test. The method minimizes the mismatch of the state and normal stress extensions on discrete interfaces subject to the subdomain equations, while interface “fluxes” provide virtual Neumann controls.
Maintaining the performance of high-performance computing (HPC) applications with the expected increase in failures is a major challenge for next-generation extreme-scale systems. With increasing scale, resilience activities (e.g. checkpointing) are expected to become more diverse, less tightly synchronized, and more computationally intensive. Few existing studies, however, have examined how decisions about scheduling resilience activities impact application performance. In this work, we examine the relationship between the duration and frequency of resilience activities and application performance. Our study reveals several key findings: (i) the aggregate amount of time consumed by resilience activities is not an effective metric for predicting application performance; (ii) the duration of the interruptions due to resilience activities has the greatest influence on application performance; shorter, but more frequent, interruptions are correlated with better application performance; and (iii) the differential impact of resilience activities across applications is related to the applications’ inter-collective frequencies; the performance of applications that perform infrequent collective operations scales better in the presence of resilience activities than the performance of applications that perform more frequent collective operations. This initial study demonstrates the importance of considering how resilience activities are scheduled. We provide critical analysis and direct guidance on how the resilience challenges of future systems can be met while minimizing the impact on application performance.
We review the physical foundations of Landauer’s Principle, which relates the loss of information from a computational process to an increase in thermodynamic entropy. Despite the long history of the Principle, its fundamental rationale and proper interpretation remain frequently misunderstood. Contrary to some misinterpretations of the Principle, the mere transfer of entropy between computational and non-computational subsystems can occur in a thermodynamically reversible way without increasing total entropy. However, Landauer’s Principle is not about general entropy transfers; rather, it more specifically concerns the ejection of (all or part of) some correlated information from a controlled, digital form (e.g., a computed bit) to an uncontrolled, non-computational form, i.e., as part of a thermal environment. Any uncontrolled thermal system will, by definition, continually re-randomize the physical information in its thermal state, from our perspective as observers who cannot predict the exact dynamical evolution of the microstates of such environments. Thus, any correlations involving information that is ejected into and subsequently thermalized by the environment will be lost from our perspective, resulting directly in an irreversible increase in thermodynamic entropy. Avoiding the ejection and thermalization of correlated computational information motivates the reversible computing paradigm, although the requirements for computations to be thermodynamically reversible are less restrictive than frequently described, particularly in the case of stochastic computational operations. There are interesting possibilities for the design of computational processes that utilize stochastic, many-to-one computational operations while nevertheless avoiding net entropy increase that remain to be fully explored.
In this work we present a computational capability featuring a hierarchy of models with different fidelities for the solution of electrokinetics problems at the micro-/nano-scale. A multifidelity approach allows the selection of the most appropriate model, in terms of accuracy and computational cost, for the particular application at hand. We demonstrate the proposed multifidelity approach by studying the mobility of a colloid in a micro-channel as a function of the colloid charge and of the size of the ions dissolved in the fluid.
The development of scramjet engines is an important research area for advancing hypersonic and orbital flights. Progress toward optimal engine designs requires accurate flow simulations together with uncertainty quantification. However, performing uncertainty quantification for scramjet simulations is challenging due to the large number of uncertainparameters involvedandthe high computational costofflow simulations. These difficulties are addressedin this paper by developing practical uncertainty quantification algorithms and computational methods, and deploying themin the current studyto large-eddy simulations ofajet incrossflow inside a simplified HIFiRE Direct Connect Rig scramjet combustor. First, global sensitivity analysis is conducted to identify influential uncertain input parameters, which can help reduce the system's stochastic dimension. Second, because models of different fidelity are used in the overall uncertainty quantification assessment, a framework for quantifying and propagating the uncertainty due to model error is presented. These methods are demonstrated on a nonreacting jet-in-crossflow test problem in a simplified scramjet geometry, with parameter space up to 24 dimensions, using static and dynamic treatments of the turbulence subgrid model, and with two-dimensional and three-dimensional geometries.
Previous work has demonstrated that propagating groups of samples, called ensembles, together through forward simulations can dramatically reduce the aggregate cost of sampling-based uncertainty propagation methods [E. Phipps, M. D'Elia, H. C. Edwards, M. Hoemmen, J. Hu, and S. Rajamanickam, SIAM J. Sci. Comput., 39 (2017), pp. C162-C193]. However, critical to the success of this approach when applied to challenging problems of scientific interest is the grouping of samples into ensembles to minimize the total computational work. For example, the total number of linear solver iterations for ensemble systems may be strongly influenced by which samples form the ensemble when applying iterative linear solvers to parameterized and stochastic linear systems. In this work we explore sample grouping strategies for local adaptive stochastic collocation methods applied to PDEs with uncertain input data, in particular canonical anisotropic diffusion problems where the diffusion coefficient is modeled by truncated Karhunen-Loève expansions. We demonstrate that a measure of the total anisotropy of the diffusion coefficient is a good surrogate for the number of linear solver iterations for each sample and therefore provides a simple and effective metric for grouping samples.
Sparse Matrix-Matrix multiplication is a key kernel that has applications in several domains such as scientific computing and graph analysis. Several algorithms have been studied in the past for this foundational kernel. In this paper, we develop parallel algorithms for sparse matrix- matrix multiplication with a focus on performance portability across different high performance computing architectures. The performance of these algorithms depend on the data structures used in them. We compare different types of accumulators in these algorithms and demonstrate the performance difference between these data structures. Furthermore, we develop a meta-algorithm, kkSpGEMM, to choose the right algorithm and data structure based on the characteristics of the problem. We show performance comparisons on three architectures and demonstrate the need for the community to develop two phase sparse matrix-matrix multiplication implementations for efficient reuse of the data structures involved.
The development of scramjet engines is an important research area for advancing hypersonic and orbital flights. Progress towards optimal engine designs requires accurate and computationally affordable flow simulations, as well as uncertainty quantification (UQ). While traditional UQ techniques can become prohibitive under expensive simulations and high-dimensional parameter spaces, polynomial chaos (PC) surrogate modeling is a useful tool for alleviating some of the computational burden. However, non-intrusive quadrature-based constructions of PC expansions relying on a single high-fidelity model can still be quite expensive. We thus introduce a two-stage numerical procedure for constructing PC surrogates while making use of multiple models of different fidelity. The first stage involves an initial dimension reduction through global sensitivity analysis using compressive sensing. The second stage utilizes adaptive sparse quadrature on a multifidelity expansion to compute PC surrogate coefficients in the reduced parameter space where quadrature methods can be more effective. The overall method is used to produce accurate surrogates and to propagate uncertainty induced by uncertain boundary conditions and turbulence model parameters, for performance quantities of interest from large eddy simulations of supersonic reactive flows inside a scramjet engine.
Within the SEQUOIA project, funded by the DARPA EQUiPS program, we pursue algorithmic approaches that enable comprehensive design under uncertainty, through inclusion of aleatory/parametric and epistemic/model form uncertainties within scalable forward/inverse UQ approaches. These statistical methods are embedded within design processes that manage computational expense through active subspace, multilevel-multifidelity, and reduced-order modeling approximations. To demonstrate these methods, we focus on the design of devices that involve multi-physics interactions in advanced aerospace vehicles. A particular problem of interest is the shape design of nozzles for advanced vehicles such as the Northrop Grumman UCAS X-47B, involving coupled aero-structural-thermal simulations for nozzle performance. In this paper, we explore a combination of multilevel and multifidelity forward and inverse UQ algorithms to reduce the overall computational cost of the analysis by leveraging hierarchies of model form (i.e., multifidelity hierarchies) and solution discretization (i.e., multilevel hierarchies) in order of exploit trade offs between solution accuracy and cost. In particular, we seek the most cost effective fusion of information across complex multi-dimensional modeling hierarchies. Results to date indicate the utility of multiple approaches, including methods that optimally allocate resources when estimator variance varies smoothly across levels, methods that allocate sufficient sampling density based on sparsity estimates, and methods that employ greedy multilevel refinement.
Slycat™ is a web-based system for performing data analysis and visualization of potentially large quantities of remote, high-dimensional data. Slycat™ specializes in working with ensemble data. An ensemble is a group of related data sets, which typically consists of a set of simulation runs exploring the same problem space. An ensemble can be thought of as a set of samples within a multi-variate domain, where each sample is a vector whose value defines a point in high-dimensional space. To understand and describe the underlying problem being modeled in the simulations, ensemble analysis looks for shared behaviors and common features across the group of runs. Additionally, ensemble analysis tries to quantify differences found in any members that deviate from the rest of the group. The Slycat™ system integrates data management, scalable analysis, and visualization. Results are viewed remotely on a user’s desktop via commodity web clients using a multi-tiered hierarchy of computation and data storage, as shown in Figure 1. Our goal is to operate on data as close to the source as possible, thereby reducing time and storage costs associated with data movement. Consequently, we are working to develop parallel analysis capabilities that operate on High Performance Computing (HPC) platforms, to explore approaches for reducing data size, and to implement strategies for staging computation across the Slycat™ hierarchy. Within Slycat™, data and visual analysis are organized around projects, which are shared by a project team. Project members are explicitly added, each with a designated set of permissions. Although users sign-in to access Slycat™, individual accounts are not maintained. Instead, authentication is used to determine project access. Within projects, Slycat™ models capture analysis results and enable data exploration through various visual representations. Although for scientists each simulation run is a model of real-world phenomena given certain conditions, we use the term model to refer to our modeling of the ensemble data, not the physics. Different model types often provide complementary perspectives on data features when analyzing the same data set. Each model visualizes data at several levels of abstraction, allowing the user to range from viewing the ensemble holistically to accessing numeric parameter values for a single run. Bookmarks provide a mechanism for sharing results, enabling interesting model states to be labeled and saved.
The design of satellites usually includes the objective of minimizing mass due to high launch costs, which is complicated by the need to protect sensitive electronics from the space radiation environment. There is growing interest in automated design optimization techniques to help achieve that objective. Traditional optimization approaches that rely exclusively on response functions (e.g. dose calculations) can be quite expensive when applied to transport problems. Previously we showed how adjoint-based transport sensitivities used in conjunction with gradient-based optimization algorithms can be quite effective in designing mass-efficient electron/proton shields in one-dimensional slab geometries. In this paper we extend that work to two-dimensional Cartesian geometries. This consists primarily of deriving the sensitivities to geometric changes, given a particular prescription for parametrizing the shield geometry. We incorporate these sensitivities into our optimization process and demonstrate their effectiveness in such design calculations.
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Klinkenberg, Jannis; Samfass, Philipp; Terboven, Christian; Duran, Alejandro; Klemm, Michael; Teruel, Xavier; Mateo, Sergi; Olivier, Stephen L.; Muller, Matthias S.
In modern shared-memory NUMA systems which typically consist of two or more multi-core processor packages with local memory, affinity of data to computation is crucial for achieving high performance with an OpenMP program. OpenMP* 3.0 introduced support for task-parallel programs in 2008 and has continued to extend its applicability and expressiveness. However, the ability to support data affinity of tasks is missing. In this paper, we investigate several approaches for task-to-data affinity that combine locality-aware task distribution and task stealing. We introduce the task affinity clause that will be part of OpenMP 5.0 and provide the reasoning behind its design. Evaluation with our experimental implementation in the LLVM OpenMP runtime shows that task affinity improves execution performance up to 4.5x on an 8-socket NUMA machine and significantly reduces runtime variability of OpenMP tasks. Our results demonstrate that a variety of applications can benefit from task affinity and that the presented clause is closing the gap of task-to-data affinity in OpenMP 5.0.
Modern supercomputers are shared among thousands of users running a variety of applications. Knowing which applications are running in the system can bring substantial benefits: knowledge of applications that intensively use shared resources can aid scheduling; unwanted applications such as cryptocurrency mining or password cracking can be blocked; system architects can make design decisions based on system usage. However, identifying applications on supercomputers is challenging because applications are executed using esoteric scripts along with binaries that are compiled and named by users. This paper introduces a novel technique to identify applications running on supercomputers. Our technique, Taxonomist, is based on the empirical evidence that applications have different and characteristic resource utilization patterns. Taxonomist uses machine learning to classify known applications and also detect unknown applications. We test our technique with a variety of benchmarks and cryptocurrency miners, and also with applications that users of a production supercomputer ran during a 6 month period. We show that our technique achieves nearly perfect classification for this challenging data set.
Supercomputing hardware is undergoing a period of significant change. In order to cope with the rapid pace of hardware and, in many cases, programming model innovation, we have developed the Kokkos Programming Model – a C++-based abstraction that permits performance portability across diverse architectures. Our experience has shown that the abstractions developed can significantly frustrate debugging and profiling activities because they break expected code proximity and layout assumptions. In this paper we present the Kokkos Profiling interface, a lightweight, suite of hooks to which debugging and profiling tools can attach to gain deep insights into the execution and data structure behaviors of parallel programs written to the Kokkos interface.
We discuss uncertainty quantification in multisensor data integration and analysis, including estimation methods and the role of uncertainty in decision making and trust in automated analytics. The challenges associated with automatically aggregating information across multiple images, identifying subtle contextual cues, and detecting small changes in noisy activity patterns are well-established in the intelligence, surveillance, and reconnaissance (ISR) community. In practice, such questions cannot be adequately addressed with discrete counting, hard classifications, or yes/no answers. For a variety of reasons ranging from data quality to modeling assumptions to inadequate definitions of what constitutes "interesting" activity, variability is inherent in the output of automated analytics, yet it is rarely reported. Consideration of these uncertainties can provide nuance to automated analyses and engender trust in their results. In this work, we assert the importance of uncertainty quantification for automated data analytics and outline a research agenda. We begin by defining uncertainty in the context of machine learning and statistical data analysis, identify its sources, and motivate the importance and impact of its quantification. We then illustrate these issues and discuss methods for data-driven uncertainty quantification in the context of a multi-source image analysis example. We conclude by identifying several specific research issues and by discussing the potential long-term implications of uncertainty quantification for data analytics, including sensor tasking and analyst trust in automated analytics.
Today’s computational, experimental, and observational sciences rely on computations that involve many related tasks. The success of a scientific mission often hinges on the computer automation of these workflows. In April 2015, the US Department of Energy (DOE) invited a diverse group of domain and computer scientists from national laboratories supported by the Office of Science, the National Nuclear Security Administration, from industry, and from academia to review the workflow requirements of DOE’s science and national security missions, to assess the current state of the art in science workflows, to understand the impact of emerging extreme-scale computing systems on those workflows, and to develop requirements for automated workflow management in future and existing environments. This article is a summary of the opinions of over 50 leading researchers attending this workshop. We highlight use cases, computing systems, workflow needs and conclude by summarizing the remaining challenges this community sees that inhibit large-scale scientific workflows from becoming a mainstream tool for extreme-scale science.
Real-time energy pricing has caused a paradigm shift for process operations with flexibility becoming a critical driver of economics. As such, incorporating real-time pricing into planning and scheduling optimization formulations has received much attention over the past two decades (Zhang and Grossman, 2016). These formulations, however, focus on 1-hour or longer time discretizations and neglect process dynamics. Recent analysis of historical price data from the California electricity market (CAISO) reveals that a majority of economic opportunities come from fast market layers, i.e., real-time energy market and ancillary services (Dowling et al., 2017). We present a dynamic optimization framework to quantify the revenue opportunities of chemical manufacturing systems providing frequency regulation (FR). Recent analysis of first order systems finds that slow process dynamics naturally dampen high frequency harmonics in FR signals (Dowling and Zavala, 2017). As a consequence, traditional chemical processes with long time constants may be able to provide fast flexibility without disrupting product quality, performance of downstream unit operations, etc. This study quantifies the ability of a distillation system to provide sufficient dynamic flexibility to adjust energy demands every 4 seconds in response to market signals. Using a detailed differential algebraic equation (DAE) model (Hahn and Edgar, 2002) and historic data from the Texas electricity market (ECROT), we estimate revenue opportunities for different column designs. We implement our model using the algebraic modeling language Pyomo (Hart et al., 2011) and its dynamic optimization extension Pyomo.DAE (Nicholson et al., 2017). These software packages enable rapid development of complex optimization models using high-level modelling constructs and provide flexible tools for initializing and discretizing DAE models.
Multiple physical time-scales can arise in electromagnetic simulations when dissipative effects are introduced through boundary conditions, when currents follow external time-scales, and when material parameters vary spatially. In such scenarios, the time-scales of interest may be much slower than the fastest time-scales supported by the Maxwell equations, therefore making implicit time integration an efficient approach. The use of implicit temporal discretizations results in linear systems in which fast time-scales, which severely constrain the stability of an explicit method, can manifest as so-called stiff modes. This study proposes a new block preconditioner for structure preserving (also termed physics compatible) discretizations of the Maxwell equations in first order form. The intent of the preconditioner is to enable the efficient solution of multiple-time-scale Maxwell type systems. An additional benefit of the developed preconditioner is that it requires only a traditional multigrid method for its subsolves and compares well against alternative approaches that rely on specialized edge-based multigrid routines that may not be readily available. Results demonstrate parallel scalability at large electromagnetic wave CFL numbers on a variety of test problems.
The solution of the Optimal Power Flow (OPF) and Unit Commitment (UC) problems (i.e., determining generator schedules and set points that satisfy demands) is critical for efficient and reliable operation of the electricity grid. For computational efficiency, the alternating current OPF (ACOPF) problem is usually formulated with a linearized transmission model, often referred to as the DCOPF problem. However, these linear approximations do not guarantee global optimality or even feasibility for the true nonlinear alternating current (AC) system. Nonlinear AC power flow models can and should be used to improve model fidelity, but successful global solution of problems with these models requires the availability of strong relaxations of the AC optimal power flow constraints. In this paper, we use McCormick envelopes to strengthen the well-known second-order cone (SOC) relaxation of the ACOPF problem. With this improved relaxation, we can further include tight bounds on the voltages at the reference bus, and this paper demonstrates the effectiveness of this for improved bounds tightening. We present results on the optimality gap of both the base SOC relaxation and our Strengthened SOC (SSOC) relaxation for the National Information and Communications Technology Australia (NICTA) Energy System Test Case Archive (NESTA). For the cases where the SOC relaxation yields an optimality gap more than 0.1 %, the SSOC relaxation with bounds tightening further reduces the optimality gap by an average of 67 % and ultimately reduces the optimality gap to less than 0.1 % for 58 % of all the NESTA cases considered. Stronger relaxations enable more efficient global solution of the ACOPF problem and can improve computational efficiency of MINLP problems with AC power flow constraints, e.g., unit commitment.
MPI usage patterns are changing as applications move towards fully-multithreaded runtimes. However, the impact of these patterns on MPI message matching is not well-studied. In particular, MPI’s mechanic for receiver-side data placement, message matching, can be impacted by increased message volume and nondeterminism incurred by multithreading. While there has been significant developer interest and work to provide an efficient MPI interface for multithreaded access, there has not been a study showing how these patterns affect messaging patterns and matching behavior. In this paper, we present a framework for studying the effects of multithreading on MPI message matching. This framework allows us to explore the implications of different common communication patterns and thread-level decompositions. We present a study of these impacts on the architecture of two of the Top 10 supercomputers (NERSC’s Cori and LANL’s Trinity). This data provides a baseline to evaluate reasonable matching engine queue lengths, search depths, and queue drain times under the multithreaded model. Furthermore, the study highlights surprising results on the challenge posed by message matching for multithreaded application performance.