Atmospheric tracer transport is a computationally demanding component of the atmospheric dynamical core of weather and climate simulations. Simulations typically have tens to hundreds of tracers. A tracer field is required to preserve several properties, including mass, shape, and tracer consistency. To improve computational efficiency, it is common to apply different spatial and temporal discretizations to the tracer transport equations than to the dynamical equations. Using different discretizations increases the difficulty of preserving properties. This paper provides a unified framework to analyze the property preservation problem and classes of algorithms to solve it. We examine the primary problem and a safety problem; describe three classes of algorithms to solve these; introduce new algorithms in two of these classes; make connections among the algorithms; analyze each algorithm in terms of correctness, bound on its solution magnitude, and its communication efficiency; and study numerical results. A new algorithm, QLT, has the smallest communication volume, and in an important case it redistributes mass approximately locally. These algorithms are only very loosely coupled to the underlying discretizations of the dynamical and tracer transport equations and thus are broadly and efficiently applicable. In addition, they may be applied to remap problems in applications other than tracer transport.
Haddock, Walker; Bangalore, Purushotham V.; Curry, Matthew L.; Skjellum, Anthony
Exascale computing demands high bandwidth and low latency I/O on the computing edge. Object storage systems can provide higher bandwidth and lower latencies than tape archive. File transfer nodes present a single point of mediation through which data moving between these storage systems must pass. By increasing the performance of erasure coding, stripes can be subdivided into large numbers of shards. This paper’s contribution is a prototype nearline disk object storage system based on Ceph. We show that using general purpose graphical processing units (GPGPU) for erasure coding on file transfer nodes is effective when using a large number of shards. We describe an architecture for nearline disk archive storage for use with high performance computing (HPC) and demonstrate the performance with benchmarking results. We compare the benchmark performance of our design with the IntelR⃝ Storage Acceleration Library (ISA-L) CPU based erasure coding libraries using the native Ceph erasure coding feature.
The MPI multithreading model has been historically difficult to optimize; the interface that it provides for threads was designed as a process-level interface. This model has led to implementations that treat function calls as critical regions and protect them with locks to avoid race conditions. We hypothesize that an interface designed specifically for threads can provide superior performance than current approaches and even outperform single-threaded MPI. In this paper, we describe a design for partitioned communication in MPI that we call finepoints. First, we assess the existing communication models for MPI two-sided communication and then introduce finepoints as a hybrid of MPI models that has the best features of each existing MPI communication model. In addition, “partitioned communication” created with finepoints leverages new network hardware features that cannot be exploited with current MPI point-to-point semantics, making this new approach both innovative and useful both now and in the future. To demonstrate the validity of our hypothesis, we implement a finepoints library and show improvements against a state-of-the-art multithreaded optimized Open MPI implementation on a Cray XC40 with an Aries network. Our experiments demonstrate up to a 12 × reduction in wait time for completion of send operations. This new model is shown working on a nuclear reactor physics neutron-transport proxy-application, providing up to 26.1% improvement in communication time and up to 4.8% improvement in runtime over the best performing MPI communication mode, single-threaded MPI.
This paper considers response surface approximations for discontinuous quantities of interest. Our objective is not to adaptively characterize the interface defining the discontinuity. Instead, we utilize an epistemic description of the uncertainty in the location of a discontinuity to produce robust bounds on sample-based estimates of probabilistic quantities of interest. We demonstrate that two common machine learning strategies for classification, one based on nearest neighbors (Voronoi cells) and one based on support vector machines, provide reasonable descriptions of the region where the discontinuity may reside. In higher dimensional spaces, we demonstrate that support vector machines are more accurate for discontinuities defined by smooth interfaces. We also show how gradient information, often available via adjoint-based approaches, can be used to define indicators to effectively detect a discontinuity and to decompose the samples into clusters using an unsupervised learning technique. Numerical results demonstrate the epistemic bounds on probabilistic quantities of interest for simplistic models and for a compressible fluid model with a shock-induced discontinuity.
Communication networks have evolved to a level of sophistication that requires computer models and numerical simulations to understand and predict their behavior. A network simulator is a software that enables the network designer to model several components of a computer network such as nodes, routers, switches and links and events such as data transmissions and packet errors in order to obtain device and network level metrics. Network simulations, as many other numerical approximations that model complex systems, are subject to the specification of parameters and operative conditions of the system. Very often the full characterization of the system and their input is not possible, therefore Uncertainty Quantification (UQ) strategies need to be deployed to evaluate the statistics of its response and behavior. UQ techniques, despite the advancements in the last two decades, still suffer in the presence of a large number of uncertain variables and when the regularity of the systems response cannot be guaranteed. In this context, multifidelity approaches have gained popularity in the UQ community recently due to their flexibility and robustness with respect to these challenges. The main idea behind these techniques is to extract information from a limited number of high-fidelity model realizations and complement them with a much larger number of a set of lower fidelity evaluations. The final result is an estimator with a much lower variance, i.e. a more accurate and reliable estimator can be obtained. In this contribution we investigate the possibility to deploy multifidelity UQ strategies to computer network analysis. Two numerical configurations are studied based on a simplified network with one client and one server. Preliminary results for these tests suggest that multifidelity sampling techniques might be used as effective tools for UQ tools in network applications.
Probabilistic simulations of the post-closure performance of a generic deep geologic repository for commercial spent nuclear fuel in shale host rock provide a test case for comparing sensitivity analysis methods available in Geologic Disposal Safety Assessment (GDSA) Framework, the U.S. Department of Energy's state-of-the-art toolkit for repository performance assessment. Simulations assume a thick low-permeability shale with aquifers (potential paths to the biosphere) above and below the host rock. Multi-physics simulations on the 7-million-cell grid are run in a high-performance computing environment with PFLOTRAN. Epistemic uncertain inputs include properties of the engineered and natural systems. The output variables of interest, maximum I-129 concentrations (independent of time) at observation points in the aquifers, vary over several orders of magnitude. Variance-based global sensitivity analyses (i.e., calculations of sensitivity indices) conducted with Dakota use polynomial chaos expansion (PCE) and Gaussian process (GP) surrogate models. Results of analyses conducted with raw output concentrations and with log-transformed output concentrations are compared. Using log-transformed concentrations results in larger sensitivity indices for more influential input variables, smaller sensitivity indices for less influential input variables, and more consistent values for sensitivity indices between methods (PCE and GP) and between analyses repeated with samples of different sizes.
In the field of semiconductor quantum dot spin qubits, there is growing interest in leveraging the unique properties of hole-carrier systems and their intrinsically strong spin-orbit coupling to engineer novel qubits. Recent advances in semiconductor heterostructure growth have made available high quality, undoped Ge/SiGe quantum wells, consisting of a pure strained Ge layer flanked by Ge-rich SiGe layers above and below. These quantum wells feature heavy hole carriers and a cubic Rashba-type spin-orbit interaction. Here, we describe progress toward realizing spin qubits in this platform, including development of multi-metal-layer gated device architectures, device tuning protocols, and charge-sensing capabilities. Iterative improvement of a three-layer metal gate architecture has significantly enhanced device performance over that achieved using an earlier single-layer gate design. We discuss ongoing, simulation-informed work to fine-tune the device geometry, as well as efforts toward a single-spin qubit demonstration.
The design of satellites usually includes the objective of minimizing mass due to high launch costs, which is challenging due to the need to protect sensitive electronics from the space radiation environment by means of radiation shielding. This is further complicated by the need to account for uncertainties, e.g. in manufacturing. There is growing interest in automated design optimization and uncertainty quantification (UQ) techniques to help achieve that objective. Traditional optimization and UQ approaches that rely exclusively on response functions (e.g. dose calculations) can be quite expensive when applied to transport problems. Previously we showed how adjoint-based transport sensitivities used in conjunction with gradient-based optimization algorithms can be quite effective in designing mass-efficient electron and/or proton shields in one- or two-dimensional Cartesian geometries. In this paper we extend that work to UQ and to robust design (i.e. optimization that considers uncertainties) in 2D. This consists primarily of using the sensitivities to geometric changes, originally derived for optimization, within relevant algorithms for UQ and robust design. We perform UQ analyses on previous optimized designs given some assumed manufacturing uncertainties. We also conduct a new optimization exercise that accounts for the same uncertainties. Our results show much improved computational efficiencies over previous approaches.
Bayesian optimization is an effective surrogate-based optimization method that has been widely used for simulation-based applications. However, the traditional Bayesian optimization (BO) method is only applicable to single-fidelity applications, whereas multiple levels of fidelity exist in reality. In this work, we propose a bi-fidelity known/unknown constrained Bayesian optimization method for design applications. The proposed framework, called sBF-BO-2CoGP, is built on a two-level CoKriging method to predict the objective function. An external binary classifier, which is also another CoKriging model, is used to distinguish between feasible and infeasible regions. The sBF-BO-2CoGP method is demonstrated using a numerical example and a flip-chip application for design optimization to minimize the warpage deformation under thermal loading conditions.
Input/output (I/O) from various sources often contend for scarcely available bandwidth. For example, checkpoint/restart (CR) protocols can help to ensure application progress in failure-prone environments. However, CR I/O alongside an application's normal, requisite I/O can increase I/O contention and might negatively impact performance. In this work, we consider different aspects (system-level scheduling policies and hardware) that optimize the overall performance of concurrently executing CR-based applications that share I/O resources. We provide a theoretical model and derive a set of necessary constraints to minimize the global waste on a given platform. Our results demonstrate that Young/Daly's optimal checkpoint interval, despite providing a sensible metric for a single, undisturbed application, is not sufficient to optimally address resource contention at scale. We show that by combining optimal checkpointing periods with contention-aware system-level I/O scheduling strategies, we can significantly improve overall application performance and maximize the platform throughput. Finally, we evaluate how specialized hardware, namely burst buffers, may help to mitigate the I/O contention problem. Altogether, these results provide critical analysis and direct guidance on how to design efficient, CR ready, large -scale platforms without a large investment in the I/O subsystem.
We present a meshfree quadrature rule for compactly supported nonlocal integro-differential equations (IDEs) with radial kernels. We apply this rule to develop a meshfree discretization of a peridynamic solid mechanics model that requires no background mesh. Existing discretizations of peridynamic models have been shown to exhibit a lack of asymptotic compatibility to the corresponding linearly elastic local solution. By posing the quadrature rule as an equality constrained least squares problem, we obtain asymptotically compatible convergence by introducing polynomial reproduction constraints. Our approach naturally handles traction-free conditions, surface effects, and damage modeling for both static and dynamic problems. We demonstrate high-order convergence to the local theory by comparing to manufactured solutions and to cases with crack singularities for which an analytic solution is available. Finally, we verify the applicability of the approach to realistic problems by reproducing high-velocity impact results from the Kalthoff–Winkler experiments.
This work introduces a new method to efficiently solve optimization problems constrained by partial differential equations (PDEs) with uncertain coefficients. The method leverages two sources of inexactness that trade accuracy for speed: (1) stochastic collocation based on dimension-Adaptive sparse grids (SGs), which approximates the stochastic objective function with a limited number of quadrature nodes, and (2) projection-based reduced-order models (ROMs), which generate efficient approximations to PDE solutions. These two sources of inexactness lead to inexact objective function and gradient evaluations, which are managed by a trust-region method that guarantees global convergence by adaptively refining the SG and ROM until a proposed error indicator drops below a tolerance specified by trust-region convergence theory. A key feature of the proposed method is that the error indicator|which accounts for errors incurred by both the SG and ROM|must be only an asymptotic error bound, i.e., a bound that holds up to an arbitrary constant that need not be computed. This enables the method to be applicable to a wide range of problems, including those where sharp, computable error bounds are not available; this distinguishes the proposed method from previous works. Numerical experiments performed on a model problem from optimal ow control under uncertainty verify global convergence of the method and demonstrate the method's ability to outperform previously proposed alternatives.
Sampling of drinking water distribution systems is performed to ensure good water quality and protect public health. Sampling also satisfies regulatory requirements and is done to respond to customer complaints or emergency situations. Water distribution system modeling techniques can be used to plan and inform sampling strategies. However, a high degree of accuracy and confidence in the hydraulic and water quality models is required to support real-time response. One source of error in these models is related to uncertainty in model input parameters. Effective characterization of these uncertainties and their effect on contaminant transport during a contamination incident is critical for providing confidence estimates in model-based design and evaluation of different sampling strategies. In this paper, the effects of uncertainty in customer demand, isolation valve status, bulk reaction rate coefficient, contaminant injection location, start time, duration, and rate on the size and location of the contaminant plume are quantified for two example water distribution systems. Results show that the most important parameter was the injection location. The size of the plume was also affected by the reaction rate coefficient, injection rate, and injection duration, whereas the exact location of the plume was additionally affected by the isolation valve status. Uncertainty quantification provides a more complete picture of how contaminants move within a water distribution system and more information when using modeling results to select sampling locations.
Existing machines for lazy evaluation use a flat representation of environments, storing the terms associated with free variables in an array. Combined with a heap, this structure supports the shared intermediate results required by lazy evaluation. We propose and describe an alternative approach that uses a shared environment to minimize the overhead of delayed computations. We show how a shared environment can act as both an environment and a mechanism for sharing results. To formalize this approach, we introduce a calculus that makes the shared environment explicit, as well as a machine to implement the calculus, the Cactus Environment Machine. A simple compiler implements the machine and is used to run experiments for assessing performance. The results show reasonable performance and suggest that incorporating this approach into real-world compilers could yield performance benefits in some scenarios.
There is a need for efficient optimization strategies to efficiently solve large-scale, nonlinear optimization problems. Many problem classes, including design under uncertainty are inherently structured and can be accelerated with decomposition approaches. This paper describes a second-order multiplier update for the alternating direction method of multipliers (ADMM) to solve nonlinear stochastic programming problems. We exploit connections between ADMM and the Schur-complement decomposition to derive an accelerated version of ADMM. Specifically, we study the effectiveness of performing a Newton-Raphson algorithm to compute multiplier estimates for the method of multipliers (MM). We interpret ADMM as a decomposable version of MM and propose modifications to the multiplier update of the standard ADMM scheme based on improvements observed in MM. The modifications to the ADMM algorithm seek to accelerate solutions of optimization problems for design under uncertainty and the numerical effectiveness of the approaches is demonstrated on a set of ten stochastic programming problems. Practical strategies for improving computational performance are discussed along with comparisons between the algorithms. We observe that the second-order update achieves convergence in fewer unconstrained minimizations for MM on general nonlinear problems. In the case of ADMM, the second-order update reduces significantly the number of subproblem solves for convex quadratic programs (QPs).
Two surrogate models are under development to rapidly emulate the effects of the Fuel Matrix Degradation (FMD) model in GDSA Framework. One is a polynomial regression surrogate with linear and quadratic fits, and the other is a k-Nearest Neighbors regressor (kNNr) method that operates on a lookup table. Direct coupling of the FMD model to GDSA Framework is too computationally expensive. Preliminary results indicate these surrogate models will enable GDSA Framework to rapidly simulate spent fuel dissolution for each individual breached spent fuel waste package in a probabilistic repository simulation. This capability will allow uncertainties in spent fuel dissolution to be propagated and sensitivities in FMD inputs to be quantified and ranked against other inputs.
We use Bayesian data analysis to predict dengue fever outbreaks and quantify the link between outbreaks and meteorological precursors tied to the breeding conditions of vector mosquitos. We use Hamiltonian Monte Carlo sampling to estimate a seasonal Gaussian process modeling infection rate, and aperiodic basis coefficients for the rate of an “outbreak level” of infection beyond seasonal trends across two separate regions. We use this outbreak level to estimate an autoregressive moving average (ARMA) model from which we extrapolate a forecast. We show that the resulting model has useful forecasting power in the 6–8 week range. The forecasts are not significantly more accurate with the inclusion of meteorological covariates than with infection trends alone.
The Tularosa study was designed to understand how defensive deception-including both cyber and psychological-affects cyber attackers. Over 130 red teamers participated in a network penetration task over two days in which we controlled both the presence of and explicit mention of deceptive defensive techniques. To our knowledge, this represents the largest study of its kind ever conducted on a professional red team population. The design was conducted with a battery of questionnaires (e.g., experience, personality, etc.) and cognitive tasks (e.g., fluid intelligence, working memory, etc.), allowing for the characterization of a “typical” red teamer, as well as physiological measures (e.g., galvanic skin response, heart rate, etc.) to be correlated with the cyber events. This paper focuses on the design, implementation, data, population characteristics, and begins to examine preliminary results.
This is the official user guide for MUELU multigrid library in Trilinos version 12.13 (Dev). This guide provides an overview of MUELU, its capabilities, and instructions for new users who want to start using MUELU with a minimum of effort. Detailed information is given on how to drive MUELU through its XML interface. Links to more advanced use cases are given. This guide gives information on how to achieve good parallel performance, as well as how to introduce new algorithms Finally, readers will find a comprehensive listing of available MUELU options. Any options not documented in this manual should be considered strictly experimental.
We present an overview of optimization under uncertainty efforts under the DARPA Enabling Quantification of Uncertainty in Physical Systems (EQUiPS) ScramjetUQ project. We introduce the mathematical frameworks and computational tools employed for performing this task. In particular, we provide details in the optimization and multilevel uncertainty quantification algorithms, which are available through the SNOWPAC and DAKOTA software packages. The overall workflow is first demonstrated on a simplified model design problem with non-reacting inviscid supersonic flows. Preliminary results and updates are then reported for a in-progress scramjet design optimization case using large-eddy simulations of supersonic reactive flows inside the HIFiRE Direct Connect Rig.
For at least the last 20 years, many have tried to create a general resource management system to support interoperability across various concurrent libraries. The previous strategies all suffered from additional toolchain requirements, and/or a usage of a shared programing model that assumed it owned/controlled access to all resources available to the program. None of these techniques have achieved wide spread adoption. The ubiquity of OpenMP coupled with C++ developing a standard way to describe many different concurrent paradigms (C++23 executors) would allow OpenMP to assume the role of a general resource manager without requiring user code written directly in OpenMP. With a few added features such as the ability to use otherwise idle threads to execute tasks and to specify a task “width”, many interesting concurrent frameworks could be developed in native OpenMP and achieve high performance. Further, one could create concrete C++ OpenMP executors that enable support for general C++ executor based codes, which would allow Fortran, C, and C++ codes to use the same underlying concurrent framework when expressed as native OpenMP or using language specific features. Effectively, OpenMP would become the de facto solution for a problem that has long plagued the HPC community.
Krichmar, Jeffrey L.; Severa, William M.; Khan, Muhammad S.; Olds, James L.
The Artificial Intelligence (AI) revolution foretold of during the 1960s is well underway in the second decade of the twenty first century. Its period of phenomenal growth likely lies ahead. AI-operated machines and technologies will extend the reach of Homo sapiens far beyond the biological constraints imposed by evolution: outwards further into deep space, as well as inwards into the nano-world of DNA sequences and relevant medical applications. And yet, we believe, there are crucial lessons that biology can offer that will enable a prosperous future for AI. For machines in general, and for AI's especially, operating over extended periods or in extreme environments will require energy usage orders of magnitudes more efficient than exists today. In many operational environments, energy sources will be constrained. The AI's design and function may be dependent upon the type of energy source, as well as its availability and accessibility. Any plans for AI devices operating in a challenging environment must begin with the question of how they are powered, where fuel is located, how energy is stored and made available to the machine, and how long the machine can operate on specific energy units. While one of the key advantages of AI use is to reduce the dimensionality of a complex problem, the fact remains that some energy is required for functionality. Hence, the materials and technologies that provide the needed energy represent a critical challenge toward future use scenarios of AI and should be integrated into their design. Here we look to the brain and other aspects of biology as inspiration for Biomimetic Research for Energy-efficient AI Designs (BREAD).
Herrington, Adam R.; Lauritzen, Peter H.; Taylor, Mark A.; Goldhaber, Steve; Eaton; Reed; Ullrich, Paul A.
Atmospheric modeling with element-based high-order Galerkin methods presents a unique challenge to the conventional physics–dynamics coupling paradigm, due to the highly irregular distribution of nodes within an element and the distinct numerical characteristics of the Galerkin method. The conventional coupling procedure is to evaluate the physical parameterizations (physics) on the dynamical core grid. Evaluating the physics at the nodal points exacerbates numerical noise from the Galerkin method, enabling and amplifying local extrema at element boundaries. Grid imprinting may be substantially reduced through the introduction of an entirely separate, approximately isotropic finite-volume grid for evaluating the physics forcing. Integration of the spectral basis over the control volumes provides an area-average state to the physics, which is more representative of the state in the vicinity of the nodal points rather than the nodal point itself and is more consistent with the notion of a “large-scale state” required by conventional physics packages. This study documents the implementation of a quasi-equal-area physics grid into NCAR’s Community Atmosphere Model Spectral Element and is shown to be effective at mitigating grid imprinting in the solution. The physics grid is also appropriate for coupling to other components within the Community Earth System Model, since the coupler requires component fluxes to be defined on a finite-volume grid, and one can be certain that the fluxes on the physics grid are, indeed, volume averaged.
Composition of computational science applications into both ad hoc pipelines for analysis of collected or generated data and into well-defined and repeatable workflows is becoming increasingly popular. Meanwhile, dedicated high performance computing storage environments are rapidly becoming more diverse, with both significant amounts of non-volatile memory storage and mature parallel file systems available. At the same time, computational science codes are being coupled to data analysis tools which are not filesystem-oriented. In this paper, we describe how the FAODEL data management service can expose different available data storage options and mediate among them in both application- and FAODEL-directed ways. These capabilities allow applications to exploit their knowledge of the different types of data they may exchange during a workflow execution, and also provide FAODEL with mechanisms to proactively tune data storage behavior when appropriate. We describe the implementation of these capabilities in FAODEL and how they are used by applications, and present preliminary performance results demonstrating the potential benefits of our approach.
The study of hypersonic flows and their underlying aerothermochemical reactions is particularly important in the design and analysis of vehicles exiting and reentering Earth’s atmosphere. Computational physics codes can be employed to simulate these phenomena; however, code verification of these codes is necessary to certify their credibility. To date, few approaches have been presented for verifying codes that simulate hypersonic flows, especially flows reacting in thermochemical nonequilibrium. In this paper, we present our code-verification techniques for hypersonic reacting flows in thermochemical nonequilibrium, as well as their deployment in the Sandia Parallel Aerodynamics and Reentry Code (SPARC).
We begin by presenting an overview of the general philosophy that is guiding the novel DARMA developments, followed by a brief reminder about the background of this project. We finally present the FY19 design requirements. As the Exascale era arises, DARMA is uniquely positioned at the forefront of asychronous many-task (AMT) research and development (R&D) to explore emerging programming model paradigms for next-generation HPC applications at Sandia, across NNSA labs, and beyond. The DARMA project explores how to fundamentally shift the expression(PM) and execution(EM)of massively concurrent HPC scientific algorithms to be more asynchronous, resilient to executional aberrations in heterogeneous/unpredictable environments, and data-dependency conscious—thereby enabling an intelligent, dynamic, and self-aware runtime to guide execution.
Significant testing is required to design and certify primary aircraft structures subject to High Energy Dynamic Impact (HEDI) events; current work under the NASA Advanced Composites Consortium (ACC) HEDI Project seeks to determine the state-of-the-art of dynamic fracture simulations for composite structures in these events. This paper discusses one of three Progressive Damage Analysis (PDA) methods selected for the second phase of the NASA ACC project: peridynamics, through its implementation in EMU. A brief discussion of peridynamic theory is provided, including the effects of nonlinearity and strain rate dependence of the matrix followed by a blind prediction and test-analysis correlation for ballistic impact testing performed for configured skin-stringer panels.
Drinking water utilities use booster stations to maintain chlorine residuals throughout water distribution systems. Booster stations could also be used as part of an emergency response plan to minimize health risks in the event of an unintentional or malicious contamination incident. The benefit of booster stations for emergency response depends on several factors, including the reaction between chlorine and an unknown contaminant species, the fate and transport of the contaminant in the water distribution system, and the time delay between detection and initiation of boosted levels of chlorine. This paper takes these aspects into account and proposes a mixed-integer linear program formulation for optimizing the placement of booster stations for emergency response. A case study is used to explore the ability of optimally placed booster stations to reduce the impact of contamination in water distribution systems.
We describe new machine-learning-based methods to defeature CAD models for tetrahedral meshing. Using machine learning predictions of mesh quality for geometric features of a CAD model prior to meshing we can identify potential problem areas and improve meshing outcomes by presenting a prioritized list of suggested geometric operations to users. Our machine learning models are trained using a combination of geometric and topological features from the CAD model and local quality metrics for ground truth. We demonstrate a proof-of-concept implementation of the resulting work ow using Sandia's Cubit Geometry and Meshing Toolkit.
As clock speeds have stagnated, the number of cores in a node has been drastically increased to improve processor throughput. Most scalable system software was designed and developed for single-threaded environments. Multithreaded environments become increasingly prominent as application developers optimize their codes to leverage the full performance of the processor; however, these environments are incompatible with a number of assumptions that have driven scalable system software development. This paper will highlight a case study of this mismatch focusing on MPI message matching. MPI message matching has been designed and optimized for traditional serial execution. The reduced determinism in the order of MPI calls can significantly reduce the performance of MPI message matching, potentially overtaking time-per-iteration targets of many applications. Different proposed techniques attempt to address these issues and enable multithreaded MPI usage. These approaches highlight a number of tradeoffs that make adapting MPI message matching complex. This case study and its proposed solutions highlight a number of general concepts that need to be leveraged in the design of next generation scaleable system software.
In this paper, we develop software for decomposing sparse tensors that is portable to and performant on a variety of multicore, manycore, and GPU computing architectures. The result is a single code whose performance matches optimized architecture-specific implementations. The key to a portable approach is to determine multiple levels of parallelism that can be mapped in different ways to different architectures, and we explain how to do this for the matricized tensor times Khatri-Rao product (MTTKRP), which is the key kernel in canonical polyadic tensor decomposition. Our implementation leverages the Kokkos framework, which enables a single code to achieve high performance across multiple architectures that differ in how they approach fine-grained parallelism. We also introduce a new construct for portable thread-local arrays, which we call compile-time polymorphic arrays. Not only are the specifics of our approaches and implementation interesting for tuning tensor computations, but they also provide a roadmap for developing other portable high-performance codes. As a last step in optimizing performance, we modify the MTTKRP algorithm itself to do a permuted traversal of tensor nonzeros to reduce atomic-write contention. We test the performance of our implementation on 16- and 68-core Intel CPUs and the K80 and P100 NVIDIA GPUs, showing that we are competitive with state-of-the-art architecture-specific codes while having the advantage of being able to run on a variety of architectures.
In the field of semiconductor quantum dot spin qubits, there is growing interest in leveraging the unique properties of hole-carrier systems and their intrinsically strong spin-orbit coupling to engineer novel qubits. Recent advances in semiconductor heterostructure growth have made available high quality, undoped Ge/SiGe quantum wells, consisting of a pure strained Ge layer flanked by Ge-rich SiGe layers above and below. These quantum wells feature heavy hole carriers and a cubic Rashba-type spin-orbit interaction. Here, we describe progress toward realizing spin qubits in this platform, including development of multi-metal-layer gated device architectures, device tuning protocols, and charge-sensing capabilities. Iterative improvement of a three-layer metal gate architecture has significantly enhanced device performance over that achieved using an earlier single-layer gate design. We discuss ongoing, simulation-informed work to fine-tune the device geometry, as well as efforts toward a single-spin qubit demonstration.