In this work, a stabilized continuous Galerkin (CG) method for magnetohydrodynamics (MHD) is presented. Ideal, compressible inviscid MHD equations are discretized in space on unstructured meshes using piecewise linear or bilinear finite element bases to get a semi-discrete scheme. Stabilization is then introduced to the semi-discrete method in a strategy that follows the algebraic flux correction paradigm. This involves adding some artificial diffusion to the high order, semi-discrete method and mass lumping in the time derivative term. The result is a low order method that provides local extremum diminishing properties for hyperbolic systems. The difference between the low order method and the high order method is scaled element-wise using a limiter and added to the low order scheme. The limiter is solution dependent and computed via an iterative linearity preserving nodal variation limiting strategy. The stabilization also involves an optional consistent background high order dissipation that reduces phase errors. The resulting stabilized scheme is a semi-discrete method that can be applied to inviscid shock MHD problems and may be even extended to resistive and viscous MHD problems. To satisfy the divergence free constraint of the MHD equations, we add parabolic divergence cleaning to the system. Various time integration methods can be used to discretize the scheme in time. We demonstrate the robustness of the scheme by solving several shock MHD problems.
In this paper, we continue our efforts to exploit optimization and control ideas as a common foundation for the development of property-preserving numerical methods. Here we focus on a class of scalar advection equations whose solutions have fixed mass in a given Eulerian region and constant bounds in any Lagrangian volume. Our approach separates discretization of the equations from the preservation of their solution properties by treating the latter as optimization constraints. This relieves the discretization process from having to comply with additional restrictions and makes stability and accuracy the sole considerations in its design. A property-preserving solution is then sought as a state that minimizes the distance to an optimally accurate but not property-preserving target solution computed by the scheme, subject to constraints enforcing discrete proxies of the desired properties. Furthermore, we consider two such formulations in which the optimization variables are given by the nodal solution values and suitably defined nodal fluxes, respectively. A key result of the paper reveals that a standard Algebraic Flux Correction (AFC) scheme is a modified version of the second formulation obtained by shrinking its feasible set to a hypercube. In conclusion, we present numerical studies illustrating the optimization-based formulations and comparing them with AFC
We extend and improve recent results given by Singh and Watson on using classical bounds on the union of sets in a chance-constrained optimization problem. Specifically, we revisit the so-called Dawson and Sankoff bound that provided one of the best approximations of a chance constraint in the previous analysis. Next, we show that our work is a generalization of the previous work, and in fact the inequality employed previously is a very relaxed approximation with assumptions that do not generally hold. Computational results demonstrate on average over a 43% improvement in the bounds. As a byproduct, we provide an exact reformulation of the floor function in optimization models.
Mimetic methods discretize divergence by restricting the Gauss theorem to mesh cells. Because point clouds lack such geometric entities, construction of a compatible meshfree divergence remains a challenge. In this work, we define an abstract Meshfree Mimetic Divergence (MMD) operator on point clouds by contraction of field and virtual face moments. This MMD satisfies a discrete divergence theorem, provides a discrete local conservation principle, and is first-order accurate. We consider two MMD instantiations. The first one assumes a background mesh and uses generalized moving least squares (GMLS) to obtain the necessary field and face moments. This MMD instance is appropriate for settings where a mesh is available but its quality is insufficient for a robust and accurate mesh-based discretization. The second MMD operator retains the GMLS field moments but defines virtual face moments using computationally efficient weighted graph-Laplacian equations. This MMD instance does not require a background grid and is appropriate for applications where mesh generation creates a computational bottleneck. It allows one to trade an expensive mesh generation problem for a scalable algebraic one, without sacrificing compatibility with the divergence operator. We demonstrate the approach by using the MMD operator to obtain a virtual finite-volume discretization of conservation laws on point clouds. Numerical results in the paper confirm the mimetic properties of the method and show that it behaves similarly to standard finite volume methods.
We utilize generalized moving least squares (GMLS) to develop meshfree techniques for discretizing hydrodynamic flow problems on manifolds. We use exterior calculus to formulate incompressible hydrodynamic equations in the Stokesian regime and handle the divergence-free constraints via a generalized vector potential. This provides less coordinate-centric descriptions and enables the development of efficient numerical methods and splitting schemes for the fourth-order governing equations in terms of a system of second-order elliptic operators. Using a Hodge decomposition, we develop methods for manifolds having spherical topology. We show the methods exhibit high-order convergence rates for solving hydrodynamic flows on curved surfaces. The methods also provide general high-order approximations for the metric, curvature, and other geometric quantities of the manifold and associated exterior calculus operators. The approaches also can be utilized to develop high-order solvers for other scalar-valued and vector-valued problems on manifolds.
Historically, neuroscience principles have heavily influenced artificial intelligence (AI), for example the influence of the perceptron model, essentially a simple model of a biological neuron, on artificial neural networks. More recently, notable recent AI advances, for example the growing popularity of reinforcement learning, often appear more aligned with cognitive neuroscience or psychology, focusing on function at a relatively abstract level. At the same time, neuroscience stands poised to enter a new era of large-scale high-resolution data and appears more focused on underlying neural mechanisms or architectures that can, at times, seem rather removed from functional descriptions. While this might seem to foretell a new generation of AI approaches arising from a deeper exploration of neuroscience specifically for AI, the most direct path for achieving this is unclear. Here we discuss cultural differences between the two fields, including divergent priorities that should be considered when leveraging modern-day neuroscience for AI. For example, the two fields feed two very different applications that at times require potentially conflicting perspectives. We highlight small but significant cultural shifts that we feel would greatly facilitate increased synergy between the two fields.
Rushdi, Mostafa A.; Dief, Tarek N.; Yoshida, Shigeo; Schmehl, Roland; Rushdi, Ahmad R.
Kites can be used to harvest wind energy at higher altitudes while using only a fraction of the material required for conventional wind turbines. In this work, we present the kite system of Kyushu University and demonstrate how experimental data can be used to train machine learning regression models. The system is designed for 7 kW traction power and comprises an inflatable wing with suspended kite control unit that is either tethered to a fixed ground anchor or to a towing vehicle to produce a controlled relative flow environment. A measurement unit was attached to the kite for data acquisition. To predict the generated tether force, we collected input–output samples from a set of well-designed experimental runs to act as our labeled training data in a supervised machine learning setting. We then identified a set of key input parameters which were found to be consistent with our sensitivity analysis using Pearson input–output correlation metrics. Finally, we designed and tested the accuracy of a neural network, among other multivariate regression models. The quality metrics of our models show great promise in accurately predicting the tether force for new input/feature combinations and potentially guide new designs for optimal power generation.
Graph partitioning has been an important tool to partition the work among several processors to minimize the communication cost and balance the workload. While accelerator-based supercomputers are emerging to be the standard, the use of graph partitioning becomes even more important as applications are rapidly moving to these architectures. However, there is no scalable, distributed-memory, multi-GPU graph partitioner available for applications. We developed a spectral graph partitioner, Sphynx, using the portable, accelerator-friendly stack of the Trilinos framework. We use Sphnyx to systematically evaluate the various algorithmic choices in spectral partitioning with a focus on GPU performance. We perform those evaluations on irregular graphs, because state-of-the-art partitioners have the most difficulty on them. We demonstrate that Sphynx is up to 17x faster on GPUs compared to the case on CPUs, and up to 580x faster compared to a state-of-the-art multilevel partitioner. Sphynx provides a robust alternative for applications looking for a GPU-based partitioner.
Remote Direct Memory Access (RDMA) is an increasingly important technology in high-performance computing (HPC). RDMA provides low-latency, high-bandwidth data transfer between compute nodes. Additionally, it does not require explicit synchronization with the destination processor. Eliminating unnecessary synchronization can significantly improve the communication performance of large-scale scientific codes. A long-standing challenge presented by RDMA communication is mitigating the cost of registering memory with the network interface controller (NIC). Reusing memory once it is registered has been shown to significantly reduce the cost of RDMA communication. However, existing approaches for reusing memory rely on implicit memory semantics. In this paper, we introduce an approach that makes memory reuse semantics explicit by exposing a separate allocator for registered memory. The data and analysis in this paper yield the following contributions: (i) managing registered memory explicitly enables efficient reuse of registered memory; (ii) registering large memory regions to amortize the registration cost over multiple user requests can significantly reduce cost of acquiring new registered memory; and (iii) reducing the cost of acquiring registered memory can significantly improve the performance of RDMA communication. Reusing registered memory is key to high-performance RDMA communication. By making reuse semantics explicit, our approach has the potential to improve RDMA performance by making it significantly easier for programmers to efficiently reuse registered memory.
Generally, scientific simulations load the entire simulation domain into memory because most, if not all, of the data changes with each time step. This has driven application structures that have, in turn, affected the design of popular IO libraries, such as HDF-5, ADIOS, and NetCDF. This assumption makes sense for many cases, but there is also a significant collection of simulations where this approach results in vast swaths of unchanged data written each time step.This paper explores a new IO approach that is capable of stitching together a coherent global view of the total simulation space at any given time. This benefit is achieved with no performance penalty compared to running with the full data set in memory, at a radically smaller process requirement, and results in radical data reduction with no fidelity loss. Additionally, the structures employed enable online simulation monitoring.
Uncertainty pervades virtually every branch of science and engineering, and in many disciplines, the underlying phenomena can be modeled by partial differential equations (PDEs) with uncertain or random inputs. This work is motivated by risk-averse stochastic programming problems constrained by PDEs. These problems are posed in infinite dimensions, which leads to a significant increase in the scale of the (discretized) problem. In order to handle the inherent nonsmoothness of, for example, coherent risk measures and to exploit existing solution techniques for smooth, PDE-constrained optimization problems, we propose a variational smoothing technique called epigraphical (epi-)regularization. We investigate the effects of epi-regularization on the axioms of coherency and prove differentiability of the smoothed risk measures. In addition, we demonstrate variational convergence of the epi-regularized risk measures and prove the consistency of minimizers and first-order stationary points for the approximate risk-averse optimization problem. We conclude with numerical experiments confirming our theoretical results.
As part of the Department of Energy response to the novel coronavirus pandemic of 2020, a modeling effort was sponsored by the DOE Office of Science. One task of this modeling effort at Sandia was to develop a model to predict medical resource needs given various patient arrival scenarios. Resources needed include personnel resources (nurses, ICU nurses, physicians, respiratory therapists), fixed resources (regular or ICU beds and ventilators), and consumable resources (masks, gowns, gloves, face shields, sedatives). This report documents the uncertainty analysis that was performed on the resource model. The uncertainty analysis involved sampling 26 input parameters to the model. The sampling was performed conditional on the patient arrival streams that also were inputs to the model. These patient arrival streams were derived from various epidemiology models and had a significant effect on the projected resource needs. In this report, we document the sampling approach, the parameter ranges used, and the computational workflow necessary to perform large-scale uncertainty studies for every county and state in the United States.
Data movement is a significant and growing consumer of energy in modern systems, from specialized low-power accelerators to GPUs with power budgets in the hundreds of Watts. Given the importance of the problem, prior work has proposed designing interconnects on which the energy cost of transmitting a 0 is significantly lower than that of transmitting a 1. With such an interconnect, data movement energy is reduced by encoding the transmitted data such that the number of 1s is minimized. Although promising, these data encoding proposals do not take full advantage of application level semantics. As an example of a neglected optimization opportunity, consider the case of a dot product computation as part of a neural network inference task. The order in which the neural network weights are fetched and processed does not affect correctness, and can be optimized to further reduce data movement energy. This paper presents commutative data reordering (CDR), a hardware-software approach that leverages the commutative property in linear algebra to strategically select the order in which weight matrix coefficients are fetched from memory. To find a low-energy transmission order, weight ordering is modeled as an instance of one of two well-studied problems, the Traveling Salesman Problem and the Capacitated Vehicle Routing Problem. This reduction makes it possible to leverage the vast body of work on efficient approximation methods to find a good transmission order. CDR exploits the indirection inherent to sparse matrix formats such that no additional metadata is required to specify the selected order. The hardware modifications required to support CDR are minimal, and incur an area penalty of less than 0.01% when implemented on top of a mobile-class GPU. When applied to 7 neural network inference tasks running on a GPU-based system, CDR respectively reduces average DRAM IO energy by 53.1% and 22.2% over the data bus invert encoding scheme used by LPDDR4, and the recently proposed Base + XOR encoding. These savings are attained with no changes to the mobile system software and no runtime performance penalty.
We describe and analyze a variance reduction approach for Monte Carlo (MC) sampling that accelerates the estimation of statistics of computationally expensive simulation models using an ensemble of models with lower cost. These lower cost models — which are typically lower fidelity with unknown statistics — are used to reduce the variance in statistical estimators relative to a MC estimator with equivalent cost. We derive the conditions under which our proposed approximate control variate framework recovers existing multifidelity variance reduction schemes as special cases. We demonstrate that existing recursive/nested strategies are suboptimal because they use the additional low-fidelity models only to efficiently estimate the unknown mean of the first low-fidelity model. As a result, they cannot achieve variance reduction beyond that of a control variate estimator that uses a single low-fidelity model with known mean. However, there often exists about an order-of-magnitude gap between the maximum achievable variance reduction using all low-fidelity models and that achieved by a single low-fidelity model with known mean. We show that our proposed approach can exploit this gap to achieve greater variance reduction by using non-recursive sampling schemes. The proposed strategy reduces the total cost of accurately estimating statistics, especially in cases where only low-fidelity simulation models are accessible for additional evaluations. Several analytic examples and an example with a hyperbolic PDE describing elastic wave propagation in heterogeneous media are used to illustrate the main features of the methodology.
A broad set of data science and engineering questions may be organized as graphs, providing a powerful means for describing relational data. Although experts now routinely compute graph algorithms on huge, unstructured graphs using high performance computing (HPC) or cloud resources, this practice hasn't yet broken into the mainstream. Such computations require great expertise, yet users often need rapid prototyping and development to quickly customize existing code. Toward that end, we are exploring the use of the Chapel programming language as a means of making some important graph analytics more accessible, examining the breadth of characteristics that would make for a productive programming environment, one that is expressive, performant, portable, and robust.
Bayesian optimization (BO) is an effective surrogate-based method that has been widely used to optimize simulation-based applications. While the traditional Bayesian optimization approach only applies to single-fidelity models, many realistic applications provide multiple levels of fidelity with various levels of computational complexity and predictive capability. In this work, we propose a multi-fidelity Bayesian optimization method for design applications with both known and unknown constraints. The proposed framework, called sMF-BO-2CoGP, is built on a multi-level CoKriging method to predict the objective function. An external binary classifier, which we approximate using a separate CoKriging model, is used to distinguish between feasible and infeasible regions. Finally, the sMF-BO-2CoGP method is demonstrated using a series of analytical examples and a flip-chip application for design optimization to minimize the deformation due to warping under thermal loading conditions.
We propose a multilevel approach for trace systems resulting from hybridized discontinuous Galerkin (HDG) methods. The key is to blend ideas from nested dissection, domain decomposition, and high-order characteristic of HDG discretizations. Specifically, we first create a coarse solver by eliminating and/or limiting the front growth in nested dissection. This is accomplished by projecting the trace data into a sequence of same or high-order polynomials on a set of increasingly h-coarser edges/faces. We then combine the coarse solver with a block-Jacobi fine scale solver to form a two-level solver/preconditioner. Numerical experiments indicate that the performance of the resulting two-level solver/preconditioner depends on the smoothness of the solution and can offer significant speedups and memory savings compared to the nested dissection direct solver. While the proposed algorithms are developed within the HDG framework, they are applicable to other hybrid(ized) high-order finite element methods. Moreover, we show that our multilevel algorithms can be interpreted as a multigrid method with specific intergrid transfer and smoothing operators. With several numerical examples from Poisson, pure transport, and convection-diffusion equations we demonstrate the robustness and scalability of the algorithms with respect to solution order. While scalability with mesh size in general is not guaranteed and depends on the smoothness of the solution and the type of equation, improving it is a part of future work.
Here, we study orthogonal polynomials with respect to self-similar measures, focusing on the class of infinite Bernoulli convolutions, which are defined by iterated function systems with overlaps, especially those defined by the Pisot, Garsia, and Salem numbers. By using an algorithm of Mantica, we obtain graphs of the coefficients of the 3-term recursion relation defining the orthogonal polynomials. We use these graphs to predict whether the singular infinite Bernoulli convolutions belong to the Nevai class. Based on our numerical results, we conjecture that all infinite Bernoulli convolutions with contraction ratios greater than or equal to 1/2 belong to Nevai’s class, regardless of the probability weights assigned to the self-similar measures.
This report describes the results of a seven day effort to assist subject matter experts address a problem related to COVID-19. In the course of this effort, we analyzed the 29K documents provided as part of the White House's call to action. This involved applying a variety of natural language processing techniques and compression-based analytics in combination with visualization techniques and assessment with subject matter experts to pursue answers to a specific question. In this paper, we will describe the algorithms, the software, the study performed, and availability of the software developed during the effort.
In response to anticipated resource shortfalls related to the treatment and testing of COVID-19, many communities are planning to build additional facilities to increase capacity. These facilities include field hospitals, testing centers, mobile manufacturing units, and distribution centers. In many cases, these facilities are intended to be temporary and are designed to meet an immediate need. When deciding where to place new facilities many factors need to be considered, including the feasibility of potential locations, existing resource availability, anticipated demand, and accessibility between patients and the new facility. In this project, a facility location optimization model was developed to integrate these key pieces of information to help decision makers identify the best place, or places, to build a facility to meet anticipated resource demands. The facility location optimization model uses the location of existing resources and the anticipated resource demand at each location to minimize the distance a patient must travel to get to the resource they need. The optimization formulation is presented below. The model was designed to operate at the county scale, where patients are grouped per county. This assumption can be modified to integrate other scales or include individual patients.
The ExaLearn miniGAN team (Ellis and Rajamanickam) have released miniGAN, a generative adversarial network(GAN) proxy application, through the ECP proxy application suite. miniGAN is the first machine learning proxy application in the suite (note: the ECP CANDLE project did previously release some benchmarks) and models the performance for training generator and discriminator networks. The GAN's generator and discriminator generate plausible 2D/3D maps and identify fake maps, respectively. miniGAN aims to be a proxy application for related applications in cosmology (CosmoFlow, ExaGAN) and wind energy (ExaWind). miniGAN has been developed so that optimized mathematical kernels (e.g., kernels provided by Kokkos Kernels) can be plugged into to the proxy application to explore potential performance improvements. miniGAN has been released as open source software and is available through the ECP proxy application website (https://proxyapps.exascaleproject.ordecp-proxy-appssuite/) and on GitHub (https://github.com/SandiaMLMiniApps/miniGAN). As part of this release, a generator is provided to generate a data set (series of images) that are inputs to the proxy application.
Non-volatile memory arrays can deploy pre-trained neural network models for edge inference. However, these systems are affected by device-level noise and retention issues. Here, we examine damage caused by these effects, introduce a mitigation strategy, and demonstrate its use in fabricated array of SONOS (Silicon-Oxide-Nitride-Oxide-Silicon) devices. On MNIST, fashion-MNIST, and CIFAR-10 tasks, our approach increases resilience to synaptic noise and drift. We also show strong performance can be realized with ADCs of 5-8 bits precision.
The attachment of dopant precursor molecules to depassivated areas of hydrogen-terminated silicon templated with a scanning tunneling microscope (STM) has been used to create electronic devices with sub-nanometer precision, typically for quantum physics demonstrations, and to dope silicon past the solid-solubility limit, with potential applications in microelectronics and plasmonics. However, this process, which we call atomic precision advanced manufacturing (APAM), currently lacks the throughput required to develop sophisticated applications because there is no proven scalable hydrogen lithography pathway. Here, we demonstrate and characterize an APAM device workflow where STM lithography has been replaced with photolithography. An ultraviolet laser is shown to locally heat silicon controllably above the temperature required for hydrogen depassivation. STM images indicate a narrow range of laser energy density where hydrogen has been depassivated, and the surface remains well-ordered. A model for photothermal heating of silicon predicts a local temperature which is consistent with atomic-scale STM images of the photo-patterned regions. Finally, a simple device made by exposing photo-depassivated silicon to phosphine is found to have a carrier density and mobility similar to that produced by similar devices patterned by STM.
A number of applications in basic science and technology would benefit from high-fidelity photon-number-resolving photodetectors. While some recent experimental progress has been made in this direction, the requirements for true photon number resolution are stringent, and no design currently exists that achieves this goal. Here we employ techniques from fundamental quantum optics to demonstrate that detectors composed of subwavelength elements interacting collectively with the photon field can achieve high-performance photon number resolution. We propose a new design that simultaneously achieves photon number resolution, high efficiency, low jitter, low dark counts, and high count rate. We discuss specific systems that satisfy the design requirements, pointing to the important role of nanoscale device elements.
Asaad, Serwan; Mourik, Vincent; Joecker, Benjamin; Johnson, Mark A.I.; Baczewski, Andrew D.; Firgau, Hannes R.; Madzik, Mateusz T.; Schmitt, Vivien; Pla, Jarryd J.; Hudson, Fay E.; Itoh, Kohei M.; Mccallum, Jeffrey C.; Dzurak, Andrew S.; Laucht, Arne; Morello, Andrea
Nuclear spins are highly coherent quantum objects. In large ensembles, their control and detection via magnetic resonance is widely exploited, for example, in chemistry, medicine, materials science and mining. Nuclear spins also featured in early proposals for solid-state quantum computers1 and demonstrations of quantum search2 and factoring3 algorithms. Scaling up such concepts requires controlling individual nuclei, which can be detected when coupled to an electron4–6. However, the need to address the nuclei via oscillating magnetic fields complicates their integration in multi-spin nanoscale devices, because the field cannot be localized or screened. Control via electric fields would resolve this problem, but previous methods7–9 relied on transducing electric signals into magnetic fields via the electron–nuclear hyperfine interaction, which severely affects nuclear coherence. Here we demonstrate the coherent quantum control of a single 123Sb (spin-7/2) nucleus using localized electric fields produced within a silicon nanoelectronic device. The method exploits an idea proposed in 196110 but not previously realized experimentally with a single nucleus. Our results are quantitatively supported by a microscopic theoretical model that reveals how the purely electrical modulation of the nuclear electric quadrupole interaction results in coherent nuclear spin transitions that are uniquely addressable owing to lattice strain. The spin dephasing time, 0.1 seconds, is orders of magnitude longer than those obtained by methods that require a coupled electron spin to achieve electrical driving. These results show that high-spin quadrupolar nuclei could be deployed as chaotic models, strain sensors and hybrid spin-mechanical quantum systems using all-electrical controls. Integrating electrically controllable nuclei with quantum dots11,12 could pave the way to scalable, nuclear- and electron-spin-based quantum computers in silicon that operate without the need for oscillating magnetic fields.
Lischke, Anna; Pang, Guofei; Gulian, Mamikon G.; Song, Fangying; Glusa, Christian A.; Zheng, Xiaoning; Mao, Zhiping; Cai, Wei; Meerschaert, Mark M.; Ainsworth, Mark; Karniadakis, George E.
The fractional Laplacian in Rd, which we write as (−Δ)α/2 with α∈(0,2), has multiple equivalent characterizations. Moreover, in bounded domains, boundary conditions must be incorporated in these characterizations in mathematically distinct ways, and there is currently no consensus in the literature as to which definition of the fractional Laplacian in bounded domains is most appropriate for a given application. The Riesz (or integral) definition, for example, admits a nonlocal boundary condition, where the value of a function must be prescribed on the entire exterior of the domain in order to compute its fractional Laplacian. In contrast, the spectral definition requires only the standard local boundary condition. These differences, among others, lead us to ask the question: “What is the fractional Laplacian?” Beginning from first principles, we compare several commonly used definitions of the fractional Laplacian theoretically, through their stochastic interpretations as well as their analytical properties. Then, we present quantitative comparisons using a sample of state-of-the-art methods. We discuss recent advances on nonzero boundary conditions and present new methods to discretize such boundary value problems: radial basis function collocation (for the Riesz fractional Laplacian) and nonharmonic lifting (for the spectral fractional Laplacian). In our numerical studies, we aim to compare different definitions on bounded domains using a collection of benchmark problems. We consider the fractional Poisson equation with both zero and nonzero boundary conditions, where the fractional Laplacian is defined according to the Riesz definition, the spectral definition, the directional definition, and the horizon-based nonlocal definition. We verify the accuracy of the numerical methods used in the approximations for each operator, and we focus on identifying differences in the boundary behaviors of solutions to equations posed with these different definitions. Through our efforts, we aim to further engage the research community in open problems and assist practitioners in identifying the most appropriate definition and computational approach to use for their mathematical models in addressing anomalous transport in diverse applications.
As we approach exascale, computational parallelism will have to drastically increase in order to meet throughput targets. Many-core architectures have exacerbated this problem by trading reduced clock speeds, core complexity, and computation throughput for increasing parallelism. This presents two major challenges for communication libraries such as MPI: the library must leverage the performance advantages of thread level parallelism and avoid the scalability problems associated with increasing the number of processes to that scale. Hybrid programming models, such as MPI+X, have been proposed to address these challenges. MPI THREAD MULTIPLE is MPI's thread safe mode. While there has been work to optimize it, it largely remains non-performant in most implementations. While current applications avoid MPI multithreading due to performance concerns, it is expected to be utilized in future applications. One of the major synchronous data structures required by MPI is the matching engine. In this paper, we present a parallel matching algorithm that can improve MPI matching for multithreaded applications. We then perform a feasibility study to demonstrate the performance benefit of the technique.
Gunther, Stefanie; Ruthotto, Lars; Schroder, Jacob B.; Cyr, Eric C.; Gauger, Nicolas R.
Residual neural networks (ResNets) are a promising class of deep neural networks that have shown excellent performance for a number of learning tasks, e.g., image classification and recognition. Mathematically, ResNet architectures can be interpreted as forward Euler discretizations of a nonlinear initial value problem whose time-dependent control variables represent the weights of the neural network. Hence, training a ResNet can be cast as an optimal control problem of the associated dynamical system. For similar time-dependent optimal control problems arising in engineering applications, parallel-in-time methods have shown notable improvements in scalability. This paper demonstrates the use of those techniques for efficient and effective training of ResNets. The proposed algorithms replace the classical (sequential) forward and backward propagation through the network layers with a parallel nonlinear multigrid iteration applied to the layer domain. This adds a new dimension of parallelism across layers that is attractive when training very deep networks. From this basic idea, we derive multiple layer-parallel methods. The most efficient version employs a simultaneous optimization approach where updates to the network parameters are based on inexact gradient information in order to speed up the training process. Finally, using numerical examples from supervised classification, we demonstrate that the new approach achieves a training performance similar to that of traditional methods, but enables layer-parallelism and thus provides speedup over layer-serial methods through greater concurrency.
We study the solution of block-structured linear algebra systems arising in optimization by using iterative solution techniques. These systems are the core computational bottleneck of many problems of interest such as parameter estimation, optimal control, network optimization, and stochastic programming. Our approach uses a Krylov solver (GMRES) that is preconditioned with an alternating method of multipliers (ADMM). We show that this ADMM-GMRES approach overcomes well-known scalability issues of Schur complement decomposition in problems that exhibit a high degree of coupling. The effectiveness of the approach is demonstrated using linear systems that arise in stochastic optimal power flow problems and that contain up to 2 million total variables and 4000 coupling variables. We find that ADMM-GMRES is nearly an order of magnitude faster than Schur complement decomposition. Moreover, we demonstrate that the approach is robust to the selection of the augmented Lagrangian penalty parameter, which is a key advantage over the direct use of ADMM.