Bayesian calibration of hydrological parameters in the Community Land Model (CODA Presentation)
Abstract not provided.
Abstract not provided.
Abstract not provided.
Applied Optics
Abstract not provided.
Proceedings of Co-HPC 2014: 1st International Workshop on Hardware-Software Co-Design for High Performance Computing - Held in Conjunction with SC 2014: The International Conference for High Performance Computing, Networking, Storage and Analysis
The Piecewise Parabolic Method (PPM) was designed as a means of exploring compressible gas dynam-ics problems of interest in astrophysics, including super-sonic jets, compressible turbulence, stellar convection, and turbulent mixing and burning of gases in stellar interiors. Over time, the capabilities encapsulated in PPM have co-evolved with the availability of a series of high performance computing platforms. Implementation of the algorithm has adapted to and advanced with the architectural capabilities and characteristics of these machines. This adaptability of our PPM codes has enabled targeted astrophysical applica-tions of PPM to exploit these scarce resources to explore complex physical phenomena. Here we describe the means by which this was accomplished, and set a path forward, with a new miniapp, mPPM, for continuing this process in a diverse and dynamic architecture design environment. Adaptations in mPPM for the latest high performance machines are discussed that address the important issue of limited bandwidth from locally attached main memory to the microprocessor chip.
Proceedings of Co-HPC 2014: 1st International Workshop on Hardware-Software Co-Design for High Performance Computing - Held in Conjunction with SC 2014: The International Conference for High Performance Computing, Networking, Storage and Analysis
Disruptive changes to computer architecture are paving the way toward extreme scale computing. The co-design strategy of collaborative research and development among computer architects, system software designers, and application teams can help to ensure that applications not only cope but thrive with these changes. In this paper, we present a novel combined co-design approach of emulation and simulation in the context of investigating future Processing in Memory (PIM) architectures. PIM enables co-location of data and computation to decrease data movement, to provide increases in memory speed and capacity compared to existing technologies and, perhaps most importantly for extreme scale, to improve energy efficiency. Our evaluation of PIM focuses on three mini-applications representing important production applications. The emulation and simulation studies examine the effects of locality-aware versus locality-oblivious data distribution and computation, and they compare PIM to conventional architectures. Both studies contribute in their own way to the overall understanding of the application-architecture interactions, and our results suggest that PIM technology shows great potential for efficient computation without negatively impacting productivity.
International Conference for High Performance Computing, Networking, Storage and Analysis, SC
Krylov subspace projection methods are widely used iterative methods for solving large-scale linear systems of equations. Researchers have demonstrated that communication avoiding (CA) techniques can improve Krylov methods' performance on modern computers, where communication is becoming increasingly expensive compared to arithmetic operations. In this paper, we extend these studies by two major contributions. First, we present our implementation of a CA variant of the Generalized Minimum Residual (GMRES) method, called CAGMRES, for solving no symmetric linear systems of equations on a hybrid CPU/GPU cluster. Our performance results on up to 120 GPUs show that CA-GMRES gives a speedup of up to 2.5x in total solution time over standard GMRES on a hybrid cluster with twelve Intel Xeon CPUs and three Nvidia Fermi GPUs on each node. We then outline a domain decomposition framework to introduce a family of preconditioners that are suitable for CA Krylov methods. Our preconditioners do not incur any additional communication and allow the easy reuse of existing algorithms and software for the sub domain solves. Experimental results on the hybrid CPU/GPU cluster demonstrate that CA-GMRES with preconditioning achieve a speedup of up to 7.4x over CAGMRES without preconditioning, and speedup of up to 1.7x over GMRES with preconditioning in total solution time. These results confirm the potential of our framework to develop a practical and effective preconditioned CA Krylov method.
Journal of Chemical Physics
The random-phase approximation with second-order screened exchange (RPA+SOSEX) is a model of electron correlation energy with two caveats: its accuracy depends on an arbitrary choice of mean field, and it scales as O(n 5) operations and O(n3) memory for n electrons. We derive a new algorithm that reduces its scaling to O(n3) operations and O(n2) memory using controlled approximations and a new self-consistent field that approximates Brueckner coupled-cluster doubles theory with RPA+SOSEX, referred to as Brueckner RPA theory. The algorithm comparably reduces the scaling of second-order Møller-Plesset perturbation theory with smaller cost prefactors than RPA+SOSEX. Within a semiempirical model, we study H2 dissociation to test accuracy and Hn rings to verify scaling. © 2014 AIP Publishing LLC.
Studies in Computational Intelligence
Large-scale computational models have become common tools for analyzing complex man-made systems. However, when coupled with optimization or uncertainty quantification methods in order to conduct extensive model exploration and analysis, the computational expense quickly becomes intractable. Furthermore, these models may have both continuous and discrete parameters. One common approach to mitigating the computational expense is the use of response surface approximations. While well developed for models with continuous parameters, they are still new and largely untested for models with both continuous and discrete parameters. In this work, we describe and investigate the performance of three types of response surfaces developed for mixed-variable models: Adaptive Component Selection and Shrinkage Operator, Treed Gaussian Process, and Gaussian Process with Special Correlation Functions. We focus our efforts on test problems with a small number of parameters of interest, a characteristic of many physics-based engineering models. We present the results of our studies and offer some insights regarding the performance of each response surface approximation method. © 2014 Springer International Publishing Switzerland.
A method is investigated to reduce the number of numerical parameters in a material model for a solid. The basis of the method is to detect interdependencies between parameters within a class of materials of interest. The method is demonstrated for a set of material property data for iron and steel using the Johnson-Cook plasticity model.
Journal of Parallel and Distributed Computing
The need to process streaming data, which arrives continuously at high-volume in real-time, arises in a variety of contexts including data produced by experiments, collections of environmental or network sensors, and running simulations. Streaming data can also be formulated as queries or transactions which operate on a large dynamic data store, e.g. a distributed database. We describe a lightweight, portable framework named PHISH which provides a communication model enabling a set of independent processes to compute on a stream of data in a distributed-memory parallel manner. Datums are routed between processes in patterns defined by the application. PHISH provides multiple communication backends including MPI and sockets/ZMQ. The former means streaming computations can be run on any parallel machine which supports MPI; the latter allows them to run on a heterogeneous, geographically dispersed network of machines. We illustrate how streaming MapReduce operations can be implemented using the PHISH communication model, and describe streaming versions of three algorithms for large, sparse graph analytics: triangle enumeration, sub-graph isomorphism matching, and connected component finding. We also provide benchmark timings comparing MPI and socket performance for several kernel operations useful in streaming algorithms. © 2014 Elsevier Inc. All rights reserved.
Journal of Mechanics of Materials and Structures
A simple demonstration of nonlocality in a heterogeneous material is presented. By analysis of the microscale deformation of a two-component layered medium, it is shown that nonlocal interactions necessarily appear in a homogenized model of the system. Explicit expressions for the nonlocal forces are determined. The way these nonlocal forces appear in various nonlocal elasticity theories is derived. The length scales that emerge involve the constituent material properties as well as their geometrical dimensions. A peridynamic material model for the smoothed displacement field is derived. It is demonstrated by comparison with experimental data that the incorporation of nonlocality in modeling improves the prediction of the stress concentration in an open-hole tension test on a composite plate. © 2014 Mathematical Sciences Publishers.
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Transport algorithms are highly important for dynamical modeling of the atmosphere, where it is critical that scalar tracer species are conserved and satisfy physical bounds. We present an optimization-based algorithm for the conservative transport of scalar quantities (i.e. mass) on the cubed sphere grid, which preserves local solution bounds without the use of flux limiters. The optimization variables are the net mass updates to the cell, the objective is to minimize the discrepancy between these variables and suitable high-order cell mass update (the "target"), and the constraints are derived from the local solution bounds and the conservation of the total mass. The resulting robust and efficient algorithm for conservative and local bound-preserving transport on the sphere further demonstrates the flexibility and scope of the recently developed optimization-based modeling approach [1, 2]. © 2014 Springer-Verlag.
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Much recent research has explored fault-tolerance mechanisms intended for current and future extreme-scale systems. Evaluations of the suitability of checkpoint-based solutions have typically been carried out using relatively uncomplicated computational kernels designed to measure floating point performance. More recent investigations have added scaled-down "proxy" applications to more closely match the composition and behavior of deployed ones. However, the information obtained from these studies (whether floating point performance or application runtime) is not necessarily of the most value in evaluating resilience strategies. We observe that even when using a more sophisticated metric, the information available from evaluating uncoordinated checkpointing using both microbenchmarks and proxy applications does not agree. This implies that not only might researchers be asking the wrong questions, but that the answers to the right ones might be unexpected and potentially misleading. We seek to open a discussion on whether benchmarks designed to provide predictable performance evaluations of HPC hardware and toolchains are providing the right feedback for the evaluation of fault-tolerance in these applications, and more generally on how benchmarking of resilience mechanisms ought to be approached in the exascale design space. © 2014 Springer-Verlag Berlin Heidelberg.
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Two decades of experience with massively parallel supercomputing has given insight into the problem domains where these architectures are cost effective. Likewise experience with database machines and more recently massively parallel database appliances has shown where these architectures are valuable. Combining both architectures to simultaneously solve problems has received much less attention. In this paper, we describe a motivating application for economic modeling that requires both HPC and database capabilities. Then we discuss hardware and software integration issues related to a direct integration of a Cray XT supercomputer and a Netezza database appliance. © 2014 Springer-Verlag Berlin Heidelberg.
IEEE Transactions on Plasma Science
A new algorithm was developed, which reduces the self-force in particle-in-cell codes on unstructured meshes in a predictable and controllable way. This is accomplished by computing a charge density weighting function for a particle, which reproduces the Green's function solution to Poisson's equation at nodes when using a standard finite element method methodology. This provides a superior local potential and allows for particle-particle particle-mesh techniques to be used to subtract off local force contributions, including fictitious self-forces resulting in accurate long-range forces on a particle and improved local Coulomb collisions. Local physical forces are then computed using the Green's function on local particle pairs and added to the long-range forces. Results were shown with up to five orders reduction in self-force and superior intraparticle forces for two test cases. © 2013 IEEE.
A simple demonstration of nonlocality in a heterogeneous material is presented. By analysis of the microscale deformation of a two-component layered medium, it is shown that nonlocal interactions necessarily appear in a homogenized model of the system. Explicit expressions for the nonlocal forces are determined. The way these nonlocal forces appear in various nonlocal elasticity theories is derived. The length scales that emerge involve the constituent material properties as well as their geometrical dimen- sions. A peridynamic material model for the smoothed displacement eld is derived. It is demonstrated by comparison with experimental data that the incorporation of non- locality in modeling dramatically improves the prediction of the stress concentration in an open hole tension test on a composite plate.
Proceedings of the International Parallel and Distributed Processing Symposium, IPDPS
Finding the strongly connected components (SCCs) of a directed graph is a fundamental graph-theoretic problem. Tarjan's algorithm is an efficient serial algorithm to find SCCs, but relies on the hard-to-parallelize depth-first search (DFS). We observe that implementations of several parallel SCC detection algorithms show poor parallel performance on modern multicore platforms and large-scale networks. This paper introduces the Multistep method, a new approach that avoids work inefficiencies seen in prior SCC approaches. It does not rely on DFS, but instead uses a combination of breadth-first search (BFS) and a parallel graph coloring routine. We show that the Multistep method scales well on several real-world graphs, with performance fairly independent of topological properties such as the size of the largest SCC and the total number of SCCs. On a 16-core Intel Xeon platform, our algorithm achieves a 20X speedup over the serial approach on a 2 billion edge graph, fully decomposing it in under two seconds. For our collection of test networks, we observe that the Multistep method is 1.92X faster (mean speedup) than the state-of-the-art Hong et al. SCC method. In addition, we modify the Multistep method to find connected and weakly connected components, as well as introduce a novel algorithm for determining articulation vertices of biconnected components. These approaches all utilize the same underlying BFS and coloring routines. © 2014 IEEE.
Journal of Elasticity
Abstract not provided.
We are on the threshold of a transformative change in the basic architecture of highperformance computing. The use of accelerator processors, characterized by large core counts, shared but asymmetrical memory, and heavy thread loading, is quickly becoming the norm in high performance computing. These accelerators represent significant challenges in updating our existing base of software. An intrinsic problem with this transition is a fundamental programming shift from message passing processes to much more fine thread scheduling with memory sharing. Another problem is the lack of stability in accelerator implementation; processor and compiler technology is currently changing rapidly. This report documents the results of our three-year ASCR project to address these challenges. Our project includes the development of the Dax toolkit, which contains the beginnings of new algorithms for a new generation of computers and the underlying infrastructure to rapidly prototype and build further algorithms as necessary.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.