Algebraic Flux Correction FEM for Convection Dominated Transport
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
This paper examines task mapping algorithms for non-contiguously allocated parallel jobs. Several studies have shown that task placement affects job running time for both contiguously and non-contiguously allocated jobs. Traditionally, work on task mapping either uses a very general model where the job has an arbitrary communication pattern or assumes that jobs are allocated contiguously, making them completely isolated from each other. A middle ground between these two cases is the mapping problem for non-contiguous jobs having a specific communication pattern. We propose several task mapping algorithms for jobs with a stencil communication pattern and evaluate them using experiments and simulations. Our strategies improve the running time of a MiniApp by as much as 30% over a baseline strategy. Furthermore, this improvement increases markedly with the job size, demonstrating the importance of task mapping as systems grow toward exascale.
Abstract not provided.
Proposed for publication in Jounal of Physical Chemistry A.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
IEEE Transactions on Visualization and Computer Graphics
The most common abstraction used by visualization libraries and applications today is what is known as the visualization pipeline. The visualization pipeline provides a mechanism to encapsulate algorithms and then couple them together in a variety of ways. The visualization pipeline has been in existence for over 20 years, and over this time many variations and improvements have been proposed. This paper provides a literature review of the most prevalent features of visualization pipelines and some of the most recent research directions. © 1995-2012 IEEE.
Journal of Computational Physics
We present a new approach to the simulation of gravity-driven viscous fingering instabilities in porous media flow. These instabilities play a very important role during carbon sequestration processes in brine aquifers. Our approach is based on a nonlinear implementation of the discontinuous Galerkin method, and possesses a number of key features. First, the method developed is inherently high order, and is therefore well suited to study unstable flow mechanisms. Secondly, it maintains high-order accuracy on completely unstructured meshes. The combination of these two features makes it a very appealing strategy in simulating the challenging flow patterns and very complex geometries of actual reservoirs and aquifers. This article includes an extensive set of verification studies on the stability and accuracy of the method, and also features a number of computations with unstructured grids and non-standard geometries.
Procedia Engineering
Hydrocarbon polymers and foams are utilized in high energy-density physics (HEDP) and inertial confinement fusion (ICF) experiments as tampers, energy conversion and radiation pulse shaping layers in dynamic hohlraum Z-pinches, and ablators in ICF capsule implosions. Shocked foams frequently are found to be mixed with other materials either by intentional doping with high-Z elements or by instabilities and turbulent mixing with surrounding materials. In this paper we present one-dimensional and three-dimensional mesoscale hydrodynamic simulations of high-Z doped poly-(4-methyl-1-pentene) (PMP or TPX) foams in order to examine the validity of various equation of state (EOS) mixing rules available in two state-of-the-art simulation codes. Platinum-doped PMP foam experiments conducted at Sandia's Z facility provide data that can be used to test EOS mixing rules. We apply Sandia's ALEGRA-MHD code and the joint LLNL/SNL KULL HEDP code to model these doped foam experiments and exercise the available EOS mixing methods. One-dimensional simulations homogenize the foam with platinum dopant and show which EOS mixing methods produce results that are consistent with measured Hugoniot states. These simulations produce sharp shock fronts that are well described by traditional Hugoniot relations. Three-dimensional mesoscale simulations explicitly model the foam structure embedded with discrete platinum particles. The heterogeneous structure of the foam results in diffuse shock fronts and an unsteady post-shock state with large fluctuations about an average state. We compare shock propagation through pure foam and Pt-doped foams (50-50 mixture by weight) at equal average initial density, and examine how well the results compare to the experimentally measured Hugoniot states. © 2013 The Authors.
Abstract not provided.
Proposed for publication in Journal of Computational Physics.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Proposed for publication in Physical Review Letters.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Proposed for publication in International Journal for Numerical Methods in Engineering.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Proposed for publication in Frontiers in Neurogenesis.
Abstract not provided.
Abstract not provided.
Proposed for publication in Monthly Weather Review.
Abstract not provided.
Proc. of 2nd Int. Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, BigMine 2013 - Held in Conj. with SIGKDD 2013 Conf.
We present an algorithm to maintain the connected components of a graph that arrives as an infinite stream of edges. We formalize the algorithm on X-stream, a new parallel theoretical computational model for infinite streams. Connectivity-related queries, including component spanning trees, are supported with some latency, returning the state of the graph at the time of the query. Because an infinite stream may eventually exceed the storage limits of any number of finite-memory processors, we assume an aging command or daemon where "uninteresting" edges are removed when the system nears capacity. Following an aging command the system will block queries until its data structures are repaired, but edges will continue to be accepted from the stream, never dropped. The algorithm will not fail unless a model-specific constant fraction of the aggregate memory across all processors is full. In normal operation, it will not fail unless aggregate memory is completely full. Unlike previous theoretical streaming models designed for finite graphs that assume a single shared memory machine or require arbitrary-size intemediate files, X-stream distributes a graph over a ring network of finite-memory processors. Though the model is synchronous and reminiscent of systolic algorithms, our implementation uses an asynchronous message-passing system. We argue the correctness of our X-stream connected components algorithm, and give preliminary experimental results on synthetic and real graph streams.
Frontiers in Computational Neuroscience
Mathematical modeling of anatomically-constrained neural networks has provided significant insights regarding the response of networks to neurological disorders or injury. A logical extension of these models is to incorporate treatment regimens to investigate network responses to intervention. The addition of nascent neurons from stem cell precursors into damaged or diseased tissue has been used as a successful therapeutic tool in recent decades. Interestingly, models have been developed to examine the incorporation of new neurons into intact adult structures, particularly the dentate granule neurons of the hippocampus. These studies suggest that the unique properties of maturing neurons, can impact circuit behavior in unanticipated ways. In this perspective, we review the current status of models used to examine damaged CNS structures with particular focus on cortical damage due to stroke. Secondly, we suggest that computational modeling of cell replacement therapies can be made feasible by implementing approaches taken by current models of adult neurogenesis. The development of these models is critical for generating hypotheses regarding transplant therapies and improving outcomes by tailoring transplants to desired effects.
ACM International Conference Proceeding Series
Power and energy concerns are motivating chip manufacturers to consider future hybrid-core processor designs that combine a small number of traditional cores optimized for single-thread performance with a large number of simpler cores optimized for throughput performance. This trend is likely to impact the way compute resources for network protocol processing functions are allocated and managed. In particular, the performance of MPI match processing is critical to achieving high message throughput. In this paper, we analyze the ability of simple and more complex cores to perform MPI matching operations for various scenarios in order to gain insight into how MPI implementations for future hybrid-core processors should be designed.
Proceedings - 2013 Extreme Scaling Workshop, XSW 2013
The manycore revolution in computational hardware can be characterized by increasing thread counts, decreasing memory per thread, and architecture specific performance constraints for memory access patterns. High performance computing (HPC) on emerging many core architectures requires codes to exploit every opportunity for thread-level parallelism and satisfy conflicting performance constraints. We developed the Kokkos C++ library to provide scientific and engineering codes with a user accessible many core performance portable programming model. The two foundational abstractions of Kokkos are (1) dispatch work to a many core device for parallel execution and (2) manage multidimensional arrays with polymorphic layouts. The integration of these abstractions enables users' code to satisfy multiple architecture specific memory access pattern performance constraints without having to modify their source code. In this paper we describe the Kokkos abstractions, summarize its application programmer interface (API), and present performance results for a molecular dynamics computational kernel and finite element mini-application. © 2013 IEEE.
International Conference for High Performance Computing, Networking, Storage and Analysis, SC
Scalable parallel computing is essential for processing large scale-free (power-law) graphs. The distribution of data across processes becomes important on distributed-memory computers with thousands of cores. It has been shown that two dimensional layouts (edge partitioning) can have significant advantages over traditional one-dimensional layouts. However, simple 2D block distribution does not use the structure of the graph, and more advanced 2D partitioning methods are too expensive for large graphs. We propose a new two-dimensional partitioning algorithm that combines graph partitioning with 2D block distribution. The computational cost of the algorithm is essentially the same as 1D graph partitioning. We study the performance of sparse matrix-vector multiplication (SpMV) for scale-free graphs from the web and social networks using several different partitioners and both 1D and 2D data layouts. We show that SpMV run time is reduced by exploiting the graph's structure. Contrary to popular belief, we observe that current graph and hypergraph partitioners often yield relatively good partitions on scale-free graphs. We demonstrate that our new 2D partitioning method consistently outperforms the other methods considered, for both SpMV and an eigensolver, on matrices with up to 1.6 billion nonzeros using up to 16,384 cores. Copyright 2013 ACM.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Proposed for publication in International Journal of Computer Mathematics.
Abstract not provided.
Abstract not provided.
Modeling geospatial information with semantic graphs enables search for sites of interest based on relationships between features, without requiring strong a priori models of feature shape or other intrinsic properties. Geospatial semantic graphs can be constructed from raw sensor data with suitable preprocessing to obtain a discretized representation. This report describes initial work toward extending geospatial semantic graphs to include temporal information, and initial results applying semantic graph techniques to SAR image data. We describe an efficient graph structure that includes geospatial and temporal information, which is designed to support simultaneous spatial and temporal search queries. We also report a preliminary implementation of feature recognition, semantic graph modeling, and graph search based on input SAR data. The report concludes with lessons learned and suggestions for future improvements.
Journal of Applied Physics
We present a method which computes many-electron energies and eigenfunctions by a full configuration interaction, which uses a basis of atomistic tight-binding wave functions. This approach captures electron correlation as well as atomistic effects, and is well suited to solid state quantum dot systems containing few electrons, where valley physics and disorder contribute significantly to device behavior. Results are reported for a two-electron silicon double quantum dot as an example. © 2012 American Institute of Physics.
11th International Probabilistic Safety Assessment and Management Conference and the Annual European Safety and Reliability Conference 2012, PSAM11 ESREL 2012
Many of the Performance Shaping Factors (PSFs) used in Human Reliability Analysis (HRA) methods are not directly measurable or observable. Methods like SPAR-H require the analyst to assign values for all of the PSFs, regardless of the PSF observability; this introduces subjectivity into the human error probability (HEP) calculation. One method to reduce the subjectivity of HRA estimates is to formally incorporate information about the probability of the PSFs into the methodology for calculating the HEP. This can be accomplished by encoding prior information in a Bayesian Network (BN) and updating the network using available observations. We translated an existing HRA methodology, SPAR-H, into a Bayesian Network to demonstrate the usefulness of the BN framework. We focus on the ability to incorporate prior information about PSF probabilities into the HRA process. This paper discusses how we produced the model by combining information from two sources, and how the BN model can be used to estimate HEPs despite missing observations. Use of the prior information allows HRA analysts to use partial information to estimate HEPs, and to rely on the prior information (from data or cognitive literature) when they are unable to gather information about the state of a particular PSF. The SPAR-H BN model is a starting point for future research activities to create a more robust HRA BN model using data from multiple sources.
Abstract not provided.
Abstract not provided.
Proposed for publication in Journal of Physical Oceanography.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Proposed for publication in SIGMETRICS Performance Evaluation Review.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Proceedings - 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, SCC 2012
The predictions for exascale computing are dire. Although we have benefited from a consistent supercomputer architecture design, even across manufacturers, for well over a decade, recent trends indicate that future high-performance computers will have different hardware structure and programming models to which software must adapt. This paper provides an informal discussion on the ways in which changes in high-performance computing architecture will profoundly affect the scalability of our current generation of scientific visualization and analysis codes and how we must adapt our applications, workflows, and attitudes to continue our success at exascale computing. © 2012 IEEE.
Proceedings - 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, SCC 2012
Proceedings - 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, SCC 2012
The push to exascale computing is informed by the assumption that the architecture, regardless of the specific design, will be fundamentally different from petascale computers. The Mantevo project has been established to produce a set of proxies, or 'miniapps,' which enable rapid exploration of key performance issues that impact a broad set of scientific applications programs of interest to ASC and the broader HPC community. Understanding the conditions under which a miniapp can be confidently used as predictive of an applications' behavior must be clearly elucidated. Toward this end, we have developed a methodology for assessing the predictive capabilities of application proxies. Adhering to the spirit of experimental validation, our approach provides a framework for examining data from the application with that provided by their proxies. In this poster we present this methodology, and apply it to three miniapps developed by the Mantevo project. © 2012 IEEE.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
The rise in node-level parallelism has increased interest in task-based parallel runtimes for a wide array of application areas. Applications have a wide variety of task spawning patterns which frequently change during the course of application execution, based on the algorithm or solver kernel in use. Task scheduling and load balance regimes, however, are often highly optimized for specific patterns. This paper uses four basic task spawning patterns to quantify the impact of specific scheduling policy decisions on execution time. We compare the behavior of six publicly available tasking runtimes: Intel Cilk, Intel Threading Building Blocks (TBB), Intel OpenMP, GCC OpenMP, Qthreads, and High Performance ParalleX (HPX). With the exception of Qthreads, the runtimes prove to have schedulers that are highly sensitive to application structure. No runtime is able to provide the best performance in all cases, and those that do provide the best performance in some cases, unfortunately, provide extremely poor performance when application structure does not match the scheduler's assumptions.
Sandias parallel circuit simulator, Xyce, can address large scale neuron simulations in a new way extending the range within which one can perform high-fidelity, multi-compartment neuron simulations. This report documents the implementation of neuron devices in Xyce, their use in simulation and analysis of neuron systems.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Over the past few years, we have defined and gone a long ways towards implementing a component-based strategy for building scientific application codes. We have asserted that this approach offers significant advantages over a model of writing project-based application codes. There are now several technical and programmatic successes that validate these claims. Not only are there net benefits to code projects that follow this strategy, but also the most striking gains are for the long-term impact and productivity of our computational science organizations.
Reliability Engineering and System Safety
Sensitivity analysis is comprised of techniques to quantify the effects of the input variables on a set of outputs. In particular, sensitivity indices can be used to infer which input parameters most significantly affect the results of a computational model. With continually increasing computing power, sensitivity analysis has become an important technique by which to understand the behavior of large-scale computer simulations. Many sensitivity analysis methods rely on sampling from distributions of the inputs. Such sampling-based methods can be computationally expensive, requiring many evaluations of the simulation; in this case, the Sobol method provides an easy and accurate way to compute variance-based measures, provided a sufficient number of model evaluations are available. As an alternative, meta-modeling approaches have been devised to approximate the response surface and estimate various measures of sensitivity. In this work, we consider a variety of sensitivity analysis methods, including different sampling strategies, different meta-models, and different ways of evaluating variance-based sensitivity indices. The problem we consider is the 1-D Riemann problem. By a careful choice of inputs, discontinuous solutions are obtained, leading to discontinuous response surfaces; such surfaces can be particularly problematic for meta-modeling approaches. The goal of this study is to compare the estimated sensitivity indices with exact values and to evaluate the convergence of these estimates with increasing samples sizes and under an increasing number of meta-model evaluations. © 2011 Elsevier Ltd. All rights reserved.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
This report presents a specification for the Portals 4.0 network programming interface. Portals 4.0 is intended to allow scalable, high-performance network communication between nodes of a parallel computing system. Portals 4.0 is well suited to massively parallel processing and embedded systems. Portals 4.0 represents an adaption of the data movement layer developed for massively parallel processing platforms, such as the 4500-node Intel TeraFLOPS machine. Sandia’s Cplant cluster project motivated the development of Version 3.0, which was later extended to Version 3.3 as part of the Cray Red Storm machine and XT line. Version 4.0 is targeted to the next generation of machines employing advanced network interface architectures that support enhanced offload capabilities.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Proposed for publication in Engineering with Computers.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Proposed for publication in Human Factors: The Journal of Human Factors and Ergonomics Society.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Message passing paradigms provide for many to one messaging patterns that result in receive side resource exhaustion. Traditionally, MPI implementations layered over the Portals network programming interface provided a large default unexpected receive buffer space, the user was expected to configure the buffer size to the application demand, and the application was aborted when the buffer space was overrun. The Portals 4 design provides a set of primitives for implementing scalable resource exhaustion recovery without negatively impacting normal operation. A resource exhaustion recovery protocol for MPI implementations is presented, as well as performance results for an Open MPI implementation of the protocol. © 2012 Springer-Verlag.
Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium, IPDPS 2012
With the ubiquity of multicore processors, it is crucial that solvers adapt to the hierarchical structure of modern architectures. We present ShyLU, a "hybrid-hybrid" solver for general sparse linear systems that is hybrid in two ways: First, it combines direct and iterative methods. The iterative part is based on approximate Schur complements where we compute the approximate Schur complement using a value-based dropping strategy or structure-based probing strategy. Second, the solver uses two levels of parallelism via hybrid programming (MPI+threads). ShyLU is useful both in shared-memory environments and on large parallel computers with distributed memory. In the latter case, it should be used as a sub domain solver. We argue that with the increasing complexity of compute nodes, it is important to exploit multiple levels of parallelism even within a single compute node. We show the robustness of ShyLU against other algebraic preconditioners. ShyLU scales well up to 384 cores for a given problem size. We also study the MPI-only performance of ShyLU against a hybrid implementation and conclude that on present multicore nodes MPI-only implementation is better. However, for future multicore machines (96 or more cores) hybrid/ hierarchical algorithms and implementations are important for sustained performance. © 2012 IEEE.
Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium, IPDPS 2012
We design, implement, and evaluate algorithms for computing a matching of maximum cardinality in a bipartite graph on multicore and massively multithreaded computers. As computers with larger numbers of slower cores dominate the commodity processor market, the design of multithreaded algorithms to solve large matching problems becomes a necessity. Recent work on serial algorithms for the matching problem has shown that their performance is sensitive to the order in which the vertices are processed for matching. In a multithreaded environment, imposing a serial order in which vertices are considered for matching would lead to loss of concurrency and performance. But this raises the question: Would parallel matching algorithms on multithreaded machines improve performance over a serial algorithm? We answer this question in the affirmative. We report efficient multithreaded implementations of three classes of algorithms based on their manner of searching for augmenting paths: breadth-first-search, depth-first-search, and a combination of both. The Karp-Sipser initialization algorithm is used to make the parallel algorithms practical. We report extensive results and insights using three shared-memory platforms (a 48-core AMD Opteron, a 32-coreIntel Nehalem, and a 128-processor Cray XMT) on a representative set of real-world and synthetic graphs. To the best of our knowledge, this is the first study of augmentation-based parallel algorithms for bipartite cardinality matching that demonstrates good speedups on multithreaded shared memory multiprocessors. © 2012 IEEE.
Journal of Computational Physics
Abstract not provided.
Abstract not provided.
Proposed for publication in IEEE Computer Magazine.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Proposed for publication in Concurrency and Computation: Practice and Experience.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Proposed for publication in arxiv.org.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.