An Algebraic Multigrid Method for Compatible Least-Squares Formulations of Div-Curl Equations
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Sprinter Lecture Notes
Abstract not provided.
Abstract not provided.
Running untrusted user-level code inside an operating system kernel has been studied in the 1990's but has not really caught on. We believe the time has come to resurrect kernel extensions for operating systems that run on highly-parallel clusters and supercomputers. The reason is that the usage model for these machines differs significantly from a desktop machine or a server. In addition, vendors are starting to add features, such as floating-point accelerators, multicore processors, and reconfigurable compute elements. An operating system for such machines must be adaptable to the requirements of specific applications and provide abstractions to access next-generation hardware features, without sacrificing performance or scalability.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Algorithmic properties of the midpoint predictor-corrector time integration algorithm are examined. In the case of a finite number of iterations, the errors in angular momentum conservation and incremental objectivity are controlled by the number of iterations performed. Exact angular momentum conservation and exact incremental objectivity are achieved in the limit of an infinite number of iterations. A complete stability and dispersion analysis of the linearized algorithm is detailed. The main observation is that stability depends critically on the number of iterations performed.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Over the past several years, verifying and validating complex codes at Sandia National Laboratories has become a major part of code development. These aspects tackle two important parts of simulation modeling: determining if the models have been correctly implemented - verification, and determining if the correct models have been selected - validation. In this talk, we will focus on verification and discuss the basics of code verification and its application to a few codes and problems at Sandia.
Abstract not provided.
Abstract not provided.
This document is a reference guide to the Xyce Parallel Electronic Simulator, and is a companion document to the Xyce Users Guide. The focus of this document is (to the extent possible) exhaustively list device parameters, solver options, parser options, and other usage details of Xyce. This document is not intended to be a tutorial. Users who are new to circuit simulation are better served by the Xyce Users Guide.
This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been designed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas: (1) Capability to solve extremely large circuit problems by supporting large-scale parallel computing platforms (up to thousands of processors). Note that this includes support for most popular parallel and serial computers. (2) Improved performance for all numerical kernels (e.g., time integrator, nonlinear and linear solvers) through state-of-the-art algorithms and novel techniques. (3) Device models which are specifically tailored to meet Sandia's needs, including some radiation-aware devices (for Sandia users only). (4) Object-oriented code design and implementation using modern coding practices that ensure that the Xyce Parallel Electronic Simulator will be maintainable and extensible far into the future. Xyce is a parallel code in the most general sense of the phrase - a message passing parallel implementation - which allows it to run efficiently on the widest possible number of computing platforms. These include serial, shared-memory and distributed-memory parallel as well as heterogeneous platforms. Careful attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows. The development of Xyce provides a platform for computational research and development aimed specifically at the needs of the Laboratory. With Xyce, Sandia has an 'in-house' capability with which both new electrical (e.g., device model development) and algorithmic (e.g., faster time-integration methods, parallel solver algorithms) research and development can be performed. As a result, Xyce is a unique electrical simulation capability, designed to meet the unique needs of the laboratory.
Abstract not provided.
The object of the 'Enabling Immersive Simulation for Complex Systems Analysis and Training' LDRD has been to research, design, and engineer a capability to develop simulations which (1) provide a rich, immersive interface for participation by real humans (exploiting existing high-performance game-engine technology wherever possible), and (2) can leverage Sandia's substantial investment in high-fidelity physical and cognitive models implemented in the Umbra simulation framework. We report here on these efforts. First, we describe the integration of Sandia's Umbra modular simulation framework with the open-source Delta3D game engine. Next, we report on Umbra's integration with Sandia's Cognitive Foundry, specifically to provide for learning behaviors for 'virtual teammates' directly from observed human behavior. Finally, we describe the integration of Delta3D with the ABL behavior engine, and report on research into establishing the theoretical framework that will be required to make use of tools like ABL to scale up to increasingly rich and realistic virtual characters.
The motivating vision behind Sandia's MENTOR/PAL LDRD project has been that of systems which use real-time psychophysiological data to support and enhance human performance, both individually and of groups. Relevant and significant psychophysiological data being a necessary prerequisite to such systems, this LDRD has focused on identifying and refining such signals. The project has focused in particular on EEG (electroencephalogram) data as a promising candidate signal because it (potentially) provides a broad window on brain activity with relatively low cost and logistical constraints. We report here on two analyses performed on EEG data collected in this project using the SOBI (Second Order Blind Identification) algorithm to identify two independent sources of brain activity: one in the frontal lobe and one in the occipital. The first study looks at directional influences between the two components, while the second study looks at inferring gender based upon the frontal component.
Tensor decompositions are higher-order analogues of matrix decompositions and have proven to be powerful tools for data analysis. In particular, we are interested in the canonical tensor decomposition, otherwise known as the CANDECOMP/PARAFAC decomposition (CPD), which expresses a tensor as the sum of component rank-one tensors and is used in a multitude of applications such as chemometrics, signal processing, neuroscience, and web analysis. The task of computing the CPD, however, can be difficult. The typical approach is based on alternating least squares (ALS) optimization, which can be remarkably fast but is not very accurate. Previously, nonlinear least squares (NLS) methods have also been recommended; existing NLS methods are accurate but slow. In this paper, we propose the use of gradient-based optimization methods. We discuss the mathematical calculation of the derivatives and further show that they can be computed efficiently, at the same cost as one iteration of ALS. Computational experiments demonstrate that the gradient-based optimization methods are much more accurate than ALS and orders of magnitude faster than NLS.
Abstract not provided.
Abstract not provided.
International Journal of Decision Support System Technology (IJDSST)
The Design for Tractable Analysis (DTA) framework was developed to address the analysis of complex systems and so-called “wicked problems.” DTA is distinctive because it treats analytic processes as key artifacts that can be created and improved through formal design processes. Systems (or enterprises) are analyzed as a whole, in conjunction with decomposing them into constituent elements for domain-specific analyses that are informed by the whole. After using the Systems Modeling Language (SysML) to frame the problem in the context of stakeholder needs, DTA harnesses the Design Structure Matrix (DSM) to structure the analysis of the system and address questions about the emergent properties of the system. The novel use of DSM to “design the analysis” makes DTA particularly suitable for addressing the interdependent nature of complex systems. The use of DTA is demonstrated by a case study of sensor grid placement decisions to secure assets at a fixed site. © 2009, IGI Global. All rights reserved.
Abstract not provided.
ESAIM: Mathematical Modelling and Numerical Analysis
Abstract not provided.
Abstract not provided.
Abstract not provided.
Journal of Computational Chemistry
Abstract not provided.
Operations Research Letters
Abstract not provided.
Abstract not provided.
Journal of Physics: Conference Series
SciDAC applications have a demonstrated need for advanced software tools to manage the complexities associated with sophisticated geometry, mesh, and field manipulation tasks, particularly as computer architectures move toward the petascale. In this paper, we describe a software component - an abstract data model and programming interface - designed to provide support for parallel unstructured mesh operations. We describe key issues that must be addressed to successfully provide high-performance, distributed-memory unstructured mesh services and highlight some recent research accomplishments in developing new load balancing and MPI-based communication libraries appropriate for leadership class computing. Finally, we give examples of the use of parallel adaptive mesh modification in two SciDAC applications. © 2009 IOP Publishing Ltd.
Journal of Physics: Conference Series
This paper describes an application driven methodology for understanding the impact of future architecture decisions on the end of the MPP era. Fundamental transistor device limitations combined with application performance characteristics have created the switch to multicore/multithreaded architectures. Designing large-scale supercomputers to match application demands is particularly challenging since performance characteristics are highly counter-intuitive. In fact, data movement more than FLOPS dominates. This work discusses some basic performance analysis for a set of DOE applications, the limits of CMOS technology, and the impact of both on future architectures. © 2009 IOP Publishing Ltd.
Computer Aided Chemical Engineering
Process Systems Engineering (PSE) is built on the application of computational tools to the solution of physical engineering problems. Over the course of its nearly five decade history, advances in PSE have relied roughly equally on advancements in desktop computing technology and developments of new tools and approaches for representing and solving problems (Westerberg, 2004). Just as desktop computing development over that period focused on increasing the net serial instruction rate, tool development in PSE has emphasized creating faster general-purpose serial algorithms. However, in recent years the increase in net serial instruction rate has slowed dramatically, with processors first reaching an effective upper limit for clock speed and now approaching apparent limits for microarchitecture efficiency. Current trends in desktop processor development suggest that future performance gains will occur primarily through exploitation of parallelism. For PSE to continue to leverage the "free" advancements from desktop computing technology in the future, the PSE toolset will need to embrace the use of parallelization. Unfortunately, "parallelization" is more than just identifying multiple things to do at once. Parallel algorithm design has two fundamental challenges: first, to match the characteristics of the parallelizable problem workload to the capabilities of the hardware platform, and second to properly balance parallel computation with the overhead of communication and synchronization on that platform. The performance of any parallel algorithm is thus a strong function of how well the characteristics of the problem and algorithm match those of the hardware platform on which it will run. This has led to a proliferation of highly specialized parallel hardware platforms, each designed around specific problems or problem classes. While every platform has its own unique characteristics, we can group current approaches into six basic classes: symmetric multiprocessing (SMP), networks of workstations (NOW), massively parallel processing (MPP), specialized coprocessors, multi-threaded shared memory, and hybrids that combine components of the first five classes. Perhaps the most familiar of these is the SMP architecture, which forms the bulk of current the desktop and workstation market. These systems have multiple processing units (processors and/or cores) controlled by a single operating system image and sharing a single common shared memory space. While SMP systems provide only a modest level of parallelism (typically 2-16 processing units), the existence of shared memory and full-featured processing units makes them perhaps the most straightforward development platform. A challenge of SMP platforms is the discrepancy between the speed of the processor and the memory system: both latency and overall memory bandwidth limitations can lead to processors idling waiting for data. Clusters, a generic term for coordinated groups of independent computers (nodes) connected with high-speed networks, provide the opportunity for a radically different level of parallelism, with the largest clusters having over 25,000 nodes and 100,000 processing units. The challenge with clusters is memory is distributed across independent nodes. Communication and coordination among nodes must be explicitly managed and occurs over a relatively high latency network interconnect. Efficient use of this architecture requires applications that decompose into pseudo-independent components that run with high computation to communication ratios. The level to which systems utilize commodity components distinguishes the two main types of cluster architectures, with NOW nodes running commodity network interconnects and operating systems and MPP nodes using specialized or proprietary network layers or microkernels. Specialized coprocessors, including graphics processing units (GPU) and the Cell Broadband Engine (Cell), are gaining popularity as scientific computing platforms. These platforms employ non-general purpose dependent processing units to speed fine-grained, repetitive processing. Architecturally, they are reminiscent of vector computing, combining very fast access to a small amount of local memory with processing elements implementing either a single-instruction-multiple-data (SIMD) (GPU) or a pipelined (Cell) model. As application developers must explicitly manage both parallelism on the coprocessor and the movement of data to and from the coprocessor memory space, these architectures can be some of the most challenging to program. Finally, multi-threaded shared memory (MTSM) systems represent a fundamental departure from traditional distributed memory systems like NOW and MPP. Instead of a collection of independent nodes and memory spaces, an MTSM system runs a single system image across all nodes, combining all node memory into a single coherent shared memory space. To a developer, the MTSM appears to be a single very large SMP. However, unlike a SMP that uses caches to reduce the latency of a memory access, the MTSM tolerates latency by using a large number of concurrent threads. While this architecture lends itself to problems that are not readily decomposable, effective utilization of MTSM systems requires applications to run hundreds - or thousands - of concurrent threads. The proliferation of specialized parallel computing architectures presents several significant challenges for developers of parallel modeling and optimization applications. Foremost is the challenge of selecting the "appropriate" platform to target when developing the application. While it is clear that architectural characteristics can significantly affect the performance of an algorithm, relatively few rules or heuristics exist for selecting a platform based solely on application characteristics. A contributing challenge is that different architectures employ fundamentally different programming paradigms, libraries, and tools. Knowledge and experience on one platform does not necessarily translate to other platforms. This also complicates the process of directly comparing platform performance, as applications are rarely portable: software designed for one platform rarely compiles on another without modification, and the modifications may require a redesign of the fundamental parallelization approach. A final challenge is effectively communicating parallel results. While the relatively homogenous environment of serial desktop computing facilitated extremely terse descriptions of a test platform, often limited to processor make and clock speed, reporting results for parallel architectures must include not only processor information, but depending on the architecture, also include operating system, network interconnect, coprocessor make, model, and interconnect, and node configuration. There are numerous examples of algorithms and applications designed explicitly to leverage specific architectural features of parallel systems. While by no means comprehensive, three current representative efforts are the development of parallel branch and bound algorithms, distributed collaborative optimization algorithms, and multithreaded parallel discrete event simulation. PICO, the Parallel Integer and Combinatorial Optimizer (Eckstein, et al., 2001), is a scalable parallel mixed-integer linear optimizer. Designed explicitly for cluster environments (both NOW and MPP), PICO leverages the synergy between the inherently decomposable branch and bound tree search and the independent nature of the nodes within a cluster by distributing the independent sub-problems for the tree search across the nodes of the cluster. In contrast, agent-based collaborative optimization (Siirola, et al., 2004, 2007) matches traditionally non-decomposable nonlinear programming algorithms to high-latency clusters (e.g. NOWs or Grids) by replicating serial search algorithms intact and unmodified across the independent nodes of the cluster. The system then enforces collaboration through sharing intermediate "solutions" to the common problem. This creates a decomposable artificial meta-algorithm with a high computation to communication ratio that can scale efficiently on large, high latency, low bandwidth cluster environments. For modeling applications, efficiently parallelizing discrete event simulations has presented a longstanding challenge, with several decades of study and literature (Perumalla, 2006). The central challenge in parallelizing discrete event simulations on traditional distributed memory clusters is efficiently synchronizing the simulation time across the processing nodes during a simulation. A promising alternative approach leverages the Cray XMT (formerly called Eldorado; Feo, et al. 2005). The XMT implements an MTSM architecture and provides a single shared memory space across all nodes, greatly simplifying the time synchronization challenge. Further, the fine-grained parallelism provided by the architecture opens new opportunities for additional parallelism beyond simple event parallelization, for example, parallelizing the event queue management. While these three examples are a small subset of current parallel algorithm design, they demonstrate the impact that parallel architectures have had and will continue to have on future developments for modeling and optimization in PSE. © 2009 Elsevier B.V. All rights reserved.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
This preliminary study considers the scaling and performance of a finite element (FE) semiconductor device simulator on a capacity cluster with 272 compute nodes based on a homogeneous multicore node architecture utilizing 16 cores. The inter-node communication backbone for this Tri-Lab Linux Capacity Cluster (TLCC) machine is comprised of an InfiniBand interconnect. The nonuniform memory access (NUMA) nodes consist of 2.2 GHz quad socket/quad core AMD Opteron processors. The performance results for this study are obtained with a FE semiconductor device simulation code (Charon) that is based on a fully-coupled Newton-Krylov solver with domain decomposition and multilevel preconditioners. Scaling and multicore performance results are presented for large-scale problems of 100+ million unknowns on up to 4096 cores. A parallel scaling comparison is also presented with the Cray XT3/4 Red Storm capability platform. The results indicate that an MPI-only programming model for utilizing the multicore nodes is reasonably efficient on all 16 cores per compute node. However, the results also indicated that the multilevel preconditioner, which is critical for large-scale capability type simulations, scales better on the Red Storm machine than the TLCC machine.
Abstract not provided.
Abstract not provided.
Abstract not provided.
We present an analysis of gas chromatographic columns where the stationary phase is not assumed to be a thin uniform coating along the walls of the cross section. We also give an asymptotic analysis assuming that the parameter {beta} = KD{sup II}{rho}{sup II}/D{sup I}{rho}{sup I} is small. Here K is the partition coefficient, and D{sup i} and {rho}{sup i}, i = I, II are the diffusivity and density in the mobile (i = I) and stationary (i = II) regions.