This report details the accomplishments of the 'Building More Powerful Less Expensive Supercomputers Using Processing-In-Memory (PIM)' LDRD ('PIM LDRD', number 105809) for FY07-FY09. Latency dominates all levels of supercomputer design. Within a node, increasing memory latency, relative to processor cycle time, limits CPU performance. Between nodes, the same increase in relative latency impacts scalability. Processing-In-Memory (PIM) is an architecture that directly addresses this problem using enhanced chip fabrication technology and machine organization. PIMs combine high-speed logic and dense, low-latency, high-bandwidth DRAM, and lightweight threads that tolerate latency by performing useful work during memory transactions. This work examines the potential of PIM-based architectures to support mission critical Sandia applications and an emerging class of more data intensive informatics applications. This work has resulted in a stronger architecture/implementation collaboration between 1400 and 1700. Additionally, key technology components have impacted vendor roadmaps, and we are in the process of pursuing these new collaborations. This work has the potential to impact future supercomputer design and construction, reducing power and increasing performance. This final report is organized as follow: this summary chapter discusses the impact of the project (Section 1), provides an enumeration of publications and other public discussion of the work (Section 1), and concludes with a discussion of future work and impact from the project (Section 1). The appendix contains reprints of the refereed publications resulting from this work.
Functional brain imaging is of great interest for understanding correlations between specific cognitive processes and underlying neural activity. This understanding can provide the foundation for developing enhanced human-machine interfaces, decision aides, and enhanced cognition at the physiological level. The functional near infrared spectroscopy (fNIRS) based event-related optical signal (EROS) technique can provide direct, high-fidelity measures of temporal and spatial characteristics of neural networks underlying cognitive behavior. However, current EROS systems are hampered by poor signal-to-noise-ratio (SNR) and depth of measure, limiting areas of the brain and associated cognitive processes that can be investigated. We propose to investigate a flexible, tunable, multi-spectral fNIRS EROS system which will provide up to 10x greater SNR as well as improved spatial and temporal resolution through significant improvements in electronics, optoelectronics and optics, as well as contribute to the physiological foundation of higher-order cognitive processes and provide the technical foundation for miniaturized portable neuroimaging systems.
Large relational datasets such as national-scale social networks and power grids present different computational challenges than do physical simulations. Sandia's distributed-memory supercomputers are well suited for solving problems concerning the latter, but not the former. The reason is that problems such as pattern recognition and knowledge discovery on large networks are dominated by memory latency and not by computation. Furthermore, most memory requests in these applications are very small, and when the datasets are large, most requests miss the cache. The result is extremely low utilization. We are unlikely to be able to grow out of this problem with conventional architectures. As the power density of microprocessors has approached that of a nuclear reactor in the past two years, we have seen a leveling of Moores Law. Building larger and larger microprocessor-based supercomputers is not a solution for informatics and network infrastructure problems since the additional processors are utilized to only a tiny fraction of their capacity. An alternative solution is to use the paradigm of massive multithreading with a large shared memory. There is only one instance of this paradigm today: the Cray MTA-2. The proposal team has unique experience with and access to this machine. The XMT, which is now being delivered, is a Red Storm machine with up to 8192 multithreaded 'Threadstorm' processors and 128 TB of shared memory. For many years, the XMT will be the only way to address very large graph problems efficiently, and future generations of supercomputers will include multithreaded processors. Roughly 10 MTA processor can process a simple short paths problem in the time taken by the Gordon Bell Prize-nominated distributed memory code on 32,000 processors of Blue Gene/Light. We have developed algorithms and open-source software for the XMT, and have modified that software to run some of these algorithms on other multithreaded platforms such as the Sun Niagara and Opteron multi-core chips.
Palacios and Kitten are new open source tools that enable applications, whether ported or not, to achieve scalable high performance on large machines. They provide a thin layer over the hardware to support both full-featured virtualized environments and native code bases. Kitten is an OS under development at Sandia that implements a lightweight kernel architecture to provide predictable behavior and increased flexibility on large machines, while also providing Linux binary compatibility. Palacios is a VMM that is under development at Northwestern University and the University of New Mexico. Palacios, which can be embedded into Kitten and other OSes, supports existing, unmodified applications and operating systems by using virtualization that leverages hardware technologies. We describe the design and implementation of both Kitten and Palacios. Our benchmarks show that they provide near native, scalable performance. Palacios and Kitten provide an incremental path to using supercomputer resources that is not performance-compromised.
As the core count of HPC machines continue to grow in size, issues such as fault tolerance and reliability are becoming limiting factors for application scalability. Current techniques to ensure progress across faults, for example coordinated checkpoint-restart, are unsuitable for machines of this scale due to their predicted high overheads. In this study, we present the design and implementation of a novel system for ensuring reliability which uses transparent, rank-level, redundant computation. Using this system, we show the overheads involved in redundant computation for a number of real-world HPC applications. Additionally, we relate the communication characteristics of an application to the overheads observed.
Recent papers have used Fiedler's definition of algebraic connectivity to show that network robustness, as measured by node-connectivity and edge-connectivity, can be increased by increasing the algebraic connectivity of the network. By the definition of algebraic connectivity, the second smallest eigenvalue of the graph Laplacian is a lower bound on the node-connectivity. In this paper we show that for circular random lattice graphs and mesh graphs algebraic connectivity is a conservative lower bound, and that increases in algebraic connectivity actually correspond to a decrease in node-connectivity. This means that the networks are actually less robust with respect to node-connectivity as the algebraic connectivity increases. However, an increase in algebraic connectivity seems to correlate well with a decrease in the characteristic path length of these networks - which would result in quicker communication through the network. Applications of these results are then discussed for perimeter security.
The Image Composition Engine for Tiles (IceT) is a high-performance sort-last parallel rendering library. In addition to providing accelerated rendering for a standard display, IceT provides the unique ability to generate images for tiled displays. The overall resolution of the display may be several times larger than any viewport that may be rendered by a single machine. This document is an overview of the user interface to IceT.
The purpose of this work was to help develop a research roadmap and small proof ofconcept for addressing key problems and gaps from the perspective of using text analysis methods as a primary tool for detecting when a group is undergoing a phase change. Self- rganizing map (SOM) techniques were used to analyze text data obtained from the tworld-wide web. Statistical studies indicate that it may be possible to predict phase changes, as well as detect whether or not an example of writing can be attributed to a group of interest.
This paper presents a parallel programming model, Parallel Phase Model (PPM), for next-generation high-end parallel machines based on a distributed memory architecture consisting of a networked cluster of nodes with a large number of cores on each node. PPM has a unified high-level programming abstraction that facilitates the design and implementation of parallel algorithms to exploit both the parallelism of the many cores and the parallelism at the cluster level. The programming abstraction will be suitable for expressing both fine-grained and coarse-grained parallelism. It includes a few high-level parallel programming language constructs that can be added as an extension to an existing (sequential or parallel) programming language such as C; and the implementation of PPM also includes a light-weight runtime library that runs on top of an existing network communication software layer (e.g. MPI). Design philosophy of PPM and details of the programming abstraction are also presented. Several unstructured applications that inherently require high-volume random fine-grained data accesses have been implemented in PPM with very promising results.
Running untrusted user-level code inside an operating system kernel has been studied in the 1990's but has not really caught on. We believe the time has come to resurrect kernel extensions for operating systems that run on highly-parallel clusters and supercomputers. The reason is that the usage model for these machines differs significantly from a desktop machine or a server. In addition, vendors are starting to add features, such as floating-point accelerators, multicore processors, and reconfigurable compute elements. An operating system for such machines must be adaptable to the requirements of specific applications and provide abstractions to access next-generation hardware features, without sacrificing performance or scalability.
Algorithmic properties of the midpoint predictor-corrector time integration algorithm are examined. In the case of a finite number of iterations, the errors in angular momentum conservation and incremental objectivity are controlled by the number of iterations performed. Exact angular momentum conservation and exact incremental objectivity are achieved in the limit of an infinite number of iterations. A complete stability and dispersion analysis of the linearized algorithm is detailed. The main observation is that stability depends critically on the number of iterations performed.
Over the past several years, verifying and validating complex codes at Sandia National Laboratories has become a major part of code development. These aspects tackle two important parts of simulation modeling: determining if the models have been correctly implemented - verification, and determining if the correct models have been selected - validation. In this talk, we will focus on verification and discuss the basics of code verification and its application to a few codes and problems at Sandia.
This document is a reference guide to the Xyce Parallel Electronic Simulator, and is a companion document to the Xyce Users Guide. The focus of this document is (to the extent possible) exhaustively list device parameters, solver options, parser options, and other usage details of Xyce. This document is not intended to be a tutorial. Users who are new to circuit simulation are better served by the Xyce Users Guide.
This manual describes the use of the Xyce Parallel Electronic Simulator. Xyce has been designed as a SPICE-compatible, high-performance analog circuit simulator, and has been written to support the simulation needs of the Sandia National Laboratories electrical designers. This development has focused on improving capability over the current state-of-the-art in the following areas: (1) Capability to solve extremely large circuit problems by supporting large-scale parallel computing platforms (up to thousands of processors). Note that this includes support for most popular parallel and serial computers. (2) Improved performance for all numerical kernels (e.g., time integrator, nonlinear and linear solvers) through state-of-the-art algorithms and novel techniques. (3) Device models which are specifically tailored to meet Sandia's needs, including some radiation-aware devices (for Sandia users only). (4) Object-oriented code design and implementation using modern coding practices that ensure that the Xyce Parallel Electronic Simulator will be maintainable and extensible far into the future. Xyce is a parallel code in the most general sense of the phrase - a message passing parallel implementation - which allows it to run efficiently on the widest possible number of computing platforms. These include serial, shared-memory and distributed-memory parallel as well as heterogeneous platforms. Careful attention has been paid to the specific nature of circuit-simulation problems to ensure that optimal parallel efficiency is achieved as the number of processors grows. The development of Xyce provides a platform for computational research and development aimed specifically at the needs of the Laboratory. With Xyce, Sandia has an 'in-house' capability with which both new electrical (e.g., device model development) and algorithmic (e.g., faster time-integration methods, parallel solver algorithms) research and development can be performed. As a result, Xyce is a unique electrical simulation capability, designed to meet the unique needs of the laboratory.
The object of the 'Enabling Immersive Simulation for Complex Systems Analysis and Training' LDRD has been to research, design, and engineer a capability to develop simulations which (1) provide a rich, immersive interface for participation by real humans (exploiting existing high-performance game-engine technology wherever possible), and (2) can leverage Sandia's substantial investment in high-fidelity physical and cognitive models implemented in the Umbra simulation framework. We report here on these efforts. First, we describe the integration of Sandia's Umbra modular simulation framework with the open-source Delta3D game engine. Next, we report on Umbra's integration with Sandia's Cognitive Foundry, specifically to provide for learning behaviors for 'virtual teammates' directly from observed human behavior. Finally, we describe the integration of Delta3D with the ABL behavior engine, and report on research into establishing the theoretical framework that will be required to make use of tools like ABL to scale up to increasingly rich and realistic virtual characters.
The motivating vision behind Sandia's MENTOR/PAL LDRD project has been that of systems which use real-time psychophysiological data to support and enhance human performance, both individually and of groups. Relevant and significant psychophysiological data being a necessary prerequisite to such systems, this LDRD has focused on identifying and refining such signals. The project has focused in particular on EEG (electroencephalogram) data as a promising candidate signal because it (potentially) provides a broad window on brain activity with relatively low cost and logistical constraints. We report here on two analyses performed on EEG data collected in this project using the SOBI (Second Order Blind Identification) algorithm to identify two independent sources of brain activity: one in the frontal lobe and one in the occipital. The first study looks at directional influences between the two components, while the second study looks at inferring gender based upon the frontal component.
Tensor decompositions are higher-order analogues of matrix decompositions and have proven to be powerful tools for data analysis. In particular, we are interested in the canonical tensor decomposition, otherwise known as the CANDECOMP/PARAFAC decomposition (CPD), which expresses a tensor as the sum of component rank-one tensors and is used in a multitude of applications such as chemometrics, signal processing, neuroscience, and web analysis. The task of computing the CPD, however, can be difficult. The typical approach is based on alternating least squares (ALS) optimization, which can be remarkably fast but is not very accurate. Previously, nonlinear least squares (NLS) methods have also been recommended; existing NLS methods are accurate but slow. In this paper, we propose the use of gradient-based optimization methods. We discuss the mathematical calculation of the derivatives and further show that they can be computed efficiently, at the same cost as one iteration of ALS. Computational experiments demonstrate that the gradient-based optimization methods are much more accurate than ALS and orders of magnitude faster than NLS.
This preliminary study considers the scaling and performance of a finite element (FE) semiconductor device simulator on a capacity cluster with 272 compute nodes based on a homogeneous multicore node architecture utilizing 16 cores. The inter-node communication backbone for this Tri-Lab Linux Capacity Cluster (TLCC) machine is comprised of an InfiniBand interconnect. The nonuniform memory access (NUMA) nodes consist of 2.2 GHz quad socket/quad core AMD Opteron processors. The performance results for this study are obtained with a FE semiconductor device simulation code (Charon) that is based on a fully-coupled Newton-Krylov solver with domain decomposition and multilevel preconditioners. Scaling and multicore performance results are presented for large-scale problems of 100+ million unknowns on up to 4096 cores. A parallel scaling comparison is also presented with the Cray XT3/4 Red Storm capability platform. The results indicate that an MPI-only programming model for utilizing the multicore nodes is reasonably efficient on all 16 cores per compute node. However, the results also indicated that the multilevel preconditioner, which is critical for large-scale capability type simulations, scales better on the Red Storm machine than the TLCC machine.
We present an analysis of gas chromatographic columns where the stationary phase is not assumed to be a thin uniform coating along the walls of the cross section. We also give an asymptotic analysis assuming that the parameter {beta} = KD{sup II}{rho}{sup II}/D{sup I}{rho}{sup I} is small. Here K is the partition coefficient, and D{sup i} and {rho}{sup i}, i = I, II are the diffusivity and density in the mobile (i = I) and stationary (i = II) regions.