Understanding the Effects of Communication on Uncoordinated Checkpointing at Scale
Abstract not provided.
Abstract not provided.
Physical Review Letters
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
SIAM Journal of Control and Optimization
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Operations Research Letters
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
The rapidly improving compute capability of contemporary processors and accelerators is providing the opportunity for significant increases in the accuracy and fidelity of scientific calculations. In this paper we present performance studies of a new molecular dynamics (MD) potential called SNAP. The SNAP potential has shown great promise in accurately reproducing physics and chemistry not described by simpler potentials. We have developed new algorithms to exploit high single-node concurrency provided by three different classes of machine: the Titan GPU-based system operated by Oak Ridge National Laboratory, the combined Sequoia and Vulcan BlueGene/Q machines located at Lawrence Livermore National Laboratory, and the large-scale Intel Sandy Bridge system, Chama, located at Sandia. Our analysis focuses on strong scaling experiments with approximately 246,000 atoms over the range 1-122,880 nodes on Sequoia/Vulcan and 40-18,630 nodes on Titan. We compare these machine in terms of both simulation rate and power efficiency. We find that node performance correlates with power consumption across the range of machines, except for the case of extreme strong scaling, where more powerful compute nodes show greater efficiency. This study is a unique assessment of a challenging, scientifically relevant calculation running on several of the world's leading contemporary production supercomputing platforms. © 2014 Springer International Publishing.
Proceedings of the Human Factors and Ergonomics Society
Within large organizations, the defense of cyber assets generally involves the use of various mechanisms, such as intrusion detection systems, to alert cyber security personnel to suspicious network activity. Resulting alerts are reviewed by the organization's cyber security personnel to investigate and assess the threat and initiate appropriate actions to defend the organization's network assets. While automated software routines are essential to cope with the massive volumes of data transmitted across data networks, the ultimate success of an organization's efforts to resist adversarial attacks upon their cyber assets relies on the effectiveness of individuals and teams. This paper reports research to understand the factors that impact the effectiveness of Cyber Security Incidence Response Teams (CSIRTs). Specifically, a simulation is described that captures the workflow within a CSIRT. The simulation is then demonstrated in a study comparing the differential response time to threats that vary with respect to key characteristics (attack trajectory, targeted asset and perpetrator). It is shown that the results of the simulation correlate with data from the actual incident response times of a professional CSIRT.
Proceedings of ExaMPI 2014: Exascale MPI 2014 - held in conjunction with SC 2014: The International Conference for High Performance Computing, Networking, Storage and Analysis
Advances in node-level architecture and interconnect technology needed to reach extreme scale necessitate a reevaluation of long-standing models of computation, in particular bulk synchronous processing. The end of Dennard-scaling and subsequent increases in CPU core counts each successive generation of general purpose processor has made the ability to leverage parallelism for communication an increasingly critical aspect for future extreme-scale application performance. But the use of massive multithreading in combination with MPI is an open research area, with many proposed approaches requiring code changes that can be unfeasible for important large legacy applications already written in MPI. This paper covers the design and initial evaluation of an extension of a massive multithreading runtime system supporting dynamic parallelism to interface with MPI to handle fine-grain parallel communication and communication-computation overlap. Our initial evaluation of the approach uses the ubiquitous stencil computation, in three dimensions, with the halo exchange as the driving example that has a demonstrated tie to real code bases. The preliminary results suggest that even for a very well-studied and balanced workload and message exchange pattern, co-scheduling work and communication tasks is effective at significant levels of decomposition using up to 131,072 cores. Furthermore, we demonstrate useful communication-computation overlap when handling blocking send and receive calls, and show evidence suggesting that we can decrease the burstiness of network traffic, with a corresponding decrease in the rate of stalls (congestion) seen on the host link and network.
SIAM Journal on Numerical Analysis
In this paper we introduce an approach that augments least-squares finite element formulations with user-specified quantities-of-interest. The method incorporates the quantity-ofinterest into the least-squares functional and inherits the global approximation properties of the standard formulation as well as increased resolution of the quantity-of-interest. We establish theoretical properties such as optimality and enhanced convergence under a set of general assumptions. Central to the approach is that it offers an element-level estimate of the error in the quantity-ofinterest. As a result, we introduce an adaptive approach that yields efficient, adaptively refined approximations. Several numerical experiments for a range of situations are presented to support the theory and highlight the effectiveness of our methodology. Notably, the results show that the new approach is effective at improving the accuracy per total computational cost.
SIAM Journal on Numerical Analysis
We present a new optimization-based method for atomistic-to-continuum (AtC) coupling. The main idea is to cast the latter as a constrained optimization problem with virtual Dirichlet controls on the interfaces between the atomistic and continuum subdomains. The optimization objective is to minimize the error between the atomistic and continuum solutions on the overlap between the two subdomains, while the atomistic and continuum force balance equations provide the constraints. Separation, rather then blending of the atomistic and continuum problems, and their subsequent use as constraints in the optimization problem distinguishes our approach from the existing AtC formulations. We present and analyze the method in the context of a one-dimensional chain of atoms modeled using a linearized two-body potential with next-nearest neighbor interactions.
ECS Transactions
Resistive random access memory (ReRAM), or memristors, may be capable of significantly improve the efficiency of neuromorphic computing, when used as a central component of an analog hardware accelerator. However, the significant electrical variation within a device and between devices degrades the maximum efficiency and accuracy which can be achieved by a ReRAMbased neuromorphic accelerator. In this report, the electrical variability is characterized, with a particular focus on that which is due to fundamental, intrinsic factors. Analytical and ab initio models are presented which offer some insight into the factors responsible for this variability.
Proceedings - 2014 IEEE International Conference on Big Data, IEEE Big Data 2014
We present PuLP, a parallel and memory-efficient graph partitioning method specifically designed to partition low-diameter networks with skewed degree distributions. Graph partitioning is an important Big Data problem because it impacts the execution time and energy efficiency of graph analytics on distributed-memory platforms. Partitioning determines the in-memory layout of a graph, which affects locality, intertask load balance, communication time, and overall memory utilization of graph analytics. A novel feature of our method PuLP (Partitioning using Label Propagation) is that it optimizes for multiple objective metrics simultaneously, while satisfying multiple partitioning constraints. Using our method, we are able to partition a web crawl with billions of edges on a single compute server in under a minute. For a collection of test graphs, we show that PuLP uses 8-39× less memory than state-of-the-art partitioners and is up to 14.5× faster, on average, than alternate approaches (with 16-way parallelism). We also achieve better partitioning quality results for the multi-objective scenario.
13th International Conference on Space Operations, SpaceOps 2014
Spacecraft state-of-health (SOH) analysis typically consists of limit-checking to compare incoming measurand values against their predetermined limits. While useful, this approach requires significant engineering insight along with the ability to evolve limit values over time as components degrade and their operating environment changes. In addition, it fails to take into account the effects of measurand combinations, as multiple values together could signify an imminent problem. A more powerful approach is to apply data mining techniques to uncover hidden trends and patterns as well as interactions among groups of measurands. In an internal research and development effort, software engineers at Sandia National Laboratories explored ways to mine SOH data from a remote sensing spacecraft. Because our spacecraft uses variable sample rates and packetized telemetry to transmit values for 30,000 measurands across 700 unique packet IDs, our data is characterized by a wide disparity of time and value pairs. We discuss how we summarized and aligned this data to be efficiently applied to data mining algorithms. We apply supervised learning including decision tree and principal component analysis and unsupervised learning including k-means and orthogonal partitioning clustering and one-class support vector machine to four different spacecraft SOH scenarios after the data preprocessing step. Our experiment results show that data mining is a very good low-cost and high-payoff approach to SOH analysis and provides an excellent way to exploit vast quantities of time-series data among groups of measurands in different scenarios. Our scenarios show that the supervised cases were particularly useful in identifying key contributors to anomalous events, and the unsupervised cases were well-suited for automated analysis of the system as a whole. The developed underlying models can be updated over time to accurately represent a changing operating environment and ultimately to extend the mission lifetime of our valuable space assets.
A method is investigated to reduce the number of numerical parameters in a material model for a solid. The basis of the method is to detect interdependencies between parameters within a class of materials of interest. The method is demonstrated for a set of material property data for iron and steel using the Johnson-Cook plasticity model.
A simple demonstration of nonlocality in a heterogeneous material is presented. By analysis of the microscale deformation of a two-component layered medium, it is shown that nonlocal interactions necessarily appear in a homogenized model of the system. Explicit expressions for the nonlocal forces are determined. The way these nonlocal forces appear in various nonlocal elasticity theories is derived. The length scales that emerge involve the constituent material properties as well as their geometrical dimen- sions. A peridynamic material model for the smoothed displacement eld is derived. It is demonstrated by comparison with experimental data that the incorporation of non- locality in modeling dramatically improves the prediction of the stress concentration in an open hole tension test on a composite plate.
Parallel Processing Letters
For over two decades the dominant means for enabling portable performance of computational science and engineering applications on parallel processing architectures has been the bulk-synchronous parallel programming (BSP) model. Code developers, motivated by performance considerations to minimize the number of messages transmitted, have typically pursued a strategy of aggregating message data into fewer, larger messages. Emerging and future high-performance architectures, especially those seen as targeting Exascale capabilities, provide motivation and capabilities for revisiting this approach. In this paper we explore alternative configurations within the context of a large-scale complex multi-physics application and a proxy that represents its behavior, presenting results that demonstrate some important advantages as the number of processors increases in scale.
We are on the threshold of a transformative change in the basic architecture of highperformance computing. The use of accelerator processors, characterized by large core counts, shared but asymmetrical memory, and heavy thread loading, is quickly becoming the norm in high performance computing. These accelerators represent significant challenges in updating our existing base of software. An intrinsic problem with this transition is a fundamental programming shift from message passing processes to much more fine thread scheduling with memory sharing. Another problem is the lack of stability in accelerator implementation; processor and compiler technology is currently changing rapidly. This report documents the results of our three-year ASCR project to address these challenges. Our project includes the development of the Dax toolkit, which contains the beginnings of new algorithms for a new generation of computers and the underlying infrastructure to rapidly prototype and build further algorithms as necessary.