Analyzing multicore characteristics for a suite of applications on an XT5 system
Abstract not provided.
Abstract not provided.
Abstract not provided.
International Journal of Distributed Systems and Technologies
In a recent acquisition by DOE/NNSA several large capacity computing clusters called TLCC have been installed at the DOE labs: SNL, LANL and LLNL. TLCC architecture with ccNUMA, multi-socket, multi-core nodes, and InfiniBand interconnect, is representative of the trend in HPC architectures. This paper examines application performance on TLCC contrasting them with Red Storm/Cray XT4. TLCC and Red Storm share similar AMD processors and memory DIMMs. Red Storm however has single socket nodes and custom interconnect. Micro-benchmarks and performance analysis tools help understand the causes for the observed performance differences. Control of processor and memory affinity on TLCC with the numactl utility is shown to result in significant performance gains and is essential to attenuate the detrimental impact of OS interference and cache-coherency overhead. While previous studies have investigated impact of affinity control mostly in the context of small SMP systems, the focus of this paper is on highly parallel MPI applications.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
This report describes efforts by the Performance Modeling and Analysis Team to investigate performance characteristics of Sandia's engineering and scientific applications on the ASC capability and advanced architecture supercomputers, and Sandia's capacity Linux clusters. Efforts to model various aspects of these computers are also discussed. The goals of these efforts are to quantify and compare Sandia's supercomputer and cluster performance characteristics; to reveal strengths and weaknesses in such systems; and to predict performance characteristics of, and provide guidelines for, future acquisitions and follow-on systems. Described herein are the results obtained from running benchmarks and applications to extract performance characteristics and comparisons, as well as modeling efforts, obtained during the time period 2004-2006. The format of the report, with hypertext links to numerous additional documents, purposefully minimizes the document size needed to disseminate the extensive results from our research.
Abstract not provided.
Proceedings - IEEE International Conference on Cluster Computing, ICCC
The RandomAccess benchmark as defined by the High Performance Computing Challenge (HPCC) tests the speed at which a machine can update the elements of a table spread across global system memory, as measured in billions (giga) of updates per second (GUPS). The parallel implementation provided by HPCC typically performs poorly on distributed-memory machines, due to updates requiring numerous small point-to-point messages between processors. We present an alternative algorithm which treats the collection of P processors as a hypercube, aggregating data so that larger messages are sent, and routing individual datums through dimensions of the hypercube to their destination processor. The algorithm's computation (the GUP count) scales linearly with P while its communication overhead scales as log2(P), thus enabling better performance on large numbers of processors. The new algorithm achieves a GUPS rate of 19.98 on 8192 processors of Sandia's Red Storm machine, compared to 1.02 for the HPCC-provided algorithm on 10350 processors. We also illustrate how GUPS performance varies with the benchmark's specification of its "look-ahead" parameter. As expected, parallel performance degrades for small look-ahead values, and improves dramatically for large values. © 2006 IEEE.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
A new capability for modeling thin-shell structures within the coupled Euler-Lagrange code, Zapotec, is under development. The new algorithm creates an artificial material interface for the Eulerian portion of the problem by expanding a Lagrangian shell element such that it has an effective thickness that spans one or more Eulerian cells. The algorithm implementation is discussed along with several examples involving blast loading on plates.
Proceedings of the International Conference on Supercomputing
The design of general-purpose dynamic load-balancing tools for parallel applications is more challenging than the design of static partitioning tools. Both algorithmic and software engineering issues arise. We have addressed many of these issues in the design of the Zoltan dynamic load-balancing library. Zoltan has an object-oriented interface that makes it easy to use and provides separation between the application and the load-balancing algorithms. It contains a suite of dynamic load-balancing algorithms, including both geometric and graph-based algorithms. Its design makes it valuable both as a partitioning tool for a variety of applications and as a research test-bed for new algorithmic development. In this paper, we describe Zoltan's design and demonstrate its use in an unstructured-mesh finite element application.
Current supercomputers use large parallel arrays of tightly coupled processors to achieve levels of performance far surpassing conventional vector supercomputers. Shock-wave physics codes have been developed for these new supercomputers at Sandia National Laboratories and elsewhere. These parallel codes run fast enough on many simulations to consider using them to study the effects of varying design parameters on the performance of models of conventional munitions and other complex systems. Such studies maybe directed by optimization software to improve the performance of the modeled system. Using a shaped-charge jet design as an archetypal test case and the CTH parallel shock-wave physics code controlled by the Dakota optimization software, we explored the use of automatic optimization tools to optimize the design for conventional munitions. We used a scheme in which a lower resolution computational mesh was used to identify candidate optimal solutions and then these were verified using a higher resolution mesh. We identified three optimal solutions for the model and a region of the design domain where the jet tip speed is nearly optimal, indicating the possibility of a robust design. Based on this study we identified some of the difficulties in using high-fidelity models with optimization software to develop improved designs. These include developing robust algorithms for the objective function and constraints and mitigating the effects of numerical noise in them. We conclude that optimization software running high-fidelity models of physical systems using parallel shock wave physics codes to find improved designs can be a valuable tool for designers. While current state of algorithm and software development does not permit routine, ''black box'' optimization of designs, the effort involved in using the existing tools may well be worth the improvement achieved in designs.