Publications

Results 51–100 of 206

Search results

Jump to search filters

Software Requirements for ATDM On-Node Resource Management

Olivier, Stephen L.; Foulk, James W.; Brightwell, Ronald B.

This report outlines the software requirements for on-node resource management in the Advanced Simulation and Computing (ASC) Advanced Technology Development and Mitigation (ATDM) project at Sandia National Laboratories (SNL). The need for on-node resource management has arisen from the componentization of the software stack. Componentization aids in managing complexity and making software more composable and reusable. However, components must compete for limited on-node resources for execution (e.g., cores and hardware threads) and memory. The requirements documented in this report support an effort to manage this contention, avoiding oversubscription of resources and enabling their efficient deployment for application execution.

More Details

Metrics for evaluating energy saving techniques for resilient HPC systems

Proceedings - IEEE 28th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2014

Grant, Ryan; Olivier, Stephen L.; Laros, James H.; Brightwell, Ronald B.; Porterfield, Allan K.

The metrics used for evaluating energy saving techniques for future HPC systems are critical to the correct assessment of proposed methods. Current predictions forecast that overcoming reduced system reliability, increased power requirements and energy consumption will be a major design challenge for future systems. Modern runtime energy-saving research efforts do not take into account the energy spent providing reliability. They also do not account for the increase in the probability of failure during application execution due to runtime overhead from energy saving methods. While this is very reasonable for current systems, it is insufficient for future generation systems. By taking into account the energy consumption ramifications of increased runtimes on system reliability, better energy saving techniques can be developed. This paper demonstrates how to determine the impact of runtime energy conservation methods within the context of failure-prone large scale systems. In addition, a survey of several energy savings methodologies is conducted and an analysis is performed with respect to their effectiveness in an environment in which failures occur.

More Details

An evaluation of MPI message rate on hybrid-core processors

International Journal of High Performance Computing Applications

Brightwell, Ronald B.; Barrett, Brian W.; Grant, Ryan; Hammond, Simon; Hemmert, Karl S.

Power and energy concerns are motivating chip manufacturers to consider future hybrid-core processor designs that may combine a small number of traditional cores optimized for single-thread performance with a large number of simpler cores optimized for throughput performance. This trend is likely to impact the way in which compute resources for network protocol processing functions are allocated and managed. In particular, the performance of MPI match processing is critical to achieving high message throughput. In this paper, we analyze the ability of simple and more complex cores to perform MPI matching operations for various scenarios in order to gain insight into how MPI implementations for future hybrid-core processors should be designed.

More Details

The Portals 4.0.2 Networking Programming Interface

Barrett, Brian W.; Brightwell, Ronald B.; Grant, Ryan; Hemmert, Karl S.; Foulk, James W.; Wheeler, Kyle B.; Underwood, Keith D.; Riesen, Rolf; Maccabe, Arthur B.; Hudson, Trammell

This report presents a specification for the Portals 4 network programming interface. Portals 4 is intended to allow scalable, high-performance network communication between nodes of a parallel computing system. Portals 4 is well suited to massively parallel processing and embedded systems. Portals 4 represents an adaption of the data movement layer developed for massively parallel processing platforms, such as the 4500-node Intel TeraFLOPS machine. Sandia's Cplant cluster project motivated the development of Version 3.0, which was later extended to Version 3.3 as part of the Cray Red Storm machine and XT line. Version 4 is targeted to the next generation of machines employing advanced network interface architectures that support enhanced offload capabilities.

More Details

Asking the right questions: Benchmarking fault-tolerant extreme-scale systems

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Widener, Patrick; Ferreira, Kurt; Levy, Scott; Bridges, Patrick G.; Arnold, Dorian; Brightwell, Ronald B.

Much recent research has explored fault-tolerance mechanisms intended for current and future extreme-scale systems. Evaluations of the suitability of checkpoint-based solutions have typically been carried out using relatively uncomplicated computational kernels designed to measure floating point performance. More recent investigations have added scaled-down "proxy" applications to more closely match the composition and behavior of deployed ones. However, the information obtained from these studies (whether floating point performance or application runtime) is not necessarily of the most value in evaluating resilience strategies. We observe that even when using a more sophisticated metric, the information available from evaluating uncoordinated checkpointing using both microbenchmarks and proxy applications does not agree. This implies that not only might researchers be asking the wrong questions, but that the answers to the right ones might be unexpected and potentially misleading. We seek to open a discussion on whether benchmarks designed to provide predictable performance evaluations of HPC hardware and toolchains are providing the right feedback for the evaluation of fault-tolerance in these applications, and more generally on how benchmarking of resilience mechanisms ought to be approached in the exascale design space. © 2014 Springer-Verlag Berlin Heidelberg.

More Details

The portals 4.0.1 network programming interface

Barrett, Brian; Brightwell, Ronald B.; Pedretti, Kevin; Hemmert, Karl S.

This report presents a specification for the Portals 4.0 network programming interface. Portals 4.0 is intended to allow scalable, high-performance network communication between nodes of a parallel computing system. Portals 4.0 is well suited to massively parallel processing and embedded systems. Portals 4.0 represents an adaption of the data movement layer developed for massively parallel processing platforms, such as the 4500-node Intel TeraFLOPS machine. Sandias Cplant cluster project motivated the development of Version 3.0, which was later extended to Version 3.3 as part of the Cray Red Storm machine and XT line. Version 4.0 is targeted to the next generation of machines employing advanced network interface architectures that support enhanced offload capabilities. 3

More Details
Results 51–100 of 206
Results 51–100 of 206