Publications

Results 101–125 of 206

Search results

Jump to search filters

The impact of hybrid-core processors on MPI message rate

ACM International Conference Proceeding Series

Barrett, Brian B.; Brightwell, Ronald B.; Hammond, Simon D.; Hemmert, Karl S.

Power and energy concerns are motivating chip manufacturers to consider future hybrid-core processor designs that combine a small number of traditional cores optimized for single-thread performance with a large number of simpler cores optimized for throughput performance. This trend is likely to impact the way compute resources for network protocol processing functions are allocated and managed. In particular, the performance of MPI match processing is critical to achieving high message throughput. In this paper, we analyze the ability of simple and more complex cores to perform MPI matching operations for various scenarios in order to gain insight into how MPI implementations for future hybrid-core processors should be designed.

More Details

The Portals 4.0 network programming interface

Brightwell, Ronald B.; Pedretti, Kevin; Wheeler, Kyle B.; Hemmert, Karl S.; Barrett, Brian B.

This report presents a specification for the Portals 4.0 network programming interface. Portals 4.0 is intended to allow scalable, high-performance network communication between nodes of a parallel computing system. Portals 4.0 is well suited to massively parallel processing and embedded systems. Portals 4.0 represents an adaption of the data movement layer developed for massively parallel processing platforms, such as the 4500-node Intel TeraFLOPS machine. Sandias Cplant cluster project motivated the development of Version 3.0, which was later extended to Version 3.3 as part of the Cray Red Storm machine and XT line. Version 4.0 is targeted to the next generation of machines employing advanced network interface architectures that support enhanced offload capabilities.

More Details

Leveraging MPI's one-sided communication interface for shared-memory programming

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Hoefler, Torsten; Dinan, James; Buntinas, Darius; Balaji, Pavan; Barrett, Brian W.; Brightwell, Ronald B.; Gropp, William; Kale, Vivek; Thakur, Rajeev

Hybrid parallel programming with MPI for internode communication in conjunction with a shared-memory programming model to manage intranode parallelism has become a dominant approach to scalable parallel programming. While this model provides a great deal of flexibility and performance potential, it saddles programmers with the complexity of utilizing two parallel programming systems in the same application. We introduce an MPI-integrated shared-memory programming model that is incorporated into MPI through a small extension to the one-sided communication interface. We discuss the integration of this interface with the upcoming MPI 3.0 one-sided semantics and describe solutions for providing portable and efficient data sharing, atomic operations, and memory consistency. We describe an implementation of the new interface in the MPICH2 and Open MPI implementations and demonstrate an average performance improvement of 40% to the communication component of a five-point stencil solver. © 2012 Springer-Verlag.

More Details

A low impact flow control implementation for offload communication interfaces

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Barrett, Brian W.; Brightwell, Ronald B.; Underwood, Keith D.

Message passing paradigms provide for many to one messaging patterns that result in receive side resource exhaustion. Traditionally, MPI implementations layered over the Portals network programming interface provided a large default unexpected receive buffer space, the user was expected to configure the buffer size to the application demand, and the application was aborted when the buffer space was overrun. The Portals 4 design provides a set of primitives for implementing scalable resource exhaustion recovery without negatively impacting normal operation. A resource exhaustion recovery protocol for MPI implementations is presented, as well as performance results for an Open MPI implementation of the protocol. © 2012 Springer-Verlag.

More Details

Cooperative application/OS DRAM fault recovery

Hoemmen, Mark F.; Ferreira, Kurt; Heroux, Michael A.; Brightwell, Ronald B.

Exascale systems will present considerable fault-tolerance challenges to applications and system software. These systems are expected to suffer several hard and soft errors per day. Unfortunately, many fault-tolerance methods in use, such as rollback recovery, are unsuitable for many expected errors, for example DRAM failures. As a result, applications will need to address these resilience challenges to more effectively utilize future systems. In this paper, we describe work on a cross-layer application/OS framework to handle uncorrected memory errors. We illustrate the use of this framework through its integration with a new fault-tolerant iterative solver within the Trilinos library, and present initial convergence results.

More Details

Evaluating operating system vulnerability to memory errors

Ferreira, Kurt; Pedretti, Kevin T.T.; Brightwell, Ronald B.

Reliability is of great concern to the scalability of extreme-scale systems. Of particular concern are soft errors in main memory, which are a leading cause of failures on current systems and are predicted to be the leading cause on future systems. While great effort has gone into designing algorithms and applications that can continue to make progress in the presence of these errors without restarting, the most critical software running on a node, the operating system (OS), is currently left relatively unprotected. OS resiliency is of particular importance because, though this software typically represents a small footprint of a compute node's physical memory, recent studies show more memory errors in this region of memory than the remainder of the system. In this paper, we investigate the soft error vulnerability of two operating systems used in current and future high-performance computing systems: Kitten, the lightweight kernel developed at Sandia National Laboratories, and CLE, a high-performance Linux-based operating system developed by Cray. For each of these platforms, we outline major structures and subsystems that are vulnerable to soft errors and describe methods that could be used to reconstruct damaged state. Our results show the Kitten lightweight operating system may be an easier target to harden against memory errors due to its smaller memory footprint, largely deterministic state, and simpler system structure.

More Details

Portals 4 network API definition and performance measurement

Brightwell, Ronald B.

Portals is a low-level network programming interface for distributed memory massively parallel computing systems designed by Sandia, UNM, and Intel. Portals has been designed to provide high message rates and to provide the flexibility to support a variety of higher-level communication paradigms. This project developed and analyzed an implementation of Portals using shared memory in order to measure and understand the impact of using general-purpose compute cores to handle network protocol processing functions. The goal of this study was to evaluate an approach to high-performance networking software design and hardware support that would enable important DOE modeling and simulation applications to perform well and to provide valuable input to Intel so they can make informed decisions about future network software and hardware products that impact DOE applications.

More Details

Demonstration of a Legacy Application's Path to Exascale - ASC L2 Milestone 4467

Barrett, Brian B.; Kelly, Suzanne M.; Klundt, Ruth A.; Laros, James H.; Leung, Vitus J.; Levenhagen, Michael J.; Lofstead, Gerald F.; Moreland, Kenneth D.; Oldfield, Ron A.; Pedretti, Kevin P.; Rodrigues, Arun; Barrett, Richard F.; Ward, Harry L.; Vandyke, John P.; Vaughan, Courtenay T.; Wheeler, Kyle B.; Brandt, James M.; Brightwell, Ronald B.; Curry, Matthew L.; Fabian, Nathan D.; Ferreira, Kurt; Gentile, Ann C.; Hemmert, Karl S.

Abstract not provided.

Report of experiments and evidence for ASC L2 milestone 4467 : demonstration of a legacy application's path to exascale

Barrett, Brian B.; Kelly, Suzanne M.; Klundt, Ruth A.; Laros, James H.; Leung, Vitus J.; Levenhagen, Michael J.; Lofstead, Gerald F.; Moreland, Kenneth D.; Oldfield, Ron A.; Pedretti, Kevin T.T.; Rodrigues, Arun; Barrett, Richard F.; Thompson, David C.; Ward, Harry L.; Vandyke, John P.; Vaughan, Courtenay T.; Wheeler, Kyle B.; Brandt, James M.; Brightwell, Ronald B.; Curry, Matthew L.; Fabian, Nathan D.; Ferreira, Kurt; Gentile, Ann C.; Hemmert, Karl S.

This report documents thirteen of Sandia's contributions to the Computational Systems and Software Environment (CSSE) within the Advanced Simulation and Computing (ASC) program between fiscal years 2009 and 2012. It describes their impact on ASC applications. Most contributions are implemented in lower software levels allowing for application improvement without source code changes. Improvements are identified in such areas as reduced run time, characterizing power usage, and Input/Output (I/O). Other experiments are more forward looking, demonstrating potential bottlenecks using mini-application versions of the legacy codes and simulating their network activity on Exascale-class hardware. The purpose of this report is to prove that the team has completed milestone 4467-Demonstration of a Legacy Application's Path to Exascale. Cielo is expected to be the last capability system on which existing ASC codes can run without significant modifications. This assertion will be tested to determine where the breaking point is for an existing highly scalable application. The goal is to stretch the performance boundaries of the application by applying recent CSSE RD in areas such as resilience, power, I/O, visualization services, SMARTMAP, lightweight LWKs, virtualization, simulation, and feedback loops. Dedicated system time reservations and/or CCC allocations will be used to quantify the impact of system-level changes to extend the life and performance of the ASC code base. Finally, a simulation of anticipated exascale-class hardware will be performed using SST to supplement the calculations. Determine where the breaking point is for an existing highly scalable application: Chapter 15 presented the CSSE work that sought to identify the breaking point in two ASC legacy applications-Charon and CTH. Their mini-app versions were also employed to complete the task. There is no single breaking point as more than one issue was found with the two codes. The results were that applications can expect to encounter performance issues related to the computing environment, system software, and algorithms. Careful profiling of runtime performance will be needed to identify the source of an issue, in strong combination with knowledge of system software and application source code.

More Details

Enabling flexible collective communication offload with triggered operations

Proceedings - Symposium on the High Performance Interconnects, Hot Interconnects

Underwood, Keith D.; Coffman, Jerrie; Larsen, Roy; Hemmert, Karl S.; Barrett, Brian W.; Brightwell, Ronald B.; Levenhagen, Michael J.

Low latency collective communications are key to application scalability. As systems grow larger, minimizing collective communication time becomes increasingly challenging. Offload is an effective technique for accelerating collective operations; however, algorithms for collective communication constantly evolve such that flexible implementations are critical. This paper presents triggered operations-a semantic building block that allows the key components of collective communications to be offloaded while allowing the host side software to define the algorithm. Simulations are used to demonstrate the performance improvements achievable through the offload of MPI-Allreduce using these building blocks. © 2011 IEEE.

More Details
Results 101–125 of 206
Results 101–125 of 206