Center for Computing Research (CCR)

This report presents a specification for the Portals 4.0 network programming interface. Portals 4.0 is intended to allow scalable, high-performance network communication between nodes of a parallel computing system. Portals 4.0 is well suited to massively parallel processing and embedded systems. Portals 4.0 represents an adaption of the data movement layer developed for massively parallel processing platforms, such as the 4500-node Intel TeraFLOPS machine. Sandias Cplant cluster project motivated the development of Version 3.0, which was later extended to Version 3.3 as part of the Cray Red Storm machine and XT line. Version 4.0 is targeted to the next generation of machines employing advanced network interface architectures that support enhanced offload capabilities. 3

More Details

TYPE SAND Report YEAR 2013

OSTI DOI

Portals 4 Update

Barrett, Brian B.

Abstract not provided.

More Details

TYPE Presentation YEAR 2013

OSTI

Remote Memory Access Programming in MPI-3

ACM Transactions on Parallel Computing

Barrett, Brian B.

Abstract not provided.

More Details

TYPE Journal Article YEAR 2013

OSTI

Protocols for Fully Offloaded Collective Operations on Accelerated Network Adapters

Grant, Ryan E.; Barrett, Brian B.; Brightwell, Ronald B.

Abstract not provided.

More Details

TYPE Conference YEAR 2013

OSTI

Extreme-scale Computing Grand Challenge (XGC)

Barrett, Brian B.; Barrett, Richard F.; Rodrigues, Arun; Lentine, Anthony L.; Denton-Hill, Kim M.

Abstract not provided.

More Details

TYPE Presentation YEAR 2013

OSTI

The Structural Simulation Toolkit

Proposed for publication in SIGMETRICS Performance Evaluation Review.

Rodrigues, Arun; Hemmert, Karl S.; Barrett, Brian B.; Oldfield, Ron A.

Abstract not provided.

More Details

TYPE Journal Article YEAR 2012

OSTI

Open MPI Data Transfer

Barrett, Brian B.

Abstract not provided.

More Details

TYPE Presentation YEAR 2012

OSTI

The Portals 4.0 network programming interface

Brightwell, Ronald B.; Pedretti, Kevin P.; Wheeler, Kyle B.; Hemmert, Karl S.; Barrett, Brian B.

This report presents a specification for the Portals 4.0 network programming interface. Portals 4.0 is intended to allow scalable, high-performance network communication between nodes of a parallel computing system. Portals 4.0 is well suited to massively parallel processing and embedded systems. Portals 4.0 represents an adaption of the data movement layer developed for massively parallel processing platforms, such as the 4500-node Intel TeraFLOPS machine. Sandias Cplant cluster project motivated the development of Version 3.0, which was later extended to Version 3.3 as part of the Cray Red Storm machine and XT line. Version 4.0 is targeted to the next generation of machines employing advanced network interface architectures that support enhanced offload capabilities.

More Details

TYPE SAND Report YEAR 2012

OSTI DOI

Portals 4 Network Programming Interface

Barrett, Brian B.; Brightwell, Ronald B.; Hemmert, Karl S.

Abstract not provided.

More Details

TYPE Conference YEAR 2012

OSTI

Overview of the XGC project

Shinde, Subhash L.; Ang, James A.; Barrett, Brian B.; Barrett, Richard F.; Denton-Hill, Kim M.; Lentine, Anthony L.; Murphy, Richard C.; Rodrigues, Arun; Hemmert, Karl S.

Abstract not provided.

More Details

TYPE Presentation YEAR 2012

OSTI

Scalable Parallel Runtime: Lightweight Task Spawning in a Distributed Memory Environment

Barrett, Brian B.; Stark, Dylan S.

Abstract not provided.

More Details

TYPE Conference YEAR 2012

OSTI

Improvements to the Structural Simulation Toolkit

Rodrigues, Arun; Leung, Vitus J.; Levenhagen, Michael J.; Ferreira, Kurt; Hemmert, Karl S.; Barrett, Brian B.

Abstract not provided.

More Details

TYPE Conference YEAR 2012

OSTI

Demonstration of a Legacy Application's Path to Exascale - ASC L2 Milestone 4467

Barrett, Brian B.; Kelly, Suzanne M.; Klundt, Ruth A.; Laros, James H.; Leung, Vitus J.; Levenhagen, Michael J.; Lofstead, Gerald F.; Moreland, Kenneth D.; Oldfield, Ron A.; Pedretti, Kevin P.; Rodrigues, Arun; Barrett, Richard F.; Ward, Harry L.; Vandyke, John P.; Vaughan, Courtenay T.; Wheeler, Kyle B.; Brandt, James M.; Brightwell, Ronald B.; Curry, Matthew L.; Fabian, Nathan D.; Ferreira, Kurt; Gentile, Ann C.; Hemmert, Karl S.

Abstract not provided.

More Details

TYPE Presentation YEAR 2012

OSTI

Report of experiments and evidence for ASC L2 milestone 4467 : demonstration of a legacy application's path to exascale

Barrett, Brian B.; Kelly, Suzanne M.; Klundt, Ruth A.; Laros, James H.; Leung, Vitus J.; Levenhagen, Michael J.; Lofstead, Gerald F.; Moreland, Kenneth D.; Oldfield, Ron A.; Pedretti, Kevin P.; Rodrigues, Arun; Barrett, Richard F.; Ward, Harry L.; Vandyke, John P.; Vaughan, Courtenay T.; Wheeler, Kyle B.; Brandt, James M.; Brightwell, Ronald B.; Curry, Matthew L.; Fabian, Nathan D.; Ferreira, Kurt; Gentile, Ann C.; Hemmert, Karl S.

This report documents thirteen of Sandia's contributions to the Computational Systems and Software Environment (CSSE) within the Advanced Simulation and Computing (ASC) program between fiscal years 2009 and 2012. It describes their impact on ASC applications. Most contributions are implemented in lower software levels allowing for application improvement without source code changes. Improvements are identified in such areas as reduced run time, characterizing power usage, and Input/Output (I/O). Other experiments are more forward looking, demonstrating potential bottlenecks using mini-application versions of the legacy codes and simulating their network activity on Exascale-class hardware. The purpose of this report is to prove that the team has completed milestone 4467-Demonstration of a Legacy Application's Path to Exascale. Cielo is expected to be the last capability system on which existing ASC codes can run without significant modifications. This assertion will be tested to determine where the breaking point is for an existing highly scalable application. The goal is to stretch the performance boundaries of the application by applying recent CSSE RD in areas such as resilience, power, I/O, visualization services, SMARTMAP, lightweight LWKs, virtualization, simulation, and feedback loops. Dedicated system time reservations and/or CCC allocations will be used to quantify the impact of system-level changes to extend the life and performance of the ASC code base. Finally, a simulation of anticipated exascale-class hardware will be performed using SST to supplement the calculations. Determine where the breaking point is for an existing highly scalable application: Chapter 15 presented the CSSE work that sought to identify the breaking point in two ASC legacy applications-Charon and CTH. Their mini-app versions were also employed to complete the task. There is no single breaking point as more than one issue was found with the two codes. The results were that applications can expect to encounter performance issues related to the computing environment, system software, and algorithms. Careful profiling of runtime performance will be needed to identify the source of an issue, in strong combination with knowledge of system software and application source code.

More Details

TYPE SAND Report YEAR 2012

OSTI DOI

Improvements to the Structural Simulation Toolkit

Rodrigues, Arun; Leung, Vitus J.; Levenhagen, Michael J.; Ferreira, Kurt; Hemmert, Karl S.; Barrett, Brian B.

Abstract not provided.

More Details

TYPE Conference YEAR 2012

OSTI

Using triggered operations to offload rendezvous messages

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Barrett, Brian B.; Brightwell, Ronald B.; Hemmert, Karl S.; Wheeler, Kyle B.; Underwood, Keith D.

Historically, MPI implementations have had to choose between eager messaging protocols that require buffering and rendezvous protocols that sacrifice overlap and strong independent progress in some scenarios. The typical choice is to use an eager protocol for short messages and switch to a rendezvous protocol for long messages. If overlap and progress are desired, some implementations offer the option of using a thread. We propose an approach that leverages triggered operations to implement a long message rendezvous protocol that provides strong progress guarantees. The results indicate that a triggered operation based rendezvous can achieve better overlap than a traditional rendezvous implementation and less wasted bandwidth than an eager long protocol. © 2011 Springer-Verlag Berlin Heidelberg.

More Details

TYPE Conference YEAR 2011

Scopus OSTI

Enhanced Support for PGAS Communication in Portals

Barrett, Brian B.; Brightwell, Ronald B.; Hemmert, Karl S.; Pedretti, Kevin P.; Wheeler, Kyle B.

Abstract not provided.

More Details

TYPE Conference YEAR 2011

OSTI

Reducing MPI memory usage in Exascale Networks

Barrett, Brian B.; Hemmert, Karl S.

Abstract not provided.

More Details

TYPE Conference YEAR 2011

OSTI

Enabling Flexible Collective Communication Offload with Triggered Operations

Hemmert, Karl S.; Barrett, Brian B.; Brightwell, Ronald B.; Levenhagen, Michael J.

Abstract not provided.

More Details

TYPE Conference YEAR 2011

OSTI

A Comparison of the Performance Characteristics of Capability and Capacity Class HPC Systems

Doerfler, Douglas W.; Rajan, Mahesh R.; Epperson, Marcus E.; Vaughan, Courtenay T.; Pedretti, Kevin P.; Barrett, Richard F.; Barrett, Brian B.

Abstract not provided.

More Details

TYPE Conference YEAR 2011

OSTI

Using triggered operations to offload collective communication operations

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Hemmert, K.S.; Barrett, Brian B.; Underwood, Keith D.

Efficient collective operations are a major component of application scalability. Offload of collective operations onto the network interface reduces many of the latencies that are inherent in network communications and, consequently, reduces the time to perform the collective operation. To support offload, it is desirable to expose semantic building blocks that are simple to offload and yet powerful enough to implement a variety of collective algorithms. This paper presents the implementation of barrier and broadcast leveraging triggered operations - a semantic building block for collective offload. Triggered operations are shown to be both semantically powerful and capable of improving performance. © 2010 Springer-Verlag.

More Details

TYPE Conference YEAR 2010

Scopus OSTI

Accelerating multicore graph algorithms by trading latency for bandwidth

Stark, Dylan S.; Murphy, Richard C.; Barrett, Brian B.; Berry, Jonathan W.

Abstract not provided.

More Details

TYPE Conference YEAR 2010

OSTI

Challenges for high-performance networking for exascale computing

Brightwell, Ronald B.; Barrett, Brian B.; Hemmert, Karl S.

Achieving the next three orders of magnitude performance increase to move from petascale to exascale computing will require a significant advancements in several fundamental areas. Recent studies have outlined many of the challenges in hardware and software that will be needed. In this paper, we examine these challenges with respect to high-performance networking. We describe the repercussions of anticipated changes to computing and networking hardware and discuss the impact that alternative parallel programming models will have on the network software stack. We also present some ideas on possible approaches that address some of these challenges.

More Details

TYPE Conference YEAR 2010

OSTI

Introducing the graph 500

Wheeler, Kyle B.; Barrett, Brian B.; Ang, James A.

Abstract not provided.

More Details

TYPE Conference YEAR 2010

OSTI

Introducing the graph 500

Murphy, Richard C.; Wheeler, Kyle B.; Barrett, Brian B.; Ang, James A.

Abstract not provided.

More Details

TYPE Conference YEAR 2010

OSTI

On the path to exascale

International Journal of Distributed Systems and Technologies

Alvin, Kenneth F.; Barrett, Brian B.; Brightwell, Ronald B.; Dosanjh, Sudip S.; Geist, Al; Hemmert, Karl S.; Heroux, Michael; Kothe, Doug; Murphy, Richard C.; Nichols, Jeff; Oldfield, Ron A.; Rodrigues, Arun; Vetter, Jeffrey S.

There is considerable interest in achieving a 1000 fold increase in supercomputing power in the next decade, but the challenges are formidable. In this paper, the authors discuss some of the driving science and security applications that require Exascale computing (a million, trillion operations per second). Key architectural challenges include power, memory, interconnection networks and resilience. The paper summarizes ongoing research aimed at overcoming these hurdles. Topics of interest are architecture aware and scalable algorithms, system simulation, 3D integration, new approaches to system-directed resilience and new benchmarks. Although significant progress is being made, a broader international program is needed.

More Details

TYPE Journal Article YEAR 2010

Scopus OSTI

Comparing Programming Paradigms for Graph Algorithms

Devine, Karen D.; Plimpton, Steven J.; Bayer, Gregory B.; Barrett, Brian B.; Berry, Jonathan W.

Abstract not provided.

More Details

TYPE Conference YEAR 2010

OSTI

Implementing a portable multi-threaded graph library: The mtgl on qthreads

IPDPS 2009 - Proceedings of the 2009 IEEE International Parallel and Distributed Processing Symposium

Barrett, Brian B.; Berry, Jonathan W.; Murphy, Richard C.; Wheeler, Kyle B.

Abstract not provided.

More Details

TYPE Conference YEAR 2009

Scopus OSTI

LDRD final report : massive multithreading applied to national infrastructure and informatics

Barrett, Brian B.; Hendrickson, Bruce A.; Laviolette, Randall A.; Leung, Vitus J.; Mackey, Greg; Murphy, Richard C.; Phillips, Cynthia A.; Pinar, Ali P.

Large relational datasets such as national-scale social networks and power grids present different computational challenges than do physical simulations. Sandia's distributed-memory supercomputers are well suited for solving problems concerning the latter, but not the former. The reason is that problems such as pattern recognition and knowledge discovery on large networks are dominated by memory latency and not by computation. Furthermore, most memory requests in these applications are very small, and when the datasets are large, most requests miss the cache. The result is extremely low utilization. We are unlikely to be able to grow out of this problem with conventional architectures. As the power density of microprocessors has approached that of a nuclear reactor in the past two years, we have seen a leveling of Moores Law. Building larger and larger microprocessor-based supercomputers is not a solution for informatics and network infrastructure problems since the additional processors are utilized to only a tiny fraction of their capacity. An alternative solution is to use the paradigm of massive multithreading with a large shared memory. There is only one instance of this paradigm today: the Cray MTA-2. The proposal team has unique experience with and access to this machine. The XMT, which is now being delivered, is a Red Storm machine with up to 8192 multithreaded 'Threadstorm' processors and 128 TB of shared memory. For many years, the XMT will be the only way to address very large graph problems efficiently, and future generations of supercomputers will include multithreaded processors. Roughly 10 MTA processor can process a simple short paths problem in the time taken by the Gordon Bell Prize-nominated distributed memory code on 32,000 processors of Blue Gene/Light. We have developed algorithms and open-source software for the XMT, and have modified that software to run some of these algorithms on other multithreaded platforms such as the Sun Niagara and Opteron multi-core chips.

More Details

TYPE SAND Report YEAR 2009

OSTI DOI