Publications Search

The Portals 4.1 Network Programming Interface

Barrett, Brian; Brightwell, Ronald B.; Grant, Ryan; Hemmert, Karl S.; Foulk, James W.; Wheeler, Kyle; Underwood, Keith D.; Riesen, Rolf; Maccabe, Arthur B.; Hudson, Trammel

This report presents a specification for the Portals 4 networ k programming interface. Portals 4 is intended to allow scalable, high-performance network communication betwee n nodes of a parallel computing system. Portals 4 is well suited to massively parallel processing and embedded syste ms. Portals 4 represents an adaption of the data movement layer developed for massively parallel processing platfor ms, such as the 4500-node Intel TeraFLOPS machine. Sandia's Cplant cluster project motivated the development of Version 3.0, which was later extended to Version 3.3 as part of the Cray Red Storm machine and XT line. Version 4 is tar geted to the next generation of machines employing advanced network interface architectures that support enh anced offload capabilities.

More Details

TYPE SAND Report YEAR 2017

DOI OSTI

The Portals 4.0.2 Networking Programming Interface

Barrett, Brian W.; Brightwell, Ronald B.; Grant, Ryan; Hemmert, Karl S.; Foulk, James W.; Wheeler, Kyle B.; Underwood, Keith D.; Riesen, Rolf; Maccabe, Arthur B.; Hudson, Trammell

This report presents a specification for the Portals 4 network programming interface. Portals 4 is intended to allow scalable, high-performance network communication between nodes of a parallel computing system. Portals 4 is well suited to massively parallel processing and embedded systems. Portals 4 represents an adaption of the data movement layer developed for massively parallel processing platforms, such as the 4500-node Intel TeraFLOPS machine. Sandia's Cplant cluster project motivated the development of Version 3.0, which was later extended to Version 3.3 as part of the Cray Red Storm machine and XT line. Version 4 is targeted to the next generation of machines employing advanced network interface architectures that support enhanced offload capabilities.

More Details

TYPE SAND Report YEAR 2014

DOI OSTI

Evaluating NIC hardware requirements to achieve high message rate PGAS support on multi-core processors

Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, SC'07

Underwood, Keith D.; Levenhagen, Michael; Brightwell, Ronald B.

Partitioned global address space (PGAS) programming models have been identified as one of the few viable approaches for dealing with emerging many-core systems. These models tend to generate many small messages, which requires specific support from the network interface hardware to enable efficient execution. In the past, Cray included E-registers on the Cray T3E to support the SHMEM API; however, with the advent of multi-core processors, the balance of computation to communication capabilities has shifted toward computation. This paper explores the message rates that are achievable with multi-core processors and simplified PGAS support on a more conventional network interface. For message rate tests, we find that simple network interface hardware is more than sufficient. We also find that even typical data distributions, such as cyclic or block-cyclic, do not need specialized hardware support. Finally, we assess the impact of such support on the well known RandomAccess benchmark. (c) 2007 ACM.

More Details

TYPE Conference YEAR 2007

Scopus OSTI

Approaching the Exascale? Most aren't even ready for the petascale

Underwood, Keith D.

Abstract not provided.

More Details

TYPE Conference YEAR 2007

OSTI

Advanced Interconnect Architectures

Underwood, Keith D.

Abstract not provided.

More Details

TYPE Presentation YEAR 2007

OSTI

Opportunities for Reconfigurable Computing in HPC

Underwood, Keith D.

Abstract not provided.

More Details

TYPE Presentation YEAR 2007

OSTI

An Architecture to Perform NIC Based MPI Matching

Underwood, Keith D.; Hemmert, Karl S.; Rodrigues, Arun

Abstract not provided.

More Details

TYPE Conference YEAR 2007

OSTI

Next Generation Network Goals

Underwood, Keith D.

Abstract not provided.

More Details

TYPE Presentation YEAR 2007

OSTI

The structural simulation toolkit :a tool for exploring parallel architectures and applications

Murphy, Richard C.; Underwood, Keith D.; Rodrigues, Arun

Abstract not provided.

More Details

TYPE Conference YEAR 2007

OSTI

A simple synchronous distributed-memory algorithm for the HPCC RandomAccess benchmark

Proceedings - IEEE International Conference on Cluster Computing, ICCC

Plimpton, Steven J.; Brightwell, Ronald B.; Vaughan, Courtenay T.; Underwood, Keith D.

The RandomAccess benchmark as defined by the High Performance Computing Challenge (HPCC) tests the speed at which a machine can update the elements of a table spread across global system memory, as measured in billions (giga) of updates per second (GUPS). The parallel implementation provided by HPCC typically performs poorly on distributed-memory machines, due to updates requiring numerous small point-to-point messages between processors. We present an alternative algorithm which treats the collection of P processors as a hypercube, aggregating data so that larger messages are sent, and routing individual datums through dimensions of the hypercube to their destination processor. The algorithm's computation (the GUP count) scales linearly with P while its communication overhead scales as log2(P), thus enabling better performance on large numbers of processors. The new algorithm achieves a GUPS rate of 19.98 on 8192 processors of Sandia's Red Storm machine, compared to 1.02 for the HPCC-provided algorithm on 10350 processors. We also illustrate how GUPS performance varies with the benchmark's specification of its "look-ahead" parameter. As expected, parallel performance degrades for small look-ahead values, and improves dramatically for large values. © 2006 IEEE.

More Details

TYPE Conference YEAR 2006

OSTI Scopus

FPGAs in HPC: A long road to production

Underwood, Keith D.

Abstract not provided.

More Details

TYPE Conference YEAR 2006

OSTI

FPGAs in High Perfomance Computing: Results from Two LDRD Projects

Underwood, Keith D.; Ulmer, Craig; Thompson, David; Hemmert, Karl S.

Field programmable gate arrays (FPGAs) have been used as alternative computational de-vices for over a decade; however, they have not been used for traditional scientific com-puting due to their perceived lack of floating-point performance. In recent years, there hasbeen a surge of interest in alternatives to traditional microprocessors for high performancecomputing. Sandia National Labs began two projects to determine whether FPGAs wouldbe a suitable alternative to microprocessors for high performance scientific computing and,if so, how they should be integrated into the system. We present results that indicate thatFPGAs could have a significant impact on future systems. FPGAs have thepotentialtohave order of magnitude levels of performance wins on several key algorithms; however,there are serious questions as to whether the system integration challenge can be met. Fur-thermore, there remain challenges in FPGA programming and system level reliability whenusing FPGA devices.4 AcknowledgmentArun Rodrigues provided valuable support and assistance in the use of the Structural Sim-ulation Toolkit within an FPGA context. Curtis Janssen and Steve Plimpton provided valu-able insights into the workings of two Sandia applications (MPQC and LAMMPS, respec-tively).5

More Details

TYPE SAND Report YEAR 2006

DOI OSTI

Analyzing the Scalability of Graph Algorithms on Eldorado

Underwood, Keith D.; Vance, Megan L.; Hendrickson, Bruce A.; Berry, Jonathan

Abstract not provided.

More Details

TYPE Conference YEAR 2006

OSTI

Architectures and APIs: Assessing Requirements for Delivering FPGA Performance to Applications

Underwood, Keith D.; Ulmer, Craig; Hemmert, Karl S.

Abstract not provided.

More Details

TYPE Conference YEAR 2006

OSTI

Scalability of Graph Algorithms on Eldorado

Underwood, Keith D.; Berry, Jonathan; Hendrickson, Bruce A.

Abstract not provided.

More Details

TYPE Conference YEAR 2006

OSTI

NIC Architecture Research

Underwood, Keith D.

Abstract not provided.

More Details

TYPE Presentation YEAR 2006

OSTI

A Simple Synchronous Distributed-Memory Algorithm for the HPCC RandomAccess Benchmark

Underwood, Keith D.; Plimpton, Steven J.; Brightwell, Ronald B.; Vaughan, Courtenay T.

Abstract not provided.

More Details

TYPE Conference YEAR 2006

OSTI

Benchmarking MPI: The Challenges of Getting it Right

Underwood, Keith D.

Abstract not provided.

More Details

TYPE Conference YEAR 2006

OSTI

The Impacts of Message Rate on Applications Programming

Underwood, Keith D.

Abstract not provided.

More Details

TYPE Conference YEAR 2006

OSTI

Multi-core processors : coping with the inevitable

Underwood, Keith D.

Abstract not provided.

More Details

TYPE Conference YEAR 2006

OSTI

Open source high performance floating-point modules

Underwood, Keith D.

Given the logic density of modern FPGAs, it is feasible to use FPGAs for floating-point applications. However, it is important that any floating-point units that are used be highly optimized. This paper introduces an open source library of highly optimized floating-point units for Xilinx FPGAs. The units are fully IEEE compliant and achieve approximately 230 MHz operation frequency for double-precision add and multiply in a Xilinx Virtex-2-Pro FPGA (-7 speed grade). This speed is achieved with a 10 stage adder pipeline and a 12 stage multiplier pipeline. The area requirement is 571 slices for the adder and 905 slices for the multiplier.

More Details

TYPE Conference YEAR 2006

OSTI

Considering the relative importance of network performance and network features

Proceedings of the International Conference on Parallel Processing

Lawry, William L.; Underwood, Keith D.

Latency and bandwidth are usually considered to be the dominant factor in parallel application performance; however, recent studies have indicated that support for independent progress in MPI can also have a significant impact on application performance. This paper leverages the Cplant system at Sandia National Labs to compare a faster, vendor provided MPI library without independent progress to an internally developed MPI library that sacrifices some performance to provide independent progress. The results are surprising. Although some applications see significant negative impacts from the reduced network performance, others are more sensitive to the presence of independent progress. © 2005 IEEE.

More Details

TYPE Conference YEAR 2005

OSTI Scopus

An analysis of the double-precision floating-point FFT on FPGAs

Proceedings - 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, FCCM 2005

Hemmert, Karl S.; Underwood, Keith D.

Advances in FPGA technology have led to dramatic improvements in double precision floating-point performance. Modern FPGAs boast several GigaFLOPs of raw computing power. Unfortunately, this computing power is distributed across 30 floating-point units with over 10 cycles of latency each. The user must find two orders of magnitude more parallelism than is typically exploited in a single microprocessor; thus, it is not clear that the computational power of FPGAs can be exploited across a wide range of algorithms. This paper explores three implementation alternatives for the Fast Fourier Transform (FFT) on FPGAs. The algorithms are compared in terms of sustained performance and memory requirements for various FFT sizes and FPGA sizes. The results indicate that FPGAs are competitive with microprocessors in terms of performance and that the "correct" FFT implementation varies based on the size of the transform and the size of the FPGA. © 2005 IEEE.

More Details

TYPE Conference YEAR 2005

OSTI Scopus

A comparison of floating point and logarithmic number systems for FPGAs

Proceedings - 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, FCCM 2005

Haselman, Michael; Beauchamp, Michael; Wood, Aaron; Hauck, Scott; Underwood, Keith D.; Hemmert, Karl S.

There have been many papers proposing the use of logarithmic numbers (LNS) as an alternative to floating point because of simpler multiplication, division and exponentiation computations [1,4-9,13]. However, this advantage comes at the cost of complicated, inexact addition and subtraction, as well as the need to convert between the formats. In this work, we created a parameterized LNS library of computational units and compared them to an existing floating point library. Specifically, we considered multiplication, division, addition, subtraction, and format conversion to determine when one format should be used over the other and when it is advantageous to change formats during a calculation. © 2005 IEEE.

More Details

TYPE Conference YEAR 2005

Scopus OSTI

How Relevant is Computer Architecture Research to Emerging Memory Intensive Applications?

Hendrickson, Bruce A.; Underwood, Keith D.

Abstract not provided.

More Details

TYPE Conference YEAR 2005

OSTI

Publications

Search results