Publications

Results 1–25 of 33

Search results

Jump to search filters

The Portals 4.1 Network Programming Interface

Barrett, Brian; Brightwell, Ronald B.; Grant, Ryan E.; Hemmert, Karl S.; Laros, James H.; Wheeler, Kyle; Underwood, Keith; Riesen, Rolf; Maccabe, Arthur B.; Hudson, Trammel

This report presents a specification for the Portals 4 networ k programming interface. Portals 4 is intended to allow scalable, high-performance network communication betwee n nodes of a parallel computing system. Portals 4 is well suited to massively parallel processing and embedded syste ms. Portals 4 represents an adaption of the data movement layer developed for massively parallel processing platfor ms, such as the 4500-node Intel TeraFLOPS machine. Sandia's Cplant cluster project motivated the development of Version 3.0, which was later extended to Version 3.3 as part of the Cray Red Storm machine and XT line. Version 4 is tar geted to the next generation of machines employing advanced network interface architectures that support enh anced offload capabilities.

More Details

Evaluating NIC hardware requirements to achieve high message rate PGAS support on multi-core processors

Proceedings of the 2007 ACM/IEEE Conference on Supercomputing, SC'07

Underwood, Keith; Levenhagen, Michael J.; Brightwell, Ronald B.

Partitioned global address space (PGAS) programming models have been identified as one of the few viable approaches for dealing with emerging many-core systems. These models tend to generate many small messages, which requires specific support from the network interface hardware to enable efficient execution. In the past, Cray included E-registers on the Cray T3E to support the SHMEM API; however, with the advent of multi-core processors, the balance of computation to communication capabilities has shifted toward computation. This paper explores the message rates that are achievable with multi-core processors and simplified PGAS support on a more conventional network interface. For message rate tests, we find that simple network interface hardware is more than sufficient. We also find that even typical data distributions, such as cyclic or block-cyclic, do not need specialized hardware support. Finally, we assess the impact of such support on the well known RandomAccess benchmark. (c) 2007 ACM.

More Details

FPGAs in High Perfomance Computing: Results from Two LDRD Projects

Underwood, Keith; Ulmer, Craig D.; Thompson, David C.; Hemmert, Karl S.

Field programmable gate arrays (FPGAs) have been used as alternative computational de-vices for over a decade; however, they have not been used for traditional scientific com-puting due to their perceived lack of floating-point performance. In recent years, there hasbeen a surge of interest in alternatives to traditional microprocessors for high performancecomputing. Sandia National Labs began two projects to determine whether FPGAs wouldbe a suitable alternative to microprocessors for high performance scientific computing and,if so, how they should be integrated into the system. We present results that indicate thatFPGAs could have a significant impact on future systems. FPGAs have thepotentialtohave order of magnitude levels of performance wins on several key algorithms; however,there are serious questions as to whether the system integration challenge can be met. Fur-thermore, there remain challenges in FPGA programming and system level reliability whenusing FPGA devices.4 AcknowledgmentArun Rodrigues provided valuable support and assistance in the use of the Structural Sim-ulation Toolkit within an FPGA context. Curtis Janssen and Steve Plimpton provided valu-able insights into the workings of two Sandia applications (MPQC and LAMMPS, respec-tively).5

More Details

Open source high performance floating-point modules

Underwood, Keith

Given the logic density of modern FPGAs, it is feasible to use FPGAs for floating-point applications. However, it is important that any floating-point units that are used be highly optimized. This paper introduces an open source library of highly optimized floating-point units for Xilinx FPGAs. The units are fully IEEE compliant and achieve approximately 230 MHz operation frequency for double-precision add and multiply in a Xilinx Virtex-2-Pro FPGA (-7 speed grade). This speed is achieved with a 10 stage adder pipeline and a 12 stage multiplier pipeline. The area requirement is 571 slices for the adder and 905 slices for the multiplier.

More Details

A comparison of floating point and logarithmic number systems for FPGAs

Proceedings - 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, FCCM 2005

Haselman, Michael; Beauchamp, Michael; Wood, Aaron; Hauck, Scott; Underwood, Keith; Hemmert, Karl S.

There have been many papers proposing the use of logarithmic numbers (LNS) as an alternative to floating point because of simpler multiplication, division and exponentiation computations [1,4-9,13]. However, this advantage comes at the cost of complicated, inexact addition and subtraction, as well as the need to convert between the formats. In this work, we created a parameterized LNS library of computational units and compared them to an existing floating point library. Specifically, we considered multiplication, division, addition, subtraction, and format conversion to determine when one format should be used over the other and when it is advantageous to change formats during a calculation. © 2005 IEEE.

More Details

An analysis of the double-precision floating-point FFT on FPGAs

Proceedings - 13th Annual IEEE Symposium on Field-Programmable Custom Computing Machines, FCCM 2005

Hemmert, Karl S.; Underwood, Keith

Advances in FPGA technology have led to dramatic improvements in double precision floating-point performance. Modern FPGAs boast several GigaFLOPs of raw computing power. Unfortunately, this computing power is distributed across 30 floating-point units with over 10 cycles of latency each. The user must find two orders of magnitude more parallelism than is typically exploited in a single microprocessor; thus, it is not clear that the computational power of FPGAs can be exploited across a wide range of algorithms. This paper explores three implementation alternatives for the Fast Fourier Transform (FFT) on FPGAs. The algorithms are compared in terms of sustained performance and memory requirements for various FFT sizes and FPGA sizes. The results indicate that FPGAs are competitive with microprocessors in terms of performance and that the "correct" FFT implementation varies based on the size of the transform and the size of the FPGA. © 2005 IEEE.

More Details

Considering the relative importance of network performance and network features

Proceedings of the International Conference on Parallel Processing

Lawry, William L.; Underwood, Keith

Latency and bandwidth are usually considered to be the dominant factor in parallel application performance; however, recent studies have indicated that support for independent progress in MPI can also have a significant impact on application performance. This paper leverages the Cplant system at Sandia National Labs to compare a faster, vendor provided MPI library without independent progress to an internally developed MPI library that sacrifices some performance to provide independent progress. The results are surprising. Although some applications see significant negative impacts from the reduced network performance, others are more sensitive to the presence of independent progress. © 2005 IEEE.

More Details
Results 1–25 of 33
Results 1–25 of 33