Publications

Results 101–114 of 114
Skip to search filters

An architecture to perform NIC based MPI matching

Proceedings - IEEE International Conference on Cluster Computing, ICCC

Hemmert, Karl S.; Underwood, Keith; Rodrigues, Arun

Modern supercomputers aggregate thousands of microprocessors through a high performance network. Many of these systems place a processor on the network interface controller (NIC) to handle some portion of the MPI processing. This processing involves traversing a linked list and invoking a matching function for each item. Although this task is critical to the performance of the system, microprocessors perform it extremely poorly. Furthermore, the traditional network processor approaches of multicore and multithreading map poorly to the problem because the list is a shared data structure. While match processing can be implemented directly in hardware, hardware implementations can be extremely inflexible and lead to extremely high risk. This paper presents a novel, programmable architecture for a processor to handle the matching function. The matching engine approaches the performance of a direct hardware implementation while maintaining a high degree of flexibility and programmability. More importantly, it requires a dramatically smaller area than a conventional processor. © 2007 IEEE.

More Details

Implications of application usage characteristics for collective communication offload

International Journal of High Performance Computing and Networking

Brightwell, Ronald B.; Goudy, Sue P.; Rodrigues, Arun; Underwood, Keith D.

The global, synchronous nature of some collective operations implies that they will become the bottleneck when scaling to hundreds of thousands of nodes. One approach improves collective performance using a programmable network interface to directly implement collectives. While these implementations improve micro-benchmark performance, accelerating applications will require deeper understanding of application behaviour. We describe several characteristics of applications that impact collective communication performance. We analyse network resource usage data to guide the design of collective offload engines and their associated programming interfaces. In particular, we provide an analysis of the potential benefit of non-blocking collective communication operations for MPI. © 2006 Inderscience Enterprises Ltd.

More Details

Enhancing NIC performance for MPI using processing-in-memory

Proceedings - 19th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2005

Rodrigues, Arun; Murphy, Richard; Brightwell, Ronald B.; Underwood, Keith D.

Processing-in-Memory (PIM) technology encompasses a range of research leveraging a tight coupling of memory and processing. The most unique features of the technology are extremely wide paths to memory, extremely low memory latency, and wide functional units. Many PIM researchers are also exploring extremely fine-grained multi-threading capabilities. This paper explores a mechanism for leveraging these features of PIM technology to enhance commodity architectures in a seemingly mundane way: accelerating MPI. Modern network interfaces leverage simple processors to offload portions of the MPI semantics, particularly the management of posted receive and unexpected message queues. Without adding cost or increasing clock frequency, using PIMs in the network interface can enhance performance. The results are a significant decrease in latency and increase in small message bandwidth, particularly when long queues are present.

More Details

Programming future architectures : dusty decks, memory walls, and the speed of light

Rodrigues, Arun

Due to advances in CMOS fabrication technology, high performance computing capabilities have continually grown. More capable hardware has allowed a range of complex scientific applications to be developed. However, these applications present a bottleneck to future performance. Entrenched 'legacy' codes - 'Dusty Decks' - demand that new hardware must remain compatible with existing software. Additionally, conventional architectures faces increasing challenges. Many of these challenges revolve around the growing disparity between processor and memory speed - the 'Memory Wall' - and difficulties scaling to large numbers of parallel processors. To a large extent, these limitations are inherent to the traditional computer architecture. As data is consumed more quickly, moving that data to the point of computation becomes more difficult. Barring any upward revision in the speed of light, this will continue to be a fundamental limitation on the speed of computation. This work focuses on these solving these problems in the context of Light Weight Processing (LWP). LWP is an innovative technique which combines Processing-In-Memory, short vector computation, multithreading, and extended memory semantics. It applies these techniques to try and answer the questions 'What will a next-generation supercomputer look like?' and 'How will we program it?' To that end, this work presents four contributions: (1) An implementation of MPI which uses features of LWP to substantially improve message processing throughput; (2) A technique leveraging extended memory semantics to improve message passing by overlapping computation and communication; (3) An OpenMP library modified to allow efficient partitioning of threads between a conventional CPU and LWPs - greatly improving cost/performance; and (4) An algorithm to extract very small 'threadlets' which can overcome the inherent disadvantages of a simple processor pipeline.

More Details

Accelerating list management for MPI

Hemmert, Karl S.; Rodrigues, Arun; Underwood, Keith

The latency and throughput of MPI messages are critically important to a range of parallel scientific applications. In many modern networks, both of these performance characteristics are largely driven by the performance of a processor on the network interface. Because of the semantics of MPI, this embedded processor is forced to traverse a linked list of posted receives each time a message is received. As this list grows long, the latency of message reception grows and the throughput of MPI messages decreases. This paper presents a novel hardware feature to handle list management functions on a network interface. By moving functions such as list insertion, list traversal, and list deletion to the hardware unit, latencies are decreased by up to 20% in the zero length queue case with dramatic improvements in the presence of long queues. Similarly, the throughput is increased by up to 10% in the zero length queue case and by nearly 100% in the presence queues of 30 messages.

More Details

The implications of working set analysis on supercomputing memory hierarchy design

Underwood, Keith; Rodrigues, Arun

Supercomputer architects strive to maximize the performance of scientific applications. Unfortunately, the large, unwieldy nature of most scientific applications has lead to the creation of artificial benchmarks, such as SPEC-FP, for architecture research. Given the impact that these benchmarks have on architecture research, this paper seeks an understanding of how they relate to real-world applications within the Department of Energy. Since the memory system has been found to be a particularly key issue for many applications, the focus of the paper is on the relationship between how the SPEC-FP benchmarks and DOE applications use the memory system. The results indicate that while the SPEC-FP suite is a well balanced suite, supercomputing applications typically demand more from the memory system and must perform more 'other work' (in the form of integer computations) along with the floating point operations. The SPEC-FP suite generally demonstrates slightly more temporal locality leading to somewhat lower bandwidth demands. The most striking result is the cumulative difference between the benchmarks and the applications in terms of the requirements to sustain the floating-point operation rate: the DOE applications require significantly more data from main memory (not cache) per FLOP and dramatically more integer instructions per FLOP.

More Details
Results 101–114 of 114
Results 101–114 of 114