Publications

Results 1–25 of 130
Skip to search filters

(SAI) stalled, active and idle: Characterizing power and performance of large-scale dragonfly networks

Proceedings - IEEE International Conference on Cluster Computing, ICCC

Groves, Taylor G.; Grant, Ryan E.; Hemmert, Karl S.; Hammond, Simon D.; Levenhagen, Michael J.; Arnold, Dorian C.

Exascale networks are expected to comprise a significant part of the total monetary cost and 10-20% of the power budget allocated to exascale systems. Yet, our understanding of current and emerging workloads on these networks is limited. Left ignored, this knowledge gap likely will translate into missed opportunities for (1) improved application performance and (2) decreased power and monetary costs in next generation systems. This work targets a detailed understanding and analysis of the performance and utilization of the dragonfly network topology. Using the Structural Simulation Toolkit (SST) and a range of relevant workloads on a dragonfly topology of 110,592 nodes, we examine network design tradeoffs amongst execution time, power, bandwidth, and the number of global links. Our simulations report stalled, active and idle time on a per-port level of the fabric, in order to provide a detailed picture of future networks. The results of this work show potential savings of 3-10% of the exascale power budget and provide valuable insights to researchers looking for new opportunities to improve performance and increase power efficiency of next generation HPC systems.

More Details

A comparison of power management mechanisms: P-States vs. node-level power cap control

Proceedings - 2018 IEEE 32nd International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2018

Pedretti, Kevin P.; Grant, Ryan E.; Laros, James H.; Levenhagen, Michael J.; Olivier, Stephen L.; Ward, Harry L.; Younge, Andrew J.

Large-scale HPC systems increasingly incorporate sophisticated power management control mechanisms. While these mechanisms are potentially useful for performing energy and/or power-aware job scheduling and resource management (EPA JSRM), greater understanding of their operation and performance impact on real-world applications is required before they can be applied effectively in practice. In this paper, we compare static p-state control to static node-level power cap control on a Cray XC system. Empirical experiments are performed to evaluate node-to-node performance and power usage variability for the two mechanisms. We find that static p-state control produces more predictable and higher performance characteristics than static node-level power cap control at a given power level. However, this performance benefit is at the cost of less predictable power usage. Static node-level power cap control produces predictable power usage but with more variable performance characteristics. Our results are not intended to show that one mechanism is better than the other. Rather, our results demonstrate that the mechanisms are complementary to one another and highlight their potential for combined use in achieving effective EPA JSRM solutions.

More Details

A dedicated message matching mechanism for collective communications

ACM International Conference Proceeding Series

Ghazimirsaeed, S.M.; Grant, Ryan E.; Afsahi, Ahmad

The Message Passing Interface (MPI) libraries use message queues to guarantee correct message ordering between communicating processes. Message queues are in the critical path of MPI communications and thus, the performance of message queue operations can have significant impact on the performance of applications. Collective communications are widely used in MPI applications and they can have considerable impact on generating long message queues. In this paper, we propose a message matching mechanism that improves the message queue search time by distinguishing messages coming from point-to-point and collective communications and allocating separate queues for them. Moreover, it dynamically profiles the impact of each collective call on message queues during the application runtime and uses this information to adapt the message queue data structure for each collective operation dynamically. The proposed approach can successfully reduce the queue search time while maintaining scalable memory consumption. The evaluation results show that we can obtain up to 5.5x runtime speedup for applications with long list traversals. Moreover, we can gain up to 15% and 45% queue search time improvement for applications with short and medium list traversals, respectively.

More Details

A dynamic, unified design for dedicated message matching engines for collective and point-to-point communications

Parallel Computing

Ghazimirsaeed, S.M.; Grant, Ryan E.; Afsahi, Ahmad

The Message Passing Interface (MPI) libraries use message queues to guarantee correct message ordering between communicating processes. Message queues are in the critical path of MPI communications and thus, the performance of message queue operations can have significant impact on the performance of applications. Collective communications are widely used in MPI applications and they can have considerable impact on generating long message queues. In this paper, we propose a unified message matching mechanism that improves the message queue search time by distinguishing messages coming from point-to-point and collective communications and using a distinct message queue data structure for them. For collective operations, it dynamically profiles the impact of each collective call on message queues during the application runtime and uses this information to adapt the message queue data structure for each collective dynamically. Moreover, we use a partner/non-partner message queue data structure for the messages coming from point-to-point communications. The proposed approach can successfully reduce the queue search time while maintaining scalable memory consumption. The evaluation results show that we can obtain up to 5.5x runtime speedup for applications with long list traversals. Moreover, we can gain up to 15% and 94% queue search time improvement for all elements in applications with short and medium list traversals, respectively.

More Details

A survey of MPI usage in the US exascale computing project

Concurrency and Computation: Practice and Experience

Bernholdt, David E.; Boehm, Swen; Bosilca, George; Gorentla Venkata, Manjunath; Grant, Ryan E.; Naughton, Thomas; Pritchard, Howard P.; Schulz, Martin; Vallee, Geoffroy R.

The Exascale Computing Project (ECP) is currently the primary effort in the United States focused on developing “exascale” levels of computing capabilities, including hardware, software, and applications. In order to obtain a more thorough understanding of how the software projects under the ECP are using, and planning to use the Message Passing Interface (MPI), and help guide the work of our own project within the ECP, we created a survey. Of the 97 ECP projects active at the time the survey was distributed, we received 77 responses, 56 of which reported that their projects were using MPI. This paper reports the results of that survey for the benefit of the broader community of MPI developers.

More Details

An evaluation of MPI message rate on hybrid-core processors

International Journal of High Performance Computing Applications

Barrett, Brian W.; Brightwell, Ronald B.; Grant, Ryan E.; Hammond, Simon D.; Hemmert, Karl S.

Power and energy concerns are motivating chip manufacturers to consider future hybrid-core processor designs that may combine a small number of traditional cores optimized for single-thread performance with a large number of simpler cores optimized for throughput performance. This trend is likely to impact the way in which compute resources for network protocol processing functions are allocated and managed. In particular, the performance of MPI match processing is critical to achieving high message throughput. In this paper, we analyze the ability of simple and more complex cores to perform MPI matching operations for various scenarios in order to gain insight into how MPI implementations for future hybrid-core processors should be designed.

More Details

ASC ATDM Level 2 Milestone #5325: Asynchronous Many-Task Runtime System Analysis and Assessment for Next Generation Platforms

Baker, Gavin M.; Bettencourt, Matthew T.; Bova, S.W.; franko, ken f.; Gamell, Marc G.; Grant, Ryan E.; Hammond, Simon D.; Hollman, David S.; Knight, Samuel K.; Kolla, Hemanth K.; Lin, Paul L.; Olivier, Stephen O.; Sjaardema, Gregory D.; Slattengren, Nicole L.; Teranishi, Keita T.; Wilke, Jeremiah J.; Bennett, Janine C.; Clay, Robert L.; kale, laxkimant k.; Jain, Nikhil J.; Mikida, Eric M.; Aiken, Alex A.; Bauer, Michael B.; Lee, Wonchan L.; Slaughter, Elliott S.; Treichler, Sean T.; Berzins, Martin B.; Harman, Todd H.; humphreys, alan h.; schmidt, john s.; sunderland, dan s.; Mccormick, Pat M.; gutierrez, samuel g.; shulz, martin s.; Gamblin, Todd G.; Bremer, Peer-Timo B.

Abstract not provided.

ASC ATDM Level 2 Milestone #5325: Asynchronous Many-Task Runtime System Analysis and Assessment for Next Generation Platforms

Baker, Gavin M.; Bettencourt, Matthew T.; Bova, S.W.; franko, ken f.; Gamell, Marc G.; Grant, Ryan E.; Hammond, Simon D.; Hollman, David S.; Knight, Samuel K.; Kolla, Hemanth K.; Lin, Paul L.; Olivier, Stephen O.; Sjaardema, Gregory D.; Slattengren, Nicole L.; Teranishi, Keita T.; Wilke, Jeremiah J.; Bennett, Janine C.; Clay, Robert L.; kale, laxkimant k.; Jain, Nikhil J.; Mikida, Eric M.; Aiken, Alex A.; Bauer, Michael B.; Lee, Wonchan L.; Slaughter, Elliott S.; Treichler, Sean T.; Berzins, Martin B.; Harman, Todd H.; humphreys, alan h.; schmidt, john s.; sunderland, dan s.; Mccormick, Pat M.; gutierrez, samuel g.; shulz, martin s.; Gamblin, Todd G.; Bremer, Peer-Timo B.

This report provides in-depth information and analysis to help create a technical road map for developing next-generation programming models and runtime systems that support Advanced Simulation and Computing (ASC) work- load requirements. The focus herein is on asynchronous many-task (AMT) model and runtime systems, which are of great interest in the context of "Oriascale7 computing, as they hold the promise to address key issues associated with future extreme-scale computer architectures. This report includes a thorough qualitative and quantitative examination of three best-of-class AIM] runtime systems – Charm-++, Legion, and Uintah, all of which are in use as part of the Centers. The studies focus on each of the runtimes' programmability, performance, and mutability. Through the experiments and analysis presented, several overarching Predictive Science Academic Alliance Program II (PSAAP-II) Asc findings emerge. From a performance perspective, AIV runtimes show tremendous potential for addressing extreme- scale challenges. Empirical studies show an AM runtime can mitigate performance heterogeneity inherent to the machine itself and that Message Passing Interface (MP1) and AM11runtimes perform comparably under balanced conditions. From a programmability and mutability perspective however, none of the runtimes in this study are currently ready for use in developing production-ready Sandia ASC applications. The report concludes by recommending a co- design path forward, wherein application, programming model, and runtime system developers work together to define requirements and solutions. Such a requirements-driven co-design approach benefits the community as a whole, with widespread community engagement mitigating risk for both application developers developers. and high-performance computing runtime systein

More Details

Characterizing MPI matching via trace-based simulation

ACM International Conference Proceeding Series

Ferreira, Kurt B.; Levy, Scott; Pedretti, Kevin P.; Grant, Ryan E.

With the increased scale expected on future leadership-class systems, detailed information about the resource usage and performance of MPI message matching provides important insights into how to maintain application performance on next-generation systems. However, obtaining MPI message matching performance data is often not possible without significant effort. A common approach is to instrument an MPI implementation to collect relevant statistics. While this approach can provide important data, collecting matching data at runtime perturbs the application’s execution, including its matching performance, and is highly dependent on the MPI library’s matchlist implementation. In this paper, we introduce a trace-based simulation approach to obtain detailed MPI message matching performance data for MPI applications without perturbing their execution. Using a number of key parallel workloads, we demonstrate that this simulator approach can rapidly and accurately characterize matching behavior. Specifically, we use our simulator to collect several important statistics about the operation of the MPI posted and unexpected queues. For example, we present data about search lengths and the duration that messages spend in the queues waiting to be matched. Data gathered using this simulation-based approach have significant potential to aid hardware designers in determining resource allocation for MPI matching functions and provide application and middleware developers with insight into the scalability issues associated with MPI message matching.

More Details
Results 1–25 of 130
Results 1–25 of 130