OSTI Scopus

Evaluating MPI resource usage summary statistics

Parallel Computing

Levy, Scott L.N.; Ferreira, Kurt; Siddiqua, Taniya; Debardelebe, Nathan; Sridharan, Vilas; Baseman, Elisabeth

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

DOI OSTI

Hardware MPI message matching: Insights into MPI matching behavior to inform design: Hardware MPI message matching

Concurrency and Computation. Practice and Experience

Ferreira, Kurt; Grant, Ryan; Levenhagen, Michael; Levy, Scott L.N.; Groves, Taylor

Here, this paper explores key differences of MPI match lists for several important United States Department of Energy (DOE) applications and proxy applications. This understanding is critical in determining the most promising hardware matching design for any given high-speed network. The results of MPI match list studies for the major open-source MPI implementations, MPICH and Open MPI, are presented, and we modify an MPI simulator, LogGOPSim, to provide match list statistics. These results are discussed in the context of several different potential design approaches to MPI matching–capable hardware. The data illustrate the requirements for different hardware designs in terms of performance and memory capacity. Finally, this paper's contributions are the collection and analysis of data to help inform hardware designers of common MPI requirements and highlight the difficulties in determining these requirements by only examining a single MPI implementation.

More Details

TYPE Journal Article YEAR 2019

DOI OSTI

Mediating Data Center Storage Diversity in HPC Applications with FAODEL

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Widener, Patrick; Ulmer, Craig; Levy, Scott L.N.; Kordenbrock, Todd; Templet, Gary J.

Composition of computational science applications into both ad hoc pipelines for analysis of collected or generated data and into well-defined and repeatable workflows is becoming increasingly popular. Meanwhile, dedicated high performance computing storage environments are rapidly becoming more diverse, with both significant amounts of non-volatile memory storage and mature parallel file systems available. At the same time, computational science codes are being coupled to data analysis tools which are not filesystem-oriented. In this paper, we describe how the FAODEL data management service can expose different available data storage options and mediate among them in both application- and FAODEL-directed ways. These capabilities allow applications to exploit their knowledge of the different types of data they may exchange during a workflow execution, and also provide FAODEL with mechanisms to proactively tune data storage behavior when appropriate. We describe the implementation of these capabilities in FAODEL and how they are used by applications, and present preliminary performance results demonstrating the potential benefits of our approach.

More Details

TYPE Conference Poster YEAR 2019

DOI OSTI Scopus

The upcoming storm: The implications of increasing core count on scalable system software

Advances in Parallel Computing

Dosanjh, Matthew G.; Grant, Ryan; Hjelm, Nathan; Levy, Scott L.N.; Schonbein, William W.

As clock speeds have stagnated, the number of cores in a node has been drastically increased to improve processor throughput. Most scalable system software was designed and developed for single-threaded environments. Multithreaded environments become increasingly prominent as application developers optimize their codes to leverage the full performance of the processor; however, these environments are incompatible with a number of assumptions that have driven scalable system software development. This paper will highlight a case study of this mismatch focusing on MPI message matching. MPI message matching has been designed and optimized for traditional serial execution. The reduced determinism in the order of MPI calls can significantly reduce the performance of MPI message matching, potentially overtaking time-per-iteration targets of many applications. Different proposed techniques attempt to address these issues and enable multithreaded MPI usage. These approaches highlight a number of tradeoffs that make adapting MPI message matching complex. This case study and its proposed solutions highlight a number of general concepts that need to be leveraged in the design of next generation scaleable system software.

More Details

TYPE Book YEAR 2019

OSTI Scopus

SNL ATDM Software Ecosystem

Olivier, Stephen L.; Brightwell, Ronald B.; Foulk, James W.; Younge, Andrew J.; Evans, Noah; Levy, Scott L.N.; Ferreira, Kurt; Grant, Ryan

Abstract not provided.

More Details

TYPE Presentation YEAR 2018

OSTI

Using simulation to examine the effect of MPI message matching costs on application performance

ACM International Conference Proceeding Series

Levy, Scott L.N.; Ferreira, Kurt

Attaining high performance with MPI applications requires efficient message matching to minimize message processing overheads and the latency these overheads introduce into application communication. In this paper, we use a validated simulation-based approach to examine the relationship between MPI message matching performance and application time-to-solution. Specifically, we examine how the performance of several important HPC workloads is affected by the time required for matching. Our analysis yields several important contributions: (i) the performance of current workloads is unlikely to be significantly affected by MPI matching unless match queue operations get much slower or match queues get much longer; (ii) match queue designs that provide sublinear performance as a function of queue length are unlikely to yield much benefit unless match queue lengths increase dramatically; and (iii) we provide guidance on how long the mean time per match attempt may be without significantly affecting application performance. The results and analysis in this paper provide valuable guidance on the design and development of MPI message match queues.

More Details

TYPE Conference Poster YEAR 2018

DOI OSTI Scopus

Levy, Scott L.N.; Ferreira, Kurt; Debardeleben, Nathan; Siddiqua, Taniya; Sridharan, Vilas; Baseman, Elisabeth

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

DOI OSTI

Faodel: Data Management for Next-Generation Application Workflows

Ulmer, Craig; Mukherjee, Shyamali; Templet, Gary J.; Levy, Scott L.N.; Lofstead, Gerald F.; Widener, Patrick; Lawson, Margaret

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

DOI OSTI

SNL ATDM: I/O and Data Management

Ulmer, Craig; Kordenbrock, Todd; Lawson, Margaret; Levy, Scott L.N.; Lofstead, Gerald F.; Mukherjee, Shyamali; Sjaardema, Gregory D.; Templet, Gary J.; Ward, Harry L.; Widener, Patrick

Abstract not provided.

More Details

TYPE Presentation YEAR 2018

OSTI

It’s not the heat, it’s the humidity: Scheduling resilience activity at scale

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Ulmer, Craig; Oldfield, Ron; Kordenbrock, Todd; Levy, Scott L.N.; Lofstead, Gerald F.; Mukherjee, Shyamali; Templet, Gary J.; Widener, Patrick

Abstract not provided.

More Details

TYPE Presentation YEAR 2017

OSTI

Lifetime memory reliability data from the field

2017 IEEE Int. Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems, DFT 2017

Siddiqua, Taniya; Sridharan, Vilas; Raasch, Steven E.; Debardeleben, Nathan; Ferreira, Kurt; Levy, Scott L.N.; Baseman, Elisabeth; Guan, Qiang

In order to provide high system resilience, it is important to understand the nature of the faults that occur in the field. This study analyzes fault rates from a production system that has been monitored for five years, capturing data for the entire operational lifetime of the system. The data show that devices in this system did not show any sign of aging during the monitoring period, suggesting that the lifetime of a system may be longer than five years. In DRAM, the relative incidence of fault modes changed insignificantly over the system's lifetime: The relative rate of each fault mode at the end of the system's lifetime was within 1.4 percentage point of the rate observed during the first year. SRAM caches in the system exhibited different fault modes including cache-way fault and single-bit faults. Overall, this study provides insights on how fault modes and types in a system evolve over the system's lifetime.

More Details

TYPE Conference Poster YEAR 2017

DOI OSTI Scopus

Horseshoes and hand grenades: The case for approximate coordination in local checkpointing protocols

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Widener, Patrick; Ferreira, Kurt; Levy, Scott L.N.

Fault-tolerance poses a major challenge for future large-scale systems. Active research into coordinated, uncoordinated, and hybrid checkpointing systems has explored how the introduction of asynchrony can address anticipated scalability issues. While fully uncoordinated approaches have been shown to have significant delays, the degree of sychronization required to keep overheads low has not yet been significantly addressed. In this paper, we use a simulation-based approach to show the impact of synchronization on local checkpoint activity. Specifically, we show the degree of synchronization needed to keep the impacts of local checkpointing low is attainable with current technology for a number of key production HPC workloads. Our work provides a critical analysis and comparison of synchronization and local checkpointing. This enables users and system administrators to fine-tune the checkpointing scheme to the application and system characteristics available.

More Details

TYPE Conference Poster YEAR 2017

DOI OSTI Scopus

ATDM Data Warehouse

Ulmer, Craig; Kordenbrock, Todd; Levy, Scott L.N.; Lofstead, Gerald F.; Mukherjee, Shyamali; Sjaardema, Gregory D.; Templet, Gary J.; Widener, Patrick; Oldfield, Ron

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Levy, Scott L.N.; Ferreira, Kurt; Widener, Patrick; Bridges, Patrick G.; Mondragon, Oscar H.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

An examination of the impact of failure distribution on coordinated checkpoint/restart

FTXS 2016 - Proceedings of the ACM Workshop on Fault-Tolerance for HPC at Extreme Scale

Levy, Scott L.N.; Ferreira, Kurt

Fault tolerance is a key challenge to building the first exascale system. To understand the potential impacts of failures on next-generation systems, significant effort has been devoted to collecting, characterizing and analyzing failures on current systems. These studies require large volumes of data and complex analysis. Because the occurrence of failures in large-scale systems is unpredictable, failures are commonly modeled as a stochastic process. Failure data from current systems is examined in an attempt to identify the underlying probability distribution and its statistical properties. In this paper, we use modeling to examine the impact of failure distributions on the time-to-solution and the optimal checkpoint interval of applications that use coordinated checkpoint/restart. Using this approach, we show that as failures become more frequent, the failure distribution has a larger influence on application performance. We also show that as failure times are less tightly grouped (i.e., as the standard deviation increases) the underlying probability distribution has a greater impact on application performance. Finally, we show that computing the checkpoint interval based on the assumption that failures are exponentially distributed has a modest impact on application performance even when failures are drawn from a different distribution. Our work provides critical analysis and guidance to the process of analyzing failure data in the context of coordinated checkpoint/restart. Specifically, the data presented in this paper helps to distinguish cases where the failure distribution has a strong influence on application performance from those cases when the failure distribution has relatively little impact.

More Details

TYPE Conference Poster YEAR 2016

DOI OSTI Scopus

How I Learned to Stop Worrying and Love In Situ Analytics:Leveraging latent synchronization in MPI collective algorithms

Levy, Scott L.N.; Ferreira, Kurt; Widener, Patrick; Bridges, Patrick; Mondragon, Oscar

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Using Simulation to Evaluate the Performance of Resilience Strategies at Scale

Levy, Scott L.N.; Ferreira, Kurt; Widener, Patrick; Bridges, Patrick G.; Mondragon, Oscar

Abstract not provided.

More Details

TYPE Presentation YEAR 2016

DOI OSTI

Using Rollback Avoidance to Mitigate Failures in Next-Generation Extreme-Scale Systems

Levy, Scott L.N.

High-performance computing (HPC) systems enable scientists to numerically model complex phenomena in many important physical systems. The next major milestone in the development of HPC systems is the construction of the rst supercomputer capable executing more than an exa op, 10¹⁸ oating point operations per second. On systems of this scale, failures will occur much more frequently than on current systems. As a result, resilience is a key obstacle to building next-generation extremescale systems. Coordinated checkpointing is currently the most widely-used mechanism for handling failures on HPC systems. Although coordinated checkpointing remains e ective on current systems, increasing the scale of today's systems to build next-generation systems will increase the cost of fault tolerance as more and more time is taken away from the application to protect against or recover from failure. Rollback avoidance techniques seek to mitigate the cost of checkpoint/restart by allowing an application to continue its execution rather than rolling back to an earlier checkpoint when failures occur. These techniqes include failure prediction and preventive migration, replicated computation, fault-tolerant algorithms, and softwarebased memory fault correction. In this thesis, we examine how rollback avoidance techniques can be used to address failures on extreme-scale systems. Using a combination of analytic modeling and simulation, we evaluate the potential impact of rollback avoidance on these systems. We then present a novel rollback avoidance technique that exploits similarities in application memory. Finally, we examine the feasibility of using this technique to protect against memory faults in kernel memory.

More Details

TYPE Thesis or Dissertation YEAR 2016

OSTI

On noise and the performance benefit of nonblocking collectives

International Journal of High Performance Computing Applications

Widener, Patrick; Levy, Scott L.N.; Ferreira, Kurt; Hoefler, Torsten

Relaxed synchronization offers the potential for maintaining application scalability, by allowing many processes to make independent progress when some processes suffer delays. Yet the benefits of this approach for important parallel workloads have not been investigated in detail. In this paper, we use a validated simulation approach to explore the noise-mitigation effects of idealized nonblocking collectives, in workloads where these collectives are a major contributor to total execution time. Although nonblocking collectives are unlikely to provide significant noise mitigation to applications in the low operating system noise environments expected in next-generation high-performance computing systems, we show that they can potentially improve application runtime with respect to other noise types.

More Details

TYPE Journal Article YEAR 2016

DOI OSTI Scopus