Publications Search

Virtualization has the potential to dramatically increase the usability and reliability of high performance computing (HPC) systems. However, this potential will remain unrealized unless overheads can be minimized. This is particularly challenging on large scale machines that run carefully crafted HPC OSes supporting tightlycoupled, parallel applications. In this paper, we show how careful use of hardware and VMM features enables the virtualization of a large-scale HPC system, specifically a Cray XT4 machine, with .5% overhead on key HPC applications, microbenchmarks, and guests at scales of up to 4096 nodes. We describe three techniques essential for achieving such low overhead: passthrough I/O, workload-sensitive selection of paging mechanisms, and carefully controlled preemption. These techniques are forms of symbiotic virtualization, an approach on which we elaborate. Copyright © 2011 ACM.

More Details

TYPE Conference YEAR 2011

Scopus OSTI

Evaluating the Viability of Process Replication Reliability for Exascale Systems

Ferreira, Kurt; Stearley, Jon S.; Laros, James H.; Oldfield, Ron; Pedretti, Kevin P.; Brightwell, Ronald B.; Bridges, Patrick G.

Abstract not provided.

More Details

TYPE Conference YEAR 2011

OSTI

Energy Based Performance Tuning for Large Scale High Performance Computing Systems

Pedretti, Kevin P.; Kelly, Suzanne M.; Vaughan, Courtenay T.

Abstract not provided.

More Details

TYPE Conference YEAR 2011

OSTI

Application-driven analysis of two generations of capability computing platforms :

Rajan, Mahesh; Vaughan, Courtenay T.; Doerfler, Douglas W.; Lin, Paul T.; Pedretti, Kevin P.

Abstract not provided.

More Details

TYPE Conference YEAR 2011

OSTI

Investigating the impact of the cielo cray XE6 architecture on scientific application codes

Vaughan, Courtenay T.; Rajan, Mahesh; Barrett, Richard F.; Doerfler, Douglas W.; Pedretti, Kevin P.

Cielo, a Cray XE6, is the Department of Energy NNSA Advanced Simulation and Computing (ASC) campaign's newest capability machine. Rated at 1.37 PFLOPS, it consists of 8,944 dual-socket oct-core AMD Magny-Cours compute nodes, linked using Cray's Gemini interconnect. Its primary mission objective is to enable a suite of the ASC applications implemented using MPI to scale to tens of thousands of cores. Cielo is an evolutionary improvement to a successful architecture previously available to many of our codes, thus enabling a basis for understanding the capabilities of this new architecture. Using three codes strategically important to the ASC campaign, and supplemented with some micro-benchmarks that expose the fundamental capabilities of the XE6, we report on the performance characteristics and capabilities of Cielo.

More Details

TYPE Conference YEAR 2010

OSTI

Opportunities for leveraging OS virtualization in high-end supercomputing

Pedretti, Kevin P.; Bridges, Patrick G.

This paper examines potential motivations for incorporating virtualization support in the system software stacks of high-end capability supercomputers. We advocate that this will increase the flexibility of these platforms significantly and enable new capabilities that are not possible with current fixed software stacks. Our results indicate that compute, virtual memory, and I/O virtualization overheads are low and can be further mitigated by utilizing well-known techniques such as large paging and VMM bypass. Furthermore, since the addition of virtualization support does not affect the performance of applications using the traditional native environment, there is essentially no disadvantage to its addition.

More Details

TYPE Conference YEAR 2010

OSTI

Palacios and kitten: New high performance operating systems for scalable virtualized and native supercomputing

Proceedings of the 2010 IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2010

Lange, John; Pedretti, Kevin P.; Hudson, Trammell; Dinda, Peter; Cui, Zheng; Xia, Lei; Bridges, Patrick; Gocke, Andy; Jaconette, Steven; Levenhagen, Michael; Brightwell, Ronald B.

Palacios is a new open-source VMM under development at Northwestern University and the University of New Mexico that enables applications executing in a virtualized environment to achieve scalable high performance on large machines. Palacios functions as a modularized extension to Kitten, a high performance operating system being developed at Sandia National Laboratories to support large-scale supercomputing applications. Together, Palacios and Kitten provide a thin layer over the hardware to support full-featured virtualized environments alongside Kitten's lightweight native environment. Palacios supports existing, unmodified applications and operating systems by using the hardware virtualization technologies in recent AMD and Intel processors. Additionally, Palacios leverages Kitten's simple memory management scheme to enable low-overhead pass-through of native devices to a virtualized environment. We describe the design, implementation, and integration of Palacios and Kitten. Our benchmarks show that Palacios provides near native (within 5%), scalable performance for virtualized environments running important parallel applications. This new architecture provides an incremental path for applications to use supercomputers, running specialized lightweight host operating systems, that is not significantly performance-compromised. © 2010 IEEE.

More Details

TYPE Conference YEAR 2010

Scopus OSTI

rMPI : increasing fault resiliency in a message-passing environment

Ferreira, Kurt; Riesen, Rolf; Oldfield, Ron; Laros, James H.; Pedretti, Kevin P.; Stearley, Jon S.; Brightwell, Ronald B.

Abstract not provided.

More Details

TYPE Conference YEAR 2010

OSTI

Virtualizing a large scale supercomputer with minimal overhead

Pedretti, Kevin P.

Abstract not provided.

More Details

TYPE Conference YEAR 2010

OSTI

HPC application fault-tolerance using transparent redundant computation

Ferreira, Kurt; Riesen, Rolf; Oldfield, Ron; Brightwell, Ronald B.; Laros, James H.; Pedretti, Kevin P.

As the core count of HPC machines continue to grow in size, issues such as fault tolerance and reliability are becoming limiting factors for application scalability. Current techniques to ensure progress across faults, for example coordinated checkpoint-restart, are unsuitable for machines of this scale due to their predicted high overheads. In this study, we present the design and implementation of a novel system for ensuring reliability which uses transparent, rank-level, redundant computation. Using this system, we show the overheads involved in redundant computation for a number of real-world HPC applications. Additionally, we relate the communication characteristics of an application to the overheads observed.

More Details

TYPE Conference YEAR 2009

OSTI