IEEE Transactions on Parallel and Distributed Systems

Unat, Didem; Dubey, Anshu; Hoefler, Torsten; Shalf, John B.; Abraham, Mark; Bianco, Mauro; Chamberlain, Bradford L.; Cledat, Romain; Edwards, Harold C.; Finkel, Hal; Fuerlinger, Karl; Hannig, Frank; Jeannot, Emmanuel; Kamil, Amir; Keasler, Jeff; Kelly, Paul H.J.; Leung, Vitus J.; Ltaief, Hatem; Maruyama, Naoya; Newburn, Chris J.; Pericas, Miquel

The cost of data movement has always been an important concern in high performance computing (HPC) systems. It has now become the dominant factor in terms of both energy consumption and performance. Support for expression of data locality has been explored in the past, but those efforts have had only modest success in being adopted in HPC applications for various reasons. them However, with the increasing complexity of the memory hierarchy and higher parallelism in emerging HPC systems, locality management has acquired a new urgency. Developers can no longer limit themselves to low-level solutions and ignore the potential for productivity and performance portability obtained by using locality abstractions. Fortunately, the trend emerging in recent literature on the topic alleviates many of the concerns that got in the way of their adoption by application developers. Data locality abstractions are available in the forms of libraries, data structures, languages and runtime systems; a common theme is increasing productivity without sacrificing performance. This paper examines these trends and identifies commonalities that can combine various locality concepts to develop a comprehensive approach to expressing and managing data locality on future large-scale high-performance computing systems.

More Details

TYPE Journal Article YEAR 2017

DOI OSTI Scopus

Modeling and simulating multiple failure masking enabled by local recovery for stencil-based applications at extreme scales

IEEE Transactions on Parallel and Distributed Systems

Gamell, Marc; Teranishi, Keita T.; Mayo, Jackson M.; Kolla, Hemanth K.; Heroux, Michael A.; Chen, Jacqueline H.; Parashar, Manish

Obtaining multi-process hard failure resilience at the application level is a key challenge that must be overcome before the promise of exascale can be fully realized. Previous work has shown that online global recovery can dramatically reduce the overhead of failures when compared to the more traditional approach of terminating the job and restarting it from the last stored checkpoint. If online recovery is performed in a local manner further scalability is enabled, not only due to the intrinsic lower costs of recovering locally, but also due to derived effects when using some application types. In this paper we model one such effect, namely multiple failure masking, that manifests when running Stencil parallel computations on an environment when failures are recovered locally. First, the delay propagation shape of one or multiple failures recovered locally is modeled to enable several analyses of the probability of different levels of failure masking under certain Stencil application behaviors. Our results indicate that failure masking is an extremely desirable effect at scale which manifestation is more evident and beneficial as the machine size or the failure rate increase.

More Details

TYPE Journal Article YEAR 2017

DOI OSTI Scopus

IEEE Spectrum

Swiler, Laura P.; Ray, Jaideep R.; Ebeida, Mohamed S.; Huang, Maoyi; Hou, Zhangshuan; Bao, Jie; Ren, Huiying

We present the development of a parallel Markov Chain Monte Carlo (MCMC) method called SAChES, Scalable Adaptive Chain-Ensemble Sampling. This capability is targed to Bayesian calibration of com- putationally expensive simulation models. SAChES involves a hybrid of two methods: Differential Evo- lution Monte Carlo followed by Adaptive Metropolis. Both methods involve parallel chains. Differential evolution allows one to explore high-dimensional parameter spaces using loosely coupled (i.e., largely asynchronous) chains. Loose coupling allows the use of large chain ensembles, with far more chains than the number of parameters to explore. This reduces per-chain sampling burden, enables high-dimensional inversions and the use of computationally expensive forward models. The large number of chains can also ameliorate the impact of silent-errors, which may affect only a few chains. The chain ensemble can also be sampled to provide an initial condition when an aberrant chain is re-spawned. Adaptive Metropolis takes the best points from the differential evolution and efficiently hones in on the poste- rior density. The multitude of chains in SAChES is leveraged to (1) enable efficient exploration of the parameter space; and (2) ensure robustness to silent errors which may be unavoidable in extreme-scale computational platforms of the future. This report outlines SAChES, describes four papers that are the result of the project, and discusses some additional results.

More Details

TYPE SAND Report YEAR 2017

DOI OSTI

Extreme-Value Statistics Reveal Rare Failure-Critical Defects in Additive Manufacturing

Advanced Engineering Materials

Journal of Geophysical Research: Solid Earth

Ziegler, Abra E.; Balch, Robert; Knox, Hunter A.; Van Wijk, Jolante; Draelos, Timothy J.; Peterson, Matthew G.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Publications

Search results