Performance portability on heterogeneous high-performance computing (HPC) systems is a major challenge faced today by code developers: parallel code needs to execute correctly as well as with high performance on machines with different architectures, operating systems, and software libraries. The Finite Element Method (FEM) is a popular and flexible method for discretizing partial differential equations arising in a wide variety of scientific, engineering, and industry applications that require HPC. This paper presents some preliminary results pertaining to our development of a performance portable implementation of the FEM-based Albany code. Performance portability is achieved using the Kokkos library of Trilinos. We present performance results for two different physics simulations modules in Albany: the Aeras global atmosphere dynamical code and the FELIX land-ice solver. As a result, numerical experiments show that our single code implementation gives reasonable performance across two multi-core/many-core architectures: NVIDIA GPUs and multi-core CPUs.
Computational Science and Engineering (CSE) software can benefit substantially from an explicit focus on quality improvement. This is especially true as we face increased demands in both modeling and software complexities. At the same time, just desiring improved quality is not sufficient. We must work with the entities that provide CSE research teams with publication venues, funding, and professional recognition in order to increase incentives for improved software quality. In fact, software quality is precisely calibrated to the expectations, explicit and implicit, set by these entities. We will see broad improvements in sustainability and productivity only when publishers, funding agencies and employers raise their expectations for software quality. CSE software community leaders, those who are in a position to inform and influence these entities, have a unique opportunity to broadly and positively impact software quality by working to establish incentives that will spur creative and novel approaches to improve developer productivity and software sustainability.
Application resilience is a key challenge that has to be addressed to realize the exascale vision. Online recovery, even when it involves all processes, can dramatically reduce the overhead of failures as compared to the more traditional approach where the job is terminated and restarted from the last checkpoint. In this paper we explore how local recovery can be used for certain classes of applications to further reduce overheads due to resilience. Specifically we develop programming support and scalable runtime mechanisms to enable online and transparent local recovery for stencil-based parallel applications on current leadership class systems. We also show how multiple independent failures can be masked to effectively reduce the impact on the total time to solution. We integrate these mechanisms with the S3D combustion simulation, and experimentally demonstrate (using the Titan Cray-XK7 system at ORNL) the ability to tolerate high failure rates (i.e., node failures every 5 seconds) with low overhead while sustaining performance, at scales up to 262144 cores.
The performance of a large-scale, production-quality science and engineering application (‘app’) is often dominated by a small subset of the code. Even within that subset, computational and data access patterns are often repeated, so that an even smaller portion can represent the performance-impacting features. If application developers, parallel computing experts, and computer architects can together identify this representative subset and then develop a small mini-application (‘miniapp’) that can capture these primary performance characteristics, then this miniapp can be used to both improve the performance of the app as well as provide a tool for co-design for the high-performance computing community. However, a critical question is whether a miniapp can effectively capture key performance behavior of an app. This study provides a comparison of an implicit finite element semiconductor device modeling app on unstructured meshes with an implicit finite element miniapp on unstructured meshes. The goal is to assess whether the miniapp is predictive of the performance of the app. Finally, single compute node performance will be compared, as well as scaling up to 16,000 cores. Results indicate that the miniapp can be reasonably predictive of the performance characteristics of the app for a single iteration of the solver on a single compute node.
Application resilience is a key challenge that must be ad-dressed in order to realize the exascale vision. Previous work has shown that online recovery, even when done in a global manner (i.e., involving all processes), can dramatically re-duce the overhead of failures when compared to the more traditional approach of terminating the job and restarting it from the last stored checkpoint. In this paper we suggest going one step further, and explore how local recovery can be used for certain classes of applications to reduce the over-heads due to failures. Specifically we study the feasibility of local recovery for stencil-based parallel applications and we show how multiple independent failures can be masked to effectively reduce the impact on the total time to solution.
In this study, the scientific community relies on the peer review process for assuring the quality of published material, the goal of which is to build a body of work we can trust. Computational journals such as The ACM Transactions on Mathematical Software (TOMS) use this process for rigorously promoting the clarity and completeness of content, and citation of prior work. At the same time, it is unusual to independently confirm computational results.
Computational science and engineering application programs are typically large, complex, and dynamic, and are often constrained by distribution limitations. As a means of making tractable rapid explorations of scientific and engineering application programs in the context of new, emerging, and future computing architectures, a suite of "miniapps" has been created to serve as proxies for full scale applications. Each miniapp is designed to represent a key performance characteristic that does or is expected to significantly impact the runtime performance of an application program. In this paper we introduce a methodology for assessing the ability of these miniapps to effectively represent these performance issues. We applied this methodology to three miniapps, examining the linkage between them and an application they are intended to represent. Herein we evaluate the fidelity of that linkage. This work represents the initial steps required to begin to answer the question, "Under what conditions does a miniapp represent a key performance characteristic in a full app?"
The emergence of high-concurrency architectures offering unprecedented performance has brought many high-performance partial differential equation (PDE) discretization codes to the precipice of a major refactor. To help address this challenge a workshop titled "Algorithms and Abstractions for Assembly in PDE Codes" was held in the Computer Science Research Institute at Sandia National Laboratories on May 12th-14th, 2014. This document summarizes the goals of the workshop and the results of the presentations and subsequent discussions.