Higher Order Moment Tensors for Combustion Analysis: GPU Acceleration
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Resilience is an imminent issue for next-generation platforms due to projected increases in soft/transient failures as part of the inherent trade-offs among performance, energy, and costs in system design. In this paper, we introduce a comprehensive approach to enabling application-level resilience in Asynchronous Many-Task (AMT) programming models with a focus on remedying Silent Data Corruption (SDC) that can often go undetected by the hardware and OS. Our approach makes it possible for the application programmer to declaratively express resilience attributes with minimal code changes, and to delegate the complexity of efficiently supporting resilience to our runtime system. We have created a prototype implementation of our approach as an extension to the Habanero C/C++ library (HClib), where different resilience techniques including task replay, task replication, algorithm-based fault tolerance (ABFT), and checkpointing are available. Our experimental results show that task replay incurs lower overhead than task replication when an appropriate error checking function is provided. Further, task replay matches the low overhead of ABFT. Our results also demonstrate the ability to combine different resilience schemes. To evaluate the effectiveness of our resilience mechanisms in the presence of errors, we injected synthetic errors at different error rates (1.0%, and 10.0%) and found modest increase in execution times. In summary, the results show that our approach supports efficient and scalable recovery, and that our approach can be used to influence the design of future AMT programming models and runtime systems that aim to integrate first-class support for user-level resilience.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
This report is an outcome of the ASC CSSE Level 2 Milestone 6362: Analysis of Re- silient Asynchronous Many-Task (AMT) Programming Model. It comprises a summary and in-depth analysis of resilience schemes adapted to the AMT programming model. Herein, performance trade-offs of a resilient-AMT prograrnming model are assessed through two ap- proaches: (1) an analytical model realized by discrete event simulations and (2) empirical evaluation of benchmark programs representing regular and irregular workloads of explicit partial differential equation solvers. As part of this effort, an AMT execution simulator and a prototype resilient-AMT programming framework have been developed. The former permits us to hypothesize the performance behavior of a resilient-AMT model, and has undergone a verification and validation (V&V) process. The latter allows empirical evaluation of the perfor- mance of resilience schemes under emulated program failures and enabled the aforementioned V&V process. The outcome indicates that (1) resilience techniques implemented within an AMT framework allow efficient and scalable recovery under frequent failures, that (2) the abstraction of task and data instances in the AMT programming model enables readily us- able Application Program Interfaces (APIs) for resilience, and that (3) this abstraction enables predicting the performance of resilient-AMT applications with a simple simulation infrastruc- ture. This outcome will provide guidance for the design of the AMT programming model and runtime systems, user-level resilience support, and application development for ASC's next generation platforms (NGPs).
We describe our work to embed a Python interpreter in S3D, a highly scalable parallel direct numerical simulation reacting flow solver written in Fortran. Although S3D had no in-situ capability when we began, embedding the interpreter was surprisingly easy, and the result is an extremely flexible platform for conducting machine-learning experiments in-situ.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
The SNL ATDM Data and Visualization work consolidates existing ATDM activities in scalable data management and visualization. Part of the responsibilities of the SNL ATDM Data and Visualization Project is the maintenance and development of visualization resources for ATDM applications on Exascale platforms. The ATDM Scalable Visualization project provides visualization and analysis required to satisfy the needs of the ASC/ATDM applications on next-generation, many-core platforms. This involves many activities including the re-engineering of visualization algorithms, services, and tools that enable ASC customers to carry out data analysis on ASC systems and ACES platforms. Current tools include scalable data analysis software released open source through ParaView, VTK, and Catalyst. We are also both leveraging and contributing to VTK-m, a many-core visualization library, to satisfy our visualization needs on advanced architectures. The scope of the Scalable Visualization under ATDM at SNL is R&D for the programming model and implementation of visualization code for ASC/ATDM projects and ASC/ATDM application support.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Energy, Environment, and Sustainability
Considering that simulations of turbulent combustion are computationally expensive, this chapter takes a decidedly different perspective, that of high-performance computing (HPC). The cost scaling arguments of non-reacting turbulence simulations are revisited and it is shown that the cost scaling for reacting flows is much more stringent for comparable conditions, making parallel computing and HPC indispensable. Hardware abstractions of typical parallel supercomputers are presented which show that for design of an efficient and optimal program, it is essential to exploit both distributed memory parallelism and shared-memory parallelism, i.e. hierarchical parallelism. Principles of efficient programming at various levels of parallelism are illustrated using archetypal code examples. The vast array of numerical methods, particularly schemes for spatial and temporal discretization, are examined in terms of tradeoffs they present from an HPC perspective. Aspects of data analytics that invariably result from large feature-rich data sets generated by combustion simulations are covered briefly.
Abstract not provided.
Abstract not provided.
Abstract not provided.
SIAM Journal on Scientific Computing
In order to achieve exascale systems, application resilience needs to be addressed. Some programming models, such as task-DAG (directed acyclic graphs) architectures, currently embed resilience features whereas traditional SPMD (single program, multiple data) and message-passing models do not. Since a large part of the community's code base follows the latter models, it is still required to take advantage of application characteristics to minimize the overheads of fault tolerance. To that end, this paper explores how recovering from hard process/node failures in a local manner is a natural approach for certain applications to obtain resilience at lower costs in faulty environments. In particular, this paper targets enabling online, semitransparent local recovery for stencil computations on current leadership-class systems as well as presents programming support and scalable runtime mechanisms. Also described and demonstrated in this paper is the effect of failure masking, which allows the effective reduction of impact on total time to solution due to multiple failures. Furthermore, we discuss, implement, and evaluate ghost region expansion and cell-to-rank remapping to increase the probability of failure masking. To conclude, this paper shows the integration of all aforementioned mechanisms with the S3D combustion simulation through an experimental demonstration (using the Titan system) of the ability to tolerate high failure rates (i.e., node failures every five seconds) with low overhead while sustaining performance at large scales. In addition, this demonstration also displays the failure masking probability increase resulting from the combination of both ghost region expansion and cell-to-rank remapping.
IEEE Transactions on Parallel and Distributed Systems
Obtaining multi-process hard failure resilience at the application level is a key challenge that must be overcome before the promise of exascale can be fully realized. Previous work has shown that online global recovery can dramatically reduce the overhead of failures when compared to the more traditional approach of terminating the job and restarting it from the last stored checkpoint. If online recovery is performed in a local manner further scalability is enabled, not only due to the intrinsic lower costs of recovering locally, but also due to derived effects when using some application types. In this paper we model one such effect, namely multiple failure masking, that manifests when running Stencil parallel computations on an environment when failures are recovered locally. First, the delay propagation shape of one or multiple failures recovered locally is modeled to enable several analyses of the probability of different levels of failure masking under certain Stencil application behaviors. Our results indicate that failure masking is an extremely desirable effect at scale which manifestation is more evident and beneficial as the machine size or the failure rate increase.