We describe our work to embed a Python interpreter in S3D, a highly scalable parallel direct numerical simulation reacting flow solver written in Fortran. Although S3D had no in-situ capability when we began, embedding the interpreter was surprisingly easy, and the result is an extremely flexible platform for conducting machine-learning experiments in-situ.
The SNL ATDM Data and Visualization work consolidates existing ATDM activities in scalable data management and visualization. Part of the responsibilities of the SNL ATDM Data and Visualization Project is the maintenance and development of visualization resources for ATDM applications on Exascale platforms. The ATDM Scalable Visualization project provides visualization and analysis required to satisfy the needs of the ASC/ATDM applications on next-generation, many-core platforms. This involves many activities including the re-engineering of visualization algorithms, services, and tools that enable ASC customers to carry out data analysis on ASC systems and ACES platforms. Current tools include scalable data analysis software released open source through ParaView, VTK, and Catalyst. We are also both leveraging and contributing to VTK-m, a many-core visualization library, to satisfy our visualization needs on advanced architectures. The scope of the Scalable Visualization under ATDM at SNL is R&D for the programming model and implementation of visualization code for ASC/ATDM projects and ASC/ATDM application support.
Considering that simulations of turbulent combustion are computationally expensive, this chapter takes a decidedly different perspective, that of high-performance computing (HPC). The cost scaling arguments of non-reacting turbulence simulations are revisited and it is shown that the cost scaling for reacting flows is much more stringent for comparable conditions, making parallel computing and HPC indispensable. Hardware abstractions of typical parallel supercomputers are presented which show that for design of an efficient and optimal program, it is essential to exploit both distributed memory parallelism and shared-memory parallelism, i.e. hierarchical parallelism. Principles of efficient programming at various levels of parallelism are illustrated using archetypal code examples. The vast array of numerical methods, particularly schemes for spatial and temporal discretization, are examined in terms of tradeoffs they present from an HPC perspective. Aspects of data analytics that invariably result from large feature-rich data sets generated by combustion simulations are covered briefly.
In order to achieve exascale systems, application resilience needs to be addressed. Some programming models, such as task-DAG (directed acyclic graphs) architectures, currently embed resilience features whereas traditional SPMD (single program, multiple data) and message-passing models do not. Since a large part of the community's code base follows the latter models, it is still required to take advantage of application characteristics to minimize the overheads of fault tolerance. To that end, this paper explores how recovering from hard process/node failures in a local manner is a natural approach for certain applications to obtain resilience at lower costs in faulty environments. In particular, this paper targets enabling online, semitransparent local recovery for stencil computations on current leadership-class systems as well as presents programming support and scalable runtime mechanisms. Also described and demonstrated in this paper is the effect of failure masking, which allows the effective reduction of impact on total time to solution due to multiple failures. Furthermore, we discuss, implement, and evaluate ghost region expansion and cell-to-rank remapping to increase the probability of failure masking. To conclude, this paper shows the integration of all aforementioned mechanisms with the S3D combustion simulation through an experimental demonstration (using the Titan system) of the ability to tolerate high failure rates (i.e., node failures every five seconds) with low overhead while sustaining performance at large scales. In addition, this demonstration also displays the failure masking probability increase resulting from the combination of both ghost region expansion and cell-to-rank remapping.
Obtaining multi-process hard failure resilience at the application level is a key challenge that must be overcome before the promise of exascale can be fully realized. Previous work has shown that online global recovery can dramatically reduce the overhead of failures when compared to the more traditional approach of terminating the job and restarting it from the last stored checkpoint. If online recovery is performed in a local manner further scalability is enabled, not only due to the intrinsic lower costs of recovering locally, but also due to derived effects when using some application types. In this paper we model one such effect, namely multiple failure masking, that manifests when running Stencil parallel computations on an environment when failures are recovered locally. First, the delay propagation shape of one or multiple failures recovered locally is modeled to enable several analyses of the probability of different levels of failure masking under certain Stencil application behaviors. Our results indicate that failure masking is an extremely desirable effect at scale which manifestation is more evident and beneficial as the machine size or the failure rate increase.
This report is an outcome of the ASC ATDM Level 2 Milestone 6015: Asynchronous Many-Task Software Stack Demonstration. It comprises a summary and in depth analysis of DARMA and a DARMA-compliant Asynchronous Many-Task (AMT) runtime software stack. Herein performance and productivity of the over- all approach are assessed on benchmarks and proxy applications representative of the Sandia ATDM applications. As part of the effort to assess the perceived strengths and weaknesses of AMT models compared to more traditional methods, experiments were performed on ATS-1 (Advanced Technology Systems) test bed machines and Trinity. In addition to productivity and performance assessments, this report includes findings on the generality of DARMAs backend API as well as findings on interoperability with node- level and network-level system libraries. Together, this information provides a clear understanding of the strengths and limitations of the DARMA approach in the context of Sandias ATDM codes, to guide our future research and development in this area.
This report is a sequel to [PB16], in which we provided a first progress report on research and development towards a scalable, asynchronous many-task, in situ statistical analysis engine using the Legion runtime system. This earlier work included a prototype implementation of a proposed solution, using a proxy mini-application as a surrogate for a full-scale scientific simulation code. The first scalability studies were conducted with the above on modestly-sized experimental clusters. In contrast, in the current work we have integrated our in situ analysis engines with a full-size scientific application (S3D, using the Legion-SPMD model), and have conducted nu- merical tests on the largest computational platform currently available for DOE science ap- plications. We also provide details regarding the design and development of a light-weight asynchronous collectives library. We describe how this library is utilized within our SPMD- Legion S3D workflow, and compare the data aggregation technique deployed herein to the approach taken within our previous work.
As we look ahead to next generation high performance computing platforms, the placement and movement of data is becoming the key-limiting factor on both performance and energy efficiency. Furthermore, the increased quantities of data the systems are capable of generating, in conjunction with the insufficient rate of improvements in the supporting I/0 infrastructure, is forcing applications away from the off-line post-processing of data towards techniques based on in ,situ analysis and visualization. Together, these challenges are shaping how we will both design and develop effective, performant and energy-efficient software. In particular, the challenges highlight the need for data and data-centric operations to be fundamental in the reasoning about, and optimization of, scientific workflows on extreme-scale architectures.