Publications Search

Demonstration of Model-Based Design for Digital Controller Using Formal Methods

Mayo, Jackson R.; Morris Wright, Karla V.; Aytac, Jon M.; Smith, Andrew M.; Armstrong, Robert C.; Hulette, Geoffrey C.; Lober, Randall R.

This report describes work originally performed in FY19 that assembled a workflow enabling formal verification of high-consequence digital controllers. The approach builds on an engineering analysis strategy using multiple abstraction levels (Model-Based Design) and performs exhaustive formal analysis of appropriate levels – here, state machines and C code – to assure always/never properties of digital logic that cannot be verified by testing alone. The operation of the workflow is illustrated using example models and code, including expected failures of verification when properties are violated.

More Details

TYPE SAND Report YEAR 2023

DOI OSTI

Algorithmic Input Generation for More Effective Software Testing

Proceedings - 2022 IEEE 46th Annual Computers, Software, and Applications Conference, COMPSAC 2022

Epifanovskaya, Laura; Meeson, Reginald; Mccormack, Christopher; Lee, Jinseo R.; Armstrong, Robert C.; Mayo, Jackson R.

It is impossible in practice to comprehensively test even small software programs due to the vastness of the reachable state space; however, modern cyber-physical systems such as aircraft require a high degree of confidence in software safety and reliability. Here we explore methods of generating test sets to effectively and efficiently explore the state space for a module based on the Traffic Collision Avoidance System (TCAS) used on commercial aircraft. A formal model of TCAS in the model-checking language NuSMV provides an output oracle. We compare test sets generated using various methods, including covering arrays, random, and a low-complexity input paradigm applied to 28 versions of the TCAS C program containing seeded errors. Faults are triggered by tests for all 28 programs using a combination of covering arrays and random input generation. Complexity-based inputs perform more efficiently than covering arrays, and can be paired with random input generation to create efficient and effective test sets. A random forest classifier identifies variable values that can be targeted to generate tests even more efficiently in future work, by combining a machine-learned fuzzing algorithm with more complex model oracles developed in model-based systems engineering (MBSE) software.

More Details

TYPE Conference Paper YEAR 2022

DOI OSTI Scopus

Algorithmic Input Generation for More Effective Software Testing

Proceedings - 2022 IEEE 46th Annual Computers, Software, and Applications Conference, COMPSAC 2022

Epifanovskaya, Laura; Lee, Jinseo R.; Mccormack, Christopher; Meeson, Reginald; Armstrong, Robert C.; Mayo, Jackson R.

It is impossible in practice to comprehensively test even small software programs due to the vastness of the reachable state space; however, modern cyber-physical systems such as aircraft require a high degree of confidence in software safety and reliability. Here we explore methods of generating test sets to effectively and efficiently explore the state space for a module based on the Traffic Collision Avoidance System (TCAS) used on commercial aircraft. A formal model of TCAS in the model-checking language NuSMV provides an output oracle. We compare test sets generated using various methods, including covering arrays, random, and a low-complexity input paradigm applied to 28 versions of the TCAS C program containing seeded errors. Faults are triggered by tests for all 28 programs using a combination of covering arrays and random input generation. Complexity-based inputs perform more efficiently than covering arrays, and can be paired with random input generation to create efficient and effective test sets. A random forest classifier identifies variable values that can be targeted to generate tests even more efficiently in future work, by combining a machine-learned fuzzing algorithm with more complex model oracles developed in model-based systems engineering (MBSE) software.

More Details

TYPE Conference Presentation YEAR 2022

DOI OSTI Scopus

Asynchrony and Failure Masking via Pseudo-Local Process Recovery in MPI Stencil Applications

Kolla, Hemanth; Mayo, Jackson R.; Whitlock, Matthew J.; Teranishi, Keita; Armstrong, Robert C.

Abstract not provided.

More Details

TYPE Conference Presentation YEAR 2021

DOI OSTI

Thoughts on a Cyber Threat Model for a High-Consequence System

Mayo, Jackson R.

Abstract not provided.

More Details

TYPE Other Report YEAR 2020

DOI OSTI

Towards Distributed Software Resilience in Asynchronous Many-Task Programming Models

Proceedings of FTXS 2020: Fault Tolerance for HPC at eXtreme Scale, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis

Gupta, Nikunj; Mayo, Jackson R.; Lemoine, Adrian S.; Kaiser, Hartmut

Exceptions and errors occurring within mission critical applications due to hardware failures have a high cost. With the emerging Next Generation Platforms (NGPs), the rate of hardware failures will likely increase. Therefore, designing our applications to be resilient is a critical concern in order to retain the reliability of results while meeting the constraints on power budgets. In this paper, we discuss software resilience in AMTs at both local and distributed scale. We choose HPX to prototype our resiliency designs. We implement two resiliency APIs that we expose to the application developers, namely task replication and task replay. Task replication repeats a task n-times and executes them asynchronously. Task replay reschedules a task up to n-times until a valid output is returned. Furthermore, we expose algorithm based fault tolerance (ABFT) using user provided predicates (e.g., checksums) to validate the returned results. We benchmark the resiliency scheme for both synthetic and real world applications at local and distributed scale and show that most of the added execution time arises from the replay, replication or data movement of the tasks and not the boilerplate code added to achieve resilience.

More Details

TYPE Conference Proceeding YEAR 2020

OSTI Scopus

Towards Distributed Software Resilience in Asynchronous Many-Task Programming Models

Proceedings of FTXS 2020: Fault Tolerance for HPC at eXtreme Scale, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis

Gupta, Nikunj; Mayo, Jackson R.; Lemoine, Adrian S.; Kaiser, Hartmut

Exceptions and errors occurring within mission critical applications due to hardware failures have a high cost. With the emerging Next Generation Platforms (NGPs), the rate of hardware failures will likely increase. Therefore, designing our applications to be resilient is a critical concern in order to retain the reliability of results while meeting the constraints on power budgets. In this paper, we discuss software resilience in AMTs at both local and distributed scale. We choose HPX to prototype our resiliency designs. We implement two resiliency APIs that we expose to the application developers, namely task replication and task replay. Task replication repeats a task n-times and executes them asynchronously. Task replay reschedules a task up to n-times until a valid output is returned. Furthermore, we expose algorithm based fault tolerance (ABFT) using user provided predicates (e.g., checksums) to validate the returned results. We benchmark the resiliency scheme for both synthetic and real world applications at local and distributed scale and show that most of the added execution time arises from the replay, replication or data movement of the tasks and not the boilerplate code added to achieve resilience.

More Details

TYPE Conference Proceeding YEAR 2020

OSTI Scopus

Towards Distributed Software Resilience in Asynchronous Many-Task Programming Models

Proceedings of FTXS 2020: Fault Tolerance for HPC at eXtreme Scale, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis

Gupta, Nikunj; Mayo, Jackson R.; Lemoine, Adrian S.; Kaiser, Hartmut

Exceptions and errors occurring within mission critical applications due to hardware failures have a high cost. With the emerging Next Generation Platforms (NGPs), the rate of hardware failures will likely increase. Therefore, designing our applications to be resilient is a critical concern in order to retain the reliability of results while meeting the constraints on power budgets. In this paper, we discuss software resilience in AMTs at both local and distributed scale. We choose HPX to prototype our resiliency designs. We implement two resiliency APIs that we expose to the application developers, namely task replication and task replay. Task replication repeats a task n-times and executes them asynchronously. Task replay reschedules a task up to n-times until a valid output is returned. Furthermore, we expose algorithm based fault tolerance (ABFT) using user provided predicates (e.g., checksums) to validate the returned results. We benchmark the resiliency scheme for both synthetic and real world applications at local and distributed scale and show that most of the added execution time arises from the replay, replication or data movement of the tasks and not the boilerplate code added to achieve resilience.

More Details

TYPE Conference Presentation YEAR 2020

DOI OSTI Scopus

Integrating Inter-Node Communication with a Resilient Asynchronous Many-Task Runtime System

Proceedings of ExaMPI 2020: Exascale MPI Workshop, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis

Paul, Sri R.; Hayashi, Akihiro; Whitlock, Matthew J.; Bak, Seonmyeong; Teranishi, Keita; Mayo, Jackson R.; Grossman, Max; Sarkar, Vivek

Achieving fault tolerance is one of the significant challenges of exascale computing due to projected increases in soft/transient failures. While past work on software-based resilience techniques typically focused on traditional bulk-synchronous parallel programming models, we believe that Asynchronous Many-Task (AMT) programming models are better suited to enabling resiliency since they provide explicit abstractions of data and tasks which contribute to increased asynchrony and latency tolerance. In this paper, we extend our past work on enabling application-level resilience in single node AMT programs by integrating the capability to perform asynchronous MPI communication, thereby enabling resiliency across multiple nodes. We also enable resilience against fail-stop errors where our runtime will manage all re-execution of tasks and communication without user intervention. Our results show that we are able to add communication operations to resilient programs with low overhead, by offloading communication to dedicated communication workers and also recover from fail-stop errors transparently, thereby enhancing productivity.

More Details

TYPE Conference Presentation YEAR 2020

DOI OSTI Scopus

Improving Scalability of Silent-Error Resilience for Message-Passing Solvers via Local Recovery and Asynchrony

Proceedings of FTXS 2020: Fault Tolerance for HPC at eXtreme Scale, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis

Kolla, Hemanth; Mayo, Jackson R.; Teranishi, Keita; Armstrong, Robert C.

Benefits of local recovery (restarting only a failed process or task) have been previously demonstrated in parallel solvers. Local recovery has a reduced impact on application performance due to masking of failure delays (for message-passing codes) or dynamic load balancing (for asynchronous many-task codes). In this paper, we implement MPI-process-local checkpointing and recovery of data (as an extension of the Fenix library) in combination with an existing method for local detection of silent errors in partial-differential-equation solvers, to show a path for incorporating lightweight silent-error resilience. In addition, we demonstrate how asynchrony introduced by maximizing computation-communication overlap can halt the propagation of delays. For a prototype stencil solver (including an iterative-solver-like variant) with injected memory bit flips, results show greatly reduced overhead under weak scaling compared to global recovery, and high failure-masking efficiency. The approach is expected to be generalizable to other MPI-based solvers.

More Details

TYPE Conference Presentation YEAR 2020

DOI OSTI Scopus

Improving Scalability of Silent-Error Resilience for Message-Passing Solvers via Local Recovery and Asynchrony

Proceedings of FTXS 2020: Fault Tolerance for HPC at eXtreme Scale, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis

Kolla, Hemanth; Mayo, Jackson R.; Teranishi, Keita; Armstrong, Robert C.

Benefits of local recovery (restarting only a failed process or task) have been previously demonstrated in parallel solvers. Local recovery has a reduced impact on application performance due to masking of failure delays (for message-passing codes) or dynamic load balancing (for asynchronous many-task codes). In this paper, we implement MPI-process-local checkpointing and recovery of data (as an extension of the Fenix library) in combination with an existing method for local detection of silent errors in partial-differential-equation solvers, to show a path for incorporating lightweight silent-error resilience. In addition, we demonstrate how asynchrony introduced by maximizing computation-communication overlap can halt the propagation of delays. For a prototype stencil solver (including an iterative-solver-like variant) with injected memory bit flips, results show greatly reduced overhead under weak scaling compared to global recovery, and high failure-masking efficiency. The approach is expected to be generalizable to other MPI-based solvers.

More Details

TYPE Conference Paper YEAR 2020

OSTI Scopus

Integrating Inter-Node Communication with a Resilient Asynchronous Many-Task Runtime System

Paul, Sri R.; Hayashi, Akihiro; Whitlock, Matthew J.; Bak, Seonmyoeng; Teranishi, Keita; Mayo, Jackson R.; Grossman, Max; Sarkar, Vivek

Abstract not provided.

More Details

TYPE Conference Paper YEAR 2020

OSTI

Implementing Software Resiliency in HPX for Extreme Scale Computing

Gupta, Nikunj; Mayo, Jackson R.; Lemoine, Adrian S.; Hartmut, Kaiser

The DOE Office of Science Exascale Computing Project (ECP) outlines the next milestones in the supercomputing domain. The target computing systems under the project will deliver 10x performance while keeping the power budget under 30 megawatts. With such large machines, the need to make applications resilient has become paramount. The benefits of adding resiliency to mission critical and scientific applications, includes the reduced cost of restarting the failed simulation both in terms of time and power. Most of the current implementation of resiliency at the software level makes use of a Coordinated Checkpoint and Restart (C/R). This technique of resiliency generates a consistent global snapshot, also called a checkpoint. Generating snapshots involves global communication and coordination and is achieved by synchronizing all running processes. The generated checkpoint is then stored in some form of persistent storage. On failure detection, the runtime initiates a global rollback to the most recent previously saved checkpoint. This involves aborting all running processes, rolling them back to the previous state and restarting them.

More Details

TYPE Other Report YEAR 2020

DOI OSTI

Silent-Error Detection Local Recovery and Failure Masking in MPI-Based Solvers

Kolla, Hemanth; Mayo, Jackson R.; Armstrong, Robert C.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2020

OSTI

Composing Asynchrony Communication and Resilience

Paul, Sri R.; Hayashi, Akihiro-Ex; Slattengren, Nicole L.; Kolla, Hemanth; Bak, Seonmyeong-Ex; Whitlock, Matthew J.; Mayo, Jackson R.; Teranishi, Keita; Sarker, Vivek-Ex; Grossman, Max-Ex

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2020

OSTI

Physics-Based Checksums for Silent-Error Detection in PDE Solvers

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Salloum, Maher; Mayo, Jackson R.; Armstrong, Robert C.

We discuss techniques for efficient local detection of silent data corruption in parallel scientific computations, leveraging physical quantities such as momentum and energy that may be conserved by discretized PDEs. The conserved quantities are analogous to “algorithm-based fault tolerance” checksums for linear algebra but, due to their physical foundation, are applicable to both linear and nonlinear equations and have efficient local updates based on fluxes between subdomains. These physics-based checksums enable precise intermittent detection of errors and recovery by rollback to a checkpoint, with very low overhead when errors are rare. We present applications to both explicit hyperbolic and iterative elliptic (unstructured finite-element) solvers with injected memory bit flips.

More Details

TYPE Conference Poster YEAR 2020

OSTI Scopus

Dishwashers of Armageddon: Verifying high consequence systems for Nuclear Weapons

Armstrong, Robert C.; Evans, Noah; Hulette, Geoffrey C.; Foulk, James W.; Aytac, Jon M.; Johnson-Freyd, Philip; Mayo, Jackson R.; Punnoose, Ratish J.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

Physics-based mitigation of silent errors

Mayo, Jackson R.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

Enabling Resilience in Asynchronous Many-Task Programming Models

Paul, Sri R.; Hayahsi, Akihiro; Slattengren, Nicole L.; Kolla, Hemanth; Whitlock, Matthew J.; Bak, Seonmyeong; Teranishi, Keita; Mayo, Jackson R.; Sarkar, Vivek

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

DOI OSTI

Physics-Based Checksums for Silent-Error Detection in PDE Solvers

Salloum, Maher; Mayo, Jackson R.; Armstrong, Robert C.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

DOI OSTI

Log-Correlated Large-Deviation Statistics Governing Huygens Fronts in Turbulence

Journal of Statistical Physics

Mayo, Jackson R.; Kerstein, Alan R.

Analyses have disagreed on whether the velocity uT of bulk advancement of a Huygens front in turbulence vanishes or remains finite in the limit of vanishing local front propagation speed u. Here, a connection to the large-deviation statistics of log-correlated random processes enables a definitive determination of the correct small-u asymptotics. This result reconciles several theoretical and phenomenological perspectives with the conclusion that uT remains finite for vanishing u, which implies a propagation anomaly akin to the energy-dissipation anomaly in the limit of vanishing viscosity. Various leading-order structural properties such as a novel u dependence of a bulk length scale associated with front geometry are predicted in this limit. The analysis involves a formal analogy to random advection of diffusive scalars.

More Details

TYPE Journal Article YEAR 2019

DOI OSTI Scopus

Programming Model Tradeoffs for Global vs Local Recovery: Algorithm Based Fault Tolerance

Kolla, Hemanth; Teranishi, Keita; Mayo, Jackson R.; Salloum, Maher; Armstrong, Robert C.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

Targeted modification of hardware trojans

Journal of Hardware and Systems Security (Online)

Hamlet, Jason; Mayo, Jackson R.; Kammler, Vivian

The use of untrusted design tools, components, and designers, coupled with untrusted device fabrication, introduces the possibility of malicious modifications being made to integrated circuits (ICs) during their design and fabrication. These modifications are known as hardware trojans. The widespread use of commercially purchased 3rd party intellectual property (3PIP) and commercial design tools extends even into trusted design flows. Unfortunately, due to the theoretical result that there is no program that can decide whether any other program will eventually halt, we know that the properties of a program, or circuit, cannot be known in advance of running it. While we can design a circuit to meet some functional specification and generate a simulation or test suite to obtain at least probabilistic confidence that the circuit implements the intended functionality, we cannot test a circuit for unintended functionality due to the combinatorially large state space. To address these concerns, we have developed a design-time method for automatically and systematically modifying portions of a design that exhibit characteristics of hardware trojans. After each modification, the functionality of the design is verified against a comprehensive simulation suite to ensure that the intended circuit functionality has not been changed. This approach can be applied to any digital circuit and does not rely on secret keys or obfuscation.

More Details

TYPE Journal Article YEAR 2019

DOI OSTI

Scalable Efficient Fault Tolerance in Asynchronous Many Task (AMT) Programming Models

Teranishi, Keita; Kolla, Hemanth; Slattengren, Nicole L.; Mayo, Jackson R.; Paul, Sri R.; Hayashi, Akihiro; Sarker, Vivek; Bak, Seonmyeong

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

Physics-Based Checksums for Silent-Error Detection in PDE Solvers

Salloum, Maher; Mayo, Jackson R.; Armstrong, Robert C.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2019

OSTI

Enabling Resilience in Asynchronous Many-Task Programming Models

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Paul, Sri R.; Hayashi, Akihiro; Slattengren, Nicole L.; Kolla, Hemanth; Whitlock, Matthew J.; Bak, Seonmyeong; Teranishi, Keita; Mayo, Jackson R.; Sarkar, Vivek

Resilience is an imminent issue for next-generation platforms due to projected increases in soft/transient failures as part of the inherent trade-offs among performance, energy, and costs in system design. In this paper, we introduce a comprehensive approach to enabling application-level resilience in Asynchronous Many-Task (AMT) programming models with a focus on remedying Silent Data Corruption (SDC) that can often go undetected by the hardware and OS. Our approach makes it possible for the application programmer to declaratively express resilience attributes with minimal code changes, and to delegate the complexity of efficiently supporting resilience to our runtime system. We have created a prototype implementation of our approach as an extension to the Habanero C/C++ library (HClib), where different resilience techniques including task replay, task replication, algorithm-based fault tolerance (ABFT), and checkpointing are available. Our experimental results show that task replay incurs lower overhead than task replication when an appropriate error checking function is provided. Further, task replay matches the low overhead of ABFT. Our results also demonstrate the ability to combine different resilience schemes. To evaluate the effectiveness of our resilience mechanisms in the presence of errors, we injected synthetic errors at different error rates (1.0%, and 10.0%) and found modest increase in execution times. In summary, the results show that our approach supports efficient and scalable recovery, and that our approach can be used to influence the design of future AMT programming models and runtime systems that aim to integrate first-class support for user-level resilience.

More Details

TYPE Conference Poster YEAR 2019

DOI OSTI Scopus

Robust digital computation in the physical world

Cyber-Physical Systems Security

Mayo, Jackson R.; Armstrong, Robert C.; Hulette, Geoffrey C.; Salloum, Maher; Smith, Andrew M.

Modern digital hardware and software designs are increasingly complex but are themselves only idealizations of a real system that is instantiated in, and interacts with, an analog physical environment. Insights from physics, formal methods, and complex systems theory can aid in extending reliability and security measures from pure digital computation (itself a challenging problem) to the broader cyber-physical and out-of-nominal arena. Example applications to design and analysis of high-consequence controllers and extreme-scale scientific computing illustrate the interplay of physics and computation. In particular, we discuss the limitations of digital models in an analog world, the modeling and verification of out-of-nominal logic, and the resilience of computational physics simulation. A common theme is that robustness to failures and attacks is fostered by cyber-physical system designs that are constrained to possess inherent stability or smoothness. This chapter contains excerpts from previous publications by the authors.

More Details

TYPE Book YEAR 2018

DOI OSTI Scopus

Diversity for Microelectronics Lifecycle Security

Hamlet, Jason; Mayo, Jackson R.; Martin, Mitchell T.; Torres, David; Cruz, Jonathan W.

In this work we examine approaches for using implementation diversity to disrupt or disable hardware trojans. We explore a variety of general frameworks for building diverse variants of circuits in voting architectures, and examine the impact of these on attackers and defenders mathematically and empirically. This work is augmented by analysis of a new majority voting technique. We also describe several automated approaches for generating diverse variants of a circuit and empirically study the overheads associated with these. We then describe a general technique for targeting functional circuit modifications to hardware trojans, present several specific implementations of this technique, and study the impact that they have on trojanized benchmark circuits.

More Details

TYPE SAND Report YEAR 2018

DOI OSTI

Digital/Analog Cosimulation using CocoTB and Xyce

Smith, Andrew M.; Mayo, Jackson R.; Armstrong, Robert C.; Schiek, Richard; Sholander, Peter E.; Mei, Ting

In this article, we describe a prototype cosimulation framework using Xyce, GHDL and CocoTB that can be used to analyze digital hardware designs in out-of-nominal environments. We demonstrate current software methods and inspire future work via analysis of an open-source encryption core design. Note that this article is meant as a proof-of-concept to motivate integration of general cosimulation techniques with Xyce, an open-source circuit simulator.

More Details

TYPE SAND Report YEAR 2018

DOI OSTI

Analysis of Local Recovery Resilience Model for Asynchronous Many Task Parallel Programming Models

Teranishi, Keita; Kolla, Hemanth; Slattengren, Nicole L.; Whitlock, Matthew J.; Mayo, Jackson R.; Clay, Robert L.; Paul, Sri R.; Hayashi, Akihiro; Sarkar, Vivek

Abstract not provided.

More Details

TYPE Presentation YEAR 2018

OSTI

Digital/Analog Cosimulation using CocoTB and Xyce

Smith, Andrew M.; Mayo, Jackson R.; Armstrong, Robert C.; Schiek, Richard; Sholander, Peter E.; Mei, Ting

In this article, we describe a prototype cosimulation framework using Xyce, GHDL and CocoTB that can be used to analyze digital hardware designs in out-of-nominal environments. We demonstrate current software methods and inspire future work via analysis of an open-source encryption core design. Note that this article is meant as a proof-of-concept to motivate integration of general cosimulation techniques with Xyce, an open-source circuit simulator.

More Details

TYPE SAND Report YEAR 2018

DOI OSTI

ASC CSSE Level 2 Milestone #6362: Resilient Asynchronous Many Task Programming Model

Teranishi, Keita; Kolla, Hemanth; Slattengren, Nicole L.; Whitlock, Matthew J.; Mayo, Jackson R.; Clay, Robert L.; Paul, Sri R.; Hayashi, Akihiro; Sarkar, Vivek

This report is an outcome of the ASC CSSE Level 2 Milestone 6362: Analysis of Re- silient Asynchronous Many-Task (AMT) Programming Model. It comprises a summary and in-depth analysis of resilience schemes adapted to the AMT programming model. Herein, performance trade-offs of a resilient-AMT prograrnming model are assessed through two ap- proaches: (1) an analytical model realized by discrete event simulations and (2) empirical evaluation of benchmark programs representing regular and irregular workloads of explicit partial differential equation solvers. As part of this effort, an AMT execution simulator and a prototype resilient-AMT programming framework have been developed. The former permits us to hypothesize the performance behavior of a resilient-AMT model, and has undergone a verification and validation (V&V) process. The latter allows empirical evaluation of the perfor- mance of resilience schemes under emulated program failures and enabled the aforementioned V&V process. The outcome indicates that (1) resilience techniques implemented within an AMT framework allow efficient and scalable recovery under frequent failures, that (2) the abstraction of task and data instances in the AMT programming model enables readily us- able Application Program Interfaces (APIs) for resilience, and that (3) this abstraction enables predicting the performance of resilient-AMT applications with a simple simulation infrastruc- ture. This outcome will provide guidance for the design of the AMT programming model and runtime systems, user-level resilience support, and application development for ASC's next generation platforms (NGPs).

More Details

TYPE SAND Report YEAR 2018

DOI OSTI

Fault Tolerance in Asynchronous Many-Task Programming Models & Runtimes

Kolla, Hemanth; Teranishi, Keita; Slattengren, Nicole; Whitlock, Matthew J.; Mayo, Jackson R.; Armstrong, Robert C.; Paul, Sriraj; Hayashi, Akihiro; Sarkar, Vivek

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Modeling and Analysis of the Impact of Diversity in Digital Circuits on Attackers

Hamlet, Jason; Mayo, Jackson R.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Scalable Failure Masking for Stencil Computations using Ghost Region Expansion and Cell to Rank Remapping

SIAM Journal on Scientific Computing

Gamell, Marc; Teranishi, Keita; Kolla, Hemanth; Mayo, Jackson R.; Heroux, Michael A.; Chen, Jacqueline H.; Parashar, Manish

In order to achieve exascale systems, application resilience needs to be addressed. Some programming models, such as task-DAG (directed acyclic graphs) architectures, currently embed resilience features whereas traditional SPMD (single program, multiple data) and message-passing models do not. Since a large part of the community's code base follows the latter models, it is still required to take advantage of application characteristics to minimize the overheads of fault tolerance. To that end, this paper explores how recovering from hard process/node failures in a local manner is a natural approach for certain applications to obtain resilience at lower costs in faulty environments. In particular, this paper targets enabling online, semitransparent local recovery for stencil computations on current leadership-class systems as well as presents programming support and scalable runtime mechanisms. Also described and demonstrated in this paper is the effect of failure masking, which allows the effective reduction of impact on total time to solution due to multiple failures. Furthermore, we discuss, implement, and evaluate ghost region expansion and cell-to-rank remapping to increase the probability of failure masking. To conclude, this paper shows the integration of all aforementioned mechanisms with the S3D combustion simulation through an experimental demonstration (using the Titan system) of the ability to tolerate high failure rates (i.e., node failures every five seconds) with low overhead while sustaining performance at large scales. In addition, this demonstration also displays the failure masking probability increase resulting from the combination of both ghost region expansion and cell-to-rank remapping.

More Details

TYPE Journal Article YEAR 2017

DOI OSTI

Modeling and simulating multiple failure masking enabled by local recovery for stencil-based applications at extreme scales

IEEE Transactions on Parallel and Distributed Systems

Gamell, Marc; Teranishi, Keita; Mayo, Jackson R.; Kolla, Hemanth; Heroux, Michael A.; Chen, Jacqueline H.; Parashar, Manish

Obtaining multi-process hard failure resilience at the application level is a key challenge that must be overcome before the promise of exascale can be fully realized. Previous work has shown that online global recovery can dramatically reduce the overhead of failures when compared to the more traditional approach of terminating the job and restarting it from the last stored checkpoint. If online recovery is performed in a local manner further scalability is enabled, not only due to the intrinsic lower costs of recovering locally, but also due to derived effects when using some application types. In this paper we model one such effect, namely multiple failure masking, that manifests when running Stencil parallel computations on an environment when failures are recovered locally. First, the delay propagation shape of one or multiple failures recovered locally is modeled to enable several analyses of the probability of different levels of failure masking under certain Stencil application behaviors. Our results indicate that failure masking is an extremely desirable effect at scale which manifestation is more evident and beneficial as the machine size or the failure rate increase.

More Details

TYPE Journal Article YEAR 2017

DOI OSTI Scopus

Tools for Simple Yet Very High Consequence Controls

Armstrong, Robert C.; Foulk, James W.; Hulette, Geoffrey C.; Mayo, Jackson R.; Michnovicz, Jason; Aytac, Jon M.; Johnson-Freyd, Philip; Punnoose, Ratish J.; Smith, Andrew M.

Abstract not provided.

More Details

TYPE Presentation YEAR 2017

OSTI

Modeling and Analysis of the Impact of Diversity in Digital Circuits on Attackers

Hamlet, Jason; Mayo, Jackson R.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Using computational game theory to guide verification and security in hardware designs

Proceedings of the 2017 IEEE International Symposium on Hardware Oriented Security and Trust, HOST 2017

Smith, Andrew M.; Mayo, Jackson R.; Kammler, Vivian; Armstrong, Robert C.; Vorobeychik, Yevgeniy

Verifying that hardware design implementations adhere to specifications is a time intensive and sometimes intractable problem due to the massive size of the system's state space. Formal methods techniques can be used to prove certain tractable specification properties; however, they are expensive, and often require subject matter experts to develop and solve. Nonetheless, hardware verification is a critical process to ensure security and safety properties are met, and encapsulates problems associated with trust and reliability. For complex designs where coverage of the entire state space is unattainable, prioritizing regions most vulnerable to security or reliability threats would allow efficient allocation of valuable verification resources. Stackelberg security games model interactions between a defender, whose goal is to assign resources to protect a set of targets, and an attacker, who aims to inflict maximum damage on the targets after first observing the defender's strategy. In equilibrium, the defender has an optimal security deployment strategy, given the attacker's best response. We apply this Stackelberg security framework to synthesized hardware implementations using the design's network structure and logic to inform defender valuations and verification costs. The defender's strategy in equilibrium is thus interpreted as a prioritization of the allocation of verification resources in the presence of an adversary. We demonstrate this technique on several open-source synthesized hardware designs.

More Details

TYPE Conference Poster YEAR 2017

DOI OSTI Scopus

Scientific Approaches to Cybersecurity: Design and Analysis of Complex Digital Systems

Armstrong, Robert C.; Mayo, Jackson R.

Abstract not provided.

More Details

TYPE Presentation YEAR 2017

OSTI

Compiling Statecharts into Why3

Armstrong, Robert C.; Aytac, Jon M.; Hulette, Geoffrey C.; Mayo, Jackson R.; Foulk, James W.

Abstract not provided.

More Details

TYPE Presentation YEAR 2017

OSTI

Programming Constructs for Transparent Silent-Error Mitigation in PDE Solvers

Salloum, Maher; Mayo, Jackson R.; Armstrong, Robert C.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2017

OSTI

Robust Digital Computation in the Physical World

Mayo, Jackson R.

Abstract not provided.

More Details

TYPE Presentation YEAR 2016

OSTI

Using Computational Game Theory To Guide Veriﬁcation and Security in Hardware Designs

Smith, Andrew M.; Mayo, Jackson R.; Kammler, Vivian; Armstrong, Robert C.; Vorobeychik, Yevgeniy

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

Characterizing short-term stability for Boolean networks over any distribution of transfer functions

Physical Review E

Seshadhri, C.; Smith, Andrew M.; Vorobeychik, Yevgeniy; Mayo, Jackson R.; Armstrong, Robert C.

We present a characterization of short-term stability of Kauffman's NK (random) Boolean networks under arbitrary distributions of transfer functions. Given such a Boolean network where each transfer function is drawn from the same distribution, we present a formula that determines whether short-term chaos (damage spreading) will happen. Our main technical tool which enables the formal proof of this formula is the Fourier analysis of Boolean functions, which describes such functions as multilinear polynomials over the inputs. Numerical simulations on mixtures of threshold functions and nested canalyzing functions demonstrate the formula's correctness.

More Details

TYPE Journal Article YEAR 2016

DOI OSTI Scopus

Enabling V&V for Engineered Complex Systems via Resilient Design

Mayo, Jackson R.

Abstract not provided.

More Details

TYPE Presentation YEAR 2016

OSTI

Robust finite difference stencils for solving the advection equation

Strazdins, Peter; Harding, Brendan; Lee, Brian; Mayo, Jackson R.; Ray, Jaideep; Armstrong, Robert C.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2016

OSTI

In-situ mitigation of silent data corruption in PDE solvers

FTXS 2016 - Proceedings of the ACM Workshop on Fault-Tolerance for HPC at Extreme Scale

Salloum, Maher; Mayo, Jackson R.; Armstrong, Robert C.

We present algorithmic techniques for parallel PDE solvers that leverage numerical smoothness properties of physics simulation to detect and correct silent data corruption within local computations. We initially model such silent hardware errors (which are of concern for extreme scale) via injected DRAM bit flips. Our mitigation approach generalizes previously developed "robust stencils" and uses modified linear algebra operations that spatially interpolate to replace large outlier values. Prototype implementations for 1D hyperbolic and 3D elliptic solvers, tested on up to 2048 cores, show that this error mitigation enables tolerating orders of magnitude higher bit-flip rates. The runtime overhead of the approach generally decreases with greater solver scale and complexity, becoming no more than a few percent in some cases. A key advantage is that silent data corruption can be handled transparently with data in cache, reducing the cost of false-positive detections compared to rollback approaches.

More Details

TYPE Conference Poster YEAR 2016

DOI OSTI Scopus