This report describes work originally performed in FY19 that assembled a workflow enabling formal verification of high-consequence digital controllers. The approach builds on an engineering analysis strategy using multiple abstraction levels (Model-Based Design) and performs exhaustive formal analysis of appropriate levels – here, state machines and C code – to assure always/never properties of digital logic that cannot be verified by testing alone. The operation of the workflow is illustrated using example models and code, including expected failures of verification when properties are violated.
It is impossible in practice to comprehensively test even small software programs due to the vastness of the reachable state space; however, modern cyber-physical systems such as aircraft require a high degree of confidence in software safety and reliability. Here we explore methods of generating test sets to effectively and efficiently explore the state space for a module based on the Traffic Collision Avoidance System (TCAS) used on commercial aircraft. A formal model of TCAS in the model-checking language NuSMV provides an output oracle. We compare test sets generated using various methods, including covering arrays, random, and a low-complexity input paradigm applied to 28 versions of the TCAS C program containing seeded errors. Faults are triggered by tests for all 28 programs using a combination of covering arrays and random input generation. Complexity-based inputs perform more efficiently than covering arrays, and can be paired with random input generation to create efficient and effective test sets. A random forest classifier identifies variable values that can be targeted to generate tests even more efficiently in future work, by combining a machine-learned fuzzing algorithm with more complex model oracles developed in model-based systems engineering (MBSE) software.
It is impossible in practice to comprehensively test even small software programs due to the vastness of the reachable state space; however, modern cyber-physical systems such as aircraft require a high degree of confidence in software safety and reliability. Here we explore methods of generating test sets to effectively and efficiently explore the state space for a module based on the Traffic Collision Avoidance System (TCAS) used on commercial aircraft. A formal model of TCAS in the model-checking language NuSMV provides an output oracle. We compare test sets generated using various methods, including covering arrays, random, and a low-complexity input paradigm applied to 28 versions of the TCAS C program containing seeded errors. Faults are triggered by tests for all 28 programs using a combination of covering arrays and random input generation. Complexity-based inputs perform more efficiently than covering arrays, and can be paired with random input generation to create efficient and effective test sets. A random forest classifier identifies variable values that can be targeted to generate tests even more efficiently in future work, by combining a machine-learned fuzzing algorithm with more complex model oracles developed in model-based systems engineering (MBSE) software.
Proceedings of FTXS 2020: Fault Tolerance for HPC at eXtreme Scale, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis
Gupta, Nikunj; Mayo, Jackson R.; Lemoine, Adrian S.; Kaiser, Hartmut
Exceptions and errors occurring within mission critical applications due to hardware failures have a high cost. With the emerging Next Generation Platforms (NGPs), the rate of hardware failures will likely increase. Therefore, designing our applications to be resilient is a critical concern in order to retain the reliability of results while meeting the constraints on power budgets. In this paper, we discuss software resilience in AMTs at both local and distributed scale. We choose HPX to prototype our resiliency designs. We implement two resiliency APIs that we expose to the application developers, namely task replication and task replay. Task replication repeats a task n-times and executes them asynchronously. Task replay reschedules a task up to n-times until a valid output is returned. Furthermore, we expose algorithm based fault tolerance (ABFT) using user provided predicates (e.g., checksums) to validate the returned results. We benchmark the resiliency scheme for both synthetic and real world applications at local and distributed scale and show that most of the added execution time arises from the replay, replication or data movement of the tasks and not the boilerplate code added to achieve resilience.
Proceedings of FTXS 2020: Fault Tolerance for HPC at eXtreme Scale, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis
Gupta, Nikunj; Mayo, Jackson R.; Lemoine, Adrian S.; Kaiser, Hartmut
Exceptions and errors occurring within mission critical applications due to hardware failures have a high cost. With the emerging Next Generation Platforms (NGPs), the rate of hardware failures will likely increase. Therefore, designing our applications to be resilient is a critical concern in order to retain the reliability of results while meeting the constraints on power budgets. In this paper, we discuss software resilience in AMTs at both local and distributed scale. We choose HPX to prototype our resiliency designs. We implement two resiliency APIs that we expose to the application developers, namely task replication and task replay. Task replication repeats a task n-times and executes them asynchronously. Task replay reschedules a task up to n-times until a valid output is returned. Furthermore, we expose algorithm based fault tolerance (ABFT) using user provided predicates (e.g., checksums) to validate the returned results. We benchmark the resiliency scheme for both synthetic and real world applications at local and distributed scale and show that most of the added execution time arises from the replay, replication or data movement of the tasks and not the boilerplate code added to achieve resilience.
Proceedings of FTXS 2020: Fault Tolerance for HPC at eXtreme Scale, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis
Benefits of local recovery (restarting only a failed process or task) have been previously demonstrated in parallel solvers. Local recovery has a reduced impact on application performance due to masking of failure delays (for message-passing codes) or dynamic load balancing (for asynchronous many-task codes). In this paper, we implement MPI-process-local checkpointing and recovery of data (as an extension of the Fenix library) in combination with an existing method for local detection of silent errors in partial-differential-equation solvers, to show a path for incorporating lightweight silent-error resilience. In addition, we demonstrate how asynchrony introduced by maximizing computation-communication overlap can halt the propagation of delays. For a prototype stencil solver (including an iterative-solver-like variant) with injected memory bit flips, results show greatly reduced overhead under weak scaling compared to global recovery, and high failure-masking efficiency. The approach is expected to be generalizable to other MPI-based solvers.
Proceedings of FTXS 2020: Fault Tolerance for HPC at eXtreme Scale, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis
Benefits of local recovery (restarting only a failed process or task) have been previously demonstrated in parallel solvers. Local recovery has a reduced impact on application performance due to masking of failure delays (for message-passing codes) or dynamic load balancing (for asynchronous many-task codes). In this paper, we implement MPI-process-local checkpointing and recovery of data (as an extension of the Fenix library) in combination with an existing method for local detection of silent errors in partial-differential-equation solvers, to show a path for incorporating lightweight silent-error resilience. In addition, we demonstrate how asynchrony introduced by maximizing computation-communication overlap can halt the propagation of delays. For a prototype stencil solver (including an iterative-solver-like variant) with injected memory bit flips, results show greatly reduced overhead under weak scaling compared to global recovery, and high failure-masking efficiency. The approach is expected to be generalizable to other MPI-based solvers.
Proceedings of ExaMPI 2020: Exascale MPI Workshop, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis
Achieving fault tolerance is one of the significant challenges of exascale computing due to projected increases in soft/transient failures. While past work on software-based resilience techniques typically focused on traditional bulk-synchronous parallel programming models, we believe that Asynchronous Many-Task (AMT) programming models are better suited to enabling resiliency since they provide explicit abstractions of data and tasks which contribute to increased asynchrony and latency tolerance. In this paper, we extend our past work on enabling application-level resilience in single node AMT programs by integrating the capability to perform asynchronous MPI communication, thereby enabling resiliency across multiple nodes. We also enable resilience against fail-stop errors where our runtime will manage all re-execution of tasks and communication without user intervention. Our results show that we are able to add communication operations to resilient programs with low overhead, by offloading communication to dedicated communication workers and also recover from fail-stop errors transparently, thereby enhancing productivity.
Proceedings of FTXS 2020: Fault Tolerance for HPC at eXtreme Scale, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis
Gupta, Nikunj; Mayo, Jackson R.; Lemoine, Adrian S.; Kaiser, Hartmut
Exceptions and errors occurring within mission critical applications due to hardware failures have a high cost. With the emerging Next Generation Platforms (NGPs), the rate of hardware failures will likely increase. Therefore, designing our applications to be resilient is a critical concern in order to retain the reliability of results while meeting the constraints on power budgets. In this paper, we discuss software resilience in AMTs at both local and distributed scale. We choose HPX to prototype our resiliency designs. We implement two resiliency APIs that we expose to the application developers, namely task replication and task replay. Task replication repeats a task n-times and executes them asynchronously. Task replay reschedules a task up to n-times until a valid output is returned. Furthermore, we expose algorithm based fault tolerance (ABFT) using user provided predicates (e.g., checksums) to validate the returned results. We benchmark the resiliency scheme for both synthetic and real world applications at local and distributed scale and show that most of the added execution time arises from the replay, replication or data movement of the tasks and not the boilerplate code added to achieve resilience.
Gupta, Nikunj; Mayo, Jackson R.; Lemoine, Adrian S.; Hartmut, Kaiser
The DOE Office of Science Exascale Computing Project (ECP) outlines the next milestones in the supercomputing domain. The target computing systems under the project will deliver 10x performance while keeping the power budget under 30 megawatts. With such large machines, the need to make applications resilient has become paramount. The benefits of adding resiliency to mission critical and scientific applications, includes the reduced cost of restarting the failed simulation both in terms of time and power. Most of the current implementation of resiliency at the software level makes use of a Coordinated Checkpoint and Restart (C/R). This technique of resiliency generates a consistent global snapshot, also called a checkpoint. Generating snapshots involves global communication and coordination and is achieved by synchronizing all running processes. The generated checkpoint is then stored in some form of persistent storage. On failure detection, the runtime initiates a global rollback to the most recent previously saved checkpoint. This involves aborting all running processes, rolling them back to the previous state and restarting them.
We discuss techniques for efficient local detection of silent data corruption in parallel scientific computations, leveraging physical quantities such as momentum and energy that may be conserved by discretized PDEs. The conserved quantities are analogous to “algorithm-based fault tolerance” checksums for linear algebra but, due to their physical foundation, are applicable to both linear and nonlinear equations and have efficient local updates based on fluxes between subdomains. These physics-based checksums enable precise intermittent detection of errors and recovery by rollback to a checkpoint, with very low overhead when errors are rare. We present applications to both explicit hyperbolic and iterative elliptic (unstructured finite-element) solvers with injected memory bit flips.
Analyses have disagreed on whether the velocity uT of bulk advancement of a Huygens front in turbulence vanishes or remains finite in the limit of vanishing local front propagation speed u. Here, a connection to the large-deviation statistics of log-correlated random processes enables a definitive determination of the correct small-u asymptotics. This result reconciles several theoretical and phenomenological perspectives with the conclusion that uT remains finite for vanishing u, which implies a propagation anomaly akin to the energy-dissipation anomaly in the limit of vanishing viscosity. Various leading-order structural properties such as a novel u dependence of a bulk length scale associated with front geometry are predicted in this limit. The analysis involves a formal analogy to random advection of diffusive scalars.
The use of untrusted design tools, components, and designers, coupled with untrusted device fabrication, introduces the possibility of malicious modifications being made to integrated circuits (ICs) during their design and fabrication. These modifications are known as hardware trojans. The widespread use of commercially purchased 3rd party intellectual property (3PIP) and commercial design tools extends even into trusted design flows. Unfortunately, due to the theoretical result that there is no program that can decide whether any other program will eventually halt, we know that the properties of a program, or circuit, cannot be known in advance of running it. While we can design a circuit to meet some functional specification and generate a simulation or test suite to obtain at least probabilistic confidence that the circuit implements the intended functionality, we cannot test a circuit for unintended functionality due to the combinatorially large state space. To address these concerns, we have developed a design-time method for automatically and systematically modifying portions of a design that exhibit characteristics of hardware trojans. After each modification, the functionality of the design is verified against a comprehensive simulation suite to ensure that the intended circuit functionality has not been changed. This approach can be applied to any digital circuit and does not rely on secret keys or obfuscation.