Digital systems in an out-of-nominal environment (e.g., one causing hardware bit flips) may not be expected to function correctly in all respects but may be required to fail safely. We present an approach for understanding and verifying a system’s out-of-nominal behavior as an abstraction of nominal behavior that preserves designated critical safety requirements. Because abstraction and refinement are already widely used for improved tractability in formal design and proof techniques, this additional way of viewing an abstraction can potentially verify a system’s out-of-nominal safety with little additional work. We illustrate the approach with a simple model of a turnstile controller with possible logic faults (formalized in the temporal logic of actions and NuSMV), noting how design choices can be guided by the desired out-of-nominal abstraction. Principles of robustness in complex systems (specifically, Boolean networks) are found to be compatible with the formal abstraction approach. This work indicates a direction for broader use of formal methods in safety-critical systems.
Application resilience is a key challenge that has to be addressed to realize the exascale vision. Online recovery, even when it involves all processes, can dramatically reduce the overhead of failures as compared to the more traditional approach where the job is terminated and restarted from the last checkpoint. In this paper we explore how local recovery can be used for certain classes of applications to further reduce overheads due to resilience. Specifically we develop programming support and scalable runtime mechanisms to enable online and transparent local recovery for stencil-based parallel applications on current leadership class systems. We also show how multiple independent failures can be masked to effectively reduce the impact on the total time to solution. We integrate these mechanisms with the S3D combustion simulation, and experimentally demonstrate (using the Titan Cray-XK7 system at ORNL) the ability to tolerate high failure rates (i.e., node failures every 5 seconds) with low overhead while sustaining performance, at scales up to 262144 cores.
Application resilience is a key challenge that must be ad-dressed in order to realize the exascale vision. Previous work has shown that online recovery, even when done in a global manner (i.e., involving all processes), can dramatically re-duce the overhead of failures when compared to the more traditional approach of terminating the job and restarting it from the last stored checkpoint. In this paper we suggest going one step further, and explore how local recovery can be used for certain classes of applications to reduce the over-heads due to failures. Specifically we study the feasibility of local recovery for stencil-based parallel applications and we show how multiple independent failures can be masked to effectively reduce the impact on the total time to solution.
Current programming languages and programming models make it easy to create software and hardware systems that fulfill an intended function but also leave such systems open to unintended function and vulnerabilities. Software engineering and code hygiene may make systems incrementally safer, but do not produce the wholesale change necessary for secure systems from the outset. Yet there exists an approach with impressive results: We cite recent examples showing that formal methods, coupled with formally informed digital design, have produced objectively more robust code even beyond the properties directly proven. Though discovery of zero-day vulnerabilities is almost always a surprise and powerful tools like semantic fuzzers can cover a larger search space of vulnerabilities than a developer can conceive of, formal models seem to produce robustness of a higher qualitative order than traditionally developed digital systems. Because the claim is necessarily a qualitative one, we illustrate similar results with an idealized programming language in the form of Boolean networks where we have control of parameters related to stability and adaptability. We argue that verifiability with formal methods is an instance of broader design constraints that promote robustness. We draw analogies to real-world programming models and languages that can be mathematically reasoned about in contrast to ones that are essentially undecidable.
Several tensor eigenpair definitions have been put forth in the past decade, but these can all be unified under generalized tensor eigenpair framework, introduced by Chang, Pearson, and Zhang [J. Math. Anal. Appl., 350 (2009), pp. 416-422]. Given mth-order, n-dimensional realvalued symmetric tensors A and B, the goal is to find λ ε ℝ and x ε ℝn, x ≠= 0 such that Axm-1 = λBxm-1. Different choices for B yield different versions of the tensor eigenvalue problem. We present our generalized eigenproblem adaptive power (GEAP) method for solving the problem, which is an extension of the shifted symmetric higher-order power method (SS-HOPM) for finding Z-eigenpairs. A major drawback of SS-HOPM is that its performance depended on choosing an appropriate shift, but our GEAP method also includes an adaptive method for choosing the shift automatically.