Monte Carlo (MC) sampling is a common method used to randomly sample a range of scenarios. The associated error follows a predictable rate of convergence of $1/\sqrt{N}$, such that quadrupling the sample size halves the error. This method is often employed in performing global sensitivity analysis which computes sensitivity indices, measuring fractional contributions of uncertain model inputs to the total output variance. In this study, several models are used to observe the rate of decay in the MC error in the estimation of the conditional variance, the total variance in the output, and the global sensitivity indices. The purpose is to examine the rate of convergence of the error in existing specialized, albeit MC-based, sampling methods for estimation of the sensitivity indices. It was found that the conditional variances and sensitivity indices all follow the $1/\sqrt{N}$ convergence rate. Future work will test the convergence of observables from more complex models such as ignition time in combustion.
Demonstrate algorithm-based resilience to silent data corruption (SDC) and hard faults in a task-based domain-decomposition preconditioner for elliptic PDEs.
Explore scalability of a resilient task-based domain decomposition preconditioner for elliptic PDEs. Selective reliability to study the impact of different levels of simulated SDC and hard faults. Explore interplay between the application resilience, and the role of the server-client programming model.
We discuss algorithm-based resilience to silent data corruption (SDC) in a task- based domain-decomposition preconditioner for partial differential equations (PDEs). The algorithm exploits a reformulation of the PDE as a sampling problem, followed by a solution update through data manipulation that is resilient to SDC. The implementation is based on a server-client model where all state information is held by the servers, while clients are designed solely as computational units. Scalability tests run up to ~ 51 K cores show a parallel efficiency greater than 90%. We use a 2D elliptic PDE and a fault model based on random single bit-flip to demonstrate the resilience of the application to synthetically injected SDC. We discuss two fault scenarios: one based on the corruption of all data of a target task, and the other involving the corruption of a single data point. We show that for our application, given the test problem considered, a four-fold increase in the number of faults only yields a 2% change in the overhead to overcome their presence, from 7% to 9%. We then discuss potential savings in energy consumption via dynamics voltage/frequency scaling, and its interplay with fault-rates, and application overhead.
We present a resilient domain-decomposition preconditioner for partial differential equations (PDEs). The algorithm reformulates the PDE as a sampling problem, followed by a solution update through data manipulation that is resilient to both soft and hard faults. We discuss an implementation based on a server-client model where all state information is held by the servers, while clients are designed solely as computational units. Servers are assumed to be “sandboxed”, while no assumption is made on the reliability of the clients. We explore the scalability of the algorithm up to ∼12k cores, build an SST/macro skeleton to extrapolate to∼50k cores, and show the resilience under simulated hard and soft faults for a 2D linear Poisson equation.
We present a resilient domain-decomposition preconditioner for partial differential equations (PDEs). The algorithm reformulates the PDE as a sampling problem, followed by a solution update through data manipulation that is resilient to both soft and hard faults. We discuss an implementation based on a server-client model where all state information is held by the servers, while clients are designed solely as computational units. Servers are assumed to be “sandboxed”, while no assumption is made on the reliability of the clients. We explore the scalability of the algorithm up to ∼12k cores, build an SST/macro skeleton to extrapolate to∼50k cores, and show the resilience under simulated hard and soft faults for a 2D linear Poisson equation.