Sandia LabNews

Mitigating silent computer hardware errors


Sandia researchers focus on enhancement of quality, performance of computer calculations

Sandia researchers focus on enhancement of quality, performance of calculations

MITIGATING COMPUTING ERRORS — Rob Armstrong, Maher Salloum (both 8956) and Jackson Mayo (8953) work on mitigating silent errors in computer hardware. (Photo by Randy Wong)

Understanding and pinpointing errors in scientific computing helps maintain trustworthiness and accuracy in simulations at an affordable cost — that’s according to ongoing research conducted at Sandia.

The work, done in support of Sandia’s Advanced Simulation and Computing (ASC) program, focuses on mitigating silent errors in computer hardware, which refers to the development of an incorrect state due to some physical upset such as a cosmic ray striking a piece of silicon and flipping a bit, says project leader Jackson Mayo (8953).

“Sandia’s mission requires very large scale computations that have to be carefully performed to ensure accuracy and trust,” Jackson says. “What we are simulating is important, whether it be for nuclear weapons, for simulations related to climate change, or for other fairly high-stakes computations. Not just getting an answer, but being sure we get the right answer, matters to people making those decisions.”

Jackson and team members Rob Armstrong and Maher Salloum (both 8956) specifically look at ways of more efficiently using the characteristics of ASC applications, typically continuum physics simulations, to build in resilience against silent errors so it is intrinsic to the computation. The researchers take an application-targeted approach, using the properties of what is being computed to check for nonsensical behaviors and achieve a reliable result.

 “Some of these errors may be detected or corrected within the hardware automatically, but a silent error is one that doesn’t get detected or corrected that way, so it would actually appear to be normal,” Jackson says. “The application would not know that anything was wrong unless it did its own check or other mechanism to ensure that the answer is right.”

Integrating algorithms

Maher develops silent error detection and correction algorithms and integrates them in the software frameworks used at Sandia and at other DOE institutions.

 “There are a lot of smart ideas to treat silent errors but they are restricted to a few small applications,” Maher says. “However, unlike such academic-style research, what we are developing aims to have a large impact on a wide variety of applications and code frameworks.”

The algorithms used at Sandia are required to be efficient in detecting and correcting the errors. “Developing fast mathematical algorithms for error detection and correction has been the most challenging aspect of this work,” he says, “especially while at the same time meeting the software engineering needs such that the algorithms are generalizable and maintainable in large software frameworks used at Sandia and in the DOE.”

Co-design helps mitigate errors

A part of the solution for mitigating silent errors in computer hardware is to concentrate on a concept called co-design, Jackson says. This combines hardware and software design in an iterative process, so advances in software can guide the development of hardware.

“With broad and efficient silent-error mitigation, our goal is to contribute software techniques for extreme-scale simulations that may help under a variety of future co-design scenarios,” Jackson says.

The work can support co-design by mitigating errors that might be unpredictable in extreme-scale architectures and widening design choices for those architectures. It also provides a diagnostic capability to detect silent errors, he says.

“The choice of future hardware will be influenced and optimized based on what kind of software we can produce,” he says. “If we show that in software we can handle certain types of errors that would otherwise be unacceptable, that would otherwise corrupt our calculations, if we can handle those in the software, then the hardware doesn’t have as stringent a requirement on its reliability.” Jackson says more efficient hardware and software will save money and provide better reliability and performance for the user. The software techniques may also be useful for improving cybersecurity because an attacker deliberately tampering with data may be detectable in the same way as an accidental hardware error.