Sandia LabNews

Project aims to create an OS for exascale computing environment


operating system
A Sandia-led team in the stratosphere of high-performance supercomputing has been funded by DOE’s Office of Advanced Scientific Computing Research to design an operating system suitable to handle the million-trillion-per-second mathematical operations of an envisioned exascale computer.

A Sandia-led team in the stratosphere of high-performance supercomputing has been funded by DOE’s Office of Advanced Scientific Computing Research to design an operating system suitable to handle the million-trillion-per-second mathematical operations of an envisioned exascale computer, and then create prototypes of several of its programming components.

Called the XPRESS project (eXascale Programming Environment and System Software), the effort to achieve a major milestone in supercomputing is funded at $2.3 million a year for three years and engages a team that includes the universities of Indiana, North Carolina, Oregon, and Houston; Louisiana State University; and Oak Ridge and Lawrence Berkeley national laboratories. Work began Sept. 1.

The project’s goal is to devise an innovative operating system and associated components that will enable exascale computing by 2020, making contributions along the way  to improve current petaflop (a million billion operations a second) systems, says program lead Ron Brightwell (1423).

Scientists in industry and in research institutions believe that exascale computing speeds will more accurately simulate the most complex nuclear, chem/bio, and atmospheric reactions, but enormous preparation is necessary to improve supercomputing so that it can achieve such speeds.

Current software now based on 20-year-old technologies

“System software on today’s parallel-processing computers is largely based on ideas and technologies developed more than 20 years ago, before processors with hundreds of computing cores were even imagined,” says Ron. “The XPRESS project aims to provide a system software foundation designed to maximize the performance and scalability of future large-scale parallel computers as well as enable a new approach to the science and engineering applications that run on them.”

Current supercomputers operate through a method called parallel processing, where individual chips work out parts of a problem and contribute results in an order controlled by an overall program, much like the output of instruments in an orchestra are controlled by a conductor. Chip speed itself thus plays a less important role than the ability to synchronize individual results, since more chips can be added for greater traction in solving harder problems.  

But merely adding more chips to a supercomputer “orchestra” to solve extremely difficult problems in a reasonable amount of time can make the orchestra unwieldy, the conductor’s job more difficult and, in the end, impossible. 

 In addition to programming difficulties, excess heat generation wastes energy, adding more components increases the chances that some will fail, and designing convenient information storage locations so memories are immediately available to processors is not a trivial problem.

The conundrum is, in short, that an exascale computer using current technologies could have the unwanted complexity of a Rube Goldberg contraption that uses the energy of a small city and requires constant upkeep.

To reduce these problems and start researchers on the road to solutions, the multi-institution XPRESS effort will address specific factors known to degrade fast supercomputer performance. These include “starvation” — the insufficiency of concurrent partial problem-solving at particular locations. This hinders both efficiency and scalability because it can require more parallelism. Information delays, known as latency effects, need to be reduced through a combination of better locality management, reduction of superfluous messaging, and the hiding of information unnecessary to the problem. Overhead limits the fitness of granularity that can be effectively exploited through inference. This reduces scalability. Waiting — because the same memory is needed by several processors — also causes slowdowns.

The team brings together researchers with expertise not only in operating systems, says Ron, but also other system software capabilities, such as performance analysis and dynamic resource management, that are crucial to supporting the features needed to effectively manage the increasing complexities of future exascale systems.