Sandia LabNews

Red Storm upgrade boosts Sandia supercomputer to #2 in world

Sandia’s Thunderbird Linux cluster ranks #6 in Top500 supercomputing list

Image of pic1.jpg

A $15 million upgrade to Sandia’s Red Storm computer has increased its peak speed from 41.5 to 124.4 teraflops in a computing terrain in which a single teraflop was a big deal only six years ago.

The machine, built by Cray Inc., is now rated second fastest in the world, with a Linpack speed of 101.4 teraflops. The widely recognized Linpack test measures a supercomputer’s speed as applied to a computing problem.

In peak speed, Red Storm remains well behind BlueGene/L at Lawrence Livermore National Laboratory, but, “in terms of scalability, Red Storm is the best in the world,” says Bill Camp (1400), director of Sandia’s Computation, Computers, Information, and Math Center.

Scalability refers to a supercomputer’s computational efficiency as the number of processors on a job is increased. “You want to use more processors to get large jobs done more quickly,” says Bill, “but if the computer doesn’t scale well you can lose much of that speedup.” Red Storm loses very little efficiency on large numbers of processors.

“The Cray XT3 supercomputers now dominating the highest end of computing worldwide are based upon Sandia’s Red Storm,” says Bill, who together with Sandia colleague Jim Tomkins (1420) led the design of the machine. “Scientists love it because they can do bigger science more quickly on it than any other computer in existence, except for molecular dynamics studies on BlueGene. Otherwise, it’s the best thing since night baseball.”

“The machine’s also a computational workhorse. It gets the job done,” says Sandia researcher Steve Attaway (1534), winner of several national computing awards. He runs large engineering simulations on the machine.

Red Storm, designed under NNSA’s Simulation & Computing program, became the basis for the Cray XT3™ massively parallel processor supercomputer that has been installed at supercomputing centers around the world.

Purchasers of this design include Oak Ridge National Laboratory, which will create an even bigger supercomputer than Red Storm based on the same design, as well as Lawrence Berkeley Lab, Pittsburgh Supercomputing Center (which is the largest National Science Foundation site); the US Army; the United Kingdom’s Atomic Weapons Establishment program; the national computing centers in Finland, Switzerland, and the UK; and other US and allied government sites.

Thrifty in its use of power

Red Storm is Sandia’s largest high-performance computer, but is thrifty in its use of power. It uses 2.2 megawatts, compared, for example, to IBM Purple, another highly capable NNSA platform — which requires 4.5 megawatts of power. This means that comparatively less of Red Storm’s energy is converted to useless heat.

Red Storm also takes up a relatively small area — about 3,500 square feet.

Its Linpack test demonstrated high reliability, repeatedly running for nine hours on more than 26,000 processor cores without a failure.

The machine was created in less than three years from concept to customer shipment. It was relatively inexpensive to develop and build — $77.5 million including engineering and design costs — and is used for large scientific and technical problems.

Sandia developed the architectural specifications of the machine and did much of the software development. “The hardware at Cray was built to meet our specifications,” says Jim Tomkins.

The upgrade included the addition of a fifth row of cabinets and upgrading the entire system with dual-core AMD Opteron™ processors, resulting in a supercomputer with more than 26,000 processor cores. Dual-core technology fits two processor cores on a single die, doubling processing capacity with minimal impact on power consumption and temperature levels.

Why is Red Storm so efficient? In part, says Sandia researcher Robert Balance (4328), because its operating system is based on minimalist software — termed a lightweight kernel — which carries just enough functionality to load the job, put it on the network, and stop it. Any other software is job-specific; thus, each computer node (at which two chips are located) in effect lugs no useless software on its back.

The original technology was pioneered by Sandia on its ASCI Red machine, built by Intel Corporation, which became the world’s first terascale supercomputer.

Sandia’s Thunderbird Linux cluster ranks #6 in Top500 supercomputing list

Sandia’s 8,960-processor Thunderbird Linux cluster, developed in collaboration with Dell Inc. and Cisco, maintained its sixth position in the Top500 Supercomputers list by achieving an improved overall performance of 53 teraflops, an increase of more than 18 percent over last year’s performance testing.

Image of pic1a-4

“This achievement represents a long-term investment to meet our mission to transform engineering and provide greater processing capacity,” says Computing Systems Senior Manager John Zepper (4320).

Sandia researchers use Thunderbird to perform a broad range of weapons simulations, including atomistic scale-to-device modeling of radiation effects on semiconductor electronics, assessing weapon-response safety in extreme thermal and impact environments, and quantifying uncertainties in weapon performance.

The level of detail being modeled in these assessments was not practical without the new level of scalable capacity that Thunderbird provides.

With its 4,480 commodity compute servers linked with an Infiniband message-passing interconnect, Thunderbird is the largest cluster of its type in the world.

The improvements in Thunderbird’s performance were propelled by its switch to OpenFabric Enterprise Distribution (OFED) and OpenMPI — together, a Linux-based open-source software stack qualified by the OpenFabrics Alliance to operate with multi-vendor Infiniband hardware and implement open-source Message Passing Interface (MPI) protocol.

The achievement was a joint venture involving Sandia and Cisco. Cisco, an active developer in the OFED and OpenMPI projects, had its engineers on site at Sandia to assist with monitoring, diagnosing, and fine-tuning Thunderbird’s performance.

The new software-stack environment allows for more memory per node to be available for parallel jobs at runtime, as well as an increase in reliability and scalability of users’ jobs. Sandia’s extensive use of the new software ironed out bugs and tweaked performance — improvements that benefit the entire high-performance-computing community.

Infiniband is widely regarded as one of the most attractive commodity interconnect technologies because of its high bandwidth, low latency, and low cost. This is the first time Infiniband, OpenMPI, and OFED have been used in such a massive configuration as Thunderbird.