The plant resembles a pumpkin vine whose crop is high-performance computing. In place of pumpkins are microprocessor nodes, hooked together with vine-like networks.
This system can evolve almost organically as its caretakers graft on new nodes and prune off old ones. The patch yields sufficient computing power for many needs, though less than current state-of-the-art supercomputers. This �computer cluster,� known as Cplant, is sown from commodity computers instead of being grown from the ground up as a monolithic supercomputer.
The approach represents a shift from the early days when �computers were scarce and people were plentiful,� says Rob Armstrong of Distributed Applications Dept. 8980. By buying off-the-shelf units instead of tending one large mainframe, Rob says, �what you�re really saving is human time.� Cplant�s research challenges include combining diverse operating systems, improving short-distance networks, and allocating computer power to problems suitable for highly parallel approaches.
The partitionable system is well-suited to tasks where several manufacturing design iterations, for instance, might run simultaneously, Rob says. �We don�t envision running a huge code on the machine, but people can run many codes on pieces of the machine.�
Huge codes will be run on supercomputers being acquired through DOE�s Accelerated Strategic Computing Initiative. Eventually the ASCI machines at the three DOE weapons labs will be tied together, along with the Cplant distributed computing system, into an extensive computational grid. Cplant also serves as a testbed for new technologies necessary to create this multi-site computational infrastructure.
Approach for next 10 years?
When Sandia�s ASCI teraflops machine went on-line in 1997, points out Art Hale (9224), Deputy Director for Computational/Computer Sciences and Math Center 9200, it culminated 10 years of development of massively parallel computing.
�We�ve had essentially four generations of experience,� he says. �We learned how to aggregate together hardware and architect software for large-scale systems, and we�re applying that experience in the design of the computational plant.�
Art sees the Cplant approach guiding Sandia�s high-performance computing research and development in the next 10 years. �It adds complexity,� he says about the Cplant concept, �but once it�s managed, it removes barriers.�
With Cplant�s building blocks becoming available commercially, he adds, �we have in mind it would provide a very flexible computing fabric to meet Sandia�s capacity needs.� These building blocks include microprocessing nodes provided by relatively fast, affordable personal computers, emerging network-connection technology, and network-management software. �We couldn�t have done this very well five years ago,� Art says.
Other changes also spurred interest in developing cluster computing. Only a few companies are interested in building supercomputers. Market share is shifting from providers of workstations and mainframes to personal computer manufacturers.
�The difference between a personal computer and a workstation has virtually disappeared,� says Bill Camp, Director of Center 9200. Cplant began last year with the linking of microprocessors from Digital Equipment Corp. There are some 96 nodes in Albuquerque and 32 nodes in California. Currently, each node is a Miata personal computer, but the plant could use other computers.
Roughly equal numbers of researchers at both Sandia sites are working on the cluster, which is expected to operate as a single system so people at each location have computing access.
The configuration is designed to operate at a large scale and incorporates lessons from previous supercomputing efforts. Connections monitor messages through multiple communication channels, for instance, to show if a node has hung up or the system is deadlocked due to a hardware failure. Art expects the resulting configuration to be easier to manage and monitor, more reliable, and with significant capacity to dice problems into load-balanced segments for computing ease.
The plant is also flourishing through Sandia�s foundation in distributed computing, he adds. That front �has evolved rapidly through growth of the Internet, motivated by a desire to pave cyberspace,� he says.
�We wondered if we could even build one of these, and if so, how would it work,� says Robert Clay (8980). �We absolutely proved you could.�
The Cplant machine is designed with units of 16 microprocessors and one computer to manage them. To add more computing power, you�d simply add more units of 16-plus-one.
�The notion of scalability is implicit in the design,� says Robert. �Being indefinitely scalable gives Sandia a lead.�
Cplant also benefits from improvements in networking capabilities. A wide-area network link was demonstrated last fall at SC97, the premier national conference on high-performance networking and computing. The link ran from the San Jose convention center to Livermore�s National Transparent Optical Network, a high-speed data pipe, to Albuquerque and back. At the conference, the researchers demonstrated modeling applications on what Robert terms �sort of a supercluster.�
Designing for DASE
Down the road, he anticipates Cplant will become �an extremely flexible, general-purpose computing platform� in which users can create their own computing environments. He is at work on a software architecture that will allow users to call up an operating system, tools, and other applications as needed.
One environment will resemble that used for massively parallel computing. Another would be useful for engineering design and visualization, such as the Product Realization Environment, operated in a Windows interface where objects carry out transactions for the user.
This architecture, which he calls DASE for Dynamically Adaptive Software Environment, will take a while to get on line, but once there, can help Sandia carry out its role in revolutionizing engineering and advancing computer modeling and simulation.
�It�s a new way of looking at these resources,� Robert says. Accomplishing this potential for high-scale computing may require separating functions of the physical network and software operating system that runs applications on the user�s desktop.
Gigabit a second and up
Computing speed drops when network distances increase. One question some researchers associated with Cplant are addressing is, �How can we take the tremendous speeds we can get on a chip and extend it a little further?� says Steve Gossage (4616).
Sandia�s ASCI red machine (the teraflops) links 9,000 processors on one network, but the network can�t be extended beyond a few meters, adds George Davidson (9215). Currently, alpha computers can be linked long distances using Ethernet, but this link is too slow for thousands of machines working on the same problem.
Over a short distance, system-area network products are becoming available that link machines within a room and ship data rapidly. Myrinet and ServerNet are both promising, early products addressing this level of networking.
Links between buildings over local area networks typically run at about 10 megabits per second. Cplant, however, will attempt to initially operate at a gigabit a second and go up from there, says Steve. He is participating in designing a Virtual Interface Architecture (VIA) whose specifications have been set forth by a consortium of vendors.
�What VIA�s attempting to do is an operating system bypass,� Steve says, �to decouple the operating system from the physical implementation of the system area network, for faster communications.�
Sandia�s role will be to look at how joining many computers can provide high performance, using modeling and simulation.
Although it�s challenging to write code that will interface well between the network and operating system, he adds, there is a potential for �tremendous freedom to pick components and plug them together.� Already, Sandians in Center 9200 are adapting codes to a parallel environment so Cplant can begin to be used. Early uses of Cplant will include comprehensive weapons-safety assessments by Dave Carlson (12333) and co-workers. Cplant is ideally suited to this work, which entails running many moderate-sized simulations as part of a systematic approach to exploring safety margins.
John Maenchen (9515) and colleagues will also use Cplant to develop new approaches for weapons surety tests. The Advanced Hydrotest Facility under consideration by DOE would radio-graphically probe implosions to understand nuclear weapon physics and validate simulations.
�We�re going to get the first results this summer,� says John, who intends to quantitatively determine what can be learned from penetrating X-ray radiography probes of these explosive tests. He anticipates the initial Monte Carlo computations will require 22,000 CPU hours.
Cplant, he adds, �is the architecture of the future.� The prospect of quickening computations from weeks or months into hours or days should prompt people to tackle many complex problems that were once avoided, he predicts.
Besides those people already mentioned, Cplant developers include Ron Brightwell (9233), William Davidson (9622), Carl Diegert (9215), Doug Doerfler (2523), Lee Ann Fisk (9223), David Greenberg (9223), Michael Hannah (4418), Richard Hu (4616), John Laroco (8980), Mike Levenhagen (6533), Barney Maccabe (University of New Mexico), Luis Martinez (4616), Scott Miller (9223), John Naegle (4616), Tom Pratt (4616), Rolf E. Riesen (9223), John VanDyke (9223), Alan Williams (8950), Peter Wyckoff (8980), and former Sandians Pang Chen, Joe Durant, Ed Seidl, and David van Dresser.