The Facilities, Operations and User Support (FOUS) program is responsible for operating and maintaining the computing systems procured by the Advanced Simulation and Computing (ASC) program, and for delivering additional computing related services to Defense Program customers located across the Nuclear Weapons Complex. Sandia has developed a robust User Support capability which provides various services to analysts, tool developers, major code groups, and computer science researchers alike.
Because major computing resources are procured periodically and are not duplicated at each NNSA laboratory, a highly reliable dedicated Wide Area Network connects the computing environments at Sandia (both NM and CA), Los Alamos and Lawrence Livermore national laboratories. The FOUS program maintains local high performance networks which connect computing and storage resources in the various security environments needed by our customers, and provides support for remote access and job submittal to platforms located at other laboratories. These interconnects require constant observation and analysis as minor changes or error conditions can drastically alter the performance of data transfer between the sites.
Facilities, Network, and Power
All of the resources comprising the high performance computing environment require a building and supporting cooling and power infrastructure. At Sandia, we are taking new approaches to energy conservation which plays into the design of new facilities, as well as the innovative use of existing cooling equipment or power distribution systems. Sandia has been recognized for several groundbreaking initiatives in the area of cooling, and power conservation. Our newest collaboration will provide access to over 2 MW of solar voltaic energy in partnership with our Alternative Energy research programs. Prior efforts have saved millions of gallons of water and reduced dependence on refrigerated cooling to lower our energy bill. These innovations will be leveraged into new facilities in the coming years.
System and Environment Administration and Operations
The Operations area is where the daily activities of running and managing computing resources deliver critical support to customers both near and far. Although we do not generate the results of computational simulation, we are stewards of the information and work to protect and control access through various access control mechanisms and workflow processes. In addition, monitoring of the environment and the specific computing platforms raise early warnings to provide time for manual intervention. The Systems Operations Center calls in administrators or technicians to address any of a number of problems from power to cooling to computer or disk failures, or they may determine that networking support is required.
While system administrators are often engaged to help isolate and correct user detected errors, the first responders to problem situations are those people engaged in User Support. The services provided range from documentation of systems to training on new platforms and new tools such as debuggers or optimizer tools. Every code and simulation is a slightly different instance of a general environment, and errors may arise in any one of several layers of that environment, from the hardware level, to the communications interconnect, to the logic of the code or the interpretations provided by the compilers. Most large simulations run for days to weeks and create thousands of files. Managing this complexity takes a thorough understanding of the codes, the file systems, and the limitations of the individual computing platform.
Common Computing Environment
As mentioned above, major computing resources are procured periodically and are not duplicated at each NNSA laboratory. As a result, common computing tools and services are required to meet user needs. Sandia works with Los Alamos and Livermore national laboratories to provide these services.