How a social network proved to be the missing link in building the world’s fastest supercomputer.

A lot of effort goes into building supercomputers and even more to make them come online. With many stakeholders working at different paces, getting everyone aligned is critical to success.
El Capitan, which would go on to be the world’s fastest supercomputer in 2024, was no exception. Sandia, Lawrence Livermore and Los Alamos national laboratories were tasked with porting their modeling and simulation codes while Advanced Micro Devices supplied cutting edge accelerated processing units and Hewlett Packard Enterprise integrated the machine at Lawrence Livermore. Pulling it off required the right people together at the right time throughout the length of the procurement.
“The realization of Exascale Computing at the NNSA has been an agency priority for many years due to the incredible impact that the capabilities offer the nation’s stockpile stewardship program,” said Simon Hammond, program director for advanced computing in the NNSA Office of Advanced Simulation and Computing. “As with all large-scale and complex programs at the NNSA, we cannot do this alone and must partner across our national laboratories, and with industry partners, to push the boundaries of what’s possible.”
A bumpy start on the road to success

In early 2020, a formal kickoff of the El Capitan Center of Excellence brought together Sandia, Lawrence Livermore, Los Alamos and industry partners Hewlett Packard Enterprise and Advanced Micro Devices. The intent was to get engineers and programmers on early access systems to assess Advanced Simulation and Computing Tri-lab application codes for a supercomputer that did not yet exist.
The ability to provide feedback on issues and guide the development of software so that required features could be road mapped for future release was critical.
At the heart of the Center of Excellence was an issue tracker in the form of a ticket file system, and message board. These tools were essential but not sufficient. Developers could work on problems independently, file tickets to document a bug or feature request, and post general messages where everyone could read them. The tools to steward new software and system research and development were present, but the burden still rested on developers to explain issues clearly enough so that unmet engineers at vendor sites could read and understand the laboratories complicated problems. Additionally, the teams could not share complete examples that mimic the problem due to security constraints.
The challenges now not only included state lines, but export-control rules and divergent software ecosystems.
“This is a daunting problem,” said James Elliott, principal member of technical staff and Sandia Center of Excellence lead. “How do you socialize complex problems with partners that have no prior experience with our code? Enter the COE. The main objective was to architect how to use the tools we had — expert personnel, ticket systems, message boards — while knowing the constraints present, such as security restrictions on our codes and personnel with diverse experience working from remote locations.”
Success was built on connecting people

There was another barrier, one that was entirely unprecedented: the COVID-19 pandemic. After months of lockdowns, the Center of Excellence wanted to revive in-person hackathons, which had been useful on other projects. The idea was to host in-person events with vendors and pair targeted personnel to code teams. In November 2022, the idea became a reality.
Each year, four hackathons would take place, two at Sandia and two at Lawrence Livermore, providing dedicated focus time for teams to nominally “hack” on their codes. The real effect was to build comradery and enable vendor and lab staff to work side by side. The hackathons hosted by Sandia included over 50 developers from Sandia, Los Alamos, Lawrence Livermore, Hewlett Packard Enterprise and Advanced Micro Devices.
“The hackathons have proven to be a force-multiplier. They are extremely popular with both the code teams and our vendor partners at Hewlett Packard Enterprise and Advanced Micro Devices,” said Judy Hill, lead of the El Capitan Center of Excellence and computational scientist at Lawrence Livermore National Laboratory. “We can give the vendors real-time feedback on critical challenges and obstacles in the El Capitan software stack, and our code teams have received similar real-time assistance from our COE engineers, yielding leaps forward in application readiness, all from one week of dedicated collaboration time.”
The collaborative spirit of the hackathons extended beyond the three-day events, maturing to four-day events due to popularity. Many teams began building in regular meetings outside of the hackathons with dedicated vendor personnel, often from three or four different employers, to continue the efforts.
This cohesive social network enabled deployment of El Capitan and, later, El Dorado. El Capitan, an exascale computing system located at Lawrence Livermore, was ranked first on the TOP500 list announced at Supercomputing in November 2024. El Dorado, smaller in scale to El Capitan but architecturally identical, ranked at number 20.
With the complexity of getting the codes both functional and performant on computing systems, the hackathons showed that building a successful network of people might be as fundamental as building the machine itself.
“What really unlocked our progress wasn’t just cutting-edge hardware or code tweaks, but the simple act of bringing people together,” said Jen Gaudioso, director of the Advanced Simulation and Computing at Sandia. “Developers, vendors and lab scientists worked side by side to build trust, share insights and solve problems in real time.”
The development teams are now running at full scale on El Capitan, with some teams even completing Gordon Bell runs, the means to competing for the Gordon Bell Prize. Having just completed the sixth hackathon with incredible turnout, Sandia’s hackathons continue to be a key mechanism for enabling the team’s knowledge of and access to vendor experts.