Sandia News

Stewardship beats spectatorship


“We can walk down the road with vision and leadership — or be dragged.” 

These words weren’t spoken out loud until years later, after the road to the world’s fastest supercomputer had already been traveled. For Sandians on the project, though, the quiet imperatives of vision and leadership took hold in December 2019, when the path to El Capitan lay ahead. 

That path would span five years, three working prototypes, and hundreds of quarterly software drops, nondisclosure agreements and red-taped firewalls. The journey produced an exascale bulwark for America’s nuclear stockpile — and demonstrated Sandia’s resolve to influence technology that will persist for years to come. 

The procurement was one of the most ambitious system standups in tri-lab history, and on paper it seemed straightforward. Hewlett-Packard Enterprise would integrate a machine at Lawrence Livermore National Laboratory; AMD would supply cutting-edge MI300A accelerated processing units, or APUs; Lawrence Livermore, Los Alamos National Laboratory, and Sandia would port their flagship modeling and simulation codes.

Yet the risks were many. In 2019, AMD’s data-center GPU market share rounded to zero, its ROCm compiler was brand new and no APU had ever powered a production supercomputer. Also, the three laboratories were separated by state lines, export-control regimes, and divergent software ecosystems.  

The contract’s wording therefore read something like a promissory note: We buy your roadmap today, trusting it will harden into silicon tomorrow. Sandia’s choice was clear: Either lead by participating early and helping draw the roadmap, or follow and face a steep climb to meet program milestones or — worse — be unable to run on the El Capitan system at all.  

Footholds on thin ice — The EAS ladder  

Sandia’s influence took root in its work with a progression of three early access systems, each one a progression down AMD’s hardware evolution from challenging the prior state of the art with MI60s to challenging modern chips with the MI250X and finally the novel MI300A APU. These systems were sited at Livermore, the vanguard for installing and testing vendor software on the Tri-Lab Operating System Stack. For this plan to work, Sandia and Livermore needed to work together.

Multiyear collaboration bloomed. Sandia embedded engineers inside each system, climbing a hardware ladder whose rungs materialized only when the previous one was fully ascended. 

In the first system, Sandians found that complex kernels in their simulation codes repeatedly crashed the ROCm compiler, prompting AMD to accelerate revisions. By the third system, the initial compiler issues were largely resolved and Sandia’s highly modified codes were running successfully. Meanwhile, ROCm updates rained down to address other issues, and Sandia regression-tested each one, shortening bug-identification cycles from months to days.  

Engineering the human fabric 

Hardware alone, however, could not bridge Sandia, Los Alamos and Livermore. A five-member Center of Excellence (one lead per lab plus AMD and HPE counterparts) became the focal point for enabling vendor and app team integration. The leads established information-sharing platforms that crossed institutional barriers and later revived in-person hackathons, alternating between Sandia, Los Alamos and Livermore.  

After months of Covid-19 lockdowns, the first Sandia-hosted hackathon, in November 2022, brought an immediate sense of connection. These gatherings yielded technical progress, but the face time also yielded empathy, which translated into vendor sprint priorities. Another relationship-building moment came when Livermore and Sandia code teams presented portions of a key milestone report directly to AMD, giving the vendor rare insight into how NNSA grades tri-lab efforts.  

After this presentation, in July 2023, the development cadence solidified: Code-team-specific standing meetings were every week, with ROCm and Cray compiler drops every month and hackathons every quarter. What looked frenetic from the outside became ritual on the inside — regular rapport between labs code teams and our vendor partners. Livermore and Sandia’s collaboration was much more than lip service.  

A software metronome at exascale tempo 

The cadence was also, at times, relentless. For extended stretches, AMD shipped fresh ROCm releases almost weekly. Each one required Sandia to rebuild Trilinos’ 5-million-line ecosystem, re-evaluate Kokkos compilation and linking, and verify that solvers still converged.  

But every pass through the loop also pulled AMD’s tooling closer to Sandia’s real workloads. By September 2023, the ROCm 5.7 release included support for the MI300A processor to be used in the production system. The processor itself was not yet available, so Sandia engineers compiled against a ghost architecture, exposed latent linkage issues, and sent patches upstream.  

When beta MI300A silicon arrived in December 2023, codes executed on day one. And, by December 2024, when 11,136 production nodes lit green LEDs inside Livermore’s datacenter, the question was no longer “Will it run?” but “How fast can it scale?” — a notable accomplishment enabled by tri-lab code teams.  

The payoff and the horizon  

By January 2025, when teams received early access to the production system, the compiler crashes and register and optimizer issues were safely in the rear-view mirror. New scaling issues cropped up, but incredibly, teams were able to plan and launch jobs at 1,000-, 4,000-, and even 11,000 nodes during the early access phase. 

Today, some long-running topics, such as link times and runtime performance, are still open. Performance tuning continues. But the vendors have demonstrated they can and will tackle problems — and the foundation is solid concrete. AMD, once a rookie in exascale, now fields a production stack with ongoing R&D. HPE’s Cray network stack speaks Tri-lab Operating System Stack natively. And Sandia’s code teams have earned line-item influence in ROCm roadmaps that will long outlive El Capitan.  

Sandians not only walked the road, they helped to blaze the trail. They left a wake of failing test cases and performance deltas by which vendors could navigate. In doing so they proved a thesis worth remembering: Stewardship beats spectatorship. When the next procurement beckons, AMD will arrive as a seasoned competitor, its toolchain tempered by thousands of Sandia-forged edge cases.  

And somewhere in a kickoff meeting yet to be scheduled, another engineer will answer the old ultimatum, this time with a smile: We’ll walk, thank you, because we now know exactly how far that road can take us.