Advanced Systems Technology Test Beds

As part of NNSA’s Advanced Simulation and Computing (ASC) project, Sandia has acquired a set of advanced architecture test beds to help prepare applications and system software for the disruptive computer architecture changes that have begun to emerge and will continue to appear as HPC systems approach Exascale. While these test beds can be used for node-level exploration they also provide the ability to study inter-node characteristics to understand future scalability challenges.  To date, the test bed systems populate 1-6 racks and have on the order of 50-200 multi-core nodes, many with an attached co-processor or GP-GPU.

The test beds allow for path finding explorations of 1) alternative programming models, 2) architecture-aware algorithms, 3) energy efficient runtime and system software, 4) advanced memory sub-system development and 5) application performance. But that is not all.  Validation of computer architectural simulation studies can also be performed on these early examples of future Exascale platform architectures.  As proxy applications are developed and re-implemented in architecture-centric versions, the developers need these advanced architecture systems to explore how to adapt to an “MPI + X” paradigm, where “X” may be more than one disparate alternative.  This in turn, demands that tools be developed to inform the performance analyses.  ASC has embraced a co-design approach for its future advanced technology systems.  By purchasing from and working closely with the vendors on pre-production test beds, both ASC and the vendors are afforded early guidance and feedback on their paths forward.  This applies not only to hardware, but other enabling technologies such as system software, compilers, and tools.

There are currently several test beds available for use, with more in planning and integration phases. They represent distinct architectural directions and/or unique features important for future study. Examples of the latter are custom power monitors and on-node solid state disks (SSD).

Heterogenous Advanced Architecture Platform Test Beds

Host Name Nodes CPU Accelerator or Co-Processor Cores per Accelerator or Co-Processor Interconnect Other
Blake 40 Dual-Socket Intel Xeon Platinum (24 cores) None N/A Intel OmniPath Each core has dual AVX512 vector processing units w/FMA
Caraway 4 Dual AMD EPYC 7401 Dual-GPU AMD Radeon Instinct MI25 64 Cus Mellanox FDR InfiniBand Only two nodes have GPUs
DodgeCity 1 Intel Xeon Platinum 16x GraphCore IPUs per node 1472 cores N/A Custom Machine Learning accelerator
Inouye 7 Fujitsu A64FX (48 cores None N/A Mellanox EDR Infiniband Fujitsu PRIMEHPC FX700 Armv8.2-A SVE instruction set
Mayer 44 Four ThunderX2 (BO) (28 cores) Forty ThunderX2 (A1) (28 cores) None N/A Mellanox EDR Infiniband (with socket direct) Small scale Vanguard Astra/Stria prototype
Morgan 8 Four Dual Intes IvyBridge Five Intel Cascade Lake (24 cores) Dual Intel Xeon Phi Co-processor Two 57 core Two 61 core Mellanox QDR Infiniband Intel Xeon Phi only available on Ivy Bridge nodes (codenamed Knights Corner)
Ride 4 Dual IBM Power8 (10 cores) None N/A Mellanox FDR InfiniBand Technology on the path to anticipated CORAL systems
Voltrino 56 Dual Intel Xeon Ivy Bridge (24 core) None N/A Cray Aries Cray XC30m, Full featured RAS system including power monitoring and control capabilities
White 9 Dual IBM Power8 (10 cores) Dual Nvidia K40 2880 cores Mellanox FDR InfiniBand Technology on the path to anticipated CORAL systems
Weaver 10 Dual IBM Power0 (20 core) Dual Nvidia Tesla V100 5120 cores Mellanox EDR Infiniband Technology on the path to anticipated CORAL systems

Application Readiness Test Beds

Host Name Nodes CPU Accelerator or Co-Processor Cores per Accelerator or Co-Processor Interconnect Other
Mutrino 200 One-hundred Haswell (32 cores) One-hundred Intel Knights Landing (64 cores) None N/A Cray Aries Dragonfly Small-scale Cray XC system supporting the Sandia/LANL ACES partnership Trinity platform located at LANL
Vortex 72 Dual IBM Power9 (22 cores) Quad Tesla V100 GPU per Node 5120 cores Mellanox EDR InfiniBand (Full fat-tree) Small-scale IBM system supporting evaluation of codes to run on the target LLNL Sierra cluster