Astra, deployed in 2018, was the first petascale supercomputer to utilize processors based on the ARM instruction set. The system was also the first under Sandia's Vanguard program which seeks to provide an evaluation vehicle for novel technologies that with refinement could be utilized in demanding, large-scale HPC environments. In addition to ARM, several other important first-of-a-kind developments were used in the machine, including new approaches to cooling the datacenter and machine. This article documents our experiences building a power measurement and control infrastructure for Astra. While this is often beyond the control of users today, the accurate measurement, cataloging, and evaluation of power, as our experiences show, is critical to the successful deployment of a large-scale platform. While such systems exist in part for other architectures, Astra required new development to support the novel Marvell ThunderX2 processor used in compute nodes. In addition to documenting the measurement of power during system bring up and for subsequent on-going routine use, we present results associated with controlling the power usage of the processor, an area which is becoming of progressively greater interest as data centers and supercomputing sites look to improve compute/energy efficiency and find additional sources for full system optimization.
Astra, deployed in 2018, was the first petascale supercomputer to utilize processors based on the ARM instruction set. The system was also the first under Sandia's Vanguard program which seeks to provide an evaluation vehicle for novel technologies that with refinement could be utilized in demanding, large-scale HPC environments. In addition to ARM, several other important first-of-a-kind developments were used in the machine, including new approaches to cooling the datacenter and machine. Here we document our experiences building a power measurement and control infrastructure for Astra. While this is often beyond the control of users today, the accurate measurement, cataloging, and evaluation of power, as our experiences show, is critical to the successful deployment of a large-scale platform. While such systems exist in part for other architectures, Astra required new development to support the novel Marvell ThunderX2 processor used in compute nodes. In addition to documenting the measurement of power during system bring up and for subsequent on-going routine use, we present results associated with controlling the power usage of the processor, an area which is becoming of progressively greater interest as data centers and supercomputing sites look to improve compute/energy efficiency and find additional sources for full system optimization.
Scientific applications run on high-performance computing (HPC) systems are critical for many national security missions within Sandia and the NNSA complex. However, these applications often face performance degradation and even failures that are challenging to diagnose. To provide unprecedented insight into these issues, the HPC Development, HPC Systems, Computational Science, and Plasma Theory & Simulation departments at Sandia crafted and completed their FY21 ASC Level 2 milestone entitled "Integrated System and Application Continuous Performance Monitoring and Analysis Capability." The milestone created a novel integrated HPC system and application monitoring and analysis capability by extending Sandia's Kokkos application portability framework, Lightweight Distributed Metric Service (LDMS) monitoring tool, and scalable storage, analysis, and visualization pipeline. The extensions to Kokkos and LDMS enable collection and storage of application data during run time, as it is generated, with negligible overhead. This data is combined with HPC system data within the extended analysis pipeline to present relevant visualizations of derived system and application metrics that can be viewed at run time or post run. This new capability was evaluated using several week-long, 290-node runs of Sandia's ElectroMagnetic Plasma In Realistic Environments ( EMPIRE ) modeling and design tool and resulted in 1TB of application data and 50TB of system data. EMPIRE developers remarked this capability was incredibly helpful for quickly assessing application health and performance alongside system state. In short, this milestone work built the foundation for expansive HPC system and application data collection, storage, analysis, visualization, and feedback framework that will increase total scientific output of Sandia's HPC users.
Both the data science and scientific computing communities are embracing GPU acceleration for their most demanding workloads. For scientific computing applications, the massive volume of code and diversity of hardware platforms at supercomputing centers has motivated a strong effort toward performance portability. This property of a program, denoting its ability to perform well on multiple architectures and varied datasets, is heavily dependent on the choice of parallel programming model and which features of the programming model are used. In this paper, we evaluate performance portability in the context of a data science workload in contrast to a scientific computing workload, evaluating the same sparse matrix kernel on both. Among our implementations of the kernel in different performance-portable programming models, we find that many struggle to consistently achieve performance improvements using the GPU compared to simple one-line OpenMP parallelization on high-end multicore CPUs. We show one that does, and its performance approaches and sometimes even matches that of vendor-provided GPU math libraries.
Multithreaded MPI applications are gaining popularity in scientific and high-performance computing. While the combination of programming models is suited to support current parallel hardware, it moves threading models and their interaction with MPI into focus. With the advent of new threading libraries, the flexibility to select threading implementations of choice is becoming an important usability feature. Open MPI has traditionally avoided componentizing its threading model, relying on code inlining and static initialization to minimize potential impacts on runtime fast paths and synchronization. This paper describes the implementation of a generic threading runtime support in Open MPI using the Opal Modular Component Architecture. This architecture allows the programmer to select a threading library at compile-or run-time, providing both static initialization of threading primitives as well as dynamic instantiation of threading objects. In this work, we present the implementation, define required interfaces, and discuss trade-offs of dynamic and static initialization.
Proceedings of MCHPC 2020: Workshop on Memory Centric High Performance Computing, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis
Many-core systems are beginning to feature novel large, high-bandwidth intermediate memory as a visible part of the memory hierarchy. This paper discusses how to make use of intermediate memory when composing matrix multiply with transpose to compute $A$ * AT. We re-purpose the cache-oblivious approach developed by Frigo et al. and apply it to the composition of a bandwidth-bound kernel (transpose) with a compute-bound kernel (matrix multiply). Particular focus is on regions of matrix shapes far from square that are not usually considered. Our codes are simpler than optimized codes, but reasonably close in performance. Also, perhaps of more importance is developing a paradigm for how to construct other codes using intermediate memories.
In the decade since support for task parallelism was incorporated into OpenMP, its use has remained limited in part due to concerns about its performance and scalability. This paper revisits a study from the early days of OpenMP tasking that used the Unbalanced Tree Search (UTS) benchmark as a stress test to gauge implementation efficiency. The present UTS study includes both Clang/LLVM and vendor OpenMP implementations on four different architectures. We measure parallel efficiency to examine each implementation’s performance in response to varying task granularity. We find that most implementations achieve over 90% efficiency using all available cores for tasks of O(100k) instructions, and the best even manage tasks of O(10k) instructions well.
Community detection in graphs is a canonical social network analysis method. We consider the problem of generating suites of teras-cale synthetic social networks to compare the solution quality of parallel community-detection methods. The standard method, based on the graph generator of Lancichinetti, Fortunato, and Radicchi (LFR), has been used extensively for modest-scale graphs, but has inherent scalability limitations. We provide an alternative, based on the scalable Block Two-Level Erdos-Renyi (BTER) graph generator, that enables HPC-scale evaluation of solution quality in the style of LFR. Our approach varies community coherence, and retains other important properties. Our methods can scale real-world networks, e.g., to create a version of the Friendster network that is 512 times larger. With BTER's inherent scalability, we can generate a 15-terabyte graph (4.6B vertices, 925B edges) in just over one minute. We demonstrate our capability by showing that label-propagation community-detection algorithm can be strong-scaled with negligible solution-quality loss.
In this report, we abstract eleven papers published during the project and describe preliminary unpublished results that warrant follow-up work. The topic is multi-level memory algorithmics, or how to effectively use multiple layers of main memory. Modern compute nodes all have this feature in some form.
Existing machines for lazy evaluation use a flat representation of environments, storing the terms associated with free variables in an array. Combined with a heap, this structure supports the shared intermediate results required by lazy evaluation. We propose and describe an alternative approach that uses a shared environment to minimize the overhead of delayed computations. We show how a shared environment can act as both an environment and a mechanism for sharing results. To formalize this approach, we introduce a calculus that makes the shared environment explicit, as well as a machine to implement the calculus, the Cactus Environment Machine. A simple compiler implements the machine and is used to run experiments for assessing performance. The results show reasonable performance and suggest that incorporating this approach into real-world compilers could yield performance benefits in some scenarios.
For at least the last 20 years, many have tried to create a general resource management system to support interoperability across various concurrent libraries. The previous strategies all suffered from additional toolchain requirements, and/or a usage of a shared programing model that assumed it owned/controlled access to all resources available to the program. None of these techniques have achieved wide spread adoption. The ubiquity of OpenMP coupled with C++ developing a standard way to describe many different concurrent paradigms (C++23 executors) would allow OpenMP to assume the role of a general resource manager without requiring user code written directly in OpenMP. With a few added features such as the ability to use otherwise idle threads to execute tasks and to specify a task “width”, many interesting concurrent frameworks could be developed in native OpenMP and achieve high performance. Further, one could create concrete C++ OpenMP executors that enable support for general C++ executor based codes, which would allow Fortran, C, and C++ codes to use the same underlying concurrent framework when expressed as native OpenMP or using language specific features. Effectively, OpenMP would become the de facto solution for a problem that has long plagued the HPC community.