Kubernetes for HPC Administration
Abstract not provided.
Abstract not provided.
Proceedings of INDIS 2020: Innovating the Network for Data-Intensive Science, Held in conjunction with SC 2020: The International Conference for High Performance Computing, Networking, Storage and Analysis
Priority-based Flow Control (PFC), RDMA over Converged Ethernet (RoCE) and Enhanced Transmission Selection (ETS) are three enhancements to Ethernet networks which allow increased performance and may make Ethernet attractive for systems supporting a diverse scientific workload. We constructed a 96-node testbed cluster with a 100 Gb/s Ethernet network configured as a tapered fat tree. Tests representing important network operating conditions were completed and we provide an analysis of these performance results. RoCE running over a PFC-enabled network was found to significantly increase performance for both bandwidth-sensitive and latency-sensitive applications when compared to TCP. Additionally, a case study of interfering applications showed that ETS can prevent starvation of network traffic for latency-sensitive applications running on congested networks. We did not encounter any notable performance limitations for our Ethernet testbed, but we found that practical disadvantages still tip the balance towards traditional HPC networks unless a system design is driven by additional external requirements.
Abstract not provided.
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Architecture simulation can aid in predicting and understanding application performance, particularly for proposed hardware or large system designs that do not exist. In network design studies for high-performance computing, most simulators focus on the dominant message passing (MPI) model. Currently, many simulators build and maintain their own simulator-specific implementations of MPI. This approach has several drawbacks. Rather than reusing an existing MPI library, simulator developers must implement all semantics, collectives, and protocols. Additionally, alternative runtimes like GASNet cannot be simulated without again building a simulator-specific version. It would be far more sustainable and flexible to maintain lower-level layers like uGNI or IB-verbs and reuse the production runtime code. Directly building and running production communication runtimes inside a simulator poses technical challenges, however. We discuss these challenges and show how they are overcome via the macroscale components for the Structural Simulation Toolkit (SST), leveraging a basic source-to-source tool to automatically adapt production code for simulation. SST is able to encapsulate and virtualize thousands of MPI ranks in a single simulator process, providing a “supercomputer in a laptop” environment. We demonstrate the approach for the production GASNet runtime over uGNI running inside SST. We then discuss the capabilities enabled, including investigating performance with tunable delays, deterministic debugging of race conditions, and distributed debugging with serial debuggers.
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Architecture simulation can aid in predicting and understanding application performance, particularly for proposed hardware or large system designs that do not exist. In network design studies for high-performance computing, most simulators focus on the dominant message passing (MPI) model. Currently, many simulators build and maintain their own simulator-specific implementations of MPI. This approach has several drawbacks. Rather than reusing an existing MPI library, simulator developers must implement all semantics, collectives, and protocols. Additionally, alternative runtimes like GASNet cannot be simulated without again building a simulator-specific version. It would be far more sustainable and flexible to maintain lower-level layers like uGNI or IB-verbs and reuse the production runtime code. Directly building and running production communication runtimes inside a simulator poses technical challenges, however. We discuss these challenges and show how they are overcome via the macroscale components for the Structural Simulation Toolkit (SST), leveraging a basic source-to-source tool to automatically adapt production code for simulation. SST is able to encapsulate and virtualize thousands of MPI ranks in a single simulator process, providing a “supercomputer in a laptop” environment. We demonstrate the approach for the production GASNet runtime over uGNI running inside SST. We then discuss the capabilities enabled, including investigating performance with tunable delays, deterministic debugging of race conditions, and distributed debugging with serial debuggers.
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Performance modeling of networks through simulation requires application endpoint models that inject traffic into the simulation models. Endpoint models today for system-scale studies consist mainly of post-mortem trace replay, but these off-line simulations may lack flexibility and scalability. On-line simulations running so-called skeleton applications run reduced versions of an application that generate traffic that is the same or similar to the full application. These skeleton apps have advantages for flexibility and scalability, but they often must be custom written for the simulator itself. Auto-skeletonization of existing application source code via compiler tools would provide endpoint models with minimal development effort. These source-to-source transformations have been only narrowly explored. We introduce a pragma language and corresponding Clang-driven source-to-source compiler that performs auto-skeletonization based on provided pragma annotations. We describe the compiler toolchain, validate the generated skeletons, and show scalability of the generated simulation models beyond 100Â K endpoints for example MPI applications. Overall, we assert that our proposed auto-skeletonization approach and the flexible skeletons it produces can be an important tool in realizing balanced exascale interconnect designs.
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Data movement is considered the main performance concern for exascale, including both on-node memory and off-node network communication. Indeed, many application traces show significant time spent in MPI calls, potentially indicating that faster networks must be provisioned for scalability. However, equating MPI times with network communication delays ignores synchronization delays and software overheads independent of network hardware. Using point-to-point protocol details, we explore the decomposition of MPI time into communication, synchronization and software stack components using architecture simulation. Detailed validation using Bayesian inference is used to identify the sensitivity of performance to specific latency/bandwidth parameters for different network protocols and to quantify associated uncertainties. The inference combined with trace replay shows that synchronization and MPI software stack overhead are at least as important as the network itself in determining time spent in communication routines.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
This report provides in-depth information and analysis to help create a technical road map for developing next-generation programming models and runtime systems that support Advanced Simulation and Computing (ASC) work- load requirements. The focus herein is on asynchronous many-task (AMT) model and runtime systems, which are of great interest in the context of "Oriascale7 computing, as they hold the promise to address key issues associated with future extreme-scale computer architectures. This report includes a thorough qualitative and quantitative examination of three best-of-class AIM] runtime systems – Charm-++, Legion, and Uintah, all of which are in use as part of the Centers. The studies focus on each of the runtimes' programmability, performance, and mutability. Through the experiments and analysis presented, several overarching Predictive Science Academic Alliance Program II (PSAAP-II) Asc findings emerge. From a performance perspective, AIV runtimes show tremendous potential for addressing extreme- scale challenges. Empirical studies show an AM runtime can mitigate performance heterogeneity inherent to the machine itself and that Message Passing Interface (MP1) and AM11runtimes perform comparably under balanced conditions. From a programmability and mutability perspective however, none of the runtimes in this study are currently ready for use in developing production-ready Sandia ASC applications. The report concludes by recommending a co- design path forward, wherein application, programming model, and runtime system developers work together to define requirements and solutions. Such a requirements-driven co-design approach benefits the community as a whole, with widespread community engagement mitigating risk for both application developers developers. and high-performance computing runtime systein
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.