The Kokkos OpenMPTarget Backend: Implementation and Lessons Learned
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
ACM Transactions on Mathematical Software
Automatic differentiation (AD) is a well-known technique for evaluating analytic derivatives of calculations implemented on a computer, with numerous software tools available for incorporating AD technology into complex applications. However, a growing challenge for AD is the efficient differentiation of parallel computations implemented on emerging manycore computing architectures such as multicore CPUs, GPUs, and accelerators as these devices become more pervasive. In this work, we explore forward mode, operator overloading-based differentiation of C++ codes on these architectures using the widely available Sacado AD software package. In particular, we leverage Kokkos, a C++ tool providing APIs for implementing parallel computations that is portable to a wide variety of emerging architectures. We describe the challenges that arise when differentiating code for these architectures using Kokkos, and two approaches for overcoming them that ensure optimal memory access patterns as well as expose additional dimensions of fine-grained parallelism in the derivative calculation. We describe the results of several computational experiments that demonstrate the performance of the approach on a few contemporary CPU and GPU architectures. We then conclude with applications of these techniques to the simulation of discretized systems of partial differential equations.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Proceedings - 2022 IEEE 18th International Conference on e-Science, eScience 2022
To keep pace with the demand for innovation through scientific computing, modern scientific software development is increasingly reliant upon a rich and diverse ecosystem of software libraries and toolchains. Research software engineers (RSEs) responsible for that infrastructure perform highly integrative work, acting as a bridge between the hardware, the needs of researchers, and the software layers situated between them; relatively little, however, has been written about the role played by RSEs in that work and what support they need to thrive. To that end, we present a two-part report on the development of half-precision floating point support in the Kokkos Ecosystem. Half-precision computation is a promising strategy for increasing performance in numerical computing and is particularly attractive for emerging application areas (e.g., machine learning), but developing practicable, portable, and user-friendly abstractions is a nontrivial task. In the first half of the paper, we conduct an engineering study on the technical implementation of the Kokkos half-precision scalar feature and showcase experimental results; in the second half, we offer an experience report on the challenges and lessons learned during feature development by the first author. We hope our study provides a holistic view on scientific library development and surfaces opportunities for future studies into effective strategies for RSEs engaged in such work.
Abstract not provided.
Abstract not provided.
Computer Physics Communications
Since the classical molecular dynamics simulator LAMMPS was released as an open source code in 2004, it has become a widely-used tool for particle-based modeling of materials at length scales ranging from atomic to mesoscale to continuum. Reasons for its popularity are that it provides a wide variety of particle interaction models for different materials, that it runs on any platform from a single CPU core to the largest supercomputers with accelerators, and that it gives users control over simulation details, either via the input script or by adding code for new interatomic potentials, constraints, diagnostics, or other features needed for their models. As a result, hundreds of people have contributed new capabilities to LAMMPS and it has grown from fifty thousand lines of code in 2004 to a million lines today. In this paper several of the fundamental algorithms used in LAMMPS are described along with the design strategies which have made it flexible for both users and developers. We also highlight some capabilities recently added to the code which were enabled by this flexibility, including dynamic load balancing, on-the-fly visualization, magnetic spin dynamics models, and quantum-accuracy machine learning interatomic potentials.
Computing in Science and Engineering
State-of-the-art engineering and science codes have grown in complexity dramatically over the last two decades. Application teams have adopted more sophisticated development strategies, leveraging third party libraries, deploying comprehensive testing, and using advanced debugging and profiling tools. In today's environment of diverse hardware platforms, these applications also desire performance portability-avoiding the need to duplicate work for various platforms. The Kokkos EcoSystem provides that portable software stack. Based on the Kokkos Core Programming Model, the EcoSystem provides math libraries, interoperability capabilities with Python and Fortran, and Tools for analyzing, debugging, and optimizing applications. In this article, we overview the components, discuss some specific use cases, and highlight how codesigning these components enables a more developer friendly experience.
Abstract not provided.
Abstract not provided.
IEEE Transactions on Parallel and Distributed Systems
As the push towards exascale hardware has increased the diversity of system architectures, performance portability has become a critical aspect for scientific software. We describe the Kokkos Performance Portable Programming Model that allows developers to write single source applications for diverse high performance computing architectures. Kokkos provides key abstractions for both the compute and memory hierarchy of modern hardware. Here, we describe the novel abstractions that have been added to Kokkos recently such as hierarchical parallelism, containers, task graphs, and arbitrary-sized atomic operations. We demonstrate the performance of these new features with reproducible benchmarks on CPUs and GPUs.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
In 2016/2017, the field of High-Performance Computing (HPC) entered a new era driven by fundamental physics challenges to produce ever more energy and cost-efficient processors. Since the convergence on the Message-Passing Interface (MPI) standard in the mid-1990s, application developers enjoyed a seemingly static view of the underlying machine — that of a distributed collection of homogeneous nodes executing in collaboration. However, after almost two decades of dominance, the sole use of MPI to derive parallelism acted as a limiter to improved future performance. While MPI is widely expected to continue to function as the basic mechanism for communication between compute nodes for the immediate future, additional parallelism is required on the computing node itself if high performance and efficiency goals are to be realized. When reviewing the architectures of the top HPC systems today, the change in paradigm is clear: the compute nodes of the leading machines in the world are either powered by many-core chips with a few dozen cores each, or use heterogeneous designs, where traditional CPUs marshal work to massively parallel compute accelerators which has as many as 200,000 processing threads in flight simultaneously. Complicating matters further for application developers, each processor vendor has its own preferred way of writing code for their architecture.The Kokkos EcoSystem was released by Sandia in 2017 to address this new era in HPC system design by providing a vendor independent performance portable programming system for scientific, engineering, and mathematical software applications written in the C++ programming language. Using Kokkos, application developers can be more productive because they will not have to create and maintain separate versions of their software for each architecture, nor will they have to be experts in each architecture's peculiar requirements. Instead, they will have a single method of programming for the diverse set of modern HPC architectures. While Kokkos started in 2011 as a programming model only, it soon became clear that complex applications needed more. It is also critical to have a portable mathematical functions and developers need tools to debug their applications, gain insight into the performance characteristics of their codes and tune algorithm performance parameters through automated processes. The Kokkos EcoSystem addresses those needs through its three main components: the Kokkos Core programming model, the Kokkos Kernels math library, and the Kokkos Tools project.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Engage the C++ standards committee to further the adoption of successful Kokkos concepts into the C++ standard, and provide feedback on proposed concurrency mechanisms such as the executors proposal.
Provide high quality (production) Kokkos support and consultation for ASC applications and libraries.
Supporting the latest hardware and compiler versions is important to leverage improvements in the software environment and new HPC platforms. We will provide certified support for the latest releases of vendor compilers from Intel, AMD, IBM, NVIDIA, ARM and Cray as well as of open source compilers GCC and Clang.
Abstract not provided.
Abstract not provided.
Harden and optimize the ROCm based AMD GPU backend, develop a prototype backend for the Intel ECP Path Forward architecture, and improve the existing prototype Remote Memory Space capabilities.
This report documents the completion of milestone STPRO4-26 Engaging the C++ Committee. The Kokkos team attended the three C++ Committee meetings in San Diego, Hawaii, and Cologne with multiple members, updated multiple in-flight proposals (e.g. MDSpan, atomic ref), contributed to numerous proposals central for future capabilities in C++ (e.g. executors, affinity) and organized a new effort to introduce a Basic Linear Algebra library into the C++ standard. We also implemented a production quality version of mdspan as the basis for replacing the vast majority of the implementation of Kokkos::View, and thus start the transitioning of one of the core features in Kokkos to its future replacement.
This report documents the completion of milestone STPRO4-25 Harden and optimize the ROCm based AMD GPU backend, develop a prototype backend for the Intel ECP Path Forward architecture, and improve the existing prototype Remote Memory Space capabilities. The ROCM code was hardened up to the point of passing all Kokkos unit tests - then AMD deprecated the programming model, forcing us to start over in FY20 with HIP. The Intel ECP Path Forward architecture prototype was developed with some initial capabilities on simulators - but plans changed, so that work will not continue. Instead SYCL will be developed as a backend for Aurora. Remote Spaces was improved. Development is ongoing part of a collaboration with NVIDIA.
This report documents the completion of milestone STPM12-19 Documented Kokkos application usecases. The goal of this milestone was to develop use case examples for common patterns users implement with Kokkos. This work was performed in the fourth quarter of FY19 and resulted in use case descriptions available in the Kokkos Wiki, with code examples.
This report documents the completion of milestone STPM12-17 Kokkos Training Bootcamp. The goal of this milestone was to hold a combined tutorial and hackathon bootcamp event for the Kokkos community and prospective users. The Kokkos Bootcamp event was held at Argonne National Laboratories from August 27 — August 29, 2019. Attendance being lower than expected (we believe largely due to bad timing), the team focused with a select set of ECP partners on early work in preparation for Aurora. In particular we evaluated issues posed by exposing SYCL and OpenMP target offload to applications via the Kokkos Pro Model.
Abstract not provided.
Abstract not provided.
Abstract not provided.
This report documents the completion of milestone STPRO4-13 "Documented Kokkos API", which is part of the Exascale Computing Project (ECP). The goal of this Milestone was to generate documentation for the Kokkos programming model accessible to the open HPC community, beyond what was available via the tutorials. The total documentation for Kokkos now contains the equivalent of about 250 pages in text book format. About a third of it is contained in a more text book like style like the Kokkos Programming Guide, while most of the rest is an API reference modelled after popular C++ reference webpages. On the order of 175 pages was generated new as part of the work for this milestone.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Due to the cost of hardware failures within mission critical and scientific applications, it is necessary for software to provide a mechanism to prevent or recover from interruptions. The Kokkos ecosystem is a programming environment that provides performance and portability to many applications that run on DOE supercomputers as well as smaller scale systems. These applications require a higher level of service due to the cost associated with each simulation or the critical nature of the mission. Software resilience enables an application of manage hardware failures reducing the cost of an interruption. Two different resilience methodologies have been added to the Kokkos ecosystem: checkpointing has been added for restart capabilities and a resilient execution model has been added to account for failures in compute devices. The design and implementation of each of these additions are described, and appropriate examples are included for end users.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Scope and Objectives: Kokkos Support provides cyber resources and conducts training events for current and prospective Kokkos users; In person training events are organized in various venues providing both generic Kokkos tutorials with lectures and exercises, as well as hands-on work on users applications.
Abstract not provided.
Abstract not provided.
Parallel Computing
Sparse matrix-matrix multiplication is a key kernel that has applications in several domains such as scientific computing and graph analysis. Several algorithms have been studied in the past for this foundational kernel. In this paper, we develop parallel algorithms for sparse matrix-matrix multiplication with a focus on performance portability across different high performance computing architectures. The performance of these algorithms depend on the data structures used in them. We compare different types of accumulators in these algorithms and demonstrate the performance difference between these data structures. Furthermore, we develop a meta-algorithm, KKSPGEMM, to choose the right algorithm and data structure based on the characteristics of the problem. We show performance comparisons on three architectures and demonstrate the need for the community to develop two phase sparse matrix-matrix multiplication implementations for efficient reuse of the data structures involved.
This report documents the completion of milestone STPRO4-4 Kokkos back-ends research, collaborations, development, optimization, and documentation. The Kokkos team updated its existing backend to support the software stack and hardware of DOE's Sierra, Summit and Astra machines. They also collaborated with ECP PathForward vendors on developing backends for possible exa-scale architectures. Furthermore, the team ramped up its engagement with the ISO/C++ committee to accelerate the adoption of features important for the HPC community into the C++ standard.
This report documents the completion of milestone STPRO4-4 Kokkos back-ends research, collaborations, development, optimization, and documentation. The Kokkos team updated its existing backend to support the software stack and hardware of DOE's Sierra, Summit and Astra machines. They also collaborated with ECP PathForward vendors on developing backends for possible exa-scale architectures. Furthermore, the team ramped up its engagement with the ISO/C++ committee to accelerate the adoption of features important for the HPC community into the C++ standard.
This report documents the completion of milestone STPRO4-5 Kokkos interoperability with general SIMD types to force vectorization on ATS-1. The Kokkos team worked with application developers to enable the utilization of SIMD intrinsics, which allowed up to 3.7x improvement of the affected kernels on ATS-1 in a proxy application. SIMD types are now deployed in the production code base.
This report documents the completion of milestone STPRO4-6 Kokkos Support for ASC applications and libraries. The team provided consultation and support for numerous ASC code projects including Sandias SPARC, EMPIRE, Aria, GEMMA, Alexa, Trilinos, LAMMPS and nimbleSM. Over the year more than 350 Kokkos github issues were resolved, with over 220 requiring fixes and enhancements to the code base. Resolving these requests, with many of them issued by ASC code teams, provided applications with the necessary capabilities in Kokkos to be successful.
This report documents the completion of milestone STPRO4-7 Kokkos R&D: Remote Memory Spaces for One-Sided Halo-Exchange. The goal of this milestone was to develop and deploy an initial capability to support PGAS like communication models integrated into Kokkos via Remote Memory Spaces. The team developed semantic requirements for Remote Memory Spaces and implemented a prototype library leveraging four different communication libraries: libQUO, SHMEM, MPI-OneSided and NVSHMEM. In conjunction with ADCD02-COPA the Remote Memory Space capability was used in ExaMiniMD — a Molecular Dynamics Proxy Application — to explore the current state of the technology and its usability. The obtained results demonstrate that usability is very good, allowing a significant simplification communication routines, but performance is still lacking.
Abstract not provided.
Abstract not provided.
Abstract not provided.
This report documents the completion of milestone STPM12-4 Kokkos Training Bootcamp. The goal of this milestone was to hold a combined tutorial and hackathon bootcamp event for the Kokkos community and prospective users. The Kokkos Bootcamp event was held on-site at Oak Ridge National Lab from July 24 — July 27, 2018. There were over 40 registered participants from 12 institutions, including 7 Kokkos project staff from SNL, LANL, and ORNL. The event consisted of a roughly a two-day tutorial session including hands exercises, followed by 1.5 days of intensive porting work on codes that the participants brought explore, port, and optimize the use of Kokkos with the help of Kokkos project experts.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.