In 2016/2017, the field of HPC entered a new era driven by fundamental physics challenges to produce more energy and cost-efficient processors. Since the convergence on the Message-Passing Interface (MPI) standard in the mid-1990s, application developers enjoyed a seemingly static view of the underlying machine – that of a distributed collection of homogeneous nodes executing in collaboration. However, after almost two decades of dominance, the sole use of MPI to derive parallelism acted as a limiter to improved future performance. While MPI is widely expected to continue to function as the basic mechanism for communication between compute nodes for the immediate future, additional parallelism is required on the computing node itself if high performance and efficiency goals are to be realized.
When reviewing the architectures of the top HPC systems today, the change in paradigm is clear: the compute nodes of the leading machines in the world are either powered by many-core chips with a few dozen cores each, or use heterogeneous designs, where traditional central processing units marshal work to massively parallel compute accelerators which have as many as 200,000 processing threads in flight simultaneously. Complicating matters further for application developers, each processor vendor has its own preferred way of writing code for their architecture.
The Kokkos EcoSystem was released by Sandia in 2017 to address this new era in HPC system design by providing a vendor independent performance portable programming system for scientific, engineering, and mathematical software applications written in the C++ programming language. Using Kokkos, application developers can be more productive because they will not have to create and maintain separate versions of their software for each architecture, nor will they have to be experts in each architecture’s peculiar requirements. Instead, they will have a single method of programming for the diverse set of modern HPC architectures.
While Kokkos started in 2011 as a programming model only, it soon became clear that complex applications needed more. It is also critical to have portable mathematical functions and developers need tools to debug their applications, gain insight into the performance characteristics of their codes and tune algorithm performance parameters through automated processes. The Kokkos EcoSystem addresses those needs through its three main components: the Kokkos Core programming model, the Kokkos Kernels math library, and the Kokkos Tools project.
Kokkos Core is a programming model for parallel shared memory architectures. The model enables most application-written code to be performance portable across architectures. The programming model includes abstractions for frequently used parallel computing patterns, policies that provide details for how those computing patterns are to be applied, and execution spaces that denote on which compute resources the parallel computation is performed. The programming model also includes patterns for common data structures, policies that provide details for how those data structures are laid out in memory, and memory spaces that denote in which memory the data will reside.
The Kokkos Core programming model works by requiring that application development teams implement their algorithms in terms of Kokkos’ patterns, policies, and spaces. Kokkos Core is then free to map these algorithms and data structures onto each target architecture according to architecture-specific rules necessary to achieve the best performance. While other programming models support execution patterns, execution policies, execution spaces, and memory spaces, only Kokkos supports memory layouts and memory traits, which are necessary for performance portability.
Kokkos Kernels is a software library of linear algebra and graph algorithms used across many HPC applications to achieve the best performance on every architecture. The baseline version of this library is written using the Kokkos Core programming model for portability and good performance. The library has architecture-specific optimizations or can utilize calls to vendor-specific versions of these mathematical algorithms where needed. This further reduces the amount of architecture-specific software that an application team will need to develop, thus further reducing their modification cost to achieve “best in class” performance.
Kokkos Tools is an innovative “plug-in” software interface and a growing set of tools that understand the Kokkos programming model and runtime. Providing debugging, profiling and tuning tools the project helps application developers during the entire life-cycle of a code. Debugging and correctness tools help identify complex software bugs and corner cases that often even evade manual inspection. Development teams can use the performance profiling tools to determine how well they designed and implemented their algorithms and to identify portions of their software that need improvement. Most recently, auto-tuning tools were added which allow applications to adapt to new hardware automatically, reducing the need for developers to fine tune the code for every new HPC platform their users want to leverage. Furthermore, the “plug-in” interface for tools allows third-party tool providers to hook into Kokkos codes the same way, enabling widely used profiling tools such as Tau and HPCToolkit to understand Kokkos.
Today, the Kokkos EcoSystem allows an ever-larger number of application teams to achieve portability and improve performance on advanced computing architectures. Not just a Sandia product anymore, the core Kokkos team now consists of developers distributed over five DOE national laboratories who work on maintaining and improving the EcoSystem as well as support Kokkos users at their institutions. New efforts such as Kokkos Remote Spaces cover a wider range of future HPC application needs and a dedicated support effort helps to train and educate software engineers and computational scientists. Kokkos is now used by hundreds of HPC developers around the world. Within DOE’s Exascale Computing Project, it serves as the underlying portability layer for almost two dozen projects. Kokkos is also a basis for DOE laboratories to propose improvements to the ISO/C++ language standard such that, eventually, Kokkos capabilities will become native to the language standard. But until then: Performance Portability is Kokkos.