Many distributed applications implement complex data flows and need a flexible mechanism for routing data between producers and consumers. Recent advances in programmable network interface cards, or SmartNICs, represent an opportunity to offload data-flow tasks into the network fabric, thereby freeing the hosts to perform other work. System architects in this space face multiple questions about the best way to leverage SmartNICs as processing elements in data flows. In this paper, we advocate the use of Apache Arrow as a foundation for implementing data-flow tasks on SmartNICs. We report on our experiences adapting a partitioning algorithm for particle data to Apache Arrow and measure the on-card processing performance for the BlueField-2 SmartNIC. Our experiments confirm that the BlueField-2’s (de)compression hardware can have a significant impact on in-transit workflows where data must be unpacked, processed, and repacked.
Recent efforts at Sandia such as DataSEA are creating search engines that enable analysts to query the institution’s massive archive of simulation and experiment data. The benefit of this work is that analysts will be able to retrieve all historical information about a system component that the institution has amassed over the years and make better-informed decisions in current work. As DataSEA gains momentum, it faces multiple technical challenges relating to capacity storage. From a raw capacity perspective, data producers will rapidly overwhelm the system with massive amounts of data. From an accessibility perspective, analysts will expect to be able to retrieve any portion of the bulk data, from any system on the enterprise network. Sandia’s Institutional Computing is mitigating storage problems at the enterprise level by procuring new capacity storage systems that can be accessed from anywhere on the enterprise network. These systems use the simple storage service, or S3, API for data transfers. While S3 uses objects instead of files, users can access it from their desktops or Sandia’s high-performance computing (HPC) platforms. S3 is particularly well suited for bulk storage in DataSEA, as datasets can be decomposed into object that can be referenced and retrieved individually, as needed by an analyst. In this report we describe our experiences working with S3 storage and provide information about how developers can leverage Sandia’s current systems. We present performance results from two sets of experiments. First, we measure S3 throughput when exchanging data between four different HPC platforms and two different enterprise S3 storage systems on the Sandia Restricted Network (SRN). Second, we measure the performance of S3 when communicating with a custom-built Ceph storage system that was constructed from HPC components. Overall, while S3 storage is significantly slower than traditional HPC storage, it provides significant accessibility benefits that will be valuable for archiving and exploiting historical data. There are multiple opportunities that arise from this work, including enhancing DataSEA to leverage S3 for bulk storage and adding native S3 support to Sandia’s IOSS library.
High-performance computing (HPC) researchers have long envisioned scenarios where application workflows could be improved through the use of programmable processing elements embedded in the network fabric. Recently, vendors have introduced programmable Smart Network Interface Cards (SmartNICs) that enable computations to be offloaded to the edge of the network. There is great interest in both the HPC and high-performance data analytics (HPDA) communities in understanding the roles these devices may play in the data paths of upcoming systems. This paper focuses on characterizing both the networking and computing aspects of NVIDIA’s new BlueField-2 SmartNIC when used in a 100Gb/s Ethernet environment. For the networking evaluation we conducted multiple transfer experiments between processors located at the host, the SmartNIC, and a remote host. These tests illuminate how much effort is required to saturate the network and help estimate the processing headroom available on the SmartNIC during transfers. For the computing evaluation we used the stress-ng benchmark to compare the BlueField-2 to other servers and place realistic bounds on the types of offload operations that are appropriate for the hardware. Our findings from this work indicate that while the BlueField-2 provides a flexible means of processing data at the network’s edge, great care must be taken to not overwhelm the hardware. While the host can easily saturate the network link, the SmartNIC’s embedded processors may not have enough computing resources to sustain more than half the expected bandwidth when using kernel-space packet processing. From a computational perspective, encryption operations, memory operations under contention, and on-card IPC operations on the SmartNIC perform significantly better than the general-purpose servers used for comparisons in our experiments. Therefore, applications that mainly focus on these operations may be good candidates for offloading to the SmartNIC.
Remote Direct Memory Access (RDMA) is an increasingly important technology in high-performance computing (HPC). RDMA provides low-latency, high-bandwidth data transfer between compute nodes. Additionally, it does not require explicit synchronization with the destination processor. Eliminating unnecessary synchronization can significantly improve the communication performance of large-scale scientific codes. A long-standing challenge presented by RDMA communication is mitigating the cost of registering memory with the network interface controller (NIC). Reusing memory once it is registered has been shown to significantly reduce the cost of RDMA communication. However, existing approaches for reusing memory rely on implicit memory semantics. In this paper, we introduce an approach that makes memory reuse semantics explicit by exposing a separate allocator for registered memory. The data and analysis in this paper yield the following contributions: (i) managing registered memory explicitly enables efficient reuse of registered memory; (ii) registering large memory regions to amortize the registration cost over multiple user requests can significantly reduce cost of acquiring new registered memory; and (iii) reducing the cost of acquiring registered memory can significantly improve the performance of RDMA communication. Reusing registered memory is key to high-performance RDMA communication. By making reuse semantics explicit, our approach has the potential to improve RDMA performance by making it significantly easier for programmers to efficiently reuse registered memory.
Composition of computational science applications into both ad hoc pipelines for analysis of collected or generated data and into well-defined and repeatable workflows is becoming increasingly popular. Meanwhile, dedicated high performance computing storage environments are rapidly becoming more diverse, with both significant amounts of non-volatile memory storage and mature parallel file systems available. At the same time, computational science codes are being coupled to data analysis tools which are not filesystem-oriented. In this paper, we describe how the FAODEL data management service can expose different available data storage options and mediate among them in both application- and FAODEL-directed ways. These capabilities allow applications to exploit their knowledge of the different types of data they may exchange during a workflow execution, and also provide FAODEL with mechanisms to proactively tune data storage behavior when appropriate. We describe the implementation of these capabilities in FAODEL and how they are used by applications, and present preliminary performance results demonstrating the potential benefits of our approach.
This report documents the outcome from the ASC ATDM Level 2 Milestone 6358: Assess Status of Next Generation Components and Physics Models in EMPIRE. This Milestone is an assessment of the EMPIRE (ElectroMagnetic Plasma In Realistic Environments) application and three software components. The assessment focuses on the electromagnetic and electrostatic particle-in-cell solu- tions for EMPIRE and its associated solver, time integration, and checkpoint-restart components. This information provides a clear understanding of the current status of the EMPIRE application and will help to guide future work in FY19 in order to ready the application for the ASC ATDM L 1 Milestone in FY20. It is clear from this assessment that performance of the linear solver will have to be a focus in FY19.
Composition of computational science applications, whether into ad hoc pipelines for analysis of simulation data or into well-defined and repeatable workflows, is becoming commonplace. In order to scale well as projected system and data sizes increase, developers will have to address a number of looming challenges. Increased contention for parallel filesystem bandwidth, accomodating in situ and ex situ processing, and the advent of decentralized programming models will all complicate application composition for next-generation systems. In this paper, we introduce a set of data services, Faodel, which provide scalable data management for workflows and composed applications. Faodel allows workflow components to directly and efficiently exchange data in semantically appropriate forms, rather than those dictated by the storage hierarchy or programming model in use. We describe the architecture of Faodel and present preliminary performance results demonstrating its potential for scalability in workflow scenarios.
Recent high-performance computing (HPC) platforms such as the Trinity Advanced Technology System (ATS-1) feature burst buffer resources that can have a dramatic impact on an application’s I/O performance. While these non-volatile memory (NVM) resources provide a new tier in the storage hierarchy, developers must find the right way to incorporate the technology into their applications in order to reap the benefits. Similar to other laboratories, Sandia is actively investigating ways in which these resources can be incorporated into our existing libraries and workflows without burdening our application developers with excessive, platform-specific details. This FY18Q1 milestone summaries our progress in adapting the Sandia Parallel Aerodynamics and Reentry Code (SPARC) in Sandia’s ATDM program to leverage Trinity’s burst buffers for checkpoint/restart operations. We investigated four different approaches with varying tradeoffs in this work: (1) simply updating job script to use stage-in/stage out burst buffer directives, (2) modifying SPARC to use LANL’s hierarchical I/O (HIO) library to store/retrieve checkpoints, (3) updating Sandia’s IOSS library to incorporate the burst buffer in all meshing I/O operations, and (4) modifying SPARC to use our Kelpie distributed memory library to store/retrieve checkpoints. Team members were successful in generating initial implementation for all four approaches, but were unable to obtain performance numbers in time for this report (reasons: initial problem sizes were not large enough to stress I/O, and SPARC refactor will require changes to our code). When we presented our work to the SPARC team, they expressed the most interest in the second and third approaches. The HIO work was favored because it is lightweight, unobtrusive, and should be portable to ATS-2. The IOSS work is seen as a long-term solution, and is favored because all I/O work (including checkpoints) can be deferred to a single library.
Field programmable gate arrays (FPGAs) have been used as alternative computational de-vices for over a decade; however, they have not been used for traditional scientific com-puting due to their perceived lack of floating-point performance. In recent years, there hasbeen a surge of interest in alternatives to traditional microprocessors for high performancecomputing. Sandia National Labs began two projects to determine whether FPGAs wouldbe a suitable alternative to microprocessors for high performance scientific computing and,if so, how they should be integrated into the system. We present results that indicate thatFPGAs could have a significant impact on future systems. FPGAs have thepotentialtohave order of magnitude levels of performance wins on several key algorithms; however,there are serious questions as to whether the system integration challenge can be met. Fur-thermore, there remain challenges in FPGA programming and system level reliability whenusing FPGA devices.4 AcknowledgmentArun Rodrigues provided valuable support and assistance in the use of the Structural Sim-ulation Toolkit within an FPGA context. Curtis Janssen and Steve Plimpton provided valu-able insights into the workings of two Sandia applications (MPQC and LAMMPS, respec-tively).5