Computational Offload with BlueField Smart NICs
The recent introduction of a new generation of "smart NICs" have provided new accelerator platforms that include CPU cores or reconfigurable fabric in addition to traditional networking hardware and packet offloading capabilities. While there are currently several proposals for using these smartNICs for low-latency, in-line packet processing operations, there remains a gap in knowledge as to how they might be used as computational accelerators for traditional high-performance applications. This work aims to look at benchmarks and mini-applications to evaluate possible benefits of using a smartNIC as a compute accelerator for HPC applications. We investigate NVIDIA's current-generation BlueField-2 card, which includes eight Arm CPUs along with a small amount of storage, and we test the networking and data movement performance of these cards compared to a standard Intel server host. We then detail how two different applications, YASK and miniMD can be modified to make more efficient use of the BlueField-2 device with a focus on overlapping computation and communication for operations like neighbor building and halo exchanges. Our results show that while the overall compute performance of these devices is limited, using them with a modified miniMD algorithm allows for potential speedups of 5 to 20% over the host CPU baseline with no loss in simulation accuracy.