Publications

Publications / SAND Report

LDMS-GPU: Lightweight Distributed Metric Service (LDMS) for NVIDIA GPGPUs

Elwazir, Ammar E.; Badawy, Abdel-Hameed, B.; Aaziz, Omar R.; Cook, Jeanine C.

GPUs are now a fundamental accelerator for many high-performance computing applications. They are viewed by many as a technology facilitator for the surge in fields like machine learning and Convolutional Neural Networks. To deliver the best performance on a GPU, we need to create monitoring tools to ensure that we optimize the code to get the most performance and efficiency out of a GPU. Since NVIDIA GPUs are currently the most commonly implemented in HPC applications and systems, NVIDIA tools are the solution for performance monitoring. The Light-Weight Distributed Metric System (LDMS) at Sandia is an infrastructure widely adopted for large-scale systems and application monitoring. Sandia has developed CPU application monitoring capability within LDMS. Therefore, we chose to develop a GPU monitoring capability within the same framework. In this report, we discuss the current limitations in the NVIDIA monitoring tools, how we overcame such limitations, and present an overview of the tool we built to monitor GPU performance in LDMS and its capabilities. Also, we discuss our current validation results. Most of the performance counter results are the same in both vendor tools and our tool when using LDMS to collect these results. Furthermore, our tool provides these statistics during the entire runtime of the tool as a time series and not just aggregate statistics at the end of the application run. This allows the user to see the progress of the behavior of the applications during their lifetime.