Publications Details

Publications / SAND Report

High Performance Computing Metrics to Enable Application-Platform Communication

Agelastos, Anthony M.; Brandt, James M.; Gentile, Ann C.; Lamb, Justin M.; Ruggirello, Kevin P.; Stevenson, Joel O.

Sandia has invested heavily in scientific/engineering application development and in the research, development, and deployment of large scale HPC platforms to support the computational needs of these applications. As application developers continually expand the capabilities of their software and spend more time on performance tuning of applications for these platforms, HPC platform resources are at a premium as they are a heavily shared resource serving the varied needs of many users. To ensure that the HPC platform resources are being used effciently and perform as designed, it is necessary to obtain reliable data on resource utilization that will allow us to investigate the occurrence, severity, and causes of performance-affecting contention between applications. The work presented in this paper was an initial step to determine if resource contention can be understood and minimized through monitoring, modeling, planning and infrastructure. This paper describes the set of metric definitions, identified in this research, that can be used as meaningful and potentially actionable indicators of performance-affecting contention between applications. These metrics were verified using the observed slowdown of IOR, IMB, and CTH in operating scenarios that forced contention. This paper also describes system/application monitoring activities that are critical to distilling vast amounts of data into quantities that hold the key to understanding for an application's performance under production conditions and that will ultimately aid in Sandia's efforts to succeed in extreme-scale computing.