Lightweight Distributed Metric Service (LDMS)

The open-source, Sandia-led Lightweight Distributed Metric Service (LDMS) is state-of-the-art monitoring software that provides continuous, run-time insight into applications’ performance in conjunction with system conditions. “Always-on” monitoring on our HPC systems provides a wealth of information that can be used in statistical and ML-based analysis to discover causes of performance degradation and to inform resource management and scheduling decisions. LDMS was designed from the ground up with the extreme-scalability and lightweight data collection necessary to resolve data features on timescales necessary to reveal cause and effect and to drive automated, run-time feedback and response to improve the performance and efficiency of current and future HPC systems. In addition, data-driven insights from LDMS can influence next-generation architectural design.

LDMS is an R&D 100 award winner and is deployed on extreme-scale HPC systems throughout the world. Advances in LDMS and collaborative R&D directions are discussed at the LDMS Users Group Conference, LDMSCON.

Gitlab Repository
Open-source, Sandia-led Lightweight Distributed Metric Service (LDMS)

Publications

Omar Aaziz, Benjamin Allan, James Brandt, Jeanine Cook, Karen Devine, James Elliott, Ann Gentile, Simon Hammond, Brian Kelley, Lena Lopatina, Stan Moore, Stephen Olivier, Kevin Pedretti, David Poliakoff, Roger Pawlowski, Phillip Regier, Mark Schmitz, Benjamin Schwaller, Vanessa Surjadidjaja, Matthew Swan, Nick Tucker, Tom Tucker, Courtenay Vaughan, Sara Walton, (2021). Integrated System and Application Continuous Performance Monitoring and Analysis Capability https://doi.org/10.2172/1819812 Publication ID: 75582

Benjamin Schwaller, (2021). Integrated System and Application Continuous Performance Monitoring and Analysis Capability (Final) https://doi.org/10.2172/1822583 Publication ID: 75905

Ammar Elwazir, Abdel-Hameed, Badawy, Omar Aaziz, Jeanine Cook, (2020). LDMS-GPU: Lightweight Distributed Metric Service (LDMS) for NVIDIA GPGPUs https://doi.org/10.2172/1813665 Publication ID: 71944

Alexander Goponenko, Ramin Izadpanah, James Brandt, Damian Dechev, (2020). Towards workload-adaptive scheduling for HPC clusters Proceedings – IEEE International Conference on Cluster Computing, ICCC https://www.osti.gov/servlets/purl/1814414 Publication ID: 74584

Tom Tucker, Ann Gentile, James Brandt, (2020). Supporting Dynamic Event Monitoring in the Lightweight Distributed Metric Service (LDMS) https://www.osti.gov/servlets/purl/1812466 Publication ID: 74398

James Brandt, (2019). Lightweight Distributed Metric Service: Deployments Enhancements Roadmap and Activities https://www.osti.gov/servlets/purl/1645641 Publication ID: 69785

Ramin Izadpanah, Benjamin Allan, Damian Dechev, James Brandt, (2019). Production application performance data streaming for system monitoring ACM Transactions on Modeling and Performance Evaluation of Computing Systems https://doi.org/10.1145/3319498 Publication ID: 70049

James Brandt, Ann Gentile, M. Showerman, J. Enos, J. Fullop, G. Bauer, (2016). Large-scale persistent numerical data source monitoring system experiences Proceedings – 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016 https://doi.org/10.1109/IPDPSW.2016.188 Publication ID: 48803

Benjamin Allan, Sophia Lefantzi, Edward Walsh, Jeffry Ogden, Nathan Gauntt, (2016). Lightweight Distributed Metric Service: Production analytics overview for LDMS v2 https://www.osti.gov/servlets/purl/1369518 Publication ID: 50930

Benjamin Allan, (2016). Lightweight Distributed Metric Service: Production overview and development status https://www.osti.gov/servlets/purl/1368706 Publication ID: 50289

Steven Feldman, Deli Zhang, Damian Dechev, James Brandt, (2015). Extending LDMS to enable performance monitoring in multi-core applications Proceedings – IEEE International Conference on Cluster Computing, ICCC https://www.osti.gov/servlets/purl/1325699 Publication ID: 45520

James Brandt, William Collins, Ann Gentile, Michael Martinez II, Susan Mcree, Daniel Sands, Allan Yaklin, Susan Mcree, (2015). Uncovering Bottlenecks in Data Transfer from a Filesystem to HPSS using the Lightweight Distributed Metric Service (LDMS) https://www.osti.gov/servlets/purl/1325529 Publication ID: 45513

James Brandt, Ann Gentile, (2014). Lightweight Distributed Metric Service (LDMS) https://www.osti.gov/servlets/purl/1496673 Publication ID: 37573

Anthony Agelastos, Benjamin Allan, James Brandt, Ann Gentile, Stephen Monk, Jeffry Ogden, Mahesh Rajan, Joel Stevenson, (2014). The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications https://doi.org/10.1109/SC.2014.18 Publication ID: 40490

Anthony Agelastos, Benjamin Allan, James Brandt, Paul Cassella, Jeremy Enos, Joshi Fullop, Ann Gentile, Stephen Monk, Nichamon Naksinehaboon, Jeffry Ogden, Mahesh Rajan, Michael Showerman, Joel Stevenson, Narate Taerat, Thomas Tucker, (2014). The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications International Conference for High Performance Computing, Networking, Storage and Analysis, SC https://doi.org/10.1109/SC.2014.18 Publication ID: 39574

James Brandt, Ann Gentile, (2013). Lightweight Distributed Metric Service (LDMS): Run-time Resource Utilization Monitoring https://www.osti.gov/servlets/purl/1106397 Publication ID: 34992

James Brandt, Ann Gentile, (2013). LDMS: Lightweight Distributed Metric Service for HPC Monitoring https://www.osti.gov/biblio/1063385 Publication ID: 32144

Ann Gentile, James Brandt, (2012). Lightweight Distributed Metric Service https://www.osti.gov/servlets/purl/1648269 Publication ID: 30760

Ann Gentile, James Brandt, (2012). Copy of Lightweight Distributed Metric Service https://www.osti.gov/servlets/purl/1686350 Publication ID: 30761

Showing Results. Show More Publications