Lightweight Distributed Metric Service (LDMS)

The open-source, Sandia-led Lightweight Distributed Metric Service (LDMS) is state-of-the-art monitoring software that provides continuous, run-time insight into applications’ performance in conjunction with system conditions. “Always-on” monitoring on our HPC systems provides a wealth of information that can be used in statistical and ML-based analysis to discover causes of performance degradation and to inform resource management and scheduling decisions. LDMS was designed from the ground up with the extreme-scalability and lightweight data collection necessary to resolve data features on timescales necessary to reveal cause and effect and to drive automated, run-time feedback and response to improve the performance and efficiency of current and future HPC systems. In addition, data-driven insights from LDMS can influence next-generation architectural design.

LDMS is an R&D 100 award winner and is deployed on extreme-scale HPC systems throughout the world. Advances in LDMS and collaborative R&D directions are discussed at the LDMS Users Group Conference, LDMSCON.

Gitlab Repository
Open-source, Sandia-led Lightweight Distributed Metric Service (LDMS)

Publications

  • Aaziz, O., Allan, B., Brandt, J., Cook, J., Devine, K., Elliott, J., Gentile, A., Hammond, S.D., Kelley, B., Lopatina, L., Moore, S., Olivier, S.L., Pedretti, K., Poliakoff, D., Pawlowski, R., Regier, P., Schmitz, M., Schwaller, B., Surjadidjaja, V., … Walton, S. (2021). Integrated System and Application Continuous Performance Monitoring and Analysis Capability. https://doi.org/10.2172/1819812 Publication ID: 75582
  • Schwaller, B. (2021). Integrated System and Application Continuous Performance Monitoring and Analysis Capability (Final). https://doi.org/10.2172/1822583 Publication ID: 75905
  • Elwazir, A., Badawy, A., Aaziz, O., Cook, J., & Cook, J. (2020). LDMS-GPU: Lightweight Distributed Metric Service (LDMS) for NVIDIA GPGPUs. https://doi.org/10.2172/1813665 Publication ID: 71944
  • Goponenko, A.V., Izadpanah, R., Brandt, J., Dechev, D., & Dechev, D. (2020). Towards workload-adaptive scheduling for HPC clusters [Conference Poster]. Proceedings – IEEE International Conference on Cluster Computing, ICCC. https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=85096231795&origin=inward Publication ID: 74584
  • Tucker, T., Gentile, A., Brandt, J., & Brandt, J. (2020). Supporting Dynamic Event Monitoring in the Lightweight Distributed Metric Service (LDMS) [Conference Poster]. https://www.osti.gov/biblio/1812466 Publication ID: 74398
  • Brandt, J. (2019). Lightweight Distributed Metric Service: Deployments Enhancements Roadmap and Activities [Presentation]. https://www.osti.gov/biblio/1645641 Publication ID: 69785
  • Izadpanah, R., Allan, B., Dechev, D., Brandt, J., & Brandt, J. (2019). Production application performance data streaming for system monitoring. ACM Transactions on Modeling and Performance Evaluation of Computing Systems, 4(2). https://doi.org/10.1145/3319498 Publication ID: 70049
  • Brandt, J., Gentile, A., Showerman, M., Enos, J., Fullop, J., Bauer, G., & Bauer, G. (2016). Large-scale persistent numerical data source monitoring system experiences [Conference Poster]. Proceedings – 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016. https://doi.org/10.1109/IPDPSW.2016.188 Publication ID: 48803
  • Allan, B., Lefantzi, S., Walsh, E., Ogden, J., Gauntt, N., & Gauntt, N. (2016). Lightweight Distributed Metric Service: Production analytics overview for LDMS v2 [Presentation]. https://www.osti.gov/biblio/1369518 Publication ID: 50930
  • Allan, B. (2016). Lightweight Distributed Metric Service: Production overview and development status [Conference Poster]. https://www.osti.gov/biblio/1368706 Publication ID: 50289
  • Feldman, S., Zhang, D., Dechev, D., Brandt, J., & Brandt, J. (2015). Extending LDMS to enable performance monitoring in multi-core applications [Conference Poster]. Proceedings – IEEE International Conference on Cluster Computing, ICCC. https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=84959259300&origin=inward Publication ID: 45520
  • Brandt, J., Collins, W., Gentile, A., Martinez II, M., Mcree, S., Sands, D., Yaklin, A., Mcree, S., & Mcree, S. (2015). Uncovering Bottlenecks in Data Transfer from a Filesystem to HPSS using the Lightweight Distributed Metric Service (LDMS) [Conference Poster]. https://www.osti.gov/biblio/1325529 Publication ID: 45513
  • Brandt, J., Gentile, A., & Gentile, A. (2014). Lightweight Distributed Metric Service (LDMS) [Presentation]. https://www.osti.gov/biblio/1496673 Publication ID: 37573
  • Agelastos, A., Allan, B., Brandt, J., Gentile, A., Monk, S., Ogden, J., Rajan, M., Stevenson, J., & Stevenson, J. (2014). The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications [Conference]. https://doi.org/10.1109/SC.2014.18 Publication ID: 40490
  • Agelastos, A., Allan, B., Brandt, J., Cassella, P., Enos, J., Fullop, J., Gentile, A., Monk, S., Naksinehaboon, N., Ogden, J., Rajan, M., Showerman, M., Stevenson, J., Taerat, N., Tucker, T., & Tucker, T. (2014). The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications [Conference Poster]. International Conference for High Performance Computing, Networking, Storage and Analysis, SC. https://doi.org/10.1109/SC.2014.18 Publication ID: 39574
  • Brandt, J., Gentile, A., & Gentile, A. (2013). Lightweight Distributed Metric Service (LDMS): Run-time Resource Utilization Monitoring [Conference]. https://www.osti.gov/biblio/1106397 Publication ID: 34992
  • Brandt, J., Gentile, A., & Gentile, A. (2013). LDMS: Lightweight Distributed Metric Service for HPC Monitoring [Conference]. https://www.osti.gov/biblio/1063385 Publication ID: 32144
  • Gentile, A., Brandt, J., & Brandt, J. (2012). Lightweight Distributed Metric Service [Presentation]. https://www.osti.gov/biblio/1648269 Publication ID: 30760
  • Gentile, A., Brandt, J., & Brandt, J. (2012). Copy of Lightweight Distributed Metric Service [Presentation]. https://www.osti.gov/biblio/1686350 Publication ID: 30761
Showing 15 of 19 publications.