Walton, S.P., Allan, B., Brandt, J.M., & Brandt, J.M. (2023). LDMS Version 4.3+ Basics Tutorial [Conference Presenation]. https://doi.org/10.2172/2430977
Publications
Search results
Jump to search filtersSchwaller, B., Brandt, J.M., Leung, V.J., & Leung, V.J. (2023). Towards Practical Machine Learning Frameworks for Performance Diagnostics in Supercomputers [Conference Paper]. https://www.osti.gov/biblio/2431456
Brandt, J.M., Gentile, A.C., & Gentile, A.C. (2022). AppSysFusion: CoMingling of appropriate data to drive Codesign of Applications, HPC Platforms, and Monitoring, Analysis, and Feedback Infrastructure [Conference Presenation]. https://doi.org/10.2172/2006042
Brandt, J.M. (2022). Darshan I/O Runtime Monitoring [Conference Presenation]. https://doi.org/10.2172/2006147
Goponenko, A., Lamar, K., Peterson, C., Allan, B., Brandt, J.M., Dechev, D., & Dechev, D. (2022). Metrics for Packing Efficiency and Fairness of HPC Cluster Batch Job Scheduling [Conference Presenation]. https://doi.org/10.2172/2005924
Brandt, J.M. (2022). Moving Towards Autonomous HPC Facilities [Conference Presenation]. https://doi.org/10.2172/2004563
Brandt, J.M., Showerman, M., Roman, E., Greenseid, J., Tucker, T., & Tucker, T. (2022). Fallout: A Monitoring Infrastructure Supporting Informed System Acceptance [Conference Presenation]. https://doi.org/10.2172/2004564
Aksar, B., Sencan, E., Schwaller, B., Aaziz, O., Kulis, B., Coskun, A.K., Leung, V.J., Brandt, J.M., & Brandt, J.M. (2022). ALBADross: Active Learning Based Anomaly Diagnosis for Production HPC Systems [Conference Proceeding]. https://doi.org/10.1109/CLUSTER51413.2022.00048
Goponenko, A.V., Lamar, K., Peterson, C., Allan, B., Brandt, J.M., Dechev, D., & Dechev, D. (2022). Metrics for Packing Efficiency and Fairness of HPC Cluster Batch Job Scheduling [Conference Paper]. Proceedings - Symposium on Computer Architecture and High Performance Computing. https://doi.org/10.1109/SBAC-PAD55451.2022.00035
Brandt, J.M., Gentile, A.C., Walton, S.P., Allan, B., Tucker, T., & Tucker, T. (2021). LDMS Version 4.3 Tutorial Part 1: Basics [Conference Presenation]. https://doi.org/10.2172/1899500
Brandt, J.M., Gentile, A.C., Tucker, T., & Tucker, T. (2021). LDMS Version 4.3.8 Advanced Tutorial: Part 2 [Conference Presenation]. https://doi.org/10.2172/1898478
Brandt, J.M., Gentile, A.C., Tucker, T., & Tucker, T. (2021). LDMS Version 4.3.8 Advanced Tutorial: Part 1 [Conference Presenation]. https://doi.org/10.2172/1898488
Aaziz, O., Allan, B., Brandt, J.M., Cook, J., Devine, K., Elliott, J., Gentile, A.C., Hammond, S., Kelley, B., Lopatina, L., Moore, S.G., Olivier, S.L., Bachman, W.B., Poliakoff, D., Pawlowski, R., Regier, P., Schmitz, M.E., Schwaller, B., Surjadidjaja, V., … Walton, S.P. (2021). Integrated System and Application Continuous Performance Monitoring and Analysis Capability. https://doi.org/10.2172/1819812
Brandt, J.M., Cook, J., Aaziz, O., Allan, B., Devine, K., Bachman, W.B., Gentile, A.C., Hammond, S., Kelley, B., Lopatina, L., Moore, S.G., Olivier, S.L., Bachman, W.B., Poliakoff, D., Pawlowski, R., Regier, P., Schmitz, M.E., Schwaller, B., Surjadidjaja, V., … Walton, S.P. (2021). Integrated System and Application Continuous Performance Monitoring and Analysis Capability [Presentation]. https://www.osti.gov/biblio/1886175
Aksar, B., Zhang, Y., Ates, E., Aaziz, O., Schwaller, B., Brandt, J.M., Leung, V.J., Egele, M., Coskun, A.K., & Coskun, A.K. (2021). E2EWatch: End-to-end Anomaly Diagnosis Framework for Production HPC Systems [Conference Presenation]. https://doi.org/10.2172/1891960
Costa, E., Patel, T., Schwaller, B., Brandt, J.M., Tiwari, D., & Tiwari, D. (2021). Lessons From Examining Repetitive Job Behavior and I/O Performance Variability on a Production HPC System Emily Costa Northeastern University, USA Tirthak Patel Northeastern University, USA Benjamin Schwaller [Conference Paper]. https://www.osti.gov/biblio/1884199
Gentile, A.C., Brandt, J.M., & Brandt, J.M. (2021). Integrating Systems Operations into CoDesign [Conference Presenation]. https://doi.org/10.2172/1877538
Brandt, J.M. (2021). Integrating System State and Application Performance Monitoring: Network Contention Impact [Conference Presenation]. https://doi.org/10.2172/1884471
Aksar, B., Zhang, Y., Ates, E., Aaziz, O., Schwaller, B., Brandt, J.M., Leung, V.J., Egele, M., Coskun, A.K., & Coskun, A.K. (2021). E2EWatch: End-to-end Anomaly Diagnosis Framework for Production HPC Systems [Conference Paper]. https://doi.org/10.1007/978-3-030-85665-6_5
Gentile, A.C., Brandt, J.M., Cook, J., Hammond, S., Poliakoff, D., Schwaller, B., Surjadidjaja, V., Tucker, T.O., & Tucker, T.O. (2021). Enabling Application and System Data Fusion [Conference Presenation]. https://doi.org/10.2172/1863505
Brandt, J.M., Enos, J., Gentile, A.C., Kramer, W., & Kramer, W. (2021). Including Operations Analytics & Communication In Next Generation CoDesign: [Conference Presenation]. https://doi.org/10.2172/1856310
Aksar, B., Zhang, Y., Ates, E., Schwaller, B., Aaziz, O., Leung, V.J., Brandt, J.M., Egele, M., Coskun, A.K., & Coskun, A.K. (2021). Proctor: A Semi-Supervised Performance Anomaly Diagnosis Framework for Production HPC Systems [Conference Proceeding]. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). https://doi.org/10.1007/978-3-030-78713-4_11
Lamar, K., Goponenko, A., Peterson, C., Allan, B., Brandt, J.M., Dechev, D., & Dechev, D. (2021). Backfilling HPC Jobs with a Multimodal-Aware Predictor [Conference Paper]. Proceedings - IEEE International Conference on Cluster Computing, ICCC. https://doi.org/10.1109/Cluster48925.2021.00093
Zhang, Y., Aksar, B., Aaziz, O., Schwaller, B., Brandt, J.M., Leung, V.J., Egele, M., Coskun, A.K., & Coskun, A.K. (2021). Using Monitoring Data to Improve HPC Performance via Network-Data-Driven Allocation [Conference Presenation]. 2021 IEEE High Performance Extreme Computing Conference, HPEC 2021. https://doi.org/10.2172/1888952
Zhang, Y., Aksar, B., Aaziz, O., Schwaller, B., Brandt, J.M., Leung, V.J., Egele, M., Coskun, A.K., & Coskun, A.K. (2021). Using Monitoring Data to Improve HPC Performance via Network-Data-Driven Allocation [Conference Proceeding]. 2021 IEEE High Performance Extreme Computing Conference, HPEC 2021. https://doi.org/10.1109/HPEC49654.2021.9622783
Schwaller, B., Brandt, J.M., & Brandt, J.M. (2020). CCE MMAI FY20-21 Accomplishment and Planned Activities [Presentation]. https://www.osti.gov/biblio/1836907
Brandt, J.M., Goponenko, A.V., Dechev, D., Izadpanah, R., & Izadpanah, R. (2020). Towards workload-adaptive scheduling for HPC clusters [Conference Poster]. https://www.osti.gov/biblio/1814414
Tucker, T., Gentile, A.C., Brandt, J.M., & Brandt, J.M. (2020). Supporting Dynamic Event Monitoring in the Lightweight Distributed Metric Service (LDMS) [Conference Poster]. https://www.osti.gov/biblio/1812466
Schwaller, B., Allan, B., Brandt, J.M., Tucker, T., Tucker, N., & Tucker, N. (2020). HPC System Data Pipeline to Enable Meaningful Insights through Analytic-Driven Visualizations [Conference Poster]. https://www.osti.gov/biblio/1814415
Brandt, J.M. (2020). HPC Monitoring and Analysis at Sandia National Laboratories [Conference Poster]. https://www.osti.gov/biblio/1765307
Aaziz, O., Allan, B., Brandt, J.M., Cook, J., Devine, K., Bachman, W.B., Gentile, A.C., Olivier, S.L., Bachman, W.B., Tucker, T., & Tucker, T. (2020). Attributing Performance Variation from Integrated Application and System Data [Conference Poster]. https://www.osti.gov/biblio/1765520
Brandt, J.M. (2019). Improving HPC Productivity Through Monitoring Analysis and Feedback [Presentation]. https://www.osti.gov/biblio/1646158
Aksar, B., Schwaller, B., Aaziz, O., Ates, E., Brandt, J.M., Coskun, A., Egele, M., Leung, V.J., & Leung, V.J. (2019). AD for Machine Learning Approach to Understanding HPC Application Performance Variation Poster [Conference Poster]. https://www.osti.gov/biblio/1642788
Schwaller, B., Aksar, B., Aaziz, O., Ates, E., Brandt, J.M., Coskun, A., Egele, M., Leung, V.J., & Leung, V.J. (2019). A Machine Learning Approach to Understanding HPC Application Performance Variation [Conference Poster]. https://www.osti.gov/biblio/1642784
Gauntt, N.E., Davis, K., Repik, J.J., Brandt, J.M., Gentile, A.C., Hammond, S., & Hammond, S. (2019). Design Installation and Operation of the Vortex ART Platform. https://doi.org/10.2172/1562796
Brandt, J.M. (2019). HPC Monitoring & Analysis + Power 9 Specifics [Presentation]. https://www.osti.gov/biblio/1645803
Jha, S., Patke, A., Brandt, J.M., Gentile, A.C., Showerman, M., Roman, E., Kalbarczyk, Z.T., Kramer, B., Iyer, R.K., & Iyer, R.K. (2019). A study of network congestion in two supercomputing high-speed interconnects [Conference Poster]. Proceedings - 2019 IEEE Symposium on High-Performance Interconnects, HOTI 2019. https://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=85076149891&origin=inward
Brandt, J.M. (2019). Lightweight Distributed Metric Service: Deployments Enhancements Roadmap and Activities [Presentation]. https://www.osti.gov/biblio/1645641
Ates, E., Tuncer, O., Turk, A., Leung, V.J., Brandt, J.M., Egele, M., Coskun, A.K., & Coskun, A.K. (2019). Taxonomist: Application Detection through Rich Monitoring Data [Presentation]. https://doi.org/10.1007/978-3-319-96983-1_7
Brandt, J.M., Brown, C.J., Bachman, W.B., Gentile, A.C., Greenseid, J., Kramer, W., Langer, P., Rashid, A., Rhem, K., Showerman, M., & Showerman, M. (2019). Exploring New Monitoring and Analysis Capabilities on Cray's Software Preview System (Final Version) [Conference Poster]. https://www.osti.gov/biblio/1640116
Brandt, J.M., Brown, C.J., Bachman, W.B., Gentile, A.C., Greenseid, J., Kramer, W., Langer, P., Rashid, A., Rhem, K., Showerman, M., & Showerman, M. (2019). Exploring New Monitoring and Analysis Capabilities on Cray?s Software Preview System [Conference Poster]. https://www.osti.gov/biblio/1639961
Tuncer, O., Ates, E., Zhang, Y., Turk, A., Brandt, J.M., Leung, V.J., Egele, M., Coskun, A.K., & Coskun, A.K. (2019). Online Diagnosis of Performance Variation in HPC Systems Using Machine Learning. IEEE Transactions on Parallel and Distributed Systems, 30(4), pp. 883-896. https://doi.org/10.1109/TPDS.2018.2870403
Izadpanah, R., Allan, B., Dechev, D., Brandt, J.M., & Brandt, J.M. (2019). Production application performance data streaming for system monitoring. ACM Transactions on Modeling and Performance Evaluation of Computing Systems, 4(2). https://doi.org/10.1145/3319498
Kramer, B., Bauer, G., Bode, B., Showerman, M., Enos, J., Saxton, A., Jha, S., Kalbarczyk, Z., Iyer, R., Brandt, J.M., Gentile, A.C., & Gentile, A.C. (2019). Holistic Measurement Driven System Assessment [Presentation]. https://www.osti.gov/biblio/1592279
Gentile, A.C., Brandt, J.M., & Brandt, J.M. (2018). Application and System Performance Metrics [Presentation]. https://www.osti.gov/biblio/1594278
Ahlgren, V., Andersson, S., Brandt, J.M., Cardo, N., Chunduri, S., Enos, J., Fields, P., Gentile, A.C., Gerber, R., Gienger, M., Greenseid, J., Greiner, A., Hadri, B., He, Y., Hoppe, D., Kaila, U., Kelly, K., Klein, M., Kristiansen, A., … Williams, J. (2018). Large-Scale System Monitoring Experiences and Recommendations [Conference Poster]. https://doi.org/10.1109/CLUSTER.2018.00069
Izadpanah, R., Naksinehaboon, N., Brandt, J.M., Gentile, A.C., Dechev, D., & Dechev, D. (2018). Integrating low-latency analysis into HPC system monitoring [Conference Poster]. ACM International Conference Proceeding Series. https://doi.org/10.1145/3225058.3225086
Ahlgren, V., Andersson, S., Brandt, J.M., Cardo, N., Chunduri, S., Enos, J., Fields, P., Gentile, A.C., Gerber, R., Gienger, M., Greenseid, J., Greiner, A., Hadri, B., He, Y., Hoppe, D., Kaila, U., Kelly, K., Klein, M., Kristiansen, A., … Williams, J. (2018). Large-Scale System Monitoring Experiences and Recommendations [Conference Poster]. https://doi.org/10.1109/CLUSTER.2018.00069
Brandt, J.M., Tucker, T., Gentile, A.C., & Gentile, A.C. (2018). OVIS Update 08/24/18 [Presentation]. https://www.osti.gov/biblio/1583057
Jha, S., Brandt, J.M., Gentile, A.C., Kalbarczyk, Z., Iyer, R., & Iyer, R. (2018). Characterizing Supercomputer Traffic Networks Through Link-Level Analysis [Conference Poster]. https://doi.org/10.1109/CLUSTER.2018.00072