Publications

81 Results

Search results

Jump to search filters

Assessment of Data-Management Infrastructure Needs for Production Use of Advanced Machine Learning and Artificial Intelligence: Tri-Lab Level II Milestone (8554)

Oldfield, Ron; Allan, Benjamin A.; Doutriaux, Charles; Lewis, Katherine; Ahrens, James; Sims, Benjamin; Sweeney, Christine; Banesh, Divya; Wofford, Quincy

A robust data-management infrastructure is a key enabler for National Security Enterprise (NSE) capabilities in artificial intelligence and machine learning. This document describes efforts from a team of researchers at Sandia National Laboratories, Los Alamos National Laboratory, and Livermore National Laboratory to complete ASC Level II milestone #8854 “Assessment of Data-Management Infrastructure Needs for Production use of Advanced Machine learning and Artificial Intelligence.”

More Details

Metrics for Packing Efficiency and Fairness of HPC Cluster Batch Job Scheduling

Proceedings - Symposium on Computer Architecture and High Performance Computing

Goponenko, Alexander V.; Lamar, Kenneth; Peterson, Christina; Allan, Benjamin A.; Brandt, James M.; Dechev, Damian

Development of job scheduling algorithms, which directly influence High-Performance Computing (HPC) clusters performance, is hindered because popular scheduling quality metrics, such as Bounded Slowdown, poorly correlate with global scheduling objectives that include job packing efficiency and fairness. This report proposes Area Weighted Response Time, a metric that offers an unbiased representation of job packing efficiency, and presents a class of new metrics, Priority Weighted Specific Response Time, that assess both packing efficiency and fairness of schedules. The provided examples of simulation of scheduling of real workload traces and analysis of the resulting schedules with the help of these metrics and conventional metrics, demonstrate that although Bounded Slowdown can be readily improved by modifying the standard First Come First Served backfilling algorithm and by using existing techniques of estimating job runtime, these improvements are accompanied by significant degradation of job packing efficiency and fairness. In contrast, improving job packing efficiency and fairness over the standard backfilling algorithm, which is designed to target those objectives, is difficult. It requires further algorithm development and more accurate runtime estimation techniques that reduce frequency of underpredictions.

More Details

Integrated System and Application Continuous Performance Monitoring and Analysis Capability

Brandt, James M.; Cook, Jeanine; Aaziz, Omar R.; Allan, Benjamin A.; Devine, Karen; Foulk, James W.; Gentile, Ann C.; Hammond, Simon; Kelley, Brian M.; Lopatina, Lena; Moore, Stan G.; Olivier, Stephen L.; Foulk, James W.; Poliakoff, David; Pawlowski, Roger; Regier, Phillip; Schmitz, Mark E.; Schwaller, Benjamin; Surjadidjaja, Vanessa; Swan, Matthew S.; Tucker, Tom; Tucker, Nick; Vaughan, Courtenay T.; Walton, Sara P.

Abstract not provided.

Integrated System and Application Continuous Performance Monitoring and Analysis Capability

Aaziz, Omar R.; Allan, Benjamin A.; Brandt, James M.; Cook, Jeanine; Devine, Karen; Elliott, James E.; Gentile, Ann C.; Hammond, Simon; Kelley, Brian M.; Lopatina, Lena; Moore, Stan G.; Olivier, Stephen L.; Foulk, James W.; Poliakoff, David; Pawlowski, Roger; Regier, Phillip; Schmitz, Mark E.; Schwaller, Benjamin; Surjadidjaja, Vanessa; Swan, Matthew S.; Tucker, Nick; Tucker, Thomas; Vaughan, Courtenay T.; Walton, Sara P.

Scientific applications run on high-performance computing (HPC) systems are critical for many national security missions within Sandia and the NNSA complex. However, these applications often face performance degradation and even failures that are challenging to diagnose. To provide unprecedented insight into these issues, the HPC Development, HPC Systems, Computational Science, and Plasma Theory & Simulation departments at Sandia crafted and completed their FY21 ASC Level 2 milestone entitled "Integrated System and Application Continuous Performance Monitoring and Analysis Capability." The milestone created a novel integrated HPC system and application monitoring and analysis capability by extending Sandia's Kokkos application portability framework, Lightweight Distributed Metric Service (LDMS) monitoring tool, and scalable storage, analysis, and visualization pipeline. The extensions to Kokkos and LDMS enable collection and storage of application data during run time, as it is generated, with negligible overhead. This data is combined with HPC system data within the extended analysis pipeline to present relevant visualizations of derived system and application metrics that can be viewed at run time or post run. This new capability was evaluated using several week-long, 290-node runs of Sandia's ElectroMagnetic Plasma In Realistic Environments ( EMPIRE ) modeling and design tool and resulted in 1TB of application data and 50TB of system data. EMPIRE developers remarked this capability was incredibly helpful for quickly assessing application health and performance alongside system state. In short, this milestone work built the foundation for expansive HPC system and application data collection, storage, analysis, visualization, and feedback framework that will increase total scientific output of Sandia's HPC users.

More Details

Backfilling HPC Jobs with a Multimodal-Aware Predictor

Proceedings - IEEE International Conference on Cluster Computing, ICCC

Lamar, Kenneth; Goponenko, Alexander; Peterson, Christina; Allan, Benjamin A.; Brandt, James M.; Dechev, Damian

Job scheduling aims to minimize the turnaround time on the submitted jobs while catering to the resource constraints of High Performance Computing (HPC) systems. The challenge with scheduling is that it must honor job requirements and priorities while actual job run times are unknown. Although approaches have been proposed that use classification techniques or machine learning to predict job run times for scheduling purposes, these approaches do not provide a technique for reducing underprediction, which has a negative impact on scheduling quality. A common cause of underprediction is that the distribution of the duration for a job class is multimodal, causing the average job duration to fall below the expected duration of longer jobs. In this work, we propose the Top Percent predictor, which uses a hierarchical classification scheme to provide better accuracy for job run time predictions than the user-requested time. Our predictor addresses multimodal job distributions by making a prediction that is higher than a specified percentage of the observed job run times. We integrate the Top Percent predictor into scheduling algorithms and evaluate the performance using schedule quality metrics found in literature. To accommodate the user policies of HPC systems, we propose priority metrics that account for job flow time, job resource requirements, and job priority. The experiments demonstrate that the Top Percent predictor outperforms the related approaches when evaluated using our proposed priority metrics.

More Details

LDMS Monitoring of EDR InfiniBand Networks

Proceedings - IEEE International Conference on Cluster Computing, ICCC

Allan, Benjamin A.; Aguilar, Michael J.; Schwaller, Benjamin; Langer, Steven

We introduce a new HPC system high-speed network fabric production monitoring tool, the ibnet sampler plugin for LDMS version 4. Large-scale testing of this tool is our work in progress. When deployed appropriately, the ibnet sampler plugin can provide extensive counter data, at frequencies up to 1 Hz. This allows the LDMS monitoring system to be useful for tracking the impact of new network features on production systems. We present preliminary results concerning reliability, performance impact, and usability of the sampler.

More Details

Figures of merit for production HPC

Allan, Benjamin A.

This report summarizes a set of figures of merit of interest in monitoring the hardware and hardware usage in a Sandia high performance computing (HPC) center. These figures are computable from high frequency monitoring data and other non-metric data and may aid administrators and customer support personnel in their decision processes. The figures are derived from interviews of the HPC center staff. The figures are in many cases simplistic data reductions, but they are our initial targets in creating dashboards that turn voluminous monitoring data into actionable information. Because simplistic reductions may obscure as well as reveal the situation under study, we also document the necessary 'drill-down' and %60exploration' views needed to make the data better understood quickly. These figures of merit may be compared to dashboarding tools documented by other HPC centers.

More Details

Two Weeks In The Life of Skybridge

Allan, Benjamin A.

This report documents a new large public data set for researchers studying the behavior of large commodity high-performance computing systems. Such large data sets are typically confined within institutions and access to them is limited to institutional partners. We provide it to promote HPC research more widely. The data set provides a two week time series of performance data collected once per minute using the Lightweight Distributed Metric Service from the system Skybridge at Sandia National Laboratories and the corresponding job-level accounting information. General system log information is not provided.

More Details

Production application performance data streaming for system monitoring

ACM Transactions on Modeling and Performance Evaluation of Computing Systems

Izadpanah, Ramin; Allan, Benjamin A.; Dechev, Damian; Brandt, James M.

In this article, we present an approach to streaming collection of application performance data. Practical application performance tuning and troubleshooting in production high-performance computing (HPC) environments requires an understanding of how applications interact with the platform, including (but not limited to) parallel programming libraries such as Message Passing Interface (MPI). Several profiling and tracing tools exist that collect heavy runtime data traces either in memory (released only at application exit) or on a file system (imposing an I/O load that may interfere with the performance being measured). Although these approaches are beneficial in development stages and post-run analysis, a systemwide and low-overhead method is required to monitor deployed applications continuously. This method must be able to collect information at both the application and system levels to yield a complete performance picture. In our approach, an application profiler collects application event counters. A sampler uses an efficient inter-process communication method to periodically extract the application counters and stream them into an infrastructure for performance data collection. We implement a tool-set based on our approach and integrate it with the Lightweight Distributed Metric Service (LDMS) system, a monitoring system used on large-scale computational platforms. LDMS provides the infrastructure to create and gather streams of performance data in a low overhead manner. We demonstrate our approach using applications implemented with MPI, as it is one of the most common standards for the development of large-scale scientific applications. We utilize our tool-set to study the impact of our approach on an open source HPC application, Nalu. Our tool-set enables us to efficiently identify patterns in the behavior of the application without source-level knowledge. We leverage LDMS to collect system-level performance data and explore the correlation between the system and application events. Also, we demonstrate how our tool-set can help detect anomalies with a low latency. We run tests on two different architectures: a system enabled with Intel Xeon Phi and another system equipped with Intel Xeon processor. Our overhead study shows our method imposes at most 0.5% CPU usage overhead on the application in realistic deployment scenarios.

More Details

Continuous whole-system monitoring toward rapid understanding of production HPC applications and systems

Parallel Computing

Agelastos, Anthony M.; Allan, Benjamin A.; Brandt, James M.; Gentile, Ann C.; Lefantzi, Sophia; Monk, Stephen T.; Ogden, Jeffry B.; Rajan, Mahesh; Stevenson, Joel O.

A detailed understanding of HPC applications’ resource needs and their complex interactions with each other and HPC platform resources are critical to achieving scalability and performance. Such understanding has been difficult to achieve because typical application profiling tools do not capture the behaviors of codes under the potentially wide spectrum of actual production conditions and because typical monitoring tools do not capture system resource usage information with high enough fidelity to gain sufficient insight into application performance and demands. In this paper we present both system and application profiling results based on data obtained through synchronized system wide monitoring on a production HPC cluster at Sandia National Laboratories (SNL). We demonstrate analytic and visualization techniques that we are using to characterize application and system resource usage under production conditions for better understanding of application resource needs. Our goals are to improve application performance (through understanding application-to-resource mapping and system throughput) and to ensure that future system capabilities match their intended workloads.

More Details

Emerging techniques for field device security

IEEE Security and Privacy

Mulder, John; Schwartz, Moses; Chavez, Adrian R.; Allan, Benjamin A.

Industrial control systems (ICSs) rely on embedded devices to control essential processes. State-of-the-art security solutions can't detect attacks on these devices at the hardware or firmware level. To improve ICS cybersecurity, defensive measures should focus on inspectability, trustworthiness, and diversity.

More Details

The Lightweight Distributed Metric Service: A Scalable Infrastructure for Continuous Monitoring of Large Scale Computing Systems and Applications

International Conference for High Performance Computing, Networking, Storage and Analysis, SC

Agelastos, Anthony M.; Allan, Benjamin A.; Brandt, James M.; Cassella, Paul; Enos, Jeremy; Fullop, Joshi; Gentile, Ann C.; Monk, Stephen T.; Naksinehaboon, Nichamon; Ogden, Jeffry B.; Rajan, Mahesh; Showerman, Michael; Stevenson, Joel O.; Taerat, Narate; Tucker, Thomas O.

Understanding how resources of High Performance Compute platforms are utilized by applications both individually and as a composite is key to application and platform performance. Typical system monitoring tools do not provide sufficient fidelity while application profiling tools do not capture the complex interplay between applications competing for shared resources. To gain new insights, monitoring tools must run continuously, system wide, at frequencies appropriate to the metrics of interest while having minimal impact on application performance. We introduce the Lightweight Distributed Metric Service for scalable, lightweight monitoring of large scale computing systems and applications. We describe issues and constraints guiding deployment in Sandia National Laboratories' capacity computing environment and on the National Center for Supercomputing Applications' Blue Waters platform including motivations, metrics of choice, and requirements relating to the scale and specialized nature of Blue Waters. We address monitoring overhead and impact on application performance and provide illustrative profiling results.

More Details

Lightweight performance data collectors 2.0 with Eiger support

Allan, Benjamin A.

We report on the use and design of a portable, extensible performance data collection tool motivated by modeling needs of the high performance computing systems co-design com- munity. The lightweight performance data collectors with Eiger support is intended to be a tailorable tool, not a shrink-wrapped library product, as pro ling needs vary widely. A single code markup scheme is reported which, based on compilation ags, can send perfor- mance data from parallel applications to CSV les, to an Eiger mysql database, or (in a non-database environment) to at les for later merging and loading on a host with mysql available. The tool supports C, C++, and Fortran applications.

More Details

Optimization of CPAPR for x64 multicore

Allan, Benjamin A.

I report the progress to date of my work on scaling the CPAPR algorithm and necessary supporting code to enable processing large (gigabyte to 100 gigabyte) data sets and benchmarking the same. Where possible, I also report background information possibly of relevance in future modifications of the code. The results include: minor repairs and additions to the TTB library for portability, algorithmic improvements relevant to both serial and multithreaded implementations, algorithmic improvements taking advantage of multithreading hardware, support library additions (binary IO routines) needed for efficiently and reproducibly benchmarking the algorithms. For this optimization work, no large scale data sets are available. Therefore, scalability of data synthesis algorithms is addressed as well.

More Details

The theory of diversity and redundancy in information system security : LDRD final report

Mayo, Jackson R.; Armstrong, Robert C.; Allan, Benjamin A.; Walker, Andrea M.

The goal of this research was to explore first principles associated with mixing of diverse implementations in a redundant fashion to increase the security and/or reliability of information systems. Inspired by basic results in computer science on the undecidable behavior of programs and by previous work on fault tolerance in hardware and software, we have investigated the problem and solution space for addressing potentially unknown and unknowable vulnerabilities via ensembles of implementations. We have obtained theoretical results on the degree of security and reliability benefits from particular diverse system designs, and mapped promising approaches for generating and measuring diversity. We have also empirically studied some vulnerabilities in common implementations of the Linux operating system and demonstrated the potential for diversity to mitigate these vulnerabilities. Our results provide foundational insights for further research on diversity and redundancy approaches for information systems.

More Details

Parallel computing in enterprise modeling

Heath, Zach; Shneider, Max S.; Vanderveen, Keith; Allan, Benjamin A.; Ray, Jaideep

This report presents the results of our efforts to apply high-performance computing to entity-based simulations with a multi-use plugin for parallel computing. We use the term 'Entity-based simulation' to describe a class of simulation which includes both discrete event simulation and agent based simulation. What simulations of this class share, and what differs from more traditional models, is that the result sought is emergent from a large number of contributing entities. Logistic, economic and social simulations are members of this class where things or people are organized or self-organize to produce a solution. Entity-based problems never have an a priori ergodic principle that will greatly simplify calculations. Because the results of entity-based simulations can only be realized at scale, scalable computing is de rigueur for large problems. Having said that, the absence of a spatial organizing principal makes the decomposition of the problem onto processors problematic. In addition, practitioners in this domain commonly use the Java programming language which presents its own problems in a high-performance setting. The plugin we have developed, called the Parallel Particle Data Model, overcomes both of these obstacles and is now being used by two Sandia frameworks: the Decision Analysis Center, and the Seldon social simulation facility. While the ability to engage U.S.-sized problems is now available to the Decision Analysis Center, this plugin is central to the success of Seldon. Because Seldon relies on computationally intensive cognitive sub-models, this work is necessary to achieve the scale necessary for realistic results. With the recent upheavals in the financial markets, and the inscrutability of terrorist activity, this simulation domain will likely need a capability with ever greater fidelity. High-performance computing will play an important part in enabling that greater fidelity.

More Details

Ccaffeine framework : composing and debugging applications interactively and running them statically

Allan, Benjamin A.; Armstrong, Robert C.

Ccaffeine is a Common Component Architecture (CCA) framework devoted to high-performance computing. In this note we give an overview of the system features of Ccaffeine and CCA that support component-based HPC application development. Object-oriented, single-threaded and lightweight, Ccaffeine is designed to get completely out of the way of the running application after it has been composed from components. Ccaffeine is one of the few frameworks, CCA or otherwise, that can compose and run applications on a parallel machine interactively and then automatically generate a static, possibly self-tuning, executable for production runs. Users can experiment with and debug applications interactively, improving their productivity. When the application is ready, a script is automatically generated, parsed and turned into a static executable for production runs. Within this static executable, dynamic replacement of components can be performed by self-tuning applications.

More Details
81 Results
81 Results