Publications

Results 1–25 of 110
Skip to search filters

Counterfactual Explanations for Multivariate Time Series

2021 International Conference on Applied Artificial Intelligence, ICAPAI 2021

Ates, Emre; Aksar, Burak; Leung, Vitus J.; Coskun, Ayse K.

Multivariate time series are used in many science and engineering domains, including health-care, astronomy, and high-performance computing. A recent trend is to use machine learning (ML) to process this complex data and these ML-based frameworks are starting to play a critical role for a variety of applications. However, barriers such as user distrust or difficulty of debugging need to be overcome to enable widespread adoption of such frameworks in production systems. To address this challenge, we propose a novel explainability technique, CoMTE, that provides counterfactual explanations for supervised machine learning frameworks on multivariate time series data. Using various machine learning frameworks and data sets, we compare CoMTE with several state-of-the-art explainability methods and show that we outperform existing methods in comprehensibility and robustness. We also show how CoMTE can be used to debug machine learning frameworks and gain a better understanding of the underlying multivariate time series data.

More Details

Proctor: A Semi-Supervised Performance Anomaly Diagnosis Framework for Production HPC Systems

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Aksar, Burak; Zhang, Yijia; Ates, Emre; Schwaller, Benjamin S.; Aaziz, Omar R.; Leung, Vitus J.; Brandt, James M.; Egele, Manuel; Coskun, Ayse K.

Performance variation diagnosis in High-Performance Computing (HPC) systems is a challenging problem due to the size and complexity of the systems. Application performance variation leads to premature termination of jobs, decreased energy efficiency, or wasted computing resources. Manual root-cause analysis of performance variation based on system telemetry has become an increasingly time-intensive process as it relies on human experts and the size of telemetry data has grown. Recent methods use supervised machine learning models to automatically diagnose previously encountered performance anomalies in compute nodes. However, supervised machine learning models require large labeled data sets for training. This labeled data requirement is restrictive for many real-world application domains, including HPC systems, because collecting labeled data is challenging and time-consuming, especially considering anomalies that sparsely occur. This paper proposes a novel semi-supervised framework that diagnoses previously encountered performance anomalies in HPC systems using a limited number of labeled data points, which is more suitable for production system deployment. Our framework first learns performance anomalies’ characteristics by using historical telemetry data in an unsupervised fashion. In the following process, we leverage supervised classifiers to identify anomaly types. While most semi-supervised approaches do not typically use anomalous samples, our framework takes advantage of a few labeled anomalous samples to classify anomaly types. We evaluate our framework on a production HPC system and on a testbed HPC cluster. We show that our proposed framework achieves 60% F1-score on average, outperforming state-of-the-art supervised methods by 11%, and maintains an average 0.06% anomaly miss rate.

More Details

Using Monitoring Data to Improve HPC Performance via Network-Data-Driven Allocation

2021 IEEE High Performance Extreme Computing Conference, HPEC 2021

Zhang, Yijia; Aksar, Burak; Aaziz, Omar R.; Schwaller, Benjamin S.; Brandt, James M.; Leung, Vitus J.; Egele, Manuel; Coskun, Ayse K.

On high-performance computing (HPC) systems, job allocation strategies control the placement of a job among available nodes. As the placement changes a job's communication performance, allocation can significantly affects execution times of many HPC applications. Existing allocation strategies typically make decisions based on resource limit, network topology, communication patterns, etc. However, system network performance at runtime is seldom consulted in allocation, even though it significantly affects job execution times.In this work, we demonstrate using monitoring data to improve HPC systems' performance by proposing a NetworkData-Driven (NeDD) job allocation framework, which monitors the network performance of an HPC system at runtime and allocates resources based on both network performance and job characteristics. NeDD characterizes system network performance by collecting the network traffic statistics on each router link, and it characterizes a job's sensitivity to network congestion by collecting Message Passing Interface (MPI) statistics. During allocation, NeDD pairs network-sensitive (network-insensitive) jobs with nodes whose parent routers have low (high) network traffic. Through experiments on a large HPC system, we demonstrate that NeDD reduces the execution time of parallel applications by 11% on average and up to 34%.

More Details

Online Diagnosis of Performance Variation in HPC Systems Using Machine Learning

IEEE Transactions on Parallel and Distributed Systems

Tuncer, Ozan; Ates, Emre; Zhang, Yijia; Turk, Ata; Brandt, James M.; Leung, Vitus J.; Egele, Manuel; Coskun, Ayse K.

As the size and complexity of high performance computing (HPC) systems grow in line with advancements in hardware and software technology, HPC systems increasingly suffer from performance variations due to shared resource contention as well as software-and hardware-related problems. Such performance variations can lead to failures and inefficiencies, which impact the cost and resilience of HPC systems. To minimize the impact of performance variations, one must quickly and accurately detect and diagnose the anomalies that cause the variations and take mitigating actions. However, it is difficult to identify anomalies based on the voluminous, high-dimensional, and noisy data collected by system monitoring infrastructures. This paper presents a novel machine learning based framework to automatically diagnose performance anomalies at runtime. Our framework leverages historical resource usage data to extract signatures of previously-observed anomalies. We first convert collected time series data into easy-to-compute statistical features. We then identify the features that are required to detect anomalies, and extract the signatures of these anomalies. At runtime, we use these signatures to diagnose anomalies with negligible overhead. We evaluate our framework using experiments on a real-world HPC supercomputer and demonstrate that our approach successfully identifies 98 percent of injected anomalies and consistently outperforms existing anomaly diagnosis techniques.

More Details

Statistical models of dengue fever

Communications in Computer and Information Science

Link, Hamilton E.; Richter, Samuel N.; Leung, Vitus J.; Brost, Randolph B.; Phillips, Cynthia A.; Staid, Andrea S.

We use Bayesian data analysis to predict dengue fever outbreaks and quantify the link between outbreaks and meteorological precursors tied to the breeding conditions of vector mosquitos. We use Hamiltonian Monte Carlo sampling to estimate a seasonal Gaussian process modeling infection rate, and aperiodic basis coefficients for the rate of an “outbreak level” of infection beyond seasonal trends across two separate regions. We use this outbreak level to estimate an autoregressive moving average (ARMA) model from which we extrapolate a forecast. We show that the resulting model has useful forecasting power in the 6–8 week range. The forecasts are not significantly more accurate with the inclusion of meteorological covariates than with infection trends alone.

More Details

Adverse Event Prediction Using Graph-Augmented Temporal Analysis: Final Report

Brost, Randolph B.; Carrier, Erin E.; Carroll, Michelle C.; Groth, Katrina M.; Kegelmeyer, William P.; Leung, Vitus J.; Link, Hamilton E.; Patterson, Andrew J.; Phillips, Cynthia A.; Richter, Samuel N.; Robinson, David G.; Staid, Andrea S.; Woodbridge, Diane M.-K.

This report summarizes the work performed under the Sandia LDRD project "Adverse Event Prediction Using Graph-Augmented Temporal Analysis." The goal of the project was to de- velop a method for analyzing multiple time-series data streams to identify precursors provid- ing advance warning of the potential occurrence of events of interest. The proposed approach combined temporal analysis of each data stream with reasoning about relationships between data streams using a geospatial-temporal semantic graph. This class of problems is relevant to several important topics of national interest. In the course of this work we developed new temporal analysis techniques, including temporal analysis using Markov Chain Monte Carlo techniques, temporal shift algorithms to refine forecasts, and a version of Ripley's K-function extended to support temporal precursor identification. This report summarizes the project's major accomplishments, and gathers the abstracts and references for the publication sub- missions and reports that were prepared as part of this work. We then describe work in progress that is not yet ready for publication.

More Details

Level-spread: A new job allocation policy for dragonfly networks

Proceedings - 2018 IEEE 32nd International Parallel and Distributed Processing Symposium, IPDPS 2018

Zhang, Yijia; Tuncer, Ozan; Kaplan, Fulya; Olcoz, Katzalin; Leung, Vitus J.; Coskun, Ayse K.

The dragonfly network topology has attracted attention in recent years owing to its high radix and constant diameter. However, the influence of job allocation on communication time in dragonfly networks is not fully understood. Recent studies have shown that random allocation is better at balancing the network traffic, while compact allocation is better at harnessing the locality in dragonfly groups. Based on these observations, this paper introduces a novel allocation policy called Level-Spread for dragonfly networks. This policy spreads jobs within the smallest network level that a given job can fit in at the time of its allocation. In this way, it simultaneously harnesses node adjacency and balances link congestion. To evaluate the performance of Level-Spread, we run packet-level network simulations using a diverse set of application communication patterns, job sizes, and communication intensities. We also explore the impact of network properties such as the number of groups, number of routers per group, machine utilization level, and global link bandwidth. Level-Spread reduces the communication overhead by 16% on average (and up to 71%) compared to the state-of-The-Art allocation policies.

More Details

Taxonomist: Application Detection Through Rich Monitoring Data

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Ates, Emre; Tuncer, Ozan; Turk, Ata; Leung, Vitus J.; Brandt, James M.; Egele, Manuel; Coskun, Ayse K.

Modern supercomputers are shared among thousands of users running a variety of applications. Knowing which applications are running in the system can bring substantial benefits: knowledge of applications that intensively use shared resources can aid scheduling; unwanted applications such as cryptocurrency mining or password cracking can be blocked; system architects can make design decisions based on system usage. However, identifying applications on supercomputers is challenging because applications are executed using esoteric scripts along with binaries that are compiled and named by users. This paper introduces a novel technique to identify applications running on supercomputers. Our technique, Taxonomist, is based on the empirical evidence that applications have different and characteristic resource utilization patterns. Taxonomist uses machine learning to classify known applications and also detect unknown applications. We test our technique with a variety of benchmarks and cryptocurrency miners, and also with applications that users of a production supercomputer ran during a 6 month period. We show that our technique achieves nearly perfect classification for this challenging data set.

More Details

Trends in Data Locality Abstractions for HPC Systems

IEEE Transactions on Parallel and Distributed Systems

Unat, Didem; Dubey, Anshu; Hoefler, Torsten; Shalf, John B.; Abraham, Mark; Bianco, Mauro; Chamberlain, Bradford L.; Cledat, Romain; Edwards, Harold C.; Finkel, Hal; Fuerlinger, Karl; Hannig, Frank; Jeannot, Emmanuel; Kamil, Amir; Keasler, Jeff; Kelly, Paul H.J.; Leung, Vitus J.; Ltaief, Hatem; Maruyama, Naoya; Newburn, Chris J.; Pericas, Miquel

More Details
Results 1–25 of 110
Results 1–25 of 110