Publications

Results 1–100 of 182

Search results

Jump to search filters

Defender Policy Evaluation and Resource Allocation With MITRE ATT&CK Evaluations Data

IEEE Transactions on Dependable and Secure Computing

Outkin, Alexander V.; Schulz, Patricia V.; Schulz, Timothy; Tarman, Thomas D.; Pinar, Ali P.

Protecting against multi-step attacks of uncertain start times and duration forces the defenders into indefinite, always ongoing, resource-intensive response. To allocate resources effectively, the defender must analyze and respond to an uncertain stream of potentially undetected multiple multi-step attacks and take measures of attack and response intensity over time into account. Such response requires estimation of overall attack success metrics and evaluating effect of defender strategies and actions associated with specific attack steps on overall attack metrics. We present a novel game-theoretic approach GPLADD to attack metrics estimation and demonstrate it on attack data derived from MITRE's ATT&CK Framework and other sources. In GPLADD, the time to complete attack steps is explicit; the attack dynamics emerges from attack graph and attacker-defender capabilities and strategies and therefore reflects 'physics' of attacks. The time the attacker takes to complete an attack step is drawn from a probability distribution determined by attacker and defender strategies and capabilities. This makes time a physical constraint on attack success parameters and enables comparing different defender resource allocation strategies across different attacks. We solve for attack success metrics by approximating attacker-defender games as discrete-time Markov chains and show evaluation of return on detection investments associated with different attack steps. We apply GPLADD to MITRE's APT3 data from ATT&CK Framework and show that there are substantial and un-intuitive differences in estimated real-world vendor performance against a simplified APT3 attack. We focus on metrics that reflect attack difficulty versus attacker ability to remain hidden in the system after gaining control. This enables practical defender optimization and resource allocation against multi-step attacks.

More Details

Experimental Validation of a Command and Control Traffic Detection Model

IEEE Transactions on Dependable and Secure Computing

Vugrin, Eric; Hanson, Seth T.; Cruz, Gerardo J.; Glatter, Casey; Tarman, Thomas D.; Pinar, Ali P.

Network intrusion detection systems (NIDS) are commonly used to detect malware communications, including command-and-control (C2) traffic from botnets. NIDS performance assessments have been studied for decades, but mathematical modeling has rarely been used to explore NIDS performance. This paper details a mathematical model that describes a NIDS performing packet inspection and its detection of malware's C2 traffic. Here, the paper further describes an emulation testbed and a set of cyber experiments that used the testbed to validate the model. These experiments included a commonly used NIDS (Snort) and traffic with contents from a pervasive malware (Emotet). Results are presented for two scenarios: a nominal scenario and a “stressed” scenario in which the NIDS cannot process all incoming packets. Model and experiment results match well, with model estimates mostly falling within 95 % confidence intervals on the experiment means. Model results were produced 70-3000 times faster than the experimental results. Consequently, the model's predictive capability could potentially be used to support decisions about NIDS configuration and effectiveness that require high confidence results, quantification of uncertainty, and exploration of large parameter spaces. Furthermore, the experiments provide an example for how emulation testbeds can be used to validate cyber models that include stochastic variability.

More Details

Neuromorphic Graph Algorithms

Parekh, Ojas D.; Wang, Yipu; Ho, Yang; Phillips, Cynthia A.; Pinar, Ali P.; Aimone, James B.; Severa, William M.

Graph algorithms enable myriad large-scale applications including cybersecurity, social network analysis, resource allocation, and routing. The scalability of current graph algorithm implementations on conventional computing architectures are hampered by the demise of Moore’s law. We present a theoretical framework for designing and assessing the performance of graph algorithms executing in networks of spiking artificial neurons. Although spiking neural networks (SNNs) are capable of general-purpose computation, few algorithmic results with rigorous asymptotic performance analysis are known. SNNs are exceptionally well-motivated practically, as neuromorphic computing systems with 100 million spiking neurons are available, and systems with a billion neurons are anticipated in the next few years. Beyond massive parallelism and scalability, neuromorphic computing systems offer energy consumption orders of magnitude lower than conventional high-performance computing systems. We employ our framework to design and analyze new spiking algorithms for shortest path and dynamic programming problems. Our neuromorphic algorithms are message-passing algorithms relying critically on data movement for computation. For fair and rigorous comparison with conventional algorithms and architectures, which is challenging but paramount, we develop new models of data-movement in conventional computing architectures. This allows us to prove polynomial-factor advantages, even when we assume a SNN consisting of a simple grid-like network of neurons. To the best of our knowledge, this is one of the first examples of a rigorous asymptotic computational advantage for neuromorphic computing.

More Details

Science & Engineering of Cyber Security by Uncertainty Quantification and Rigorous Experimentation (SECURE) HANDBOOK

Pinar, Ali P.; Tarman, Thomas D.; Swiler, Laura P.; Gearhart, Jared L.; Hart, Derek; Vugrin, Eric; Cruz, Gerardo J.; Arguello, Bryan; Geraci, Gianluca; Debusschere, Bert; Hanson, Seth T.; Outkin, Alexander V.; Thorpe, Jamie E.; Hart, William E.; Sahakian, Meghan A.; Gabert, Kasimir G.; Glatter, Casey; Johnson, Emma S.; Punla-Green, and She?Ifa S.

Abstract not provided.

Science and Engineering of Cybersecurity by Uncertainty quantification and Rigorous Experimentation (SECURE) (Final Report)

Pinar, Ali P.; Tarman, Thomas D.; Swiler, Laura P.; Gearhart, Jared L.; Hart, Derek; Vugrin, Eric; Cruz, Gerardo J.; Arguello, Bryan; Geraci, Gianluca; Debusschere, Bert; Hanson, Seth T.; Outkin, Alexander V.; Thorpe, Jamie E.; Hart, William E.; Sahakian, Meghan A.; Gabert, Kasimir G.; Glatter, Casey; Johnson, Emma S.; Punla-Green, She'Ifa'

This report summarizes the activities performed as part of the Science and Engineering of Cybersecurity by Uncertainty quantification and Rigorous Experimentation (SECURE) Grand Challenge LDRD project. We provide an overview of the research done in this project, including work on cyber emulation, uncertainty quantification, and optimization. We present examples of integrated analyses performed on two case studies: a network scanning/detection study and a malware command and control study. We highlight the importance of experimental workflows and list references of papers and presentations developed under this project. We outline lessons learned and suggestions for future work.

More Details

A Unifying Framework to Identify Dense Subgraphs on Streams: Graph Nuclei to Hypergraph Cores

WSDM 2021 - Proceedings of the 14th ACM International Conference on Web Search and Data Mining

Gabert, Kasimir G.; Pinar, Ali P.; Catalyurek, Umit V.

Finding dense regions of graphs is fundamental in graph mining. We focus on the computation of dense hierarchies and regions with graph nuclei - -a generalization of k-cores and trusses. Static computation of nuclei, namely through variants of 'peeling', are easy to understand and implement. However, many practically important graphs undergo continuous change. Dynamic algorithms, maintaining nucleus computations on dynamic graph streams, are nuanced and require significant effort to port between nuclei, e.g., from k-cores to trusses. We propose a unifying framework to maintain nuclei in dynamic graph streams. First, we show no dynamic algorithm can asymptotically beat re-computation, highlighting the need to experimentally understand variability. Next, we prove equivalence between k-cores on a special hypergraph and nuclei. Our algorithm splits the problem into maintaining the special hypergraph and maintaining k-cores on it. We implement our algorithm and experimentally demonstrate improvements up to 108 x over re-computation. We show algorithmic improvements on k-cores apply to trusses and outperform truss-specific implementations.

More Details

Provable advantages for graph algorithms in spiking neural networks

Annual ACM Symposium on Parallelism in Algorithms and Architectures

Aimone, James B.; Ho, Yang; Parekh, Ojas D.; Phillips, Cynthia A.; Pinar, Ali P.; Severa, William M.; Wang, Yipu

We present a theoretical framework for designing and assessing the performance of algorithms executing in networks consisting of spiking artificial neurons. Although spiking neural networks (SNNs) are capable of general-purpose computation, few algorithmic results with rigorous asymptotic performance analysis are known. SNNs are exceptionally well-motivated practically, as neuromorphic computing systems with 100 million spiking neurons are available, and systems with a billion neurons are anticipated in the next few years. Beyond massive parallelism and scalability, neuromorphic computing systems offer energy consumption orders of magnitude lower than conventional high-performance computing systems. We employ our framework to design and analyze neuromorphic graph algorithms, focusing on shortest path problems. Our neuromorphic algorithms are message-passing algorithms relying critically on data movement for computation, and we develop data-movement lower bounds for conventional algorithms. A fair and rigorous comparison with conventional algorithms and architectures is challenging but paramount. We prove a polynomial-factor advantage even when we assume an SNN consisting of a simple grid-like network of neurons. To the best of our knowledge, this is one of the first examples of a provable asymptotic computational advantage for neuromorphic computing.

More Details

Shared-Memory Scalable k-Core Maintenance on Dynamic Graphs and Hypergraphs

2021 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2021 - In conjunction with IEEE IPDPS 2021

Gabert, Kasimir G.; Pinar, Ali P.; Catalyurek, Umit V.

Computing k-cores on graphs is an important graph mining target as it provides an efficient means of identifying a graph's dense and cohesive regions. Computing k-cores on hypergraphs has seen recent interest, as many datasets naturally produce hypergraphs. Maintaining k-cores as the underlying data changes is important as graphs are large, growing, and continuously modified. In many practical applications, the graph updates are bursty, both with periods of significant activity and periods of relative calm. Existing maintenance algorithms fail to handle large bursts, and prior parallel approaches on both graphs and hypergraphs fail to scale as available cores increase.We address these problems by presenting two parallel and scalable fully-dynamic batch algorithms for maintaining k-cores on both graphs and hypergraphs. Both algorithms take advantage of the connection between k-cores and h-indices. One algorithm is well suited for large batches and the other for small. We provide the first algorithms that experimentally demonstrate scalability as the number of threads increase while sustaining high change rates in graphs and hypergraphs.

More Details

Defender Policy Evaluation and Resource Allocation against MITRE ATT&CK Data and Evaluations

Outkin, Alexander V.; Schulz, Patricia V.; Schulz, Timothy; Tarman, Thomas D.; Pinar, Ali P.

Protecting against multi-step attacks of uncertain duration and timing forces defenders into an indefinite, always ongoing, resource-intensive response. To effectively allocate resources, a defender must be able to analyze multi-step attacks under assumption of constantly allocating resources against an uncertain stream of potentially undetected attacks. To achieve this goal, we present a novel methodology that applies a game-theoretic approach to the attack, attacker, and defender data derived from MITRE´s ATT&CK® Framework. Time to complete attack steps is drawn from a probability distribution determined by attacker and defender strategies and capabilities. This constraints attack success parameters and enables comparing different defender resource allocation strategies. By approximating attacker-defender games as Markov processes, we represent the attacker-defender interaction, estimate the attack success parameters, determine the effects of attacker and defender strategies, and maximize opportunities for defender strategy improvements against an uncertain stream of attacks. This novel representation and analysis of multi-step attacks enables defender policy optimization and resource allocation, which we illustrate using the data from MITRE´ s APT3 ATT&CK® Framework.

More Details

The Multiple Instance Learning Gaussian Process Probit Model

Proceedings of Machine Learning Research

Wang, Fulton; Pinar, Ali P.

In the Multiple Instance Learning (MIL) scenario, the training data consists of instances grouped into bags. Bag labels specify whether each bag contains at least one positive instance, but instance labels are not observed. Recently, Haußmann et al [10] tackled the MIL instance label prediction task by introducing the Multiple Instance Learning Gaussian Process Logistic (MIL-GP-Logistic) model, an adaptation of the Gaussian Process Logistic Classification model that inherits its uncertainty quantification and flexibility. Notably, they give a fast mean-field variational inference procedure. However, due to their use of the logit link, they do not maximize the variational inference ELBO objective directly, but rather a lower bound on it. This approximation, as we show, hurts predictive performance. In this work, we propose the Multiple Instance Learning Gaussian Process Probit (MIL-GP-Probit) model, an adaptation of the Gaussian Process Probit Classification model to solve the MIL instance label prediction problem. Leveraging the analytical tractability of the probit link, we give a variational inference procedure based on variable augmentation that maximizes the ELBO objective directly. Applying it, we show MIL-GP-Probit is more calibrated than MIL-GP-Logistic on all 20 datasets of the benchmark 20 Newsgroups dataset collection, and achieves higher AUC than MIL-GP-Logistic on an additional 51 out of 59 datasets. Finally, we show how the probit formulation enables principled bag label predictions and a Gibbs sampling scheme. This is the first exact inference scheme for any Bayesian model for the MIL scenario.

More Details

Active Betweenness Cardinality: Algorithms and Applications

Ozkaya, Yusuf; Sariyuce, A.E.; Catalyurek, Umit V.; Pinar, Ali P.

Centrality rankings such as degree, closeness, betweenness, Katz, PageRank, etc. are commonly used to identify critical nodes in a graph. These methods are based on two assumptions that restrict their wider applicability. First, they assume the exact topology of the network is available. Secondly, they do not take into account the activity over the network and only rely on its topology. However, in many applications, the network is autonomous, vast, and distributed, and it is hard to collect the exact topology. At the same time, the underlying pairwise activity between node pairs is not uniform and node criticality strongly depends on the activity on the underlying network. In this paper, we propose active betweenness cardinality, as a new measure, where the node criticalities are based on not the static structure, but the activity of the network. We show how this metric can be computed efficiently by using only local information for a given node and how we can find the most critical nodes starting from only a few nodes. We also show how this metric can be used to monitor a network and identify failed nodes. We present experimental results to show effectiveness by demonstrating how the failed nodes can be identified by measuring active betweenness cardinality of a few nodes in the system.

More Details

Developing an Active Learning algorithm for learning Bayesian classifiers under the Multiple Instance Learning scenario

Wang, Fulton; Pinar, Ali P.

In the Multiple Instance Learning scenario, the training data consists of instances grouped into bags, and each bag is labelled with whether it is positive, i.e. contains at least one positive instance. First, Active Learning, in which additional labels can be iteratively requested, has the potential to allow more accurate classifiers to be learned with less labels. Active Learning has been applied to the Multiple Instance Learning under two settings: when bag labels of unlabelled bags can be requested, and when instance labels within bags known to be positive can be requested. Second, Bayesian Active learning methods have the potential to learn accurate classifiers with few labels, because they explicitly track the classifier uncertainty and can thus address its knowledge gaps. Yet, there does not exist any Bayesian Active Learning method for the Multiple Instance Learning Scenario. In this work, we develop the first such method. We develop a Bayesian classifier for the Multiple Instance Learning scenario, show how it can be efficiently used for Bayesian Active Learning, and perform experiments assessing its performance. While its performance exceeds that when no Active Learning is used, it is sometimes better, sometimes worse than the naive baseline of uncertainty sampling, depending on the situation. This suggests future work: building more customizable Bayesian Active Learning methods for the Multiple Instance Scenario, customizable to whether bag or instance label accuracy is targeted, and the labeling budget.

More Details

Cyber threat modeling and validation: Port scanning and detection

ACM International Conference Proceeding Series

Vugrin, Eric; Cruz, Gerardo J.; Reedy, Christian; Tarman, Thomas D.; Pinar, Ali P.

Port scanning is a commonly applied technique in the discovery phase of cyber attacks. As such, defending against them has long been the subject of many research and modeling efforts. Though modeling efforts can search large parameter spaces to find effective defensive parameter settings, confidence in modeling results can be hampered by limited or omitted validation efforts. In this paper, we introduce a novel, mathematical model that describes port scanning progress by an attacker and intrusion detection by a defender. The paper further describes a set of emulation experiments that we conducted with a virtual testbed and used to validate the model. Results are presented for two scanning strategies: a slow, stealthy approach and a fast, loud approach. Estimates from the model fall within 95% confidence intervals on the means estimated from the experiments. Consequently, the model's predictive capability provides confidence in its use for evaluation and development of defensive strategies against port scanning.

More Details

Time Series Dimension Reduction for Surrogate Models of Port Scanning Cyber Emulations

Foulk, James W.; Swiler, Laura P.; Pinar, Ali P.

Surrogate model development is a key resource in the scientific modeling community for providing computational expedience when simulating complex systems without loss of great fidelity. The initial step to development of a surrogate model is identification of the primary governing components of the system. Principal component analysis (PCA) is a widely used data science technique that provides inspection of such driving factors, when the objective for modeling is to capture the greatest sources of variance inherent to a dataset. Although an efficient linear dimension reduction tool, PCA makes the fundamental assumption that the data is continuous and normally distributed. Thus, it provides ideal performance when these conditions are met. In the case for which cyber emulations provide realizations of a port scanning scenario, the data to be modeled follows a discrete time series function comprised of monotonically increasing piece-wise constant steps. The sources of variance are related to the timing and magnitude of these steps. Therefore, we consider using XPCA, an extension to PCA for continuous and discrete random variates. This report provides the documentation of the trade-offs between the PCA and XPCA linear dimension reduction algorithms, for the intended purpose to identify key components of greatest variance in our time series data. These components will ultimately provide the basis for future surrogate models of port scanning cyber emulations.

More Details

A scalable graph generation algorithm to sample over a given shell distribution

Proceedings - 2020 IEEE 34th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2020

Ozkaya, M.Y.; Balin, M.F.; Pinar, Ali P.; Catalyurek, Umit V.

Graphs are commonly used to model the relationships between various entities. These graphs can be enormously large and thus, scalable graph analysis has been the subject of many research efforts. To enable scalable analytics, many researchers have focused on generating realistic graphs that support controlled experiments for understanding how algorithms perform under changing graph features. Significant progress has been made on scalable graph generation which preserve some important graph properties (e.g., degree distribution, clustering coefficients). In this paper, we study how to sample a graph from the space of graphs with a given shell distribution. Shell distribution is related to the k-core, which is the largest subgraph where each vertex is connected to at least kother vertices. A k-shell is the subset of vertices that are in k-core but not ( k +1)-core, and the shell distribution comprises the sizes of these shells. Core decompositions are widely used to extract information from graphs and to assist other computations. We present a scalable shared and distributed memory graph generator that, given a shell decomposition, generates a random graph that conforms to it. Our extensive experimental results show the efficiency and scalability of our methods. Our algorithm generates 233 vertices and 237 edges in less than 50 seconds on 384 cores.11This work is funded by the Laboratory Directed Research and Development program of Sandia National Laboratories. Sandia National Laboratories is a multimission laboratory managed and operated by National Technology Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International Inc., for the U.S. Department of Energy's National Nuclear Security Administration under contract DE-NA0003525.

More Details

Residual core maximization: An efficient algorithm for maximizing the size of the k-core

Proceedings of the 2020 SIAM International Conference on Data Mining, SDM 2020

Laishram, Ricky; Sariyuce, Ahmet E.; Eliassi-Rad, Tina; Pinar, Ali P.; Soundarajan, Sucheta

In many online social networking platforms, the participation of an individual is motivated by the participation of others. If an individual chooses to leave a platform, this may produce a cascade in which that person’s friends then choose to leave, causing their friends to leave, and so on. In some cases, it may be possible to incentivize key individuals to stay active within the network, thus preventing such a cascade. This problem is modeled using the anchored k-core of a network, which, for a network G and set of anchor nodes A, is the maximal subgraph of G in which every node has a total of at least k neighbors between the subgraph and anchors. In this work, we propose Residual Core Maximization (RCM), a novel algorithm for finding b anchor nodes so that the size of the anchored k-core is maximized. We perform a comprehensive experimental evaluation on numerous real-world networks and compare RCM to various baselines. We observe that RCM is more effective and efficient than the state-of-the-art methods: on average, RCM produces anchored k-cores that are 1.65 times larger than those produced by the baseline algorithm, and is approximately 500 times faster on average.

More Details

SECURE: An Evidence-based Approach to Cyber Experimentation

Proceedings - 2019 Resilience Week, RWS 2019

Pinar, Ali P.; Benz, Zachary O.; Castillo, Andrea; Hart, William E.; Swiler, Laura P.; Tarman, Thomas D.

Securing cyber systems is of paramount importance, but rigorous, evidence-based techniques to support decision makers for high-consequence decisions have been missing. The need for bringing rigor into cybersecurity is well-recognized, but little progress has been made over the last decades. We introduce a new project, SECURE, that aims to bring more rigor into cyber experimentation. The core idea is to follow the footsteps of computational science and engineering and expand similar capabilities to support rigorous cyber experimentation. In this paper, we review the cyber experimentation process, present the research areas that underlie our effort, discuss the underlying research challenges, and report on our progress to date. This paper is based on work in progress, and we expect to have more complete results for the conference.

More Details

RetSynth: Determining all optimal and sub-optimal synthetic pathways that facilitate synthesis of target compounds in chassis organisms

BMC Bioinformatics

Whitmore, Leanne S.; Nguyen, Bernard; Pinar, Ali P.; George, Anthe G.; Hudson, Corey M.

Background: The efficient biological production of industrially and economically important compounds is a challenging problem. Brute-force determination of the optimal pathways to efficient production of a target chemical in a chassis organism is computationally intractable. Many current methods provide a single solution to this problem, but fail to provide all optimal pathways, optional sub-optimal solutions or hybrid biological/non-biological solutions. Results: Here we present RetSynth, software with a novel algorithm for determining all optimal biological pathways given a starting biological chassis and target chemical. By dynamically selecting constraints, the number of potential pathways scales by the number of fully independent pathways and not by the number of overall reactions or size of the metabolic network. This feature allows all optimal pathways to be determined for a large number of chemicals and for a large corpus of potential chassis organisms. Additionally, this software contains other features including the ability to collect data from metabolic repositories, perform flux balance analysis, and to view optimal pathways identified by our algorithm using a built-in visualization module. This software also identifies sub-optimal pathways and allows incorporation of non-biological chemical reactions, which may be performed after metabolic production of precursor molecules. Conclusions: The novel algorithm designed for RetSynth streamlines an arduous and complex process in metabolic engineering. Our stand-alone software allows the identification of candidate optimal and additional sub-optimal pathways, and provides the user with necessary ranking criteria such as target yield to decide which route to select for target production. Furthermore, the ability to incorporate non-biological reactions into the final steps allows determination of pathways to production for targets that cannot be solely produced biologically. With this comprehensive suite of features RetSynth exceeds any open-source software or webservice currently available for identifying optimal pathways for target production.

More Details

An Example of Counter-Adversarial Community Detection Analysis

Kegelmeyer, William P.; Wendt, Jeremy; Pinar, Ali P.

Community detection is often used to understand the nature of a network. However, there may exist an adversarial member of the network who wishes to evade that understanding. We analyze one such specific situation, quantifying the efficacy of certain attacks against a particular analytic use of community detection and providing a preliminary assessment of a possible defense.

More Details

Chance-constrained economic dispatch with renewable energy and storage

Computational Optimization and Applications

Safta, Cosmin; Cheng, Jianqiang; Najm, Habib N.; Pinar, Ali P.; Chen, Richard L.Y.; Watson, Jean-Paul

Increasing penetration levels of renewables have transformed how power systems are operated. High levels of uncertainty in production make it increasingly difficulty to guarantee operational feasibility; instead, constraints may only be satisfied with high probability. We present a chance-constrained economic dispatch model that efficiently integrates energy storage and high renewable penetration to satisfy renewable portfolio requirements. Specifically, we require that wind energy contribute at least a prespecified proportion of the total demand and that the scheduled wind energy is deliverable with high probability. We develop an approximate partial sample average approximation (PSAA) framework to enable efficient solution of large-scale chance-constrained economic dispatch problems. Computational experiments on the IEEE-24 bus system show that the proposed PSAA approach is more accurate, closer to the prescribed satisfaction tolerance, and approximately 100 times faster than standard sample average approximation. Finally, the improved efficiency of our PSAA approach enables solution of a larger WECC-240 test system in minutes.

More Details

Unsupervised Learning Through Randomized Algorithms for High-Volume High-Velocity Data (ULTRA-HV)

Pinar, Ali P.; Kolda, Tamara G.; Carlberg, Kevin T.; Ballard, Grey; Mahoney, Michael

Through long-term investments in computing, algorithms, facilities, and instrumentation, DOE is an established leader in massive-scale, high-fidelity simulations, as well as science-leading experimentation. In both cases, DOE is generating more data than it can analyze and the problem is intensifying quickly. The need for advanced algorithms that can automatically convert the abundance of data into a wealth of useful information by discovering hidden structures is well recognized. Such efforts however, are hindered by the massive volume of the data and its high velocity. Here, the challenge is developing unsupervised learning methods to discover hidden structure in high-volume, high-velocity data.

More Details

Exploiting Social Media Sensor Networks through Novel Data Fusion Techniques

Kouri, Tina; Pinar, Ali P.

Unprecedented amounts of data are continuously being generated by sensors (“hard” data) and by humans (“soft” data), and this data needs to be exploited to its full potential. The first step in exploiting this data is determine how the hard and soft data are related to each other. In this project we fuse hard and soft data, using the attributes of each (e.g., time and space), to gain more information about interesting events. Next, we attempt to use social networking textual data to predict the present (i.e., predict that an interesting event is occurring and details about the event) using data mining, machine learning, natural language processing, and text analysis techniques.

More Details

Accurate Characterization of Real Networks from Inaccurate Measurements

Pinar, Ali P.

Our nation's dependence on information networks makes it vital to anticipate disruptions or find weaknesses in these networks. But networks like the Internet are vast, distributed, and there is no mechanism to completely collect their structure. We are restricted to specific data collection schemes (like traceroute samples from router interfaces) that examine tiny portions of such a network. It has been empirically documented and theoretically proven that these measurements have significant biases, and direct inferences from them will be wrong. But these data collection mechanisms have limited flexibility and cannot be easily modified. Moreover, in many applications there are limits on how much data can be collected. How do we make accurate inferences of network properties with biased and limited measurements? The general problem this report deals with is how to work with incompletely observed networks. We will present several different approaches to this problem. First we will present an approach to estimate the degree distribution of a graph by sampling only a small portion of the vertices. This algorithm provides provably accurate results with sublinear samples. An alternative approach would be to try to enhance the information in the by selective collecting new information by probing for neighbors of a vertex or presence of individual edges. A different setting for working with incomplete arises when we have full access to local information, but do not have any global version of the graph. Can we still identify critical nodes in such a graph? We present an approach to identify such nodes efficiently. Finally, how can we put these ideas together to identify the structure of a network? We present an approach that can complement the existing approaches for network mapping. We start with an estimate of network structure based on existing network mapping methods. Then we find a critical router in the network, use the traffic through this network to selectively collect new data to enhance our prediction.

More Details

Understanding the Hierarchy of Dense Subgraphs in Stationary and Temporally Varying Setting

Sariyuce, Ahmet E.; Pinar, Ali P.

Graphs are widely used to model relationships in a wide variety of domains such as sociology, bioinformatics, infrastructure, the WWW, to name a few. One of the key observations is that while real-world graphs are often globally sparse, they are locally dense. In other words, the average degree is often quite small (say at most 10 in a million vertex graph), but vertex neighborhoods are often dense. Finding dense subgraphs is a critical aspect of graph mining It has been used for finding communities and spam link farms in web graphs, graph visualization, real-time story identification, DNA motif detection in biological networks, finding correlated genes, epilepsy prediction, finding price value motifs in financial data, graph compression, distance query indexing, and increasing the throughput of social networking site servers. However, most standard formulations of this problem (like clique, quasi-clique, k-densest subgraph) are NP-hard. Furthermore, current dense subgraph finding algorithms usually optimize some objective, and only find a few such subgraphs without providing any structural relations, whereas the goal is rarely to find the "true optimum," but to identify many (if not all) dense substructures, understand their distribution in the graph, and ideally determine relationships among them. In this project, we first aim to devise algorithms and provide 3 implementations with nice visualizations to find the hierarchy between dense subgraphs, and then understand the structure of the hierarchy to gain more insight on the hidden patterns in real-world networks. Another important aspects in graph analysis is the temporal nature of networks. Networks evolve over time and in many applications data arrives at a high velocity, and thus it is important to design algorithms that can process data efficiently. We report three main results towards identifying dense structures in large evolving graphs. First, we will show how the hierarchical connectedness structure can be maintained efficiently, where connectedness is defined by increasing levels of connectivity strength. Next, we present dense structure can be identified in bipartite graphs without building projection graphs. And finally, we present a new method for peeling algorithms This new approach avoids sequential nature of peeling algorithms and is amenable to parallelization, which is crucial for processing high velocity data.

More Details

Efficient Uncertainty Quantification in Stochastic Economic Dispatch

IEEE Transactions on Power Systems

Safta, Cosmin; Chen, Richard L.Y.; Najm, Habib N.; Pinar, Ali P.; Watson, Jean-Paul

Stochastic economic dispatch models address uncertainties in forecasts of renewable generation output by considering a finite number of realizations drawn from a stochastic process model, typically via Monte Carlo sampling. Accurate evaluations of expectations or higher order moments for quantities of interest, e.g., generating cost, can require a prohibitively large number of samples. We propose an alternative to Monte Carlo sampling based on polynomial chaos expansions. These representations enable efficient and accurate propagation of uncertainties in model parameters, using sparse quadrature methods. We also use Karhunen-Loève expansions for efficient representation of uncertain renewable energy generation that follows geographical and temporal correlations derived from historical data at each wind farm. Considering expected production cost, we demonstrate that the proposed approach can yield several orders of magnitude reduction in computational cost for solving stochastic economic dispatch relative to Monte Carlo sampling, for a given target error threshold.

More Details

Bounded-Degree Connected Approximations of Stochastic Networks

IEEE Transactions on Molecular, Biological, and Multi-Scale Communications

Pinar, Ali P.; Quinn, Christopher J.; Kiyavash, Negar

We propose a method to find optimal sparse connected approximations for large complex networks of nodes interacting over time. Optimality is measured by Kullback-Leibler divergence. The sparsity is controlled by the user through specifying the in-degrees. The approximations have spanning tree subgraphs, enabling them to depict flow through a network. They can also depict feedback. The approximations can capture both significant statistical dependencies in, and preserve salient topological features of, the full (unknown) network. The proposed method can be applied to networks without restrictions on the dynamics beyond stationarity. We analyze computational and sample complexity. We demonstrate their efficacy using a large-scale simulation of the turtle visual cortex and experimental data from the primate motor cortex.

More Details

Escape: Efficiently counting all 5-vertex subgraphs

26th International World Wide Web Conference, WWW 2017

Pinar, Ali P.; Seshadhri, C.; Vishal, Vaidyanathan

Counting the frequency of small subgraphs is a fundamental technique in network analysis across various domains, most notably in bioinformatics and social networks. The special case of triangle counting has received much attention. Getting results for 4-vertex or 5-vertex patterns is highly challenging, and there are few practical results known that can scale to massive sizes. We introduce an algorithmic framework that can be adopted to count any small pattern in a graph and apply this framework to compute exact counts for all 5-vertex subgraphs. Our framework is built on cutting a pattern into smaller ones, and using counts of smaller patterns to get larger counts. Furthermore, we exploit degree orientations of the graph to reduce runtimes even further. These methods avoid the combinatorial explosion that typical subgraph counting algorithms face. We prove that it suffices to enumerate only four specific subgraphs (three of them have less than 5 vertices) to exactly count all 5-vertex patterns. We perform extensive empirical experiments on a variety of real-world graphs. We are able to compute counts of graphs with tens of millions of edges in minutes on a commodity machine. To the best of our knowledge, this is the first practical algorithm for 5-vertex pattern counting that runs at this scale. A stepping stone to our main algorithm is a fast method for counting all 4-vertex patterns. This algorithm is typically ten times faster than the state of the art 4-vertex counters.

More Details

MaxReach: Reducing network incompleteness through node probes

Proceedings of the 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM 2016

Soundarajan, Sucheta; Eliassi-Rad, Tina; Gallagher, Brian; Pinar, Ali P.

Real-world network datasets are often incomplete. Subsequently, any analysis on such networks is likely to produce skewed results. We examine the following problem: given an incomplete network, which b nodes should be probed to bring as many new nodes as possible into the observed network? For instance, consider someone who has observed a portion (say 1%) of the Twitter network. How should she use a limited budget to reduce the incompleteness of the network? In this work, we propose a novel algorithm, called MAXREACH, which uses a budget b to increase the number of nodes in the observed network. Our experiments, across a range of datasets and conditions, demonstrate the efficacy of MAXREACH.

More Details

Trigger Detection for Adaptive Scientific Workflows Using Percentile Sampling

SIAM Journal on Scientific Computing (Online)

Pinar, Ali P.; Bennett, Janine C.; Salloum, Maher; Bhagatwala, Ankit; Chen, Jacqueline H.; Seshadhri, C.

The increasing complexity of both scientific simulations and high-performance computing system architectures are driving the need for adaptive workflows, in which the composition and execution of computational and data manipulation steps dynamically depend on the evolutionary state of the simulation itself. Consider, for example, the frequency of data storage. Critical phases of the simulation should be captured with high frequency and with high fidelity for postanalysis; however, we cannot afford to retain the same frequency for the full simulation due to the high cost of data movement. We can instead look for triggers, indicators that the simulation will be entering a critical phase, and adapt the workflow accordingly. In this paper, we present a methodology for detecting triggers and demonstrate its use in the context of direct numerical simulations of turbulent combustion using S3D. We show that chemical explosive mode analysis (CEMA) can be used to devise a noise-tolerant indicator for rapid increase in heat release. However, exhaustive computation of CEMA values dominates the total simulation, and thus is prohibitively expensive. To overcome this computational bottleneck, we propose a quantile sampling approach. Our sampling-based algorithm comes with provable error/confidence bounds, as a function of the number of samples. Most importantly, the number of samples is independent of the problem size, and thus our proposed sampling algorithm offers perfect scalability. Our experiments on homogeneous charge compression ignition and reactivity controlled compression ignition simulations show that the proposed method can detect rapid increases in heat release, and its computational overhead is negligible. Our results will be used to make dynamic workflow decisions regarding data storage and mesh resolution in future combustion simulations.

More Details

Sparse approximations of directed information graphs

IEEE International Symposium on Information Theory - Proceedings

Quinn, Christopher J.; Pinar, Ali P.; Gao, Jing; Su, Lu

Given a network of agents interacting over time, which few interactions best characterize the dynamics of the whole network? We propose an algorithm that finds the optimal sparse approximation of a network. The user controls the level of sparsity by specifying the total number of edges. The networks are represented using directed information graphs, a graphical model that depicts causal influences between agents in a network. Goodness of approximation is measured with Kullback-Leibler divergence. The algorithm finds the best approximation with no assumptions on the topology or the class of the joint distribution.

More Details

Measuring and Modeling Bipartite Graphs with Community Structure

Kolda, Tamara G.; Pinar, Ali P.; Aksoy, Sinan

Network science is a powerful tool for analyzing complex systems in fields ranging from sociology to engineering to biology. This paper is focused on generative models of bipartite graphs, also known as two-way graphs. We propose two generative models that can be easily tuned to reproduce the characteristics of real-world networks, not just qualitatively, but quantitatively. The measurements we consider are the degree distributions and the bipartite clustering coefficient, which we refer to as the metamorphosis coefficient. We define edge, node, and degreewise metamorphosis coefficients, enabling a more detailed understand of the bipartite community structure. Our proposed bipartite Chung-Lu model is able to reproduce real-world degree distributions, and our proposed bipartite “BTER” model reproduces both the degree distributions as well as the degreewise metamorphosis coefficients. We demonstrate the effectiveness of these models on several real-world data sets.

More Details

Enabling adaptive scientific workflows via trigger detection

Proceedings of ISAV 2015: 1st International Workshop on In Situ Infrastructures for Enabling Extreme-Scale Analysis and Visualization, Held in conjunction with SC 2015: The International Conference for High Performance Computing, Networking, Storage and Analysis

Salloum, Maher; Bennett, Janine C.; Pinar, Ali P.; Bhagatwala, Ankit; Chen, Jacqueline H.

Next generation architectures necessitate a shift away from traditional workflows in which the simulation state is saved at prescribed frequencies for post-processing analysis. While the need to shift to in situ workflows has been acknowledged for some time, much of the current research is focused on static workflows, where the analysis that would have been done as a post-process is performed concurrently with the simulation at user-prescribed frequencies. Recently, research efforts are striving to enable adaptive workflows, in which the frequency, composition, and execution of computational and data manipulation steps dynamically depend on the state of the simulation. Adapting the workflow to the state of simulation in such a data-driven fashion puts extremely strict efficiency requirements on the analysis capabilities that are used to identify the transitions in the workflow. In this paper we build upon earlier work on trigger detection using sublinear techniques to drive adaptive workflows. Here we propose a methodology to detect the time when sudden heat release occurs in simulations of turbulent combustion. Our proposed method provides an alternative metric that can be used along with our former metric to increase the robustness of trigger detection. We show the effectiveness of our metric empirically for predicting heat release for two use cases.

More Details

Final Report: Sublinear Algorithms for In-situ and In-transit Data Analysis at Exascale

Bennett, Janine C.; Pinar, Ali P.; Seshadhri, C.; Thompson, David; Salloum, Maher; Bhagatwala, Ankit; Chen, Jacqueline H.

Post-Moore's law scaling is creating a disruptive shift in simulation workflows, as saving the entirety of raw data to persistent storage becomes expensive. We are moving away from a post-process centric data analysis paradigm towards a concurrent analysis framework, in which raw simulation data is processed as it is computed. Algorithms must adapt to machines with extreme concurrency, low communication bandwidth, and high memory latency, while operating within the time constraints prescribed by the simulation. Furthermore, in- put parameters are often data dependent and cannot always be prescribed. The study of sublinear algorithms is a recent development in theoretical computer science and discrete mathematics that has significant potential to provide solutions for these challenges. The approaches of sublinear algorithms address the fundamental mathematical problem of understanding global features of a data set using limited resources. These theoretical ideas align with practical challenges of in-situ and in-transit computation where vast amounts of data must be processed under severe communication and memory constraints. This report details key advancements made in applying sublinear algorithms in-situ to identify features of interest and to enable adaptive workflows over the course of a three year LDRD. Prior to this LDRD, there was no precedent in applying sublinear techniques to large-scale, physics based simulations. This project has definitively demonstrated their efficacy at mitigating high performance computing challenges and highlighted the rich potential for follow-on re- search opportunities in this space.

More Details

Toward using surrogates to accelerate solution of stochastic electricity grid operations problems

2014 North American Power Symposium, NAPS 2014

Safta, Cosmin; Chen, Richard L.Y.; Najm, Habib N.; Pinar, Ali P.; Watson, Jean-Paul

Stochastic unit commitment models typically handle uncertainties in forecast demand by considering a finite number of realizations from a stochastic process model for loads. Accurate evaluations of expectations or higher moments for the quantities of interest require a prohibitively large number of model evaluations. In this paper we propose an alternative approach based on using surrogate models valid over the range of the forecast uncertainty. We consider surrogate models based on Polynomial Chaos expansions, constructed using sparse quadrature methods. Considering expected generation cost, we demonstrate that the approach can lead to several orders of magnitude reduction in computational cost relative to using Monte Carlo sampling on the original model, for a given target error threshold.

More Details

Finding Hierarchical and Overlapping Dense Subgraphs using Nucleus Decompositions

Comandur, Seshadhri; Pinar, Ali P.; Sariyuce, Ahmet E.; Catalyurek, Umit

Finding dense substructures in a graph is a fundamental graph mining operation, with applications in bioinformatics, social networks, and visualization to name a few. Yet most standard formulations of this problem (like clique, quasiclique, k-densest subgraph) are NP-hard. Furthermore, the goal is rarely to nd the \true optimum", but to identify many (if not all) dense substructures, understand their distribution in the graph, and ideally determine a hierarchical structure among them. Current dense subgraph nding algorithms usually optimize some objective, and only nd a few such subgraphs without providing any hierarchy. It is also not clear how to account for overlaps in dense substructures. We de ne the nucleus decomposition of a graph, which represents the graph as a forest of nuclei. Each nucleus is a subgraph where smaller cliques are present in many larger cliques. The forest of nuclei is a hierarchy by containment, where the edge density increases as we proceed towards leaf nuclei. Sibling nuclei can have limited intersections, which allows for discovery of overlapping dense subgraphs. With the right parameters, the nuclear decomposition generalizes the classic notions of k-cores and k-trusses. We give provable e cient algorithms for nuclear decompositions, and empirically evaluate their behavior in a variety of real graphs. The tree of nuclei consistently gives a global, hierarchical snapshot of dense substructures, and outputs dense subgraphs of higher quality than other state-of-theart solutions. Our algorithm can process graphs with tens of millions of edges in less than an hour.

More Details
Results 1–100 of 182
Results 1–100 of 182