Publications

Results 51–100 of 182

Search results

Jump to search filters

Unsupervised Learning Through Randomized Algorithms for High-Volume High-Velocity Data (ULTRA-HV)

Pinar, Ali P.; Kolda, Tamara G.; Carlberg, Kevin T.; Ballard, Grey; Mahoney, Michael

Through long-term investments in computing, algorithms, facilities, and instrumentation, DOE is an established leader in massive-scale, high-fidelity simulations, as well as science-leading experimentation. In both cases, DOE is generating more data than it can analyze and the problem is intensifying quickly. The need for advanced algorithms that can automatically convert the abundance of data into a wealth of useful information by discovering hidden structures is well recognized. Such efforts however, are hindered by the massive volume of the data and its high velocity. Here, the challenge is developing unsupervised learning methods to discover hidden structure in high-volume, high-velocity data.

More Details

Exploiting Social Media Sensor Networks through Novel Data Fusion Techniques

Kouri, Tina; Pinar, Ali P.

Unprecedented amounts of data are continuously being generated by sensors (“hard” data) and by humans (“soft” data), and this data needs to be exploited to its full potential. The first step in exploiting this data is determine how the hard and soft data are related to each other. In this project we fuse hard and soft data, using the attributes of each (e.g., time and space), to gain more information about interesting events. Next, we attempt to use social networking textual data to predict the present (i.e., predict that an interesting event is occurring and details about the event) using data mining, machine learning, natural language processing, and text analysis techniques.

More Details

Accurate Characterization of Real Networks from Inaccurate Measurements

Pinar, Ali P.

Our nation's dependence on information networks makes it vital to anticipate disruptions or find weaknesses in these networks. But networks like the Internet are vast, distributed, and there is no mechanism to completely collect their structure. We are restricted to specific data collection schemes (like traceroute samples from router interfaces) that examine tiny portions of such a network. It has been empirically documented and theoretically proven that these measurements have significant biases, and direct inferences from them will be wrong. But these data collection mechanisms have limited flexibility and cannot be easily modified. Moreover, in many applications there are limits on how much data can be collected. How do we make accurate inferences of network properties with biased and limited measurements? The general problem this report deals with is how to work with incompletely observed networks. We will present several different approaches to this problem. First we will present an approach to estimate the degree distribution of a graph by sampling only a small portion of the vertices. This algorithm provides provably accurate results with sublinear samples. An alternative approach would be to try to enhance the information in the by selective collecting new information by probing for neighbors of a vertex or presence of individual edges. A different setting for working with incomplete arises when we have full access to local information, but do not have any global version of the graph. Can we still identify critical nodes in such a graph? We present an approach to identify such nodes efficiently. Finally, how can we put these ideas together to identify the structure of a network? We present an approach that can complement the existing approaches for network mapping. We start with an estimate of network structure based on existing network mapping methods. Then we find a critical router in the network, use the traffic through this network to selectively collect new data to enhance our prediction.

More Details

Understanding the Hierarchy of Dense Subgraphs in Stationary and Temporally Varying Setting

Sariyuce, Ahmet E.; Pinar, Ali P.

Graphs are widely used to model relationships in a wide variety of domains such as sociology, bioinformatics, infrastructure, the WWW, to name a few. One of the key observations is that while real-world graphs are often globally sparse, they are locally dense. In other words, the average degree is often quite small (say at most 10 in a million vertex graph), but vertex neighborhoods are often dense. Finding dense subgraphs is a critical aspect of graph mining It has been used for finding communities and spam link farms in web graphs, graph visualization, real-time story identification, DNA motif detection in biological networks, finding correlated genes, epilepsy prediction, finding price value motifs in financial data, graph compression, distance query indexing, and increasing the throughput of social networking site servers. However, most standard formulations of this problem (like clique, quasi-clique, k-densest subgraph) are NP-hard. Furthermore, current dense subgraph finding algorithms usually optimize some objective, and only find a few such subgraphs without providing any structural relations, whereas the goal is rarely to find the "true optimum," but to identify many (if not all) dense substructures, understand their distribution in the graph, and ideally determine relationships among them. In this project, we first aim to devise algorithms and provide 3 implementations with nice visualizations to find the hierarchy between dense subgraphs, and then understand the structure of the hierarchy to gain more insight on the hidden patterns in real-world networks. Another important aspects in graph analysis is the temporal nature of networks. Networks evolve over time and in many applications data arrives at a high velocity, and thus it is important to design algorithms that can process data efficiently. We report three main results towards identifying dense structures in large evolving graphs. First, we will show how the hierarchical connectedness structure can be maintained efficiently, where connectedness is defined by increasing levels of connectivity strength. Next, we present dense structure can be identified in bipartite graphs without building projection graphs. And finally, we present a new method for peeling algorithms This new approach avoids sequential nature of peeling algorithms and is amenable to parallelization, which is crucial for processing high velocity data.

More Details

Efficient Uncertainty Quantification in Stochastic Economic Dispatch

IEEE Transactions on Power Systems

Safta, Cosmin; Chen, Richard L.Y.; Najm, Habib N.; Pinar, Ali P.; Watson, Jean-Paul

Stochastic economic dispatch models address uncertainties in forecasts of renewable generation output by considering a finite number of realizations drawn from a stochastic process model, typically via Monte Carlo sampling. Accurate evaluations of expectations or higher order moments for quantities of interest, e.g., generating cost, can require a prohibitively large number of samples. We propose an alternative to Monte Carlo sampling based on polynomial chaos expansions. These representations enable efficient and accurate propagation of uncertainties in model parameters, using sparse quadrature methods. We also use Karhunen-Loève expansions for efficient representation of uncertain renewable energy generation that follows geographical and temporal correlations derived from historical data at each wind farm. Considering expected production cost, we demonstrate that the proposed approach can yield several orders of magnitude reduction in computational cost for solving stochastic economic dispatch relative to Monte Carlo sampling, for a given target error threshold.

More Details

Bounded-Degree Connected Approximations of Stochastic Networks

IEEE Transactions on Molecular, Biological, and Multi-Scale Communications

Pinar, Ali P.; Quinn, Christopher J.; Kiyavash, Negar

We propose a method to find optimal sparse connected approximations for large complex networks of nodes interacting over time. Optimality is measured by Kullback-Leibler divergence. The sparsity is controlled by the user through specifying the in-degrees. The approximations have spanning tree subgraphs, enabling them to depict flow through a network. They can also depict feedback. The approximations can capture both significant statistical dependencies in, and preserve salient topological features of, the full (unknown) network. The proposed method can be applied to networks without restrictions on the dynamics beyond stationarity. We analyze computational and sample complexity. We demonstrate their efficacy using a large-scale simulation of the turtle visual cortex and experimental data from the primate motor cortex.

More Details

Escape: Efficiently counting all 5-vertex subgraphs

26th International World Wide Web Conference, WWW 2017

Pinar, Ali P.; Seshadhri, C.; Vishal, Vaidyanathan

Counting the frequency of small subgraphs is a fundamental technique in network analysis across various domains, most notably in bioinformatics and social networks. The special case of triangle counting has received much attention. Getting results for 4-vertex or 5-vertex patterns is highly challenging, and there are few practical results known that can scale to massive sizes. We introduce an algorithmic framework that can be adopted to count any small pattern in a graph and apply this framework to compute exact counts for all 5-vertex subgraphs. Our framework is built on cutting a pattern into smaller ones, and using counts of smaller patterns to get larger counts. Furthermore, we exploit degree orientations of the graph to reduce runtimes even further. These methods avoid the combinatorial explosion that typical subgraph counting algorithms face. We prove that it suffices to enumerate only four specific subgraphs (three of them have less than 5 vertices) to exactly count all 5-vertex patterns. We perform extensive empirical experiments on a variety of real-world graphs. We are able to compute counts of graphs with tens of millions of edges in minutes on a commodity machine. To the best of our knowledge, this is the first practical algorithm for 5-vertex pattern counting that runs at this scale. A stepping stone to our main algorithm is a fast method for counting all 4-vertex patterns. This algorithm is typically ten times faster than the state of the art 4-vertex counters.

More Details

MaxReach: Reducing network incompleteness through node probes

Proceedings of the 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM 2016

Soundarajan, Sucheta; Eliassi-Rad, Tina; Gallagher, Brian; Pinar, Ali P.

Real-world network datasets are often incomplete. Subsequently, any analysis on such networks is likely to produce skewed results. We examine the following problem: given an incomplete network, which b nodes should be probed to bring as many new nodes as possible into the observed network? For instance, consider someone who has observed a portion (say 1%) of the Twitter network. How should she use a limited budget to reduce the incompleteness of the network? In this work, we propose a novel algorithm, called MAXREACH, which uses a budget b to increase the number of nodes in the observed network. Our experiments, across a range of datasets and conditions, demonstrate the efficacy of MAXREACH.

More Details

Trigger Detection for Adaptive Scientific Workflows Using Percentile Sampling

SIAM Journal on Scientific Computing (Online)

Pinar, Ali P.; Bennett, Janine C.; Salloum, Maher; Bhagatwala, Ankit; Chen, Jacqueline H.; Seshadhri, C.

The increasing complexity of both scientific simulations and high-performance computing system architectures are driving the need for adaptive workflows, in which the composition and execution of computational and data manipulation steps dynamically depend on the evolutionary state of the simulation itself. Consider, for example, the frequency of data storage. Critical phases of the simulation should be captured with high frequency and with high fidelity for postanalysis; however, we cannot afford to retain the same frequency for the full simulation due to the high cost of data movement. We can instead look for triggers, indicators that the simulation will be entering a critical phase, and adapt the workflow accordingly. In this paper, we present a methodology for detecting triggers and demonstrate its use in the context of direct numerical simulations of turbulent combustion using S3D. We show that chemical explosive mode analysis (CEMA) can be used to devise a noise-tolerant indicator for rapid increase in heat release. However, exhaustive computation of CEMA values dominates the total simulation, and thus is prohibitively expensive. To overcome this computational bottleneck, we propose a quantile sampling approach. Our sampling-based algorithm comes with provable error/confidence bounds, as a function of the number of samples. Most importantly, the number of samples is independent of the problem size, and thus our proposed sampling algorithm offers perfect scalability. Our experiments on homogeneous charge compression ignition and reactivity controlled compression ignition simulations show that the proposed method can detect rapid increases in heat release, and its computational overhead is negligible. Our results will be used to make dynamic workflow decisions regarding data storage and mesh resolution in future combustion simulations.

More Details

Sparse approximations of directed information graphs

IEEE International Symposium on Information Theory - Proceedings

Quinn, Christopher J.; Pinar, Ali P.; Gao, Jing; Su, Lu

Given a network of agents interacting over time, which few interactions best characterize the dynamics of the whole network? We propose an algorithm that finds the optimal sparse approximation of a network. The user controls the level of sparsity by specifying the total number of edges. The networks are represented using directed information graphs, a graphical model that depicts causal influences between agents in a network. Goodness of approximation is measured with Kullback-Leibler divergence. The algorithm finds the best approximation with no assumptions on the topology or the class of the joint distribution.

More Details

Measuring and Modeling Bipartite Graphs with Community Structure

Kolda, Tamara G.; Pinar, Ali P.; Aksoy, Sinan

Network science is a powerful tool for analyzing complex systems in fields ranging from sociology to engineering to biology. This paper is focused on generative models of bipartite graphs, also known as two-way graphs. We propose two generative models that can be easily tuned to reproduce the characteristics of real-world networks, not just qualitatively, but quantitatively. The measurements we consider are the degree distributions and the bipartite clustering coefficient, which we refer to as the metamorphosis coefficient. We define edge, node, and degreewise metamorphosis coefficients, enabling a more detailed understand of the bipartite community structure. Our proposed bipartite Chung-Lu model is able to reproduce real-world degree distributions, and our proposed bipartite “BTER” model reproduces both the degree distributions as well as the degreewise metamorphosis coefficients. We demonstrate the effectiveness of these models on several real-world data sets.

More Details

Enabling adaptive scientific workflows via trigger detection

Proceedings of ISAV 2015: 1st International Workshop on In Situ Infrastructures for Enabling Extreme-Scale Analysis and Visualization, Held in conjunction with SC 2015: The International Conference for High Performance Computing, Networking, Storage and Analysis

Salloum, Maher; Bennett, Janine C.; Pinar, Ali P.; Bhagatwala, Ankit; Chen, Jacqueline H.

Next generation architectures necessitate a shift away from traditional workflows in which the simulation state is saved at prescribed frequencies for post-processing analysis. While the need to shift to in situ workflows has been acknowledged for some time, much of the current research is focused on static workflows, where the analysis that would have been done as a post-process is performed concurrently with the simulation at user-prescribed frequencies. Recently, research efforts are striving to enable adaptive workflows, in which the frequency, composition, and execution of computational and data manipulation steps dynamically depend on the state of the simulation. Adapting the workflow to the state of simulation in such a data-driven fashion puts extremely strict efficiency requirements on the analysis capabilities that are used to identify the transitions in the workflow. In this paper we build upon earlier work on trigger detection using sublinear techniques to drive adaptive workflows. Here we propose a methodology to detect the time when sudden heat release occurs in simulations of turbulent combustion. Our proposed method provides an alternative metric that can be used along with our former metric to increase the robustness of trigger detection. We show the effectiveness of our metric empirically for predicting heat release for two use cases.

More Details

Final Report: Sublinear Algorithms for In-situ and In-transit Data Analysis at Exascale

Bennett, Janine C.; Pinar, Ali P.; Seshadhri, C.; Thompson, David; Salloum, Maher; Bhagatwala, Ankit; Chen, Jacqueline H.

Post-Moore's law scaling is creating a disruptive shift in simulation workflows, as saving the entirety of raw data to persistent storage becomes expensive. We are moving away from a post-process centric data analysis paradigm towards a concurrent analysis framework, in which raw simulation data is processed as it is computed. Algorithms must adapt to machines with extreme concurrency, low communication bandwidth, and high memory latency, while operating within the time constraints prescribed by the simulation. Furthermore, in- put parameters are often data dependent and cannot always be prescribed. The study of sublinear algorithms is a recent development in theoretical computer science and discrete mathematics that has significant potential to provide solutions for these challenges. The approaches of sublinear algorithms address the fundamental mathematical problem of understanding global features of a data set using limited resources. These theoretical ideas align with practical challenges of in-situ and in-transit computation where vast amounts of data must be processed under severe communication and memory constraints. This report details key advancements made in applying sublinear algorithms in-situ to identify features of interest and to enable adaptive workflows over the course of a three year LDRD. Prior to this LDRD, there was no precedent in applying sublinear techniques to large-scale, physics based simulations. This project has definitively demonstrated their efficacy at mitigating high performance computing challenges and highlighted the rich potential for follow-on re- search opportunities in this space.

More Details

Toward using surrogates to accelerate solution of stochastic electricity grid operations problems

2014 North American Power Symposium, NAPS 2014

Safta, Cosmin; Chen, Richard L.Y.; Najm, Habib N.; Pinar, Ali P.; Watson, Jean-Paul

Stochastic unit commitment models typically handle uncertainties in forecast demand by considering a finite number of realizations from a stochastic process model for loads. Accurate evaluations of expectations or higher moments for the quantities of interest require a prohibitively large number of model evaluations. In this paper we propose an alternative approach based on using surrogate models valid over the range of the forecast uncertainty. We consider surrogate models based on Polynomial Chaos expansions, constructed using sparse quadrature methods. Considering expected generation cost, we demonstrate that the approach can lead to several orders of magnitude reduction in computational cost relative to using Monte Carlo sampling on the original model, for a given target error threshold.

More Details

Finding Hierarchical and Overlapping Dense Subgraphs using Nucleus Decompositions

Comandur, Seshadhri; Pinar, Ali P.; Sariyuce, Ahmet E.; Catalyurek, Umit

Finding dense substructures in a graph is a fundamental graph mining operation, with applications in bioinformatics, social networks, and visualization to name a few. Yet most standard formulations of this problem (like clique, quasiclique, k-densest subgraph) are NP-hard. Furthermore, the goal is rarely to nd the \true optimum", but to identify many (if not all) dense substructures, understand their distribution in the graph, and ideally determine a hierarchical structure among them. Current dense subgraph nding algorithms usually optimize some objective, and only nd a few such subgraphs without providing any hierarchy. It is also not clear how to account for overlaps in dense substructures. We de ne the nucleus decomposition of a graph, which represents the graph as a forest of nuclei. Each nucleus is a subgraph where smaller cliques are present in many larger cliques. The forest of nuclei is a hierarchy by containment, where the edge density increases as we proceed towards leaf nuclei. Sibling nuclei can have limited intersections, which allows for discovery of overlapping dense subgraphs. With the right parameters, the nuclear decomposition generalizes the classic notions of k-cores and k-trusses. We give provable e cient algorithms for nuclear decompositions, and empirically evaluate their behavior in a variety of real graphs. The tree of nuclei consistently gives a global, hierarchical snapshot of dense substructures, and outputs dense subgraphs of higher quality than other state-of-theart solutions. Our algorithm can process graphs with tens of millions of edges in less than an hour.

More Details
Results 51–100 of 182
Results 51–100 of 182