Challenges in Data Mining and Optimization
Abstract not provided.
Abstract not provided.
Abstract not provided.
Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems
We present history-independent alternatives to a B-tree, the primary indexing data structure used in databases. A data structure is history independent (HI) if it is impossible to deduce any information by examining the bit representation of the data structure that is not already available through the API. We show how to build a history-independent cache-oblivious B-tree and a history-independent external-memory skip list. One of the main contributions is a data structure we build on the way - a history-independent packed-memory array (PMA). The PMA supports efficient range queries, one of the most important operations for answering database queries. Our HI PMA matches the asymptotic bounds of prior non-HI packed-memory arrays and sparse tables. Specifically, a PMA maintains a dynamic set of elements in sorted order in a linearsized array. Inserts and deletes take an amortized O(log2 N) element moves with high probability. Simple experiments with our implementation of HI PMAs corroborate our theoretical analysis. Comparisons to regular PMAs give preliminary indications that the practical cost of adding history-independence is not too large. Our HI cache-oblivious B-tree bounds match those of prior non-HI cache-oblivious B-trees. Searches take O(logB N) I/Os; inserts and deletes take O(log2N/B + logB N) amortized I/Os with high probability; and range queries returning k elements take O(logB N + k/B) I/Os. Our HI external-memory skip list achieves optimal bounds with high probability, analogous to in-memory skip lists: O(logB N) I/Os for point queries and amortized O(logB N) I/Os for inserts/deletes. Range queries returning k elements run in O(logB N + k/B) I/Os. In contrast, the best possible high-probability bounds for inserting into the folklore B-skip list, which promotes elements with probability 1/B, is just Θ(log N) I/Os. This is no better than the bounds one gets from running an inmemory skip list in external memory.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
ACM International Conference Proceeding Series
In recent work we quantified the anticipated performance boost when a sorting algorithm is modified to leverage user- Addressable "near-memory," which we call scratchpad. This architectural feature is expected in the Intel Knight's Land- ing processors that will be used in DOE's next large-scale supercomputer. This paper expands our analytical study of the scratch- pad to consider k-means clustering, a classical data-analysis technique that is ubiquitous in the literature and in prac- Tice. We present new theoretical results using the model introduced in [13], which measures memory transfers and assumes that computations are memory-bound. Our the- oretical results indicate that scratchpad-aware versions of k-means clustering can expect performance boosts for high- dimensional instances with relatively few cluster centers. These constraints may limit the practical impact of scratch- pad for k-means acceleration, so we discuss their origins and practical implications. We corroborate our theory with ex- perimental runs on a system instrumented to mimic one with scratchpad memory. We also contribute a semi-formalization of the computa- Tional properties that are necessary and sufficient to predict a performance boost from scratchpad-aware variants of al- gorithms. We have observed and studied these properties in the context of sorting, and now clustering. We conclude with some thoughts on the application of these properties to new areas. Specifically, we believe that dense linear algebra has similar properties to k-means, while sparse linear algebra and FFT computations are more sim-ilar to sorting. The sparse operations are more common in scientific computing, so we expect scratchpad to have signif- icant impact in that area.
Statistical Analysis and Data Mining
Geospatial semantic graphs provide a robust foundation for representing and analyzing remote sensor data. In particular, they support a variety of pattern search operations that capture the spatial and temporal relationships among the objects and events in the data. However, in the presence of large data corpora, even a carefully constructed search query may return a large number of unintended matches. This work considers the problem of calculating a quality score for each match to the query, given that the underlying data are uncertain. We present a preliminary evaluation of three methods for determining both match quality scores and associated uncertainty bounds, illustrated in the context of an example based on overhead imagery data.
Abstract not provided.
This project evaluates the effectiveness of moving target defense (MTD) techniques using a new game we have designed, called PLADD, inspired by the game FlipIt [28]. PLADD extends FlipIt by incorporating what we believe are key MTD concepts. We have analyzed PLADD and proven the existence of a defender strategy that pushes a rational attacker out of the game, demonstrated how limited the strategies available to an attacker are in PLADD, and derived analytic expressions for the expected utility of the game’s players in multiple game variants. We have created an algorithm for finding a defender’s optimal PLADD strategy. We show that in the special case of achieving deterrence in PLADD, MTD is not always cost effective and that its optimal deployment may shift abruptly from not using MTD at all to using it as aggressively as possible. We believe our effort provides basic, fundamental insights into the use of MTD, but conclude that a truly practical analysis requires model selection and calibration based on real scenarios and empirical data. We propose several avenues for further inquiry, including (1) agents with adaptive capabilities more reflective of real world adversaries, (2) the presence of multiple, heterogeneous adversaries, (3) computational game theory-based approaches such as coevolution to allow scaling to the real world beyond the limitations of analytical analysis and classical game theory, (4) mapping the game to real-world scenarios, (5) taking player risk into account when designing a strategy (in addition to expected payoff), (6) improving our understanding of the dynamic nature of MTD-inspired games by using a martingale representation, defensive forecasting, and techniques from signal processing, and (7) using adversarial games to develop inherently resilient cyber systems.
This report summarizes preliminary research into uncertainty quantification for pattern ana- lytics within the context of the Pattern Analytics to Support High-Performance Exploitation and Reasoning (PANTHER) project. The primary focus of PANTHER was to make large quantities of remote sensing data searchable by analysts. The work described in this re- port adds nuance to both the initial data preparation steps and the search process. Search queries are transformed from does the specified pattern exist in the data? to how certain is the system that the returned results match the query? We show example results for both data processing and search, and discuss a number of possible improvements for each.
Proceedings - 2015 IEEE 29th International Parallel and Distributed Processing Symposium, IPDPS 2015
We present a new distributed model for graph computations motivated by limited information sharing. Two or more independent entities have collected large social graphs. They wish to compute the result of running graph algorithms on the entire set of relationships. Because the information is sensitive or economically valuable, they do not wish to simply combine the information in a single location. We consider two models for computing the solution to graph algorithms in this setting: 1) limited-sharing: the two entities can share only a poly logarithmic size subgraph, 2) low-trust: the entities must not reveal any information beyond the query answer, assuming they are all honest but curious. We believe this model captures realistic constraints on cooperating autonomous data centres' have results for both models for s-t connectivity, one of the simplest graph problems that requires global information in the worst case. In the limited-sharing model, our results exploit social network structure. Standard communication complexity gives polynomial lower bounds on s-t connectivity for general graphs. However, if the graph for each data centre has a giant component and these giant components intersect, then we can overcome this lower bound, computing-t connectivity while exchanging O(log 2 n) bits for a constant number of data centers. We can also test the assumption that the giant components overlap using O(log 2 n) bits provided the (unknown) overlap is sufficiently large. The second result is in the low trust model. We give a secure multi-party computation (MPC) algorithm that 1) does not make cryptographic assumptions when there are 3 or more entities, and 2) is efficient, especially when compared to the usual garbled circuit approach. The entities learn only the yes/no answer. No party learns anything about the others' graph, not even node names. This algorithm does not require any special graph structure. This secure MPC result for s-t connectivity is one of the first that involves a few parties computing on large inputs, instead of many parties computing on a few local values.
This report summarizes the work performed under the project project Next-Generation Algo- rithms for Assessing Infrastructure Vulnerability and Optimizing System Resilience. The goal of the project was to improve mathematical programming-based optimization technology for in- frastructure protection. In general, the owner of a network wishes to design a network a network that can perform well when certain transportation channels are inhibited (e.g. destroyed) by an adversary. These are typically bi-level problems where the owner designs a system, an adversary optimally attacks it, and then the owner can recover by optimally using the remaining network. This project funded three years of Deon Burchett's graduate research. Deon's graduate advisor, Professor Jean-Philippe Richard, and his Sandia advisors, Richard Chen and Cynthia Phillips, supported Deon on other funds or volunteer time. This report is, therefore. essentially a replication of the Ph.D. dissertation it funded [12] in a format required for project documentation. The thesis had some general polyhedral research. This is the study of the structure of the feasi- ble region of mathematical programs, such as integer programs. For example, an integer program optimizes a linear objective function subject to linear constraints, and (nonlinear) integrality con- straints on the variables. The feasible region without the integrality constraints is a convex polygon. Careful study of additional valid constraints can significantly improve computational performance. Here is the abstract from the dissertation: We perform a polyhedral study of a multi-commodity generalization of variable upper bound flow models. In particular, we establish some relations between facets of single- and multi- commodity models. We then introduce a new family of inequalities, which generalizes traditional flow cover inequalities to the multi-commodity context. We present encouraging numerical results. We also consider the directed edge-failure resilient network design problem (DRNDP). This problem entails the design of a directed multi-commodity flow network that is capable of fulfilling a specified percentage of demands in the event that any G arcs are destroyed, where G is a constant parameter. We present a formulation of DRNDP and solve it in a branch-column-cut framework. We present computational results.
We present new algorithms for a distributed model for graph computations motivated by limited information sharing we first discussed in [20]. Two or more independent entities have collected large social graphs. They wish to compute the result of running graph algorithms on the entire set of relationships. Because the information is sensitive or economically valuable, they do not wish to simply combine the information in a single location. We consider two models for computing the solution to graph algorithms in this setting: 1) limited-sharing: the two entities can share only a polylogarithmic size subgraph; 2) low-trust: the entities must not reveal any information beyond the query answer, assuming they are all honest but curious. We believe this model captures realistic constraints on cooperating autonomous data centers. We have algorithms in both setting for s - t connectivity in both models. We also give an algorithm in the low-communication model for finding a planted clique. This is an anomaly- detection problem, finding a subgraph that is larger and denser than expected. For both the low- communication algorithms, we exploit structural properties of social networks to prove perfor- mance bounds better than what is possible for general graphs. For s - t connectivity, we use known properties. For planted clique, we propose a new property: bounded number of triangles per node. This property is based upon evidence from the social science literature. We found that classic examples of social networks do not have the bounded-triangles property. This is because many social networks contain elements that are non-human, such as accounts for a business, or other automated accounts. We describe some initial attempts to distinguish human nodes from automated nodes in social networks based only on topological properties.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Internet Mathematics
Listing all triangles is a fundamental graph operation. Triangles can have important interpretations in real-world graphs, especially social and other interaction networks. Despite the lack of provably efficient (linear, or slightly super linear) worst-case algorithms for this problem, practitioners run simple, efficient heuristics to find all triangles in graphs with millions of vertices. How are these heuristics exploiting the structure of these special graphs to provide major speedups in running time? We study one of the most prevalent algorithms used by practitioners. A trivial algorithm enumerates all paths of length 2, and checks if each such path is incident to a triangle. A good heuristic is to enumerate only those paths of length 2 in which the middle vertex has the lowest degree. It is easily implemented and is empirically known to give remarkable speedups over the trivial algorithm. We study the behavior of this algorithm over graphs with heavy-tailed degree distributions, a defining feature of real-world graphs. The erased configuration model (ECM) efficiently generates a graph with asymptotically (almost) any desired degree sequence. We show that the expected running time of this algorithm over the distribution of graphs created by the ECM is controlled by the l4/3-norm of the degree sequence. Norms of the degree sequence are a measure of the heaviness of the tail, and it is precisely this feature that allows non trivial speedups of simple triangle enumeration algorithms. As a corollary of our main theorem, we prove expected linear-time performance for degree sequences following a power law with exponent α ≥ 7/3, and non trivial speedup whenever α ∈ (2, 3).
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
We study several natural instances of the geometric hitting set problem for input consisting of sets of line segments (and rays, lines) having a small number of distinct slopes. These problems model path monitoring (e.g., on road networks) using the fewest sensors (the “hitting points”). We give approximation algorithms for cases including (i) lines of 3 slopes in the plane, (ii) vertical lines and horizontal segments, (iii) pairs of horizontal/vertical segments. We give hardness and hardness of approximation results for these problems. We prove that the hitting set problem for vertical lines and horizontal rays is polynomially solvable.
Abstract not provided.