Some past and future supercomputer nodes incorporate High- Bandwidth Memory (HBM). Compared to standard DRAM, HBM has similar latency, higher bandwidth and lower capacity. In this paper, we evaluate algorithms for managing High- Bandwidth Memory automatically. Previous work suggests that, in the worst case, performance is extremely sensitive to the policy for managing the channel to DRAM. Prior theory shows that a priority-based scheme (where there is a static strict priority-order among p threads for channel access) is O(1)-competitive, but FIFO is not, and in the worst case is ?(p) competitive. Following this theoretical guidance would be a disruptive change for vendors, who currently use FIFO variants in their DRAMcontroller hardware. Our goal is to determine theoretically and empirically whether we can justify recommending investment in priority-based DRAM controller hardware. In order to experiment with DRAM channel protocols, we chose a theoretical model, validated it against real hardware, and implemented a basic simulator. We corroborated the previous theoretical results for the model, conducted a parameter sweep while running our simulator on address traces from memory bandwidth-bound codes (GNU sort and TACO sparse matrix-vector product), and designed better channel-access algorithms.
Given an input stream S of size N, a φ-heavy hitter is an item that occurs at least φN times in S. The problem of finding heavy-hitters is extensively studied in the database literature.We study a real-time heavy-hitters variant in which an element must be reported shortly after we see its T = φN-th occurrence (and hence it becomes a heavy hitter). We call this the Timely Event Detection (TED) Problem. The TED problem models the needs of many real-world monitoring systems, which demand accurate (i.e., no false negatives) and timely reporting of all events from large, high-speed streams with a low reporting threshold (high sensitivity).Like the classic heavy-hitters problem, solving the TED problem without false-positives requires large space (ω (N) words). Thus in-RAM heavy-hitters algorithms typically sacrifice accuracy (i.e., allow false positives), sensitivity, or timeliness (i.e., use multiple passes).We show how to adapt heavy-hitters algorithms to external memory to solve the TED problem on large high-speed streams while guaranteeing accuracy, sensitivity, and timeliness. Our data structures are limited only by I/O-bandwidth (not latency) and support a tunable tradeoff between reporting delay and I/O overhead. With a small bounded reporting delay, our algorithms incur only a logarithmic I/O overhead.We implement and validate our data structures empirically using the Firehose streaming benchmark. Multi-threaded versions of our structures can scale to process 11M observations per second before becoming CPU bound. In comparison, a naive adaptation of the standard heavy-hitters algorithm to external memory would be limited by the storage device's random I/O throughput, i.e., ≈100K observations per second.
Graph algorithms enable myriad large-scale applications including cybersecurity, social network analysis, resource allocation, and routing. The scalability of current graph algorithm implementations on conventional computing architectures are hampered by the demise of Moore’s law. We present a theoretical framework for designing and assessing the performance of graph algorithms executing in networks of spiking artificial neurons. Although spiking neural networks (SNNs) are capable of general-purpose computation, few algorithmic results with rigorous asymptotic performance analysis are known. SNNs are exceptionally well-motivated practically, as neuromorphic computing systems with 100 million spiking neurons are available, and systems with a billion neurons are anticipated in the next few years. Beyond massive parallelism and scalability, neuromorphic computing systems offer energy consumption orders of magnitude lower than conventional high-performance computing systems. We employ our framework to design and analyze new spiking algorithms for shortest path and dynamic programming problems. Our neuromorphic algorithms are message-passing algorithms relying critically on data movement for computation. For fair and rigorous comparison with conventional algorithms and architectures, which is challenging but paramount, we develop new models of data-movement in conventional computing architectures. This allows us to prove polynomial-factor advantages, even when we assume a SNN consisting of a simple grid-like network of neurons. To the best of our knowledge, this is one of the first examples of a rigorous asymptotic computational advantage for neuromorphic computing.
In this project we developed and validated algorithms for privacy-preserving linear regression using a new variant of Secure Multiparty Computation (MPC) we call "Hybrid MPC" (hMPC). Our variant is intended to support low-power, unreliable networks of sensors with low-communication, fault-tolerant algorithms. In hMPC we do not share training data, even via secret sharing. Thus, agents are responsible for protecting their own local data. Only the machine learning (ML) model is protected with information-theoretic security guarantees against honest-but-curious agents. There are three primary advantages to this approach: (1) after setup, hMPC supports a communication-efficient matrix multiplication primitive, (2) organizations prevented by policy or technology from sharing any of their data can participate as agents in hMPC, and (3) large numbers of low-power agents can participate in hMPC. We have also created an open-source software library named "Cicada" to support hMPC applications with fault-tolerance. The fault-tolerance is important in our applications because the agents are vulnerable to failure or capture. We have demonstrated this capability at Sandia's Autonomy New Mexico laboratory through a simple machine-learning exercise with Raspberry Pi devices capturing and classifying images while flying on four drones.
We present a theoretical framework for designing and assessing the performance of algorithms executing in networks consisting of spiking artificial neurons. Although spiking neural networks (SNNs) are capable of general-purpose computation, few algorithmic results with rigorous asymptotic performance analysis are known. SNNs are exceptionally well-motivated practically, as neuromorphic computing systems with 100 million spiking neurons are available, and systems with a billion neurons are anticipated in the next few years. Beyond massive parallelism and scalability, neuromorphic computing systems offer energy consumption orders of magnitude lower than conventional high-performance computing systems. We employ our framework to design and analyze neuromorphic graph algorithms, focusing on shortest path problems. Our neuromorphic algorithms are message-passing algorithms relying critically on data movement for computation, and we develop data-movement lower bounds for conventional algorithms. A fair and rigorous comparison with conventional algorithms and architectures is challenging but paramount. We prove a polynomial-factor advantage even when we assume an SNN consisting of a simple grid-like network of neurons. To the best of our knowledge, this is one of the first examples of a provable asymptotic computational advantage for neuromorphic computing.
ROL-PEBBL is a C++, MPI-based parallel code for mixed-integer PDE-constrained optimization (MIPDECO). In these problems we wish to optimize (control, design, etc.) physical systems, which must obey the laws of physics, when some of the decision variables must take integer values. ROL-PEBBL combines a code to efficiently search over integer choices (PEBBL = Parallel Enumeration Branch-and-Bound Library) and a code for efficient nonlinear optimization, including PDE-constrained optimization (ROL = Rapid Optimization Library). In this report, we summarize the design of ROL-PEBBL and initial applications/results. For an artificial source-inversion problem, finding sources of pollution on a grid from sparse samples, ROL-PEBBLs solution for the nest grid gave the best optimization guarantee for any general solver that gives both a solution and a quality guarantee.
This report summarizes the work performed under the project "Linear Programming in Strongly Polynomial Time." Linear programming (LP) is a classic combinatorial optimization problem heavily used directly and as an enabling subroutine in integer programming (IP). Specifically IP is the same as LP except that some solution variables must take integer values (e.g. to represent yes/no decisions). Together LP and IP have many applications in resource allocation including general logistics, and infrastructure design and vulnerability analysis. The project was motivated by the PI's recent success developing methods to efficiently sample Voronoi vertices (essentially finding nearest neighbors in high-dimensional point sets) in arbitrary dimension. His method seems applicable to exploring the high-dimensional convex feasible space of an LP problem. Although the project did not provably find a strongly-polynomial algorithm, it explored multiple algorithm classes. The new medial simplex algorithms may still lead to solvers with improved provable complexity. We describe medial simplex algorithms and some relevant structural/complexity results. We also designed a novel parallel LP algorithm based on our geometric insights and implemented it in the Spoke-LP code. A major part of the computational step is many independent vector dot products. Our parallel algorithm distributes the problem constraints across processors. Current commercial and high-quality free LP solvers require all problem details to fit onto a single processor or multicore. Our new algorithm might enable the solution of problems too large for any current LP solvers. We describe our new algorithm, give preliminary proof-of-concept experiments, and describe a new generator for arbitrarily large LP instances.
Graphs are a widely used abstraction for representing a variety of important real-world problems including emulating cyber networks for situational awareness, or studying social networks to understand human interactions or pandemic spread. Communication data is often converted into graphs to help understand social and technical patterns in the underlying communication data. However, prior to this project, little work had been performed analyzing how best to develop graphs from such data. Thus, many critical, national security problems were being performed against graph representations of questionable quality. Herein, we describe our analyses that were precursors to our final statistically grounded technique for creating static graph snapshots from a stream of communication events. The first analyzes the statistical distribution properties of a variety of real-world communication datasets generally fit best by Pareto, log normal, and extreme value distributions. The second derives graph properties that can be estimated given the expected statistical distribution for communication events and the communication interval to be viewed node observability, edge observability, and expected accuracy of node degree. Unfortunately, as that final process is under review for publication, we can't publish it here at this time.
Given an input stream of size N, a †-heavy hitter is an item that occurs at least † N times in S. The problem of finding heavy-hitters is extensively studied in the database literature. We study a real-time heavy-hitters variant in which an element must be reported shortly after we see its T = † N-th occurrence (and hence becomes a heavy hitter). We call this the Timely Event Detection (TED) Problem. The TED problem models the needs of many real-world monitoring systems, which demand accurate (i.e., no false negatives) and timely reporting of all events from large, high-speed streams, and with a low reporting threshold (high sensitivity). Like the classic heavy-hitters problem, solving the TED problem without false-positives requires large space (ω(N) words). Thus in-RAM heavy-hitters algorithms typically sacrifice accuracy (i.e., allow false positives), sensitivity, or timeliness (i.e., use multiple passes). We show how to adapt heavy-hitters algorithms to external memory to solve the TED problem on large high-speed streams while guaranteeing accuracy, sensitivity, and timeliness. Our data structures are limited only by I/O-bandwidth (not latency) and support a tunable trade-off between reporting delay and I/O overhead. With a small bounded reporting delay, our algorithms incur only a logarithmic I/O overhead. We implement and validate our data structures empirically using the Firehose streaming benchmark. Multi-threaded versions of our structures can scale to process 11M observations per second before becoming CPU bound. In comparison, a naive adaptation of the standard heavy-hitters algorithm to external memory would be limited by the storage device's random I/O throughput, i.e., ∼100K observations per second.
We study a trajectory analysis problem we call the Trajectory Capture Problem (TCP), in which, for a given input set T of trajectories in the plane, and an integer k-2, we seek to compute a set of k points ("portals") to maximize the total weight of all subtrajectories of T between pairs of portals. This problem naturally arises in trajectory analysis and summarization. We show that the TCP is NP-hard (even in very special cases) and give some first approximation results. Our main focus is on attacking the TCP with practical algorithm-engineering approaches, including integer linear programming (to solve instances to provable optimality) and local search methods. We study the integrality gap arising from such approaches. We analyze our methods on different classes of data, including benchmark instances that we generate. Our goal is to understand the best performing heuristics, based on both solution time and solution quality. We demonstrate that we are able to compute provably optimal solutions for real-world instances. 2012 ACM Subject Classification Theory of computation ! Design and analysis of algorithms.
Community detection in graphs is a canonical social network analysis method. We consider the problem of generating suites of teras-cale synthetic social networks to compare the solution quality of parallel community-detection methods. The standard method, based on the graph generator of Lancichinetti, Fortunato, and Radicchi (LFR), has been used extensively for modest-scale graphs, but has inherent scalability limitations. We provide an alternative, based on the scalable Block Two-Level Erdos-Renyi (BTER) graph generator, that enables HPC-scale evaluation of solution quality in the style of LFR. Our approach varies community coherence, and retains other important properties. Our methods can scale real-world networks, e.g., to create a version of the Friendster network that is 512 times larger. With BTER's inherent scalability, we can generate a 15-terabyte graph (4.6B vertices, 925B edges) in just over one minute. We demonstrate our capability by showing that label-propagation community-detection algorithm can be strong-scaled with negligible solution-quality loss.
In this report, we abstract eleven papers published during the project and describe preliminary unpublished results that warrant follow-up work. The topic is multi-level memory algorithmics, or how to effectively use multiple layers of main memory. Modern compute nodes all have this feature in some form.
A previously obscure area of cryptography known as Secure Multiparty Computation (MPC) is enjoying increased attention in the field of privacy-preserving machine learning (ML), because ML models implemented using MPC can be uniquely resistant to capture or reverse engineering by an adversary. In particular, an adversary who captures a share of a distributed MPC model provably cannot recover the model itself, nor data evaluated by the model, even by observing the model in operation. We report on our small project to survey current MPC software and judge its practicality for fielding mission-relevant distributed machine learning models.
This report summarizes the work performed under the project “Quantifying Uncertainty in Emulations.” Emulation can be used to model real-world systems, typically using virtualization to run the real software on virtualized hardware. Emulations are increasingly used to answer mission-oriented questions, but how well they represent the real-world systems is still an open area of research. The goal of the project was to quantify where and how emulations differ from the real world. To do so, we ran a representative workload on both, and collected and compared metrics to identify differences. We aimed to capture behaviors, rather than performance, differences as the latter is more well-understood in the literature.
Network designers, planners, and security professionals increasingly rely on large-scale testbeds based on virtualization to emulate networks and make decisions about real-world deployments. However, there has been limited research on how well these virtual testbeds match their physical counterparts. Specifically, does the virtualization that these testbeds depend on actually capture real-world behaviors sufficiently well to support decisions?As a first step, we perform simple experiments on both physical and virtual testbeds to begin to understand where and how the testbeds differ. We set up a web service on one host and run ApacheBench against this service from a different host, instrumenting each system during these tests.We define an initial repeatable methodology (algorithm) to quantitatively compare physical and virtual testbeds. Specifically we compare the testbeds at three levels of abstraction: application, operating system (OS) and network. For the application level, we use the ApacheBench results. For OS behavior, we compare patterns of system call orderings using Markov chains. This provides a unique visual representation of the workload and OS behavior in our testbeds. We also drill down into read-system-call behaviors and show how at one level both systems are deterministic and identical, but as we move up in abstractions that consistency declines. Finally, we use packet captures to compare network behaviors and performance. We reconstruct flows and compare per-flow and per-experiment statistics.From these comparisons, we find that the behavior of the workload in the testbeds is similar but that the underlying processes to support it do vary. The low-level network behavior can vary quite widely in packetization depending on the virtual network driver. While these differences can be important, and knowing about them will help experiment designers, the core application and OS behaviors still represent similar processes.
We use Bayesian data analysis to predict dengue fever outbreaks and quantify the link between outbreaks and meteorological precursors tied to the breeding conditions of vector mosquitos. We use Hamiltonian Monte Carlo sampling to estimate a seasonal Gaussian process modeling infection rate, and aperiodic basis coefficients for the rate of an “outbreak level” of infection beyond seasonal trends across two separate regions. We use this outbreak level to estimate an autoregressive moving average (ARMA) model from which we extrapolate a forecast. We show that the resulting model has useful forecasting power in the 6–8 week range. The forecasts are not significantly more accurate with the inclusion of meteorological covariates than with infection trends alone.
Virtual testbeds are a core component of cyber experimentation as they allow for fast and relatively inexpensive modeling of computer systems. Unlike simulations, virtual testbeds run real software on virtual hardware which allows them to capture unknown or complex behaviors. However, virtualization is known to increase latency and decrease throughput. Could these and other artifacts from virtualization undermine the experiments that we wish to run? For the past three years, we have attempted to quantify where and how virtual testbeds differ from their physical counterparts to address this concern. While performance differences have been widely studied, we aim to uncover behavioral differences. We have run over 10,000 experiments and processed over half a petabyte of data. Complete details of our methodology and our experimental results from applying that methodology are published in previous work. In this paper, we describe our lessons learned in the process of constructing and instrumenting both physical and virtual testbeds and analyzing the results from each.
We seek scalable benchmarks for entity resolution problems. Solutions to these problems range from trivial approaches such as string sorting to sophisticated methods such as statistical relational learning. The theoretical and practical complexity of these approaches varies widely, so one of the primary purposes of a benchmark will be to quantify the trade-off between solution quality and runtime. We are motivated by the ubiquitous nature of entity resolution as a fundamental problem faced by any organization that ingests large amounts of noisy text data. A benchmark is typically a rigid specification that provides an objective measure usable for ranking implementations of an algorithm. For example the Top500 and HPCG500 bench- marks rank supercomputers based on their performance of dense and sparse linear algebra problems (respectively). These two benchmarks require participants to report FLOPS counts attainable on various machines. Our purpose is slightly different. Whereas the supercomputing benchmarks mentioned above hold algorithms constant and aim to rank machines, we are primarily interested in ranking algorithms. As mentioned above, entity resolution problems can be approached in completely different ways. We believe that users of our benchmarks must decide what sort of procedure to run before comparing implementations and architectures. Eventually, we also wish to provide a mechanism for ranking machines while holding algorithmic approach constant . Our primary contributions are parallel algorithms for computing solution quality mea- sures per entity. We find in some real datasets that many entities are quite easy to resolve while others are difficult, with a heavy skew toward the former case. Therefore, measures such as global confusion matrices, F measures, etc. do not meet our benchmarking needs. We design methods for computing solution quality at the granularity of a single entity in order to know when proposed solutions do well in difficult situations (perhaps justifying extra computational), or struggling in easy situations. We report on progress toward a viable benchmark for comparing entity resolution algo- rithms. Our work is incomplete, but we have designed and prototyped several algorithms to help evalute the solution quality of competing approaches to these problems. We envision a benchmark in which the objective measure is a ratio of solution quality to runtime.
This report summarizes the work performed under the Sandia LDRD project "Adverse Event Prediction Using Graph-Augmented Temporal Analysis." The goal of the project was to develop a method for analyzing multiple time-series data streams to identify precursors providing advance warning of the potential occurrence of events of interest. The proposed approach combined temporal analysis of each data stream with reasoning about relationships between data streams using a geospatial-temporal semantic graph. This class of problems is relevant to several important topics of national interest. In the course of this work we developed new temporal analysis techniques, including temporal analysis using Markov Chain Monte Carlo techniques, temporal shift algorithms to refine forecasts, and a version of Ripley's K-function extended to support temporal precursor identification. This report summarizes the project's major accomplishments, and gathers the abstracts and references for the publication sub-missions and reports that were prepared as part of this work. We then describe work in progress that is not yet ready for publication.
This report summarizes the work performed under the project "Advanced Data Structures for Improved Cyber Resilience and Awareness in Untrusted Environments." The goal of the project was to design, analyze, and test new data structures for cybersecurity applications. We had two major thrusts: 1) using new/improved write-optimized data structures and/or algorithms to better man- age and analyze high-speed massive-volume cyberstreams, and 2) adding security features to data structures at minimum cost. Write optimization allows data structures to better use secondary memory to store and search a larger amount of useful information. Secondary memory is large compared to main memory, but data movement to and from secondary memory must be carefully managed to run quickly enough to keep up with fast streams. The first thrust included managing cyberstreams in parallel, both multi-threaded and distributed, and improving the benchmarking infrastructure for testing new streaming data structures. We considered both (near) real-time discovery of particular patterns, and improved logging for improved forensics. We considered two kinds of security-feature problem. The first was high-performance history- independent external-memory data structures. These provide certain protections to data if a disk is stolen. We also prove some trade-offs between speed and security in this setting. The second data-security problem is more secure data look-up in secret-shared data bases. This report summarizes the project's major accomplishments, with the background to under- stand these accomplishments. It gathers the abstracts and references for the six refereed publications that have appeared as part of this work. We summarize several accomplishments that will be submitted for publication. We then archive one piece of partial work that is not likely to be published in the near future: validation of history-independent data structure implementations.
We study several natural instances of the geometric hitting set problem for input consisting of sets of line segments (and rays, lines) having a small number of distinct slopes. These problems model path monitoring (e.g., on road networks) using the fewest sensors (the “hitting points”). We give approximation algorithms for cases including (i) lines of 3 slopes in the plane, (ii) vertical lines and horizontal segments, (iii) pairs of horizontal/vertical segments. We give hardness and hardness of approximation results for these problems. We prove that the hitting set problem for vertical lines and horizontal rays is polynomially solvable.
We consider underwater multi-modal wireless sensor networks (UWSNs) suitable for applications on submarine surveillance and monitoring, where nodes offload data to a mobile autonomous underwater vehicle (AUV) via optical technology, and coordinate using acoustic communication. Sensed data are associated with a value, decaying in time. In this scenario, we address the problem of finding the path of the AUV so that the Value of Information (VoI) of the data delivered to a sink on the surface is maximized. We define a Greedy and Adaptive AUV Path-finding (GAAP) heuristic that drives the AUV to collect data from nodes depending on the VoI of their data. For benchmarking the performance of AUV path-finding heuristics, we define an integer linear programming (ILP) formulation that accurately models the considered scenario, deriving a path that drives the AUV to collect and deliver data with the maximum VoI. In our experiments GAAP consistently delivers more than 80 percent of the theoretical maximum VoI determined by the ILP model. We also compare the performance of GAAP with that of other strategies for driving the AUV among sensing nodes, namely, random paths, TSP-based paths and a 'lawn mower'-like strategy. Our results show that GAAP always outperforms every other heuristic in terms of delivered VoI, also obtaining higher energy efficiency.
Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems
Bender, Michael A.; Farach-Colton, Martin; Johnson, Rob; Mauras, Simon; Mayer, Tyler; Phillips, Cynthia A.; Xu, Helen
The skip list is an elegant dictionary data structure that is commonly deployed in RAM. A skip list with N elements supports searches, inserts, and deletes in O (log N) operations with high probability (w.h.p.) and range queries returning K elements in O(log N + K) operations w.h.p. A seemingly natural way to generalize the skip list to external memory with block size B is to "promote" with probability 1/B, rather than 1/2. However, there are practical and theoretical obstacles to getting the skip list to retain its efficient performance, space bounds, and high-probability guarantees. We give an external-memory skip list that achieves write-optimized bounds. That is, for 0 < ϵ < 1, range queries take O(logBϵ N + K/B) I/Os w.h.p. and insertions and deletions take O((logBϵ N)/B1-ϵ) amortized I/Os w.h.p. Our write-optimized skip list inherits the virtue of simplicity from RAM skip lists. Moreover, it matches or beats the asymptotic bounds of prior write-optimized data structures such as Bϵ trees or LSM trees. These data structures are deployed in high-performance databases and file systems. The main technical challenge in proving our bounds comes from the fact that there are so few levels in the skip list, an aspect of the data structure that is essential to getting strong external-memory bounds. We use extremal-graph coloring to show that it is possible to decompose paths in the skip list into uncorrelated groups, regardless of the insertion/deletion pattern. Thus, we achieve our bounds by averaging over these uncorrelated paths rather than by averaging over uncorrelated levels, as in the standard skip list.
A challenge in computer architecture is that processors often cannot be fed data from DRAM as fast as CPUs can consume it. Therefore, many applications are memory-bandwidth bound. With this motivation and the realization that traditional architectures (with all DRAM reachable only via bus) are insufficient to feed groups of modern processing units, vendors have introduced a variety of non-DDR 3D memory technologies (Hybrid Memory Cube (HMC),Wide I/O 2, High Bandwidth Memory (HBM)). These offer higher bandwidth and lower power by stacking DRAM chips on the processor or nearby on a silicon interposer. We will call these solutions “near-memory,” and if user-addressable, “scratchpad.” High-performance systems on the market now offer two levels of main memory: near-memory on package and traditional DRAM further away. In the near term we expect the latencies near-memory and DRAM to be similar. Thus, it is natural to think of near-memory as another module on the DRAM level of the memory hierarchy. Vendors are expected to offer modes in which the near memory is used as cache, but we believe that this will be inefficient. In this paper, we explore the design space for a user-controlled multi-level main memory. Our work identifies situations in which rewriting application kernels can provide significant performance gains when using near-memory. We present algorithms designed for two-level main memory, using divide-and-conquer to partition computations and streaming to exploit data locality. We consider algorithms for the fundamental application of sorting and for the data analysis kernel k-means. Our algorithms asymptotically reduce memory-block transfers under certain architectural parameter settings. We use and extend Sandia National Laboratories’ SST simulation capability to demonstrate the relationship between increased bandwidth and improved algorithmic performance. Memory access counts from simulations corroborate predicted performance improvements for our sorting algorithm. In contrast, the k-means algorithm is generally CPU bound and does not improve when using near-memory except under extreme conditions. These conditions require large instances that rule out SST simulation, but we demonstrate improvements by running on a customized machine with high and low bandwidth memory. These case studies in co-design serve as positive and cautionary templates, respectively, for the major task of optimizing the computational kernels of many fundamental applications for two-level main memory systems.
We present history-independent alternatives to a B-tree, the primary indexing data structure used in databases. A data structure is history independent (HI) if it is impossible to deduce any information by examining the bit representation of the data structure that is not already available through the API. We show how to build a history-independent cache-oblivious B-tree and a history-independent external-memory skip list. One of the main contributions is a data structure we build on the way - a history-independent packed-memory array (PMA). The PMA supports efficient range queries, one of the most important operations for answering database queries. Our HI PMA matches the asymptotic bounds of prior non-HI packed-memory arrays and sparse tables. Specifically, a PMA maintains a dynamic set of elements in sorted order in a linearsized array. Inserts and deletes take an amortized O(log2 N) element moves with high probability. Simple experiments with our implementation of HI PMAs corroborate our theoretical analysis. Comparisons to regular PMAs give preliminary indications that the practical cost of adding history-independence is not too large. Our HI cache-oblivious B-tree bounds match those of prior non-HI cache-oblivious B-trees. Searches take O(logB N) I/Os; inserts and deletes take O(log2N/B + logB N) amortized I/Os with high probability; and range queries returning k elements take O(logB N + k/B) I/Os. Our HI external-memory skip list achieves optimal bounds with high probability, analogous to in-memory skip lists: O(logB N) I/Os for point queries and amortized O(logB N) I/Os for inserts/deletes. Range queries returning k elements run in O(logB N + k/B) I/Os. In contrast, the best possible high-probability bounds for inserting into the folklore B-skip list, which promotes elements with probability 1/B, is just Θ(log N) I/Os. This is no better than the bounds one gets from running an inmemory skip list in external memory.
In recent work we quantified the anticipated performance boost when a sorting algorithm is modified to leverage user- Addressable "near-memory," which we call scratchpad. This architectural feature is expected in the Intel Knight's Land- ing processors that will be used in DOE's next large-scale supercomputer. This paper expands our analytical study of the scratch- pad to consider k-means clustering, a classical data-analysis technique that is ubiquitous in the literature and in prac- Tice. We present new theoretical results using the model introduced in [13], which measures memory transfers and assumes that computations are memory-bound. Our the- oretical results indicate that scratchpad-aware versions of k-means clustering can expect performance boosts for high- dimensional instances with relatively few cluster centers. These constraints may limit the practical impact of scratch- pad for k-means acceleration, so we discuss their origins and practical implications. We corroborate our theory with ex- perimental runs on a system instrumented to mimic one with scratchpad memory. We also contribute a semi-formalization of the computa- Tional properties that are necessary and sufficient to predict a performance boost from scratchpad-aware variants of al- gorithms. We have observed and studied these properties in the context of sorting, and now clustering. We conclude with some thoughts on the application of these properties to new areas. Specifically, we believe that dense linear algebra has similar properties to k-means, while sparse linear algebra and FFT computations are more sim-ilar to sorting. The sparse operations are more common in scientific computing, so we expect scratchpad to have signif- icant impact in that area.
Geospatial semantic graphs provide a robust foundation for representing and analyzing remote sensor data. In particular, they support a variety of pattern search operations that capture the spatial and temporal relationships among the objects and events in the data. However, in the presence of large data corpora, even a carefully constructed search query may return a large number of unintended matches. This work considers the problem of calculating a quality score for each match to the query, given that the underlying data are uncertain. We present a preliminary evaluation of three methods for determining both match quality scores and associated uncertainty bounds, illustrated in the context of an example based on overhead imagery data.
A fundamental challenge for supercomputer architecture is that processors cannot be fed data from DRAM as fast as CPUs can consume it. Therefore, many applications are memory-bandwidth bound. As the number of cores per chip increases, and traditional DDR DRAM speeds stagnate, the problem is only getting worse. A variety of non-DDR 3D memory technologies (Wide I/O 2, HBM) offer higher bandwidth and lower power by stacking DRAM chips on the processor or nearby on a silicon interposer. However, such a packaging scheme cannot contain sufficient memory capacity for a node. It seems likely that future systems will require at least two levels of main memory: high-bandwidth, low-power memory near the processor and low-bandwidth high-capacity memory further away. This near memory will probably not have significantly faster latency than the far memory. This, combined with the large size of the near memory (multiple GB) and power constraints, may make it difficult to treat it as a standard cache. In this paper, we explore some of the design space for a user-controlled multi-level main memory. We present algorithms designed for the heterogeneous bandwidth, using streaming to exploit data locality. We consider algorithms for the fundamental application of sorting. Our algorithms asymptotically reduce memory-block transfers under certain architectural parameter settings. We use and extend Sandia National Laboratories' SST simulation capability to demonstrate the relationship between increased bandwidth and improved algorithmic performance. Memory access counts from simulations corroborate predicted performance. This co-design effort suggests implementing two-level main memory systems may improve memory performance in fundamental applications.
This project evaluates the effectiveness of moving target defense (MTD) techniques using a new game we have designed, called PLADD, inspired by the game FlipIt [28]. PLADD extends FlipIt by incorporating what we believe are key MTD concepts. We have analyzed PLADD and proven the existence of a defender strategy that pushes a rational attacker out of the game, demonstrated how limited the strategies available to an attacker are in PLADD, and derived analytic expressions for the expected utility of the game’s players in multiple game variants. We have created an algorithm for finding a defender’s optimal PLADD strategy. We show that in the special case of achieving deterrence in PLADD, MTD is not always cost effective and that its optimal deployment may shift abruptly from not using MTD at all to using it as aggressively as possible. We believe our effort provides basic, fundamental insights into the use of MTD, but conclude that a truly practical analysis requires model selection and calibration based on real scenarios and empirical data. We propose several avenues for further inquiry, including (1) agents with adaptive capabilities more reflective of real world adversaries, (2) the presence of multiple, heterogeneous adversaries, (3) computational game theory-based approaches such as coevolution to allow scaling to the real world beyond the limitations of analytical analysis and classical game theory, (4) mapping the game to real-world scenarios, (5) taking player risk into account when designing a strategy (in addition to expected payoff), (6) improving our understanding of the dynamic nature of MTD-inspired games by using a martingale representation, defensive forecasting, and techniques from signal processing, and (7) using adversarial games to develop inherently resilient cyber systems.
This report summarizes preliminary research into uncertainty quantification for pattern ana- lytics within the context of the Pattern Analytics to Support High-Performance Exploitation and Reasoning (PANTHER) project. The primary focus of PANTHER was to make large quantities of remote sensing data searchable by analysts. The work described in this re- port adds nuance to both the initial data preparation steps and the search process. Search queries are transformed from does the specified pattern exist in the data? to how certain is the system that the returned results match the query? We show example results for both data processing and search, and discuss a number of possible improvements for each.
We present a new distributed model for graph computations motivated by limited information sharing. Two or more independent entities have collected large social graphs. They wish to compute the result of running graph algorithms on the entire set of relationships. Because the information is sensitive or economically valuable, they do not wish to simply combine the information in a single location. We consider two models for computing the solution to graph algorithms in this setting: 1) limited-sharing: the two entities can share only a poly logarithmic size subgraph, 2) low-trust: the entities must not reveal any information beyond the query answer, assuming they are all honest but curious. We believe this model captures realistic constraints on cooperating autonomous data centres' have results for both models for s-t connectivity, one of the simplest graph problems that requires global information in the worst case. In the limited-sharing model, our results exploit social network structure. Standard communication complexity gives polynomial lower bounds on s-t connectivity for general graphs. However, if the graph for each data centre has a giant component and these giant components intersect, then we can overcome this lower bound, computing-t connectivity while exchanging O(log 2 n) bits for a constant number of data centers. We can also test the assumption that the giant components overlap using O(log 2 n) bits provided the (unknown) overlap is sufficiently large. The second result is in the low trust model. We give a secure multi-party computation (MPC) algorithm that 1) does not make cryptographic assumptions when there are 3 or more entities, and 2) is efficient, especially when compared to the usual garbled circuit approach. The entities learn only the yes/no answer. No party learns anything about the others' graph, not even node names. This algorithm does not require any special graph structure. This secure MPC result for s-t connectivity is one of the first that involves a few parties computing on large inputs, instead of many parties computing on a few local values.
This report summarizes the work performed under the project project Next-Generation Algorithms for Assessing Infrastructure Vulnerability and Optimizing System Resilience. The goal of the project was to improve mathematical programming-based optimization technology for infrastructure protection. In general, the owner of a network wishes to design a network a network that can perform well when certain transportation channels are inhibited (e.g. destroyed) by an adversary. These are typically bi-level problems where the owner designs a system, an adversary optimally attacks it, and then the owner can recover by optimally using the remaining network. This project funded three years of Deon Burchett's graduate research. Deon's graduate advisor, Professor Jean-Philippe Richard, and his Sandia advisors, Richard Chen and Cynthia Phillips, supported Deon on other funds or volunteer time. This report is, therefore. essentially a replication of the Ph.D. dissertation it funded in a format required for project documentation. The thesis had some general polyhedral research. This is the study of the structure of the feasible region of mathematical programs, such as integer programs. For example, an integer program optimizes a linear objective function subject to linear constraints, and (nonlinear) integrality constraints on the variables. The feasible region without the integrality constraints is a convex polygon. Careful study of additional valid constraints can significantly improve computational performance. Here is the abstract from the dissertation: We perform a polyhedral study of a multi-commodity generalization of variable upper bound flow models. In particular, we establish some relations between facets of single- and multi- commodity models. We then introduce a new family of inequalities, which generalizes traditional flow cover inequalities to the multi-commodity context. We present encouraging numerical results. We also consider the directed edge-failure resilient network design problem (DRNDP). This problem entails the design of a directed multi-commodity flow network that is capable of fulfilling a specified percentage of demands in the event that any G arcs are destroyed, where G is a constant parameter. We present a formulation of DRNDP and solve it in a branch-column-cut framework. We present computational results.
We present new algorithms for a distributed model for graph computations motivated by limited information sharing we first discussed. Two or more independent entities have collected large social graphs. They wish to compute the result of running graph algorithms on the entire set of relationships. Because the information is sensitive or economically valuable, they do not wish to simply combine the information in a single location. We consider two models for computing the solution to graph algorithms in this setting: 1) limited-sharing: the two entities can share only a polylogarithmic size subgraph; 2) low-trust: the entities must not reveal any information beyond the query answer, assuming they are all honest but curious. We believe this model captures realistic constraints on cooperating autonomous data centers. We have algorithms in both setting for s - t connectivity in both models. We also give an algorithm in the low-communication model for finding a planted clique. This is an anomaly-detection problem, finding a subgraph that is larger and denser than expected. For both the low- communication algorithms, we exploit structural properties of social networks to prove performance bounds better than what is possible for general graphs. For s - t connectivity, we use known properties. For planted clique, we propose a new property: bounded number of triangles per node. This property is based upon evidence from the social science literature. We found that classic examples of social networks do not have the bounded-triangles property. This is because many social networks contain elements that are non-human, such as accounts for a business, or other automated accounts. We describe some initial attempts to distinguish human nodes from automated nodes in social networks based only on topological properties.
Listing all triangles is a fundamental graph operation. Triangles can have important interpretations in real-world graphs, especially social and other interaction networks. Despite the lack of provably efficient (linear, or slightly super linear) worst-case algorithms for this problem, practitioners run simple, efficient heuristics to find all triangles in graphs with millions of vertices. How are these heuristics exploiting the structure of these special graphs to provide major speedups in running time? We study one of the most prevalent algorithms used by practitioners. A trivial algorithm enumerates all paths of length 2, and checks if each such path is incident to a triangle. A good heuristic is to enumerate only those paths of length 2 in which the middle vertex has the lowest degree. It is easily implemented and is empirically known to give remarkable speedups over the trivial algorithm. We study the behavior of this algorithm over graphs with heavy-tailed degree distributions, a defining feature of real-world graphs. The erased configuration model (ECM) efficiently generates a graph with asymptotically (almost) any desired degree sequence. We show that the expected running time of this algorithm over the distribution of graphs created by the ECM is controlled by the l4/3-norm of the degree sequence. Norms of the degree sequence are a measure of the heaviness of the tail, and it is precisely this feature that allows non trivial speedups of simple triangle enumeration algorithms. As a corollary of our main theorem, we prove expected linear-time performance for degree sequences following a power law with exponent α ≥ 7/3, and non trivial speedup whenever α ∈ (2, 3).
We study several natural instances of the geometric hitting set problem for input consisting of sets of line segments (and rays, lines) having a small number of distinct slopes. These problems model path monitoring (e.g., on road networks) using the fewest sensors (the “hitting points”). We give approximation algorithms for cases including (i) lines of 3 slopes in the plane, (ii) vertical lines and horizontal segments, (iii) pairs of horizontal/vertical segments. We give hardness and hardness of approximation results for these problems. We prove that the hitting set problem for vertical lines and horizontal rays is polynomially solvable.
The Water Security Toolkit (WST) is a suite of open source software tools that can be used by water utilities to create response strategies to reduce the impact of contamination in a water distribution network . WST includes hydraulic and water quality modeling software , optimizati on methodologies , and visualization tools to identify: (1) sensor locations to detect contamination, (2) locations in the network in which the contamination was introduced, (3) hydrants to remove contaminated water from the distribution system, (4) locations in the network to inject decontamination agents to inactivate, remove, or destroy contaminants, (5) locations in the network to take grab sample s to help identify the source of contamination and (6) valves to close in order to isolate contaminate d areas of the network. This user manual describes the different components of WST , along w ith examples and case studies. License Notice The Water Security Toolkit (WST) v.1.2 Copyright c 2012 Sandia Corporation. Under the terms of Contract DE-AC04-94AL85000, there is a non-exclusive license for use of this work by or on behalf of the U.S. government. This software is distributed under the Revised BSD License (see below). In addition, WST leverages a variety of third-party software packages, which have separate licensing policies: Acro Revised BSD License argparse Python Software Foundation License Boost Boost Software License Coopr Revised BSD License Coverage BSD License Distribute Python Software Foundation License / Zope Public License EPANET Public Domain EPANET-ERD Revised BSD License EPANET-MSX GNU Lesser General Public License (LGPL) v.3 gcovr Revised BSD License GRASP AT&T Commercial License for noncommercial use; includes randomsample and sideconstraints executable files LZMA SDK Public Domain nose GNU Lesser General Public License (LGPL) v.2.1 ordereddict MIT License pip MIT License PLY BSD License PyEPANET Revised BSD License Pyro MIT License PyUtilib Revised BSD License PyYAML MIT License runpy2 Python Software Foundation License setuptools Python Software Foundation License / Zope Public License six MIT License TinyXML zlib License unittest2 BSD License Utilib Revised BSD License virtualenv MIT License Vol Common Public License vpykit Revised BSD License Additionally, some precompiled WST binary distributions might bundle other third-party executables files: Coliny Revised BSD License (part of Acro project) Dakota GNU Lesser General Public License (LGPL) v.2.1 PICO Revised BSD License (part of Acro project) i Revised BSD License Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * Neither the name of Sandia National Laboratories nor Sandia Corporation nor the names of its con- tributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IM- PLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL SANDIA CORPORATION BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUD- ING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. ii Acknowledgements This work was supported by the U.S. Environmental Protection Agency through its Office of Research and Development (Interagency Agreement # DW8992192801). The material in this document has been subject to technical and policy review by the U.S. EPA, and approved for publication. The views expressed by individual authors, however, are their own, and do not necessarily reflect those of the U.S. Environmental Protection Agency. Mention of trade names, products, or services does not convey official U.S. EPA approval, endorsement, or recommendation. The Water Security Toolkit is an extension of the Threat Ensemble Vulnerability Assessment-Sensor Place- ment Optimization Tool (TEVA-SPOT), which was also developed with funding from the U.S. Environ- mental Protection Agency through its Office of Research and Development (Interagency Agreement # DW8992192801). The authors acknowledge the following individuals for their contributions to the devel- opment of TEVA-SPOT: Jonathan Berry (Sandia National Laboratories), Erik Boman (Sandia National Laboratories), Lee Ann Riesen (Sandia National Laboratories), James Uber (University of Cincinnati), and Jean-Paul Watson (Sandia National Laboratories). iii Acronyms ATUS American Time-Use Survey BLAS Basic linear algebra sub-routines CFU Colony-forming unit CVAR Conditional value at risk CWS Contamination warning system EA Evolutionary algorithm EDS Event detection system EPA U.S. Environmental Protection Agency EC Extent of Contamination ERD EPANET results database file GLPK GNU Linear Programming Kit GRASP Greedy randomized adaptive sampling process HEX Hexadecimal HTML HyperText markup language INP EPANET input file LP Linear program MC Mass consumed MILP Mixed integer linear program MIP Mixed integer program MSX Multi-species extension for EPANET NFD Number of failed detections NS Number of sensors NZD Non-zero demand PD Population dosed PE Population exposed PK Population killed TAI Threat assessment input file TCE Tailed-conditioned expectation TD Time to detection TEC Timed extent of contamination TEVA Threat ensemble vulnerability assessment TSB Tryptic soy broth TSG Threat scenario generation file TSI Threat simulation input file VAR Value at risk VC Volume consumed WST Water Security Toolkit YML YAML configuration file format for WST iv Symbols Notation Definition Example { , } set brackets { 1,2,3 } means a set containing the values 1,2, and 3. [?] is an element of s [?] S means that s is an element of the set S . [?] for all s = 1 [?] s [?] S means that the statement s = 1 is true for all s in set S . P summation P n i =1 s i means s 1 + s 2 + * * * + s n . \ set minus S \ T means the set that contains all those elements of S that are not in set T . %7C given %7C is used to define conditional probability. P ( s %7C t ) means the prob- ability of s occurring given that t occurs. %7C ... %7C cardinality Cardinality of a set is the number of elements of the set. If set S = { 2,4,6 } , then %7C S %7C = 3. v
This report summarizes the work performed under the project (3z(BStatitically significant relational data mining.(3y (BThe goal of the project was to add more statistical rigor to the fairly ad hoc area of data mining on graphs. Our goal was to develop better algorithms and better ways to evaluate algorithm quality. We concetrated on algorithms for community detection, approximate pattern matching, and graph similarity measures. Approximate pattern matching involves finding an instance of a relatively small pattern, expressed with tolerance, in a large graph of data observed with uncertainty. This report gathers the abstracts and references for the eight refereed publications that have appeared as part of this work. We then archive three pieces of research that have not yet been published. The first is theoretical and experimental evidence that a popular statistical measure for comparison of community assignments favors over-resolved communities over approximations to a ground truth. The second are statistically motivated methods for measuring the quality of an approximate match of a small pattern in a large graph. The third is a new probabilistic random graph model. Statisticians favor these models for graph analysis. The new local structure graph model overcomes some of the issues with popular models such as exponential random graph models and latent variable models.
Proc. of 2nd Int. Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, BigMine 2013 - Held in Conj. with SIGKDD 2013 Conf.
We present an algorithm to maintain the connected components of a graph that arrives as an infinite stream of edges. We formalize the algorithm on X-stream, a new parallel theoretical computational model for infinite streams. Connectivity-related queries, including component spanning trees, are supported with some latency, returning the state of the graph at the time of the query. Because an infinite stream may eventually exceed the storage limits of any number of finite-memory processors, we assume an aging command or daemon where "uninteresting" edges are removed when the system nears capacity. Following an aging command the system will block queries until its data structures are repaired, but edges will continue to be accepted from the stream, never dropped. The algorithm will not fail unless a model-specific constant fraction of the aggregate memory across all processors is full. In normal operation, it will not fail unless aggregate memory is completely full. Unlike previous theoretical streaming models designed for finite graphs that assume a single shared memory machine or require arbitrary-size intemediate files, X-stream distributes a graph over a ring network of finite-memory processors. Though the model is synchronous and reminiscent of systolic algorithms, our implementation uses an asynchronous message-passing system. We argue the correctness of our X-stream connected components algorithm, and give preliminary experimental results on synthetic and real graph streams.
Decision makers increasingly rely on large-scale computational models to simulate and analyze complex man-made systems. For example, computational models of national infrastructures are being used to inform government policy, assess economic and national security risks, evaluate infrastructure interdependencies, and plan for the growth and evolution of infrastructure capabilities. A major challenge for decision makers is the analysis of national-scale models that are composed of interacting systems: effective integration of system models is difficult, there are many parameters to analyze in these systems, and fundamental modeling uncertainties complicate analysis. This project is developing optimization methods to effectively represent and analyze large-scale heterogeneous system of systems (HSoS) models, which have emerged as a promising approach for describing such complex man-made systems. These optimization methods enable decision makers to predict future system behavior, manage system risk, assess tradeoffs between system criteria, and identify critical modeling uncertainties.
World Environmental and Water Resources Congress 2011: Bearing Knowledge for Sustainability - Proceedings of the 2011 World Environmental and Water Resources Congress
We consider the problem of placing a limited number of sensors in a municipal water distribution network to minimize the impact over a given suite of contamination incidents. In its simplest form, the sensor placement problem is a p-median problem that has structure extremely amenable to exact and heuristic solution methods. We describe the solution of real-world instances using integer programming or local search or a Lagrangian method. The Lagrangian method is necessary for solution of large problems on small PCs. We summarize a number of other heuristic methods for effectively addressing issues such as sensor failures, tuning sensors based on local water quality variability, and problem size/approximation quality tradeoffs. These algorithms are incorporated into the TEVA-SPOT toolkit, a software suite that the US Environmental Protection Agency has used and is using to design contamination warning systems for US municipal water systems.
In a (future) quantum computer a single logical quantum bit (qubit) will be made of multiple physical qubits. These extra physical qubits implement mandatory extensive error checking. The efficiency of error correction will fundamentally influence the performance of a future quantum computer, both in latency/speed and in error threshold (the worst error tolerated for an individual gate). Executing this quantum error correction requires scheduling the individual operations subject to architectural constraints. Since our last talk on this subject, a team of researchers at Sandia National Labortories has designed a logical qubit architecture that considers all relevant architectural issues including layout, the effects of supporting classical electronics, and the types of gates that the underlying physical qubit implementation supports most naturally. This is a two-dimensional system where 2-qubit operations occur locally, so there is no need to calculate more complex qubit/information transportation. Using integer programming, we found a schedule of qubit operations that obeys the hardware constraints, implements the local-check code in the native gate set, and minimizes qubit idle periods. Even with an optimal schedule, however, parallel Monte Carlo simulation shows that there is no finite error probability for the native gates such that the error-correction system would be benecial. However, by adding dynamic decoupling, a series of timed pulses that can reverse some errors, we found that there may be a threshold. Thus finding optimal schedules for increasingly-refined scheduling problems has proven critical for the overall design of the logical qubit system. We describe the evolving scheduling problems and the ideas behind the integer programming-based solution methods. This talk assumes no prior knowledge of quantum computing.
We consider the problem of placing sensors in a municipal water network when we can choose both the location of sensors and the sensitivity and specificity of the contamination warning system. Sensor stations in a municipal water distribution network continuously send sensor output information to a centralized computing facility, and event detection systems at the control center determine when to signal an anomaly worthy of response. Although most sensor placement research has assumed perfect anomaly detection, signal analysis software has parameters that control the tradeoff between false alarms and false negatives. We describe a nonlinear sensor placement formulation, which we heuristically optimize with a linear approximation that can be solved as a mixed-integer linear program. We report the results of initial experiments on a real network and discuss tradeoffs between early detection of contamination incidents, and control of false alarms.
Large relational datasets such as national-scale social networks and power grids present different computational challenges than do physical simulations. Sandia's distributed-memory supercomputers are well suited for solving problems concerning the latter, but not the former. The reason is that problems such as pattern recognition and knowledge discovery on large networks are dominated by memory latency and not by computation. Furthermore, most memory requests in these applications are very small, and when the datasets are large, most requests miss the cache. The result is extremely low utilization. We are unlikely to be able to grow out of this problem with conventional architectures. As the power density of microprocessors has approached that of a nuclear reactor in the past two years, we have seen a leveling of Moores Law. Building larger and larger microprocessor-based supercomputers is not a solution for informatics and network infrastructure problems since the additional processors are utilized to only a tiny fraction of their capacity. An alternative solution is to use the paradigm of massive multithreading with a large shared memory. There is only one instance of this paradigm today: the Cray MTA-2. The proposal team has unique experience with and access to this machine. The XMT, which is now being delivered, is a Red Storm machine with up to 8192 multithreaded 'Threadstorm' processors and 128 TB of shared memory. For many years, the XMT will be the only way to address very large graph problems efficiently, and future generations of supercomputers will include multithreaded processors. Roughly 10 MTA processor can process a simple short paths problem in the time taken by the Gordon Bell Prize-nominated distributed memory code on 32,000 processors of Blue Gene/Light. We have developed algorithms and open-source software for the XMT, and have modified that software to run some of these algorithms on other multithreaded platforms such as the Sun Niagara and Opteron multi-core chips.
Communities of vertices within a giant network such as the World-Wide-Web are likely to be vastly smaller than the network itself. However, Fortunato and Barthelemy have proved that modularity maximization algorithms for community detection may fail to resolve communities with fewer than {radical} L/2 edges, where L is the number of edges in the entire network. This resolution limit leads modularity maximization algorithms to have notoriously poor accuracy on many real networks. Fortunato and Barthelemy's argument can be extended to networks with weighted edges as well, and we derive this corollary argument. We conclude that weighted modularity algorithms may fail to resolve communities with fewer than {radical} W{epsilon}/2 total edge weight, where W is the total edge weight in the network and {epsilon} is the maximum weight of an inter-community edge. If {epsilon} is small, then small communities can be resolved. Given a weighted or unweighted network, we describe how to derive new edge weights in order to achieve a low {epsilon}, we modify the 'CNM' community detection algorithm to maximize weighted modularity, and show that the resulting algorithm has greatly improved accuracy. In experiments with an emerging community standard benchmark, we find that our simple CNM variant is competitive with the most accurate community detection methods yet proposed.
In this paper, we introduce EXACT, the EXperimental Algorithmics Computational Toolkit. EXACT is a software framework for describing, controlling, and analyzing computer experiments. It provides the experimentalist with convenient software tools to ease and organize the entire experimental process, including the description of factors and levels, the design of experiments, the control of experimental runs, the archiving of results, and analysis of results. As a case study for EXACT, we describe its interaction with FAST, the Sandia Framework for Agile Software Testing. EXACT and FAST now manage the nightly testing of several large software projects at Sandia. We also discuss EXACT's advanced features, which include a driver module that controls complex experiments such as comparisons of parallel algorithms. Copyright 2007 ACM.
Discrete models of large, complex systems like national infrastructures and complex logistics frameworks naturally incorporate many modeling uncertainties. Consequently, there is a clear need for optimization techniques that can robustly account for risks associated with modeling uncertainties. This report summarizes the progress of the Late-Start LDRD 'Robust Analysis of Largescale Combinatorial Applications'. This project developed new heuristics for solving robust optimization models, and developed new robust optimization models for describing uncertainty scenarios.
We present a series of related robust optimization models for placing sensors in municipal water networks to detect contaminants that are maliciously or accidentally injected. We formulate sensor placement problems as mixed-integer programs, for which the objective coefficients are not known with certainty. We consider a restricted absolute robustness criteria that is motivated by natural restrictions on the uncertain data, and we define three robust optimization models that differ in how the coefficients in the objective vary. Under one set of assumptions there exists a sensor placement that is optimal for all admissible realizations of the coefficients. Under other assumptions, we can apply sorting to solve each worst-case realization efficiently, or we can apply duality to integrate the worst-case outcome and have one integer program. The most difficult case is where the objective parameters are bilinear, and we prove its complexity is NP-hard even under simplifying assumptions. We consider a relaxation that provides an approximation, giving an overall guarantee of near-optimality when used with branch-and-bound search. We present preliminary computational experiments that illustrate the computational complexity of solving these robust formulations on sensor placement applications.
We consider the accuracy of predictions made by integer programming (IP) models of sensor placement for water security applications. We have recently shown that IP models can be used to find optimal sensor placements for a variety of different performance criteria (e.g. minimize health impacts and minimize time to detection). However, these models make a variety of simplifying assumptions that might bias the final solution. We show that our IP modeling assumptions are similar to models developed for other sensor placement methodologies, and thus IP models should give similar predictions. However, this discussion highlights that there are significant differences in how temporal effects are modeled for sensor placement. We describe how these modeling assumptions can impact sensor placements.
In recent years, several integer programming models have been proposed to place sensors in municipal water networks in order to detect intentional or accidental contamination. Although these initial models assumed that it is equally costly to place a sensor at any place in the network, there clearly are practical cost constraints that would impact a sensor placement decision. Such constraints include not only labor costs but also the general accessibility of a sensor placement location. In this paper, we extend our integer program to explicitly model the cost of sensor placement. We partition network locations into groups of varying placement cost, and we consider the public health impacts of contamination events under varying budget constraints. Thus our models permit cost/benefit analyses for differing sensor placement designs. As a control for our optimization experiments, we compare the set of sensor locations selected by the optimization models to a set of manually-selected sensor locations.
This report summarizes the research and development performed from October 2002 to September 2004 at Sandia National Laboratories under the Laboratory-Directed Research and Development (LDRD) project ''Massively-Parallel Linear Programming''. We developed a linear programming (LP) solver designed to use a large number of processors. LP is the optimization of a linear objective function subject to linear constraints. Companies and universities have expended huge efforts over decades to produce fast, stable serial LP solvers. Previous parallel codes run on shared-memory systems and have little or no distribution of the constraint matrix. We have seen no reports of general LP solver runs on large numbers of processors. Our parallel LP code is based on an efficient serial implementation of Mehrotra's interior-point predictor-corrector algorithm (PCx). The computational core of this algorithm is the assembly and solution of a sparse linear system. We have substantially rewritten the PCx code and based it on Trilinos, the parallel linear algebra library developed at Sandia. Our interior-point method can use either direct or iterative solvers for the linear system. To achieve a good parallel data distribution of the constraint matrix, we use a (pre-release) version of a hypergraph partitioner from the Zoltan partitioning library. We describe the design and implementation of our new LP solver called parPCx and give preliminary computational results. We summarize a number of issues related to efficient parallel solution of LPs with interior-point methods including data distribution, numerical stability, and solving the core linear system using both direct and iterative methods. We describe a number of applications of LP specific to US Department of Energy mission areas and we summarize our efforts to integrate parPCx (and parallel LP solvers in general) into Sandia's massively-parallel integer programming solver PICO (Parallel Interger and Combinatorial Optimizer). We conclude with directions for long-term future algorithmic research and for near-term development that could improve the performance of parPCx.
We present a model for optimizing the placement of sensors in municipal water networks to detect maliciously injected contaminants. An optimal sensor configuration minimizes the expected fraction of the population at risk. We formulate this problem as a mixed-integer program, which can be solved with generally available solvers. We find optimal sensor placements for three test networks with synthetic risk and population data. Our experiments illustrate that this formulation can be solved relatively quickly and that the predicted sensor configuration is relatively insensitive to uncertainties in the data used for prediction.
We give processor-allocation algorithms for grid architectures, where the objective is to select processors from a set of available processors to minimize the average number of communication hops. The associated clustering problem is as follows: Given n points in R{sup d}, find a size-k subset with minimum average pairwise L{sub 1} distance.We present a natural approximation algorithm and show that it is a 7/4-approximation for 2D grids. In d dimensions, the approximation guarantee is 2 - 1/2d, which is tight. We also give a polynomial-time approximation scheme (PTAS) for constant dimension d and report on experimental results.
The Computational Plant or Cplant is a commodity-based distributed-memory supercomputer under development at Sandia National Laboratories. Distributed-memory supercomputers run many parallel programs simultaneously. Users submit their programs to a job queue. When a job is scheduled to run, it is assigned to a set of available processors. Job runtime depends not only on the number of processors but also on the particular set of processors assigned to it. Jobs should be allocated to localized clusters of processors to minimize communication costs and to avoid bandwidth contention caused by overlapping jobs. This report introduces new allocation strategies and performance metrics based on space-filling curves and one dimensional allocation strategies. These algorithms are general and simple. Preliminary simulations and Cplant experiments indicate that both space-filling curves and one-dimensional packing improve processor locality compared to the sorted free list strategy previously used on Cplant. These new allocation strategies are implemented in Release 2.0 of the Cplant System Software that was phased into the Cplant systems at Sandia by May 2002. Experimental results then demonstrated that the average number of communication hops between the processors allocated to a job strongly correlates with the job's completion time. This report also gives processor-allocation algorithms for minimizing the average number of communication hops between the assigned processors for grid architectures. The associated clustering problem is as follows: Given n points in {Re}d, find k points that minimize their average pairwise L{sub 1} distance. Exact and approximate algorithms are given for these optimization problems. One of these algorithms has been implemented on Cplant and will be included in Cplant System Software, Version 2.1, to be released. In more preliminary work, we suggest improvements to the scheduler separate from the allocator.
We present a model for optimizing the placement of sensors in municipal water networks to detect maliciously-injected contaminants. An optimal sensor configuration minimizes the expected fraction of the population at risk. We formulate this problem as an integer program, which can be solved with generally available IP solvers. We find optimal sensor placements for three real networks with synthetic risk and population data. Our experiments illustrate that this formulation can be solved relatively quickly, and that the predicted sensor configuration is relatively insensitive to uncertainties in the data used for prediction.
We consider two fundamental problems in dynamic scheduling: scheduling to meet deadlines in a preemptive multiprocessor setting, and scheduling to provide good response time in a number of scheduling environments. When viewed from the perspective of traditional worst-case analysis, no good on-line algorithms exist for these problems, and for some variants no good off-line algorithms exist unless T = NP. We study these problems using a relaxed notion of competitive analysis, introduced by Kalyanasundaram and Pruhs, in which the on-line algorithm is allowed more resources than the optimal off-line algorithm to which it is compared. Using this approach, we establish that several well-known on-line algorithms, that have poor performance from an absolute worst-case perspective, are optimal for the problems in question when allowed moderately more resources. For optimization of average flow time, these are the first results of any sort, for any NP-hard version of the problem, that indicate that it might be possible to design good approximation algorithms.
This report describes the design of PICO, a C++ framework for implementing general parallel branch-and-bound algorithms. The PICO framework provides a mechanism for the efficient implementation of a wide range of branch-and-bound methods on an equally wide range of parallel computing platforms. We first discuss the basic architecture of PICO, including the application class hierarchy and the package's serial and parallel layers. We next describe the design of the serial layer, and its central notion of manipulating subproblem states. Then, we discuss the design of the parallel layer, which includes flexible processor clustering and communication rates, various load balancing mechanisms, and a non-preemptive task scheduler running on each processor. We describe the application of the package to a branch-and-bound method for mixed integer programming, along with computational results on the ASCI Red massively parallel computer. Finally we describe the application of the branch-and-bound mixed-integer programming code to a resource constrained project scheduling problem for Pantex.
In the ℓ-phylogeny problem, one wishes to construct an evolutionary tree for a set of species represented by characters, in which each state of each character induces no more than ℓ connected components. We consider the fixed-topology version of this problem for fixed-topologies of arbitrary degree. This version of the problem is known to be NP-comPlete for ℓ ≥ 3 even for degree-3 trees in which no state labels more than ℓ+ 1 leaves (and therefore there is a trivial ℓ + 1 phytogeny). We give a 2-approximation algorithm for all ℓ ≥ 3 for arbitrary input topologies and we give an optimal approximation algorithm that constructs a 4-phylogeny when a 3-phylogeny exists. Dynamic programming techniques, which are typically used in fixed-topology problems, cannot be applied to ℓ-phylogeny problems. Our 2-approximation algorithm is the first application of linear programming to approximation algorithms for phylogeny problems. We extend our results to a related problem in which characters are polymorphic.
The authors propose a general technique called solution decomposition to devise approximation algorithms with provable performance guarantees. The technique is applicable to a large class of combinatorial optimization problems that can be formulated as integer linear programs. Two key ingredients of the technique involve finding a decomposition of a fractional solution into a convex combination of feasible integral solutions and devising generic approximation algorithms based on calls to such decompositions as oracles. The technique is closely related to randomized rounding. The method yields as corollaries unified solutions to a number of well studied problems and it provides the first approximation algorithms with provable guarantees for a number of new problems. The particular results obtained in this paper include the following: (1) The authors demonstrate how the technique can be used to provide more understanding of previous results and new algorithms for classical problems such as Multicriteria Spanning Trees, and Suitcase Packing. (2) They show how the ideas can be extended to apply to multicriteria optimization problems, in which they wish to minimize a certain objective function subject to one or more budget constraints. As corollaries they obtain first non-trivial multicriteria approximation algorithms for problems including the k-Hurdle and the Network Inhibition problems.
The Department of Energy (DOE) national laboratories have one of the longest and most consistent histories of supercomputer use. The authors summarize the architecture of DOE`s new supercomputers that are being built for the Accelerated Strategic Computing Initiative (ASCI). The authors then argue that in the near future scaled-down versions of these supercomputers with petaflop-per-weekend capabilities could become widely available to hundreds of research and engineering departments. The availability of such computational resources will allow simulation of physical phenomena to become a full-fledged third branch of scientific exploration, along with theory and experimentation. They describe the ASCI and other supercomputer applications at Sandia National Laboratories, and discuss which lessons learned from Sandia`s long history of supercomputing can be applied in this new setting.
This report describes the research performed under the laboratory-Directed Research and Development (LDRD) grant {open_quotes}A new approach to protein function and structure prediction{close_quotes}, funded FY94-6. We describe the goals of the research, motivate and list our improvements to the state of the art in multiple sequence alignment and phylogeny (evolutionary tree) construction, but leave technical details to the six publications resulting from this work. At least three algorithms for phylogeny construction or tree consensus have been implemented and used by researchers outside of Sandia.
There has been much recent algorithmic work on the problem of reconstructing the evolutionary history of biological species. Computer virus specialists are interested in finding the evolutionary history of computer viruses--a virus is often written using code fragments from one or more other viruses, which are its immediate ancestors. A phylogeny for a collection of computer viruses is a directed acyclic graph whose nodes are the viruses and whose edges map ancestors to descendants and satisfy the property that each code fragment is ``invented`` only once. To provide a simple explanation for the data, we consider the problem of constructing such a phylogeny with a minimal number of edges. In general, this optimization problem cannot be solved in quasi-polynomial time unless NQP=QP; we present positive and negative results for associated approximated problems. When tree solutions exist, they can be constructed and randomly sampled in polynomial time.
A natural and basic problem in scheduling theory is to provide good average quality of service to a stream of jobs that arrive over time. In this paper we consider the problem of scheduling n jobs that are released over time in order to minimize the average completion time of the set of jobs. In contrast to the problem of minimizing average completion time when all jobs are available at time 0, all the problems that we consider are NP-hard, and essentially nothing was known about constructing good approximations in polynomial time. We give the first constant-factor approximation algorithms for several variants of the single and parallel machine model. Many of the algorithms are based on interesting algorithmic and structural relationships between preemptive and nonpreemptive schedules and linear programming relaxations of both. Many of the algorithms generalize to the minimization of average weighted completion time as well.
Inferring phylogenetic trees is a fundamental problem in computational-biology. We present a new objective criterion, the phylogenetic number, for evaluating evolutionary trees for species defined by biomolecular sequences or other qualitative characters. The phylogenetic number of a tree T is the maximum number of times that any given character state arises in T. By contrast, the classical parsimonycriterion measures the total number of times that different character states arise in T. We consider the following related problems: finding the tree with minimum phylogenetic number, and computing the phylogenetic number of a given topology in which only the leaves are labeled by species. When the number of states is bounded (as is the case for biomolecular sequence characters), we can solve the second problem in polynomial time. We can also compute a fixed-topology 2-phylogeny (when one exists) for an arbitrary number of states. This algorithm can be used to further distinguish trees that are equal under parsimony. We also consider a number of other related problems.