Triangle counting is a fundamental building block in graph algorithms. In this article, we propose a block-based triangle counting algorithm to reduce data movement during both sequential and parallel execution. Our block-based formulation makes the algorithm naturally suitable for heterogeneous architectures. The problem of partitioning the adjacency matrix of a graph is well-studied. Our task decomposition goes one step further: it partitions the set of triangles in the graph. By streaming these small tasks to compute resources, we can solve problems that do not fit on a device. We demonstrate the effectiveness of our approach by providing an implementation on a compute node with multiple sockets, cores and GPUs. The current state-of-the-art in triangle enumeration processes the Friendster graph in 2.1 seconds, not including data copy time between CPU and GPU. Using that metric, our approach is 20 percent faster. When copy times are included, our algorithm takes 3.2 seconds. This is 5.6 times faster than the fastest published CPU-only time.

Given an input stream S of size N, a φ-heavy hitter is an item that occurs at least φN times in S. The problem of finding heavy-hitters is extensively studied in the database literature.We study a real-time heavy-hitters variant in which an element must be reported shortly after we see its T = φN-th occurrence (and hence it becomes a heavy hitter). We call this the Timely Event Detection (TED) Problem. The TED problem models the needs of many real-world monitoring systems, which demand accurate (i.e., no false negatives) and timely reporting of all events from large, high-speed streams with a low reporting threshold (high sensitivity).Like the classic heavy-hitters problem, solving the TED problem without false-positives requires large space (ω (N) words). Thus in-RAM heavy-hitters algorithms typically sacrifice accuracy (i.e., allow false positives), sensitivity, or timeliness (i.e., use multiple passes).We show how to adapt heavy-hitters algorithms to external memory to solve the TED problem on large high-speed streams while guaranteeing accuracy, sensitivity, and timeliness. Our data structures are limited only by I/O-bandwidth (not latency) and support a tunable tradeoff between reporting delay and I/O overhead. With a small bounded reporting delay, our algorithms incur only a logarithmic I/O overhead.We implement and validate our data structures empirically using the Firehose streaming benchmark. Multi-threaded versions of our structures can scale to process 11M observations per second before becoming CPU bound. In comparison, a naive adaptation of the standard heavy-hitters algorithm to external memory would be limited by the storage device's random I/O throughput, i.e., ≈100K observations per second.

In this project we developed and validated algorithms for privacy-preserving linear regression using a new variant of Secure Multiparty Computation (MPC) we call "Hybrid MPC" (hMPC). Our variant is intended to support low-power, unreliable networks of sensors with low-communication, fault-tolerant algorithms. In hMPC we do not share training data, even via secret sharing. Thus, agents are responsible for protecting their own local data. Only the machine learning (ML) model is protected with information-theoretic security guarantees against honest-but-curious agents. There are three primary advantages to this approach: (1) after setup, hMPC supports a communication-efficient matrix multiplication primitive, (2) organizations prevented by policy or technology from sharing any of their data can participate as agents in hMPC, and (3) large numbers of low-power agents can participate in hMPC. We have also created an open-source software library named "Cicada" to support hMPC applications with fault-tolerance. The fault-tolerance is important in our applications because the agents are vulnerable to failure or capture. We have demonstrated this capability at Sandia's Autonomy New Mexico laboratory through a simple machine-learning exercise with Raspberry Pi devices capturing and classifying images while flying on four drones.

Both the data science and scientific computing communities are embracing GPU acceleration for their most demanding workloads. For scientific computing applications, the massive volume of code and diversity of hardware platforms at supercomputing centers has motivated a strong effort toward performance portability. This property of a program, denoting its ability to perform well on multiple architectures and varied datasets, is heavily dependent on the choice of parallel programming model and which features of the programming model are used. In this paper, we evaluate performance portability in the context of a data science workload in contrast to a scientific computing workload, evaluating the same sparse matrix kernel on both. Among our implementations of the kernel in different performance-portable programming models, we find that many struggle to consistently achieve performance improvements using the GPU compared to simple one-line OpenMP parallelization on high-end multicore CPUs. We show one that does, and its performance approaches and sometimes even matches that of vendor-provided GPU math libraries.

Given an input stream of size N, a †-heavy hitter is an item that occurs at least † N times in S. The problem of finding heavy-hitters is extensively studied in the database literature. We study a real-time heavy-hitters variant in which an element must be reported shortly after we see its T = † N-th occurrence (and hence becomes a heavy hitter). We call this the Timely Event Detection (TED) Problem. The TED problem models the needs of many real-world monitoring systems, which demand accurate (i.e., no false negatives) and timely reporting of all events from large, high-speed streams, and with a low reporting threshold (high sensitivity). Like the classic heavy-hitters problem, solving the TED problem without false-positives requires large space (ω(N) words). Thus in-RAM heavy-hitters algorithms typically sacrifice accuracy (i.e., allow false positives), sensitivity, or timeliness (i.e., use multiple passes). We show how to adapt heavy-hitters algorithms to external memory to solve the TED problem on large high-speed streams while guaranteeing accuracy, sensitivity, and timeliness. Our data structures are limited only by I/O-bandwidth (not latency) and support a tunable trade-off between reporting delay and I/O overhead. With a small bounded reporting delay, our algorithms incur only a logarithmic I/O overhead. We implement and validate our data structures empirically using the Firehose streaming benchmark. Multi-threaded versions of our structures can scale to process 11M observations per second before becoming CPU bound. In comparison, a naive adaptation of the standard heavy-hitters algorithm to external memory would be limited by the storage device's random I/O throughput, i.e., ∼100K observations per second.

A key problem in social network analysis is to identify nonhuman interactions. State-of-the-art bot-detection systems like Botometer train machine-learning models on user-specific data. Unfortunately, these methods do not work on data sets in which only topological information is available. In this paper, we propose a new, purely topological approach. Our method removes edges that connect nodes exhibiting strong evidence of non-human activity from publicly available electronic-social-network datasets, including, for example, those in the Stanford Network Analysis Project repository (SNAP). Our methodology is inspired by classic work in evolutionary psychology by Dunbar that posits upper bounds on the total strength of the set of social connections in which a single human can be engaged. We model edge strength with Easley and Kleinberg's topological estimate; label nodes as “violators” if the sum of these edge strengths exceeds a Dunbar-inspired bound; and then remove the violator-to-violator edges. We run our algorithm on multiple social networks and show that our Dunbar-inspired bound appears to hold for social networks, but not for nonsocial networks. Our cleaning process classifies 0.04% of the nodes of the Twitter-2010 followers graph as violators, and we find that more than 80% of these violator nodes have Botometer scores of 0.5 or greater. Furthermore, after we remove the roughly 15 million violator-violator edges from the 1.2-billion-edge Twitter-2010 follower graph, 34% of the violator nodes experience a factor-of-two decrease in PageRank. PageRank is a key component of many graph algorithms such as node/edge ranking and graph sparsification. Thus, this artificial inflation would bias algorithmic output, and result in some incorrect decisions based on this output.

Community detection in graphs is a canonical social network analysis method. We consider the problem of generating suites of teras-cale synthetic social networks to compare the solution quality of parallel community-detection methods. The standard method, based on the graph generator of Lancichinetti, Fortunato, and Radicchi (LFR), has been used extensively for modest-scale graphs, but has inherent scalability limitations. We provide an alternative, based on the scalable Block Two-Level Erdos-Renyi (BTER) graph generator, that enables HPC-scale evaluation of solution quality in the style of LFR. Our approach varies community coherence, and retains other important properties. Our methods can scale real-world networks, e.g., to create a version of the Friendster network that is 512 times larger. With BTER's inherent scalability, we can generate a 15-terabyte graph (4.6B vertices, 925B edges) in just over one minute. We demonstrate our capability by showing that label-propagation community-detection algorithm can be strong-scaled with negligible solution-quality loss.

In this report, we abstract eleven papers published during the project and describe preliminary unpublished results that warrant follow-up work. The topic is multi-level memory algorithmics, or how to effectively use multiple layers of main memory. Modern compute nodes all have this feature in some form.

We seek scalable benchmarks for entity resolution problems. Solutions to these problems range from trivial approaches such as string sorting to sophisticated methods such as statis- tical relational learning. The theoretical and practical complexity of these approaches varies widely, so one of the primary purposes of a benchmark will be to quantify the trade-off between solution quality and runtime. We are motivated by the ubiquitous nature of entity resolution as a fundamental problem faced by any organization that ingests large amounts of noisy text data. A benchmark is typically a rigid specification that provides an objective measure usable for ranking implementations of an algorithm. For example the Top500 and HPCG500 bench- marks rank supercomputers based on their performance of dense and sparse linear algebra problems (respectively). These two benchmarks require participants to report FLOPS counts attainable on various machines. Our purpose is slightly different. Whereas the supercomputing benchmarks mentioned above hold algorithms constant and aim to rank machines, we are primarily interested in ranking algorithms. As mentioned above, entity resolution problems can be approached in completely different ways. We believe that users of our benchmarks must decide what sort of procedure to run before comparing implementations and architectures. Eventually, we also wish to provide a mechanism for ranking machines while holding algorithmic approach constant . Our primary contributions are parallel algorithms for computing solution quality mea- sures per entity. We find in some real datasets that many entities are quite easy to resolve while others are difficult, with a heavy skew toward the former case. Therefore, measures such as global confusion matrices, F measures, etc. do not meet our benchmarking needs. We design methods for computing solution quality at the granularity of a single entity in order to know when proposed solutions do well in difficult situations (perhaps justifying extra computational), or struggling in easy situations. We report on progress toward a viable benchmark for comparing entity resolution algo- rithms. Our work is incomplete, but we have designed and prototyped several algorithms to help evalute the solution quality of competing approaches to these problems. We envision a benchmark in which the objective measure is a ratio of solution quality to runtime.

Triangle counting is a representative graph analysis algorithm with several applications. It is also one of the three benchmarks used in the IEEE HPEC Graph Challenge. Triangle counting can be expressed as a graph algorithm and in a linear algebra formulation using sparse matrix-matrix multiplication (SpGEMM). The linear algebra formulation using the SpGEMM algorithm in the Kokkoskernels library was one of the fastest implementations for the triangle counting problem in last year's Graph Challenge. This paper improves upon that work by developing an SpGEMM implementation that relies on a highly efficient, work-stealing, multithreaded runtime. We demonstrate that this new implementation results in improving the triangle counting runtime up to 5× to 12× on different architectures. This new implementation also breaks the 10 9 barrier for the rate measure on a single node for the triangle counting problem. We also compare the linear algebra formulation with a traditional graph based formulation. The linear algebra implementation is up to 2.96× to 7× faster on different architectures compared to the graph based implementation. Furthermore, we present analysis of the scaling of the triangle counting implementation as the graph sizes increase using both synthetic and real graphs from the graph challenge data set.