Publications

Results 1–25 of 125

Search results

Jump to search filters

Automatic HBM Management: Models and Algorithms

Annual ACM Symposium on Parallelism in Algorithms and Architectures

Delayo, Daniel R.; Zhang, Kenny; Agrawal, Kunal; Bender, Michael A.; Berry, Jonathan W.; Das, Rathish; Moseley, Benjamin; Phillips, Cynthia A.

Some past and future supercomputer nodes incorporate High- Bandwidth Memory (HBM). Compared to standard DRAM, HBM has similar latency, higher bandwidth and lower capacity. In this paper, we evaluate algorithms for managing High- Bandwidth Memory automatically. Previous work suggests that, in the worst case, performance is extremely sensitive to the policy for managing the channel to DRAM. Prior theory shows that a priority-based scheme (where there is a static strict priority-order among p threads for channel access) is O(1)-competitive, but FIFO is not, and in the worst case is ?(p) competitive. Following this theoretical guidance would be a disruptive change for vendors, who currently use FIFO variants in their DRAMcontroller hardware. Our goal is to determine theoretically and empirically whether we can justify recommending investment in priority-based DRAM controller hardware. In order to experiment with DRAM channel protocols, we chose a theoretical model, validated it against real hardware, and implemented a basic simulator. We corroborated the previous theoretical results for the model, conducted a parameter sweep while running our simulator on address traces from memory bandwidth-bound codes (GNU sort and TACO sparse matrix-vector product), and designed better channel-access algorithms.

More Details

A Block-Based Triangle Counting Algorithm on Heterogeneous Environments

IEEE Transactions on Parallel and Distributed Systems

Yasar, Abdurrahman; Rajamanickam, Sivasankaran R.; Berry, Jonathan W.; Catalyurek, Umit V.

Triangle counting is a fundamental building block in graph algorithms. In this article, we propose a block-based triangle counting algorithm to reduce data movement during both sequential and parallel execution. Our block-based formulation makes the algorithm naturally suitable for heterogeneous architectures. The problem of partitioning the adjacency matrix of a graph is well-studied. Our task decomposition goes one step further: it partitions the set of triangles in the graph. By streaming these small tasks to compute resources, we can solve problems that do not fit on a device. We demonstrate the effectiveness of our approach by providing an implementation on a compute node with multiple sockets, cores and GPUs. The current state-of-the-art in triangle enumeration processes the Friendster graph in 2.1 seconds, not including data copy time between CPU and GPU. Using that metric, our approach is 20 percent faster. When copy times are included, our algorithm takes 3.2 seconds. This is 5.6 times faster than the fastest published CPU-only time.

More Details

Timely Reporting of Heavy Hitters Using External Memory

ACM Transactions on Database Systems

Singh, Shikha; Pandey, Prashant; Bender, Michael A.; Berry, Jonathan W.; Farach-Colton, Martin; Johnson, Rob; Kroeger, Thomas M.; Phillips, Cynthia A.

Given an input stream S of size N, a φ-heavy hitter is an item that occurs at least φN times in S. The problem of finding heavy-hitters is extensively studied in the database literature.We study a real-time heavy-hitters variant in which an element must be reported shortly after we see its T = φN-th occurrence (and hence it becomes a heavy hitter). We call this the Timely Event Detection (TED) Problem. The TED problem models the needs of many real-world monitoring systems, which demand accurate (i.e., no false negatives) and timely reporting of all events from large, high-speed streams with a low reporting threshold (high sensitivity).Like the classic heavy-hitters problem, solving the TED problem without false-positives requires large space (ω (N) words). Thus in-RAM heavy-hitters algorithms typically sacrifice accuracy (i.e., allow false positives), sensitivity, or timeliness (i.e., use multiple passes).We show how to adapt heavy-hitters algorithms to external memory to solve the TED problem on large high-speed streams while guaranteeing accuracy, sensitivity, and timeliness. Our data structures are limited only by I/O-bandwidth (not latency) and support a tunable tradeoff between reporting delay and I/O overhead. With a small bounded reporting delay, our algorithms incur only a logarithmic I/O overhead.We implement and validate our data structures empirically using the Firehose streaming benchmark. Multi-threaded versions of our structures can scale to process 11M observations per second before becoming CPU bound. In comparison, a naive adaptation of the standard heavy-hitters algorithm to external memory would be limited by the storage device's random I/O throughput, i.e., ≈100K observations per second.

More Details

Adapting Secure MultiParty Computation to Support Machine Learning in Radio Frequency Sensor Networks

Berry, Jonathan W.; Ganti, Anand G.; Goss, Kenneth G.; Mayer, Carolyn D.; Onunkwo, Uzoma O.; Phillips, Cynthia A.; Saia, Jarared; Shead, Timothy M.

In this project we developed and validated algorithms for privacy-preserving linear regression using a new variant of Secure Multiparty Computation (MPC) we call "Hybrid MPC" (hMPC). Our variant is intended to support low-power, unreliable networks of sensors with low-communication, fault-tolerant algorithms. In hMPC we do not share training data, even via secret sharing. Thus, agents are responsible for protecting their own local data. Only the machine learning (ML) model is protected with information-theoretic security guarantees against honest-but-curious agents. There are three primary advantages to this approach: (1) after setup, hMPC supports a communication-efficient matrix multiplication primitive, (2) organizations prevented by policy or technology from sharing any of their data can participate as agents in hMPC, and (3) large numbers of low-power agents can participate in hMPC. We have also created an open-source software library named "Cicada" to support hMPC applications with fault-tolerance. The fault-tolerance is important in our applications because the agents are vulnerable to failure or capture. We have demonstrated this capability at Sandia's Autonomy New Mexico laboratory through a simple machine-learning exercise with Raspberry Pi devices capturing and classifying images while flying on four drones.

More Details

Performance Portability of an SpMV Kernel Across Scientific Computing and Data Science Applications

2021 IEEE High Performance Extreme Computing Conference, HPEC 2021

Olivier, Stephen L.; Ellingwood, Nathan D.; Berry, Jonathan W.; Dunlavy, Daniel D.

Both the data science and scientific computing communities are embracing GPU acceleration for their most demanding workloads. For scientific computing applications, the massive volume of code and diversity of hardware platforms at supercomputing centers has motivated a strong effort toward performance portability. This property of a program, denoting its ability to perform well on multiple architectures and varied datasets, is heavily dependent on the choice of parallel programming model and which features of the programming model are used. In this paper, we evaluate performance portability in the context of a data science workload in contrast to a scientific computing workload, evaluating the same sparse matrix kernel on both. Among our implementations of the kernel in different performance-portable programming models, we find that many struggle to consistently achieve performance improvements using the GPU compared to simple one-line OpenMP parallelization on high-end multicore CPUs. We show one that does, and its performance approaches and sometimes even matches that of vendor-provided GPU math libraries.

More Details

A Block-Based Triangle Counting Algorithm on Heterogeneous Environments

Yasar, Abdurrahman; Rajamanickam, Sivasankaran R.; Berry, Jonathan W.; Catalyurek, Umit V.

Triangle counting is a fundamental building block in graph algorithms. In this paper, we propose a block-based triangle counting algorithm to reduce data movement during both sequential and parallel execution. Our block-based formulation makes the algorithm naturally suitable for heterogeneous architectures. The problem of partitioning the adjacency matrix of a graph is well-studied. Our task decomposition goes one step further: it partitions the set of triangles in the graph. By streaming these small tasks to compute resources, we can solve problems that do not fit on a device. We demonstrate the effectiveness of our approach by providing an implementation on a compute node with multiple sockets, cores and GPUs. The current state-of-the-art in triangle enumeration processes the Friendster graph in 2.1 seconds, not including data copy time between CPU and GPU. Using that metric, our approach is 20 percent faster. When copy times are included, our algorithm takes 3.2 seconds. This is 5.6 times faster than the fastest published CPU-only time.

More Details

Timely Reporting of Heavy Hitters using External Memory

Proceedings of the ACM SIGMOD International Conference on Management of Data

Pandey, Prashant; Singh, Shikha; Bender, Michael A.; Berry, Jonathan W.; Farach-Colton, Martin; Johnson, Rob; Kroeger, Thomas M.; Phillips, Cynthia A.

Given an input stream of size N, a †-heavy hitter is an item that occurs at least † N times in S. The problem of finding heavy-hitters is extensively studied in the database literature. We study a real-time heavy-hitters variant in which an element must be reported shortly after we see its T = † N-th occurrence (and hence becomes a heavy hitter). We call this the Timely Event Detection (TED) Problem. The TED problem models the needs of many real-world monitoring systems, which demand accurate (i.e., no false negatives) and timely reporting of all events from large, high-speed streams, and with a low reporting threshold (high sensitivity). Like the classic heavy-hitters problem, solving the TED problem without false-positives requires large space (ω(N) words). Thus in-RAM heavy-hitters algorithms typically sacrifice accuracy (i.e., allow false positives), sensitivity, or timeliness (i.e., use multiple passes). We show how to adapt heavy-hitters algorithms to external memory to solve the TED problem on large high-speed streams while guaranteeing accuracy, sensitivity, and timeliness. Our data structures are limited only by I/O-bandwidth (not latency) and support a tunable trade-off between reporting delay and I/O overhead. With a small bounded reporting delay, our algorithms incur only a logarithmic I/O overhead. We implement and validate our data structures empirically using the Firehose streaming benchmark. Multi-threaded versions of our structures can scale to process 11M observations per second before becoming CPU bound. In comparison, a naive adaptation of the standard heavy-hitters algorithm to external memory would be limited by the storage device's random I/O throughput, i.e., ∼100K observations per second.

More Details

Making social networks more human: A topological approach

Statistical Analysis and Data Mining

Berry, Jonathan W.

A key problem in social network analysis is to identify nonhuman interactions. State-of-the-art bot-detection systems like Botometer train machine-learning models on user-specific data. Unfortunately, these methods do not work on data sets in which only topological information is available. In this paper, we propose a new, purely topological approach. Our method removes edges that connect nodes exhibiting strong evidence of non-human activity from publicly available electronic-social-network datasets, including, for example, those in the Stanford Network Analysis Project repository (SNAP). Our methodology is inspired by classic work in evolutionary psychology by Dunbar that posits upper bounds on the total strength of the set of social connections in which a single human can be engaged. We model edge strength with Easley and Kleinberg's topological estimate; label nodes as “violators” if the sum of these edge strengths exceeds a Dunbar-inspired bound; and then remove the violator-to-violator edges. We run our algorithm on multiple social networks and show that our Dunbar-inspired bound appears to hold for social networks, but not for nonsocial networks. Our cleaning process classifies 0.04% of the nodes of the Twitter-2010 followers graph as violators, and we find that more than 80% of these violator nodes have Botometer scores of 0.5 or greater. Furthermore, after we remove the roughly 15 million violator-violator edges from the 1.2-billion-edge Twitter-2010 follower graph, 34% of the violator nodes experience a factor-of-two decrease in PageRank. PageRank is a key component of many graph algorithms such as node/edge ranking and graph sparsification. Thus, this artificial inflation would bias algorithmic output, and result in some incorrect decisions based on this output.

More Details

Scalable generation of graphs for benchmarking HPC community-detection algorithms

International Conference for High Performance Computing, Networking, Storage and Analysis, SC

Slota, George M.; Berry, Jonathan W.; Hammond, Simon D.; Olivier, Stephen L.; Phillips, Cynthia A.; Rajamanickam, Sivasankaran R.

Community detection in graphs is a canonical social network analysis method. We consider the problem of generating suites of teras-cale synthetic social networks to compare the solution quality of parallel community-detection methods. The standard method, based on the graph generator of Lancichinetti, Fortunato, and Radicchi (LFR), has been used extensively for modest-scale graphs, but has inherent scalability limitations. We provide an alternative, based on the scalable Block Two-Level Erdos-Renyi (BTER) graph generator, that enables HPC-scale evaluation of solution quality in the style of LFR. Our approach varies community coherence, and retains other important properties. Our methods can scale real-world networks, e.g., to create a version of the Friendster network that is 512 times larger. With BTER's inherent scalability, we can generate a 15-terabyte graph (4.6B vertices, 925B edges) in just over one minute. We demonstrate our capability by showing that label-propagation community-detection algorithm can be strong-scaled with negligible solution-quality loss.

More Details

Multi-Level Memory Algorithmics for Large, Sparse Problems

Berry, Jonathan W.; Butcher, Neil; Catalyurek, Umit; Kogge, Peter; Lin, Paul; Olivier, Stephen L.; Phillips, Cynthia A.; Rajamanickam, Sivasankaran R.; Slota, George M.; Voskuilen, Gwendolyn R.; Yasar, Abdurrahman; Young, Jeffrey G.

In this report, we abstract eleven papers published during the project and describe preliminary unpublished results that warrant follow-up work. The topic is multi-level memory algorithmics, or how to effectively use multiple layers of main memory. Modern compute nodes all have this feature in some form.

More Details

Information-Theoretically Secure Distributed Machine Learning

Shead, Timothy M.; Berry, Jonathan W.; Phillips, Cynthia A.; Saia, Jared

A previously obscure area of cryptography known as Secure Multiparty Computation (MPC) is enjoying increased attention in the field of privacy-preserving machine learning (ML), because ML models implemented using MPC can be uniquely resistant to capture or reverse engineering by an adversary. In particular, an adversary who captures a share of a distributed MPC model provably cannot recover the model itself, nor data evaluated by the model, even by observing the model in operation. We report on our small project to survey current MPC software and judge its practicality for fielding mission-relevant distributed machine learning models.

More Details

Linear algebra-based triangle counting via fine-grained tasking on heterogeneous environments : ((Update on Static Graph Challenge)

2019 IEEE High Performance Extreme Computing Conference, HPEC 2019

Yasar, Abdurrahman Y.; Rajamanickam, Sivasankaran R.; Berry, Jonathan W.; Acer, Seher A.; Wolf, Michael W.; Young, Jeffrey G.; Catalyurek, Umit V.

Triangle counting is a representative graph problem that shows the challenges of improving graph algorithm performance using algorithmic techniques and adopting graph algorithms to new architectures. In this paper, we describe an update to the linear-algebraic formulation of the triangle counting problem. Our new approach relies on fine-grained tasking based on a tile layout. We adopt this task based algorithm to heterogeneous architectures (CPUs and GPUs) for up to 10.8x speed up over past year's graph challenge submission. This implementation also results in the fastest kernel time known at time of publication for real-world graphs like twitter (3.7 second) and friendster (1.8 seconds) on GPU accelerators when the graph is GPU resident. This is a 1.7 and 1.2 time improvement over previous state-of-the-art triangle counting on GPUs. We also improved end-to-end execution time by overlapping computation and communication of the graph to the GPUs. In terms of end-to-end execution time, our implementation also achieves the fastest end-to-end times due to very low overhead costs.

More Details
Results 1–25 of 125
Results 1–25 of 125