Publications Search

We seek scalable benchmarks for entity resolution problems. Solutions to these problems range from trivial approaches such as string sorting to sophisticated methods such as statistical relational learning. The theoretical and practical complexity of these approaches varies widely, so one of the primary purposes of a benchmark will be to quantify the trade-off between solution quality and runtime. We are motivated by the ubiquitous nature of entity resolution as a fundamental problem faced by any organization that ingests large amounts of noisy text data. A benchmark is typically a rigid specification that provides an objective measure usable for ranking implementations of an algorithm. For example the Top500 and HPCG500 bench- marks rank supercomputers based on their performance of dense and sparse linear algebra problems (respectively). These two benchmarks require participants to report FLOPS counts attainable on various machines. Our purpose is slightly different. Whereas the supercomputing benchmarks mentioned above hold algorithms constant and aim to rank machines, we are primarily interested in ranking algorithms. As mentioned above, entity resolution problems can be approached in completely different ways. We believe that users of our benchmarks must decide what sort of procedure to run before comparing implementations and architectures. Eventually, we also wish to provide a mechanism for ranking machines while holding algorithmic approach constant . Our primary contributions are parallel algorithms for computing solution quality mea- sures per entity. We find in some real datasets that many entities are quite easy to resolve while others are difficult, with a heavy skew toward the former case. Therefore, measures such as global confusion matrices, F measures, etc. do not meet our benchmarking needs. We design methods for computing solution quality at the granularity of a single entity in order to know when proposed solutions do well in difficult situations (perhaps justifying extra computational), or struggling in easy situations. We report on progress toward a viable benchmark for comparing entity resolution algo- rithms. Our work is incomplete, but we have designed and prototyped several algorithms to help evalute the solution quality of competing approaches to these problems. We envision a benchmark in which the objective measure is a ratio of solution quality to runtime.

More Details

TYPE SAND Report YEAR 2018

DOI OSTI

Fast Linear Algebra-Based Triangle Analytics with Kokkos Kernels

Yasar, Abdurrahman; Rajamanickam, Sivasankaran; Wolf, Michael; Berry, Jonathan; Catalyurek, Umit

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Optimizing for KNL Usage Modes When Data Doesn?t Fit in MCDRAM

Berry, Jonathan; Butcher, Neil; Olivier, Stephen L.; Hammond, Simon; Kogge, Peter

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Graph Libraries Wish List

Berry, Jonathan; Phillips, Cynthia A.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Generating massive random graphs that mimic real data

Slota, George; Berry, Jonathan; Phillips, Cynthia A.; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Fast Triangle Counting Using Cilk

Yasar, Abdurrahman; Rajamanickam, Sivasankaran; Wolf, Michael; Berry, Jonathan; Catalyurek, Umit

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

DOI OSTI

Scalable Community Detection Benchmark Generation

Berry, Jonathan; Slota, George M.; Phillips, Cynthia A.; Rajamanickam, Sivasankaran

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Optimizing for KNL Usage Models When Data Doesn't Fit in MCDRAM

Berry, Jonathan; Butcher, Neil; Olivier, Stephen L.; Hammond, Simon; Kogge, Peter

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Scalable community detection benchmark generation

Berry, Jonathan; Phillips, Cynthia A.; Rajamanickam, Sivasankaran; Slota, George M.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Advanced Data Structures for Improved Cyber Resilience and Awareness in Untrusted Environments: LDRD Report

Bender, Michael A.; Berry, Jonathan; Farach-Colton, Martin; Jacobs, Justin; Johnson, Rob; Kroeger, Thomas; Mayer, Tyler; Mccauley, Samuel; Pandey, Prashant; Phillips, Cynthia A.; Porter, Alexandra; Singh, Shikha; Raizes, Justin; Xu, Helen; Zage, David

This report summarizes the work performed under the project "Advanced Data Structures for Improved Cyber Resilience and Awareness in Untrusted Environments." The goal of the project was to design, analyze, and test new data structures for cybersecurity applications. We had two major thrusts: 1) using new/improved write-optimized data structures and/or algorithms to better man- age and analyze high-speed massive-volume cyberstreams, and 2) adding security features to data structures at minimum cost. Write optimization allows data structures to better use secondary memory to store and search a larger amount of useful information. Secondary memory is large compared to main memory, but data movement to and from secondary memory must be carefully managed to run quickly enough to keep up with fast streams. The first thrust included managing cyberstreams in parallel, both multi-threaded and distributed, and improving the benchmarking infrastructure for testing new streaming data structures. We considered both (near) real-time discovery of particular patterns, and improved logging for improved forensics. We considered two kinds of security-feature problem. The first was high-performance history- independent external-memory data structures. These provide certain protections to data if a disk is stolen. We also prove some trade-offs between speed and security in this setting. The second data-security problem is more secure data look-up in secret-shared data bases. This report summarizes the project's major accomplishments, with the background to under- stand these accomplishments. It gathers the abstracts and references for the six refereed publications that have appeared as part of this work. We summarize several accomplishments that will be submitted for publication. We then archive one piece of partial work that is not likely to be published in the near future: validation of history-independent data structure implementations.

More Details

TYPE SAND Report YEAR 2018

DOI OSTI

Making Social-Network Data Sets More Human to Aid National-Security Graph Analytics

Berry, Jonathan; Phillips, Cynthia A.; Saia, Jared

More Details

TYPE Presentation YEAR 2018

OSTI

Multi-threaded Sparse Matrix Matrix Multiplication with Applications in Scientific Computing and Graph Analytics

Deveci, Mehmet; Wolf, Michael; Berry, Jonathan; Rajamanickam, Sivasankaran; Boman, Erik G.; Trott, Christian R.; Hammond, Simon; Olivier, Stephen L.

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

TriData: High Performance Linear Algebra-Based Data Analytics

Wolf, Michael; Dunlavy, Daniel M.; Lehoucq, Rich; Berry, Jonathan; Bourgeois, Daniel

Abstract not provided.

More Details

TYPE Conference Poster YEAR 2018

OSTI

Benchmarking for High-Performance Data Analysis

Berry, Jonathan

Abstract not provided.

More Details

TYPE Presentation YEAR 2017

OSTI

Fast linear algebra-based triangle counting with KokkosKernels

2017 IEEE High Performance Extreme Computing Conference, HPEC 2017

Wolf, Michael; Deveci, Mehmet; Berry, Jonathan; Hammond, Simon; Rajamanickam, Sivasankaran

Triangle counting serves as a key building block for a set of important graph algorithms in network science. In this paper, we address the IEEE HPEC Static Graph Challenge problem of triangle counting, focusing on obtaining the best parallel performance on a single multicore node. Our implementation uses a linear algebra-based approach to triangle counting that has grown out of work related to our miniTri data analytics miniapplication [1] and our efforts to pose graph algorithms in the language of linear algebra. We leverage KokkosKernels to implement this approach efficiently on multicore architectures. Our performance results are competitive with the fastest known graph traversal-based approaches and are significantly faster than the Graph Challenge reference implementations, up to 670,000 times faster than the C++ reference and 10,000 times faster than the Python reference on a single Intel Haswell node.

More Details

TYPE Conference Poster YEAR 2017

DOI OSTI Scopus