Publications

Results 1–25 of 52
Skip to search filters

NgramPPM: Compression Analytics without Compression

Bauer, Travis L.

Arithmetic Coding (AC) using Prediction by Partial Matching (PPM) is a compression algorithm that can be used as a machine learning algorithm. This paper describes a new algorithm, NGram PPM. NGram PPM has all the predictive power of AC/PPM, but at a fraction of the computational cost. Unlike compression-based analytics, it is also amenable to a vector space interpretation, which creates the ability for integration with other traditional machine learning algorithms. AC/PPM is reviewed, including its application to machine learning. Then NGram PPM is described and test results are presented, comparing them to AC/PPM.

More Details

Rapid Response Data Science for COVID-19

Bandlow, Alisa B.; Bauer, Travis L.; Crossno, Patricia J.; Garcia, Rudy J.; Astuto Gribble, Lisa A.; Hernandez, Patricia M.; Martin, Shawn; McClain, Jonathan T.; Patrizi, Laura P.

This report describes the results of a seven day effort to assist subject matter experts address a problem related to COVID-19. In the course of this effort, we analyzed the 29K documents provided as part of the White House's call to action. This involved applying a variety of natural language processing techniques and compression-based analytics in combination with visualization techniques and assessment with subject matter experts to pursue answers to a specific question. In this paper, we will describe the algorithms, the software, the study performed, and availability of the software developed during the effort.

More Details

Compression Analytics for Classification and Anomaly Detection Within Network Communication

IEEE Transactions on Information Forensics and Security

Ting, Christina T.; Field, Richard V.; Fisher, Andrew N.; Bauer, Travis L.

The flexibility of network communication within Internet protocols is fundamental to network function, yet this same flexibility permits the possibility of malicious use. In particular, malicious behavior can masquerade as benign traffic, thus evading systems designed to catch misuse of network resources. However, perfect imitation of benign traffic is difficult, meaning that small unintentional deviations from normal can occur. Identifying these deviations requires that the defenders know what features reveal malicious behavior. Herein, we present an application of compression-based analytics to network communication that can reduce the need for defenders to know a priori what features they need to examine. Motivating the approach is the idea that compression relies on the ability to discover and make use of predictable elements in information, thereby highlighting any deviations between expected and received content. We introduce a so-called 'slice compression' score to identify malicious or anomalous communication in two ways. First, we apply normalized compression distances to classification problems and discuss methods for reducing the noise by excising application content (as opposed to protocol features) using slice compression. Second, we present a new technique for anomaly detection, referred to as slice compression for anomaly detection. A diverse collection of datasets are analyzed to illustrate the efficacy of the proposed approaches. While our focus is network communication, other types of data are also considered to illustrate the generality of the method.

More Details

Generalized Boundary Detection Using Compression-based Analytics

ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

Ting, Christina T.; Field, Richard V.; Quach, Tu-Thach Q.; Bauer, Travis L.

We present a new method for boundary detection within sequential data using compression-based analytics. Our approach is to approximate the information distance between two adjacent sliding windows within the sequence. Large values in the distance metric are indicative of boundary locations. A new algorithm is developed, referred to as sliding information distance (SLID), that provides a fast, accurate, and robust approximation to the normalized information distance. A modified smoothed z-score algorithm is used to locate peaks in the distance metric, indicating boundary locations. A variety of data sources are considered, including text and audio, to demonstrate the efficacy of our approach.

More Details

Temporal anomaly detection in social media

Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM 2017

Skryzalin, Jacek S.; Field, Richard V.; Fisher, Andrew N.; Bauer, Travis L.

In this work, we approach topic tracking and meme trending in social media with a temporal focus; rather than analyzing topics, we aim to identify time periods whose content differs significantly from normal. We detail two approaches. The first is an information-theoretic analysis of the distributions of terms emitted during each time period. In the second, we cluster the documents from each time period and analyze the tightness of each clustering. We also discuss a method of combining the scores created by each technique, and we provide ample empirical analysis of our methodology on various Twitter datasets.

More Details

Footprint of Sandia's August 15 2016 Informal Idea Exploration Session on "Towards an Engineering and Applied Science of Research"

Tsao, Jeffrey Y.; Fleming Lindsley, Elizabeth S.; Heffelfinger, Grant S.; Narayanamurti, Venkatesh N.; Schneider, Rick S.; Starkweather, Lynne M.; Ting, Christina T.; Yajima, Rieko Y.; Bauer, Travis L.; Coltrin, Michael E.; Guy, Donald W.; Jones, Wendell J.; Mareda, John F.; Nenoff, T.M.; Turnley, Jessica G.

On August 15, 2016, Sandia hosted a visit by Professor Venkatesh Narayanamurti. Prof Narayanamurti (Benjamin Peirce Research Professor of Technology and Public Policy at Harvard, Board Member of the Belfer Center for Science and International Affairs, former Dean of the School of Engineering and Applied Science at Harvard, former Dean of Engineering at UC Santa Barbara, and former Vice President of Division 1000 at Sandia). During the visit, a small, informal, all-day idea exploration session on "Towards an Engineering and Applied Science of Research" was conducted. This document is a brief synopsis or "footprint" of the presentations and discussions at this Idea Exploration Session. The intent of this document is to stimulate further discussion about pathways Sandia can take to improve its Research practices.

More Details

Accessibility, adaptability, and extendibility: Dealing with the small data problem

Advances in Intelligent Systems and Computing

Bauer, Travis L.; Garcia, Daniel G.

An underserved niche exists for data mining tools in complex analytical environments. We propose three attributes of analytical tool development that facilitates rapid operationalization of new tools into complex, dynamic environments: accessibility, adaptability, and extendibility. Accessibility we define as the ability to load data into an analytical system quickly and seamlessly. Adaptability we define as the ability to apply a tool rapidly to new, unanticipated use cases. Extendibility we define as the ability to create new functionality “in the field” where it is being used and, if needed, harden that new functionality into a new, more permanent user interface. Distributed “big data” systems generally do not optimize for these attributes, creating an underserved niche for new analytical tools. In this paper we will define the problem, examine the three attributes, and describe the architecture of an example system called Citrus that we have built and use that is especially focused on these three attributes.

More Details

Information Theoretic Measures for Visual Analytics: The Silver Ticket?: A Summary of a 2016 Exploratory Express LDRD Idea and Research Activity

McNamara, Laura A.; Bauer, Travis L.; Haass, Michael J.; Matzen, Laura E.

In the context of text-based analysis workflows, we propose that an effective analytic tool facilitates triage by a) enabling users to identify and set aside irrelevant content (i.e., reduce the complexity of information in a dataset) and b) develop a working mental model of which items are most relevant to the question at hand. This LDRD funded research developed a dataset that is enabling this team to evaluate propose normalized compression distance (NCD) as a task, user, and context-insensitive measure of categorization outcomes (Shannon entropy is reduced as order is imposed). Effective analytics tools help people impose order, reducing complexity in measurable ways. Our concept and research was documented in a paper accepted to the ACM conference Beyond Time and Error: Novel Methods in Information Visualization Evaluation , part of the IEEE VisWeek Conference, Baltimore, MD, October 16-21, 2016. The paper is included as an appendix to this report.

More Details

Inferring Organizational Structure from Behavior

Bauer, Travis L.; Brounstein, Tom R.

In this project, we researched new techniques for detecting hidden networks of individuals and institutions by introducing the use of temporal correlations among behaviors, leveraging both information sources and metadata. We validated the algorithms using the Wikipedia edit history. The rapid increase in crowd-sourced applications like Wikipedia is providing a rich set of data with both a record of behaviors and a set of direct interactions among individuals. Data sets with network ground truth are needed to develop and validate models, before applying them to national security settings where content and meta-data alone are available.

More Details
Results 1–25 of 52
Results 1–25 of 52