Research

NgramPPM: Compression Analytics without Compression

Arithmetic Coding (AC) using Prediction by Partial Matching (PPM) is a compression algorithm that can be used as a machine learning algorithm. This paper describes a new algorithm, NGram PPM. NGram PPM has all the predictive power of AC/PPM, but at a fraction of the computational cost. Unlike compression-based analytics, it is also amenable to a vector space interpretation, which creates the ability for integration with other traditional machine learning algorithms. AC/PPM is reviewed, including its application to machine learning. Then NGram PPM is described and test results are presented, comparing them to AC/PPM.

More Details

TYPE SAND Report YEAR 2021

OSTI DOI

Rapid Response Data Science for COVID-19

Bandlow, Alisa B.; Bauer, Travis L.; Crossno, Patricia J.; Garcia, Rudy J.; Astuto Gribble, Lisa A.; Hernandez, Patricia M.; Martin, Shawn; McClain, Jonathan T.; Patrizi, Laura P.

This report describes the results of a seven day effort to assist subject matter experts address a problem related to COVID-19. In the course of this effort, we analyzed the 29K documents provided as part of the White House's call to action. This involved applying a variety of natural language processing techniques and compression-based analytics in combination with visualization techniques and assessment with subject matter experts to pursue answers to a specific question. In this paper, we will describe the algorithms, the software, the study performed, and availability of the software developed during the effort.

More Details

TYPE SAND Report YEAR 2020

OSTI DOI

Compression Analytics for Classification and Anomaly Detection Within Network Communication

IEEE Transactions on Information Forensics and Security

Ting, Christina T.; Field, Richard V.; Fisher, Andrew N.; Bauer, Travis L.

The flexibility of network communication within Internet protocols is fundamental to network function, yet this same flexibility permits the possibility of malicious use. In particular, malicious behavior can masquerade as benign traffic, thus evading systems designed to catch misuse of network resources. However, perfect imitation of benign traffic is difficult, meaning that small unintentional deviations from normal can occur. Identifying these deviations requires that the defenders know what features reveal malicious behavior. Herein, we present an application of compression-based analytics to network communication that can reduce the need for defenders to know a priori what features they need to examine. Motivating the approach is the idea that compression relies on the ability to discover and make use of predictable elements in information, thereby highlighting any deviations between expected and received content. We introduce a so-called 'slice compression' score to identify malicious or anomalous communication in two ways. First, we apply normalized compression distances to classification problems and discuss methods for reducing the noise by excising application content (as opposed to protocol features) using slice compression. Second, we present a new technique for anomaly detection, referred to as slice compression for anomaly detection. A diverse collection of datasets are analyzed to illustrate the efficacy of the proposed approaches. While our focus is network communication, other types of data are also considered to illustrate the generality of the method.

More Details

TYPE Journal Article YEAR 2019

Scopus OSTI DOI

Generalized Boundary Detection Using Compression-based Analytics

ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

Ting, Christina T.; Field, Richard V.; Quach, Tu-Thach Q.; Bauer, Travis L.

We present a new method for boundary detection within sequential data using compression-based analytics. Our approach is to approximate the information distance between two adjacent sliding windows within the sequence. Large values in the distance metric are indicative of boundary locations. A new algorithm is developed, referred to as sliding information distance (SLID), that provides a fast, accurate, and robust approximation to the normalized information distance. A modified smoothed z-score algorithm is used to locate peaks in the distance metric, indicating boundary locations. A variety of data sources are considered, including text and audio, to demonstrate the efficacy of our approach.

More Details

TYPE Conference Poster YEAR 2019

Scopus OSTI DOI

Generalized boundary detection using compression-based analytics

Ting, Christina T.; Field Jr., Richard V.; Quach, Tu-Thach Q.; Bauer, Travis L.

Abstract not provided.

Publications