Statistical Properties of Compression Analytics
Abstract not provided.
Abstract not provided.
Abstract not provided.
Arithmetic Coding (AC) using Prediction by Partial Matching (PPM) is a compression algorithm that can be used as a machine learning algorithm. This paper describes a new algorithm, NGram PPM. NGram PPM has all the predictive power of AC/PPM, but at a fraction of the computational cost. Unlike compression-based analytics, it is also amenable to a vector space interpretation, which creates the ability for integration with other traditional machine learning algorithms. AC/PPM is reviewed, including its application to machine learning. Then NGram PPM is described and test results are presented, comparing them to AC/PPM.
This report summarizes the goals and findings of eight research projects conducted under the Computing and Information Sciences (CIS) Research Foundation and related to the COVID- 19 pandemic. The projects were all formulated in response to Sandia's call for proposals for rapid-response research with the potential to have a positive impact on the global health emergency. Six of the projects in the CIS portfolio focused on modeling various facets of disease spread, resource requirements, testing programs, and economic impact. The two remaining projects examined the use of web-crawlers and text analytics to allow rapid identification of articles relevant to specific technical questions, and categorization of the reliability of content. The portfolio has collectively produced methods and findings that are being applied by a range of state, regional, and national entities to support enhanced understanding and prediction of the pandemic's spread and its impacts.
Sandia National Laboratories currently has 27 COVID-related Laboratory Directed Research & Development (LDRD) projects focused on helping the nation during the pandemic. These LDRD projects cross many disciplines including bioscience, computing & information sciences, engineering science, materials science, nanodevices & microsystems, and radiation effects & high energy density science.
This report describes the results of a seven day effort to assist subject matter experts address a problem related to COVID-19. In the course of this effort, we analyzed the 29K documents provided as part of the White House's call to action. This involved applying a variety of natural language processing techniques and compression-based analytics in combination with visualization techniques and assessment with subject matter experts to pursue answers to a specific question. In this paper, we will describe the algorithms, the software, the study performed, and availability of the software developed during the effort.
ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
We present a new method for boundary detection within sequential data using compression-based analytics. Our approach is to approximate the information distance between two adjacent sliding windows within the sequence. Large values in the distance metric are indicative of boundary locations. A new algorithm is developed, referred to as sliding information distance (SLID), that provides a fast, accurate, and robust approximation to the normalized information distance. A modified smoothed z-score algorithm is used to locate peaks in the distance metric, indicating boundary locations. A variety of data sources are considered, including text and audio, to demonstrate the efficacy of our approach.
IEEE Transactions on Information Forensics and Security
The flexibility of network communication within Internet protocols is fundamental to network function, yet this same flexibility permits the possibility of malicious use. In particular, malicious behavior can masquerade as benign traffic, thus evading systems designed to catch misuse of network resources. However, perfect imitation of benign traffic is difficult, meaning that small unintentional deviations from normal can occur. Identifying these deviations requires that the defenders know what features reveal malicious behavior. Herein, we present an application of compression-based analytics to network communication that can reduce the need for defenders to know a priori what features they need to examine. Motivating the approach is the idea that compression relies on the ability to discover and make use of predictable elements in information, thereby highlighting any deviations between expected and received content. We introduce a so-called 'slice compression' score to identify malicious or anomalous communication in two ways. First, we apply normalized compression distances to classification problems and discuss methods for reducing the noise by excising application content (as opposed to protocol features) using slice compression. Second, we present a new technique for anomaly detection, referred to as slice compression for anomaly detection. A diverse collection of datasets are analyzed to illustrate the efficacy of the proposed approaches. While our focus is network communication, other types of data are also considered to illustrate the generality of the method.
Abstract not provided.
Abstract not provided.
Social Network Analysis Lecture Notes Series
Abstract not provided.
Proceedings of the 2017 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, ASONAM 2017
Abstract not provided.
On August 15, 2016, Sandia hosted a visit by Professor Venkatesh Narayanamurti. Prof Narayanamurti (Benjamin Peirce Research Professor of Technology and Public Policy at Harvard, Board Member of the Belfer Center for Science and International Affairs, former Dean of the School of Engineering and Applied Science at Harvard, former Dean of Engineering at UC Santa Barbara, and former Vice President of Division 1000 at Sandia). During the visit, a small, informal, all-day idea exploration session on "Towards an Engineering and Applied Science of Research" was conducted. This document is a brief synopsis or "footprint" of the presentations and discussions at this Idea Exploration Session. The intent of this document is to stimulate further discussion about pathways Sandia can take to improve its Research practices.
Advances in Intelligent Systems and Computing
An underserved niche exists for data mining tools in complex analytical environments. We propose three attributes of analytical tool development that facilitates rapid operationalization of new tools into complex, dynamic environments: accessibility, adaptability, and extendibility. Accessibility we define as the ability to load data into an analytical system quickly and seamlessly. Adaptability we define as the ability to apply a tool rapidly to new, unanticipated use cases. Extendibility we define as the ability to create new functionality “in the field” where it is being used and, if needed, harden that new functionality into a new, more permanent user interface. Distributed “big data” systems generally do not optimize for these attributes, creating an underserved niche for new analytical tools. In this paper we will define the problem, examine the three attributes, and describe the architecture of an example system called Citrus that we have built and use that is especially focused on these three attributes.
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
In this work we extend compression-based algorithms for deception detection in text. In contrast to approaches that rely on theories for deception to identify feature sets, compression automatically identifies the most significant features. We consider two datasets that allow us to explore deception in opinion (content) and deception in identity (stylometry). Our first approach is to use unsupervised clustering based on a normalized compression distance (NCD) between documents. Our second approach is to use Prediction by Partial Matching (PPM) to train a classifier with conditional probabilities from labeled documents, followed by arithmetic coding (AC) to classify an unknown document based on which label gives the best compression. We find a significant dependence of the classifier on the relative volume of training data used to build the conditional probability distributions of the different labels. Methods are demonstrated to overcome the data size-dependence when analytics, not information transfer, is the goal. Our results indicate that deceptive text contains structure statistically distinct from truthful text, and that this structure can be automatically detected using compression-based algorithms.
Abstract not provided.
ACM International Conference Proceeding Series
In this paper, we argue that information theoretic measures may provide a robust, broadly applicable, repeatable metric to assess how a system enables people to reduce high-dimensional data into topically relevant subsets of information. Explosive growth in electronic data necessitates the development of systems that balance automation with human cognitive engagement to facilitate pattern discovery, analysis and characterization, variously described as "cognitive augmentation" or "insight generation." However, operationalizing the concept of insight in any measurable way remains a difficult challenge for visualization researchers. The "golden ticket" of insight evaluation would be a precise, generalizable, repeatable, and ecologically valid metric that indicates the relative utility of a system in heightening cognitive performance or facilitating insights. Unfortunately, the golden ticket does not yet exist. In its place, we are exploring information theoretic measures derived from Shannon's ideas about information and entropy as a starting point for precise, repeatable, and generalizable approaches for evaluating analytic tools. We are specifically concerned with needle-in-haystack workflows that require interactive search, classification, and reduction of very large heterogeneous datasets into manageable, task-relevant subsets of information. We assert that systems aimed at facilitating pattern discovery, characterization and analysis - i.e., "insight" - must afford an efficient means of sorting the needles from the chaff; and simple compressibility measures provide a way of tracking changes in information content as people shape meaning from data.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.
Abstract not provided.