Publications

Results 51–75 of 105
Skip to search filters

Hybrid methods for cybersecurity analysis :

Davis, Warren L.; Dunlavy, Daniel D.

Early 2010 saw a signi cant change in adversarial techniques aimed at network intrusion: a shift from malware delivered via email attachments toward the use of hidden, embedded hyperlinks to initiate sequences of downloads and interactions with web sites and network servers containing malicious software. Enterprise security groups were well poised and experienced in defending the former attacks, but the new types of attacks were larger in number, more challenging to detect, dynamic in nature, and required the development of new technologies and analytic capabilities. The Hybrid LDRD project was aimed at delivering new capabilities in large-scale data modeling and analysis to enterprise security operators and analysts and understanding the challenges of detection and prevention of emerging cybersecurity threats. Leveraging previous LDRD research e orts and capabilities in large-scale relational data analysis, large-scale discrete data analysis and visualization, and streaming data analysis, new modeling and analysis capabilities were quickly brought to bear on the problems in email phishing and spear phishing attacks in the Sandia enterprise security operational groups at the onset of the Hybrid project. As part of this project, a software development and deployment framework was created within the security analyst work ow tool sets to facilitate the delivery and testing of new capabilities as they became available, and machine learning algorithms were developed to address the challenge of dynamic threats. Furthermore, researchers from the Hybrid project were embedded in the security analyst groups for almost a full year, engaged in daily operational activities and routines, creating an atmosphere of trust and collaboration between the researchers and security personnel. The Hybrid project has altered the way that research ideas can be incorporated into the production environments of Sandias enterprise security groups, reducing time to deployment from months and years to hours and days for the application of new modeling and analysis capabilities to emerging threats. The development and deployment framework has been generalized into the Hybrid Framework and incor- porated into several LDRD, WFO, and DOE/CSL projects and proposals. And most importantly, the Hybrid project has provided Sandia security analysts with new, scalable, extensible analytic capabilities that have resulted in alerts not detectable using their previous work ow tool sets.

More Details

Shifted power method for computing tensor eigenpairs

Kolda, Tamara G.; Dunlavy, Daniel D.

Recent work on eigenvalues and eigenvectors for tensors of order m {>=} 3 has been motivated by applications in blind source separation, magnetic resonance imaging, molecular conformation, and more. In this paper, we consider methods for computing real symmetric-tensor eigenpairs of the form Ax{sup m-1} = {lambda}x subject to {parallel}x{parallel} = 1, which is closely related to optimal rank-1 approximation of a symmetric tensor. Our contribution is a novel shifted symmetric higher-order power method (SS-HOPM), which we showis guaranteed to converge to a tensor eigenpair. SS-HOPM can be viewed as a generalization of the power iteration method for matrices or of the symmetric higher-order power method. Additionally, using fixed point analysis, we can characterize exactly which eigenpairs can and cannot be found by the method. Numerical examples are presented, including examples from an extension of the method to fnding complex eigenpairs.

More Details

ParaText : scalable solutions for processing and searching very large document collections : final LDRD report

Dunlavy, Daniel D.; Crossno, Patricia J.

This report is a summary of the accomplishments of the 'Scalable Solutions for Processing and Searching Very Large Document Collections' LDRD, which ran from FY08 through FY10. Our goal was to investigate scalable text analysis; specifically, methods for information retrieval and visualization that could scale to extremely large document collections. Towards that end, we designed, implemented, and demonstrated a scalable framework for text analysis - ParaText - as a major project deliverable. Further, we demonstrated the benefits of using visual analysis in text analysis algorithm development, improved performance of heterogeneous ensemble models in data classification problems, and the advantages of information theoretic methods in user analysis and interpretation in cross language information retrieval. The project involved 5 members of the technical staff and 3 summer interns (including one who worked two summers). It resulted in a total of 14 publications, 3 new software libraries (2 open source and 1 internal to Sandia), several new end-user software applications, and over 20 presentations. Several follow-on projects have already begun or will start in FY11, with additional projects currently in proposal.

More Details

LDRD final report : leveraging multi-way linkages on heterogeneous data

Dunlavy, Daniel D.; Kolda, Tamara G.

This report is a summary of the accomplishments of the 'Leveraging Multi-way Linkages on Heterogeneous Data' which ran from FY08 through FY10. The goal was to investigate scalable and robust methods for multi-way data analysis. We developed a new optimization-based method called CPOPT for fitting a particular type of tensor factorization to data; CPOPT was compared against existing methods and found to be more accurate than any faster method and faster than any equally accurate method. We extended this method to computing tensor factorizations for problems with incomplete data; our results show that you can recover scientifically meaningfully factorizations with large amounts of missing data (50% or more). The project has involved 5 members of the technical staff, 2 postdocs, and 1 summer intern. It has resulted in a total of 13 publications, 2 software releases, and over 30 presentations. Several follow-on projects have already begun, with more potential projects in development.

More Details

Scalable tensor factorizations with incomplete data

Kolda, Tamara G.; Dunlavy, Daniel D.

The problem of incomplete data - i.e., data with missing or unknown values - in multi-way arrays is ubiquitous in biomedical signal processing, network traffic analysis, bibliometrics, social network analysis, chemometrics, computer vision, communication networks, etc. We consider the problem of how to factorize data sets with missing values with the goal of capturing the underlying latent structure of the data and possibly reconstructing missing values (i.e., tensor completion). We focus on one of the most well-known tensor factorizations that captures multi-linear structure, CANDECOMP/PARAFAC (CP). In the presence of missing data, CP can be formulated as a weighted least squares problem that models only the known entries. We develop an algorithm called CP-WOPT (CP Weighted OPTimization) that uses a first-order optimization approach to solve the weighted least squares problem. Based on extensive numerical experiments, our algorithm is shown to successfully factorize tensors with noise and up to 99% missing data. A unique aspect of our approach is that it scales to sparse large-scale data, e.g., 1000 x 1000 x 1000 with five million known entries (0.5% dense). We further demonstrate the usefulness of CP-WOPT on two real-world applications: a novel EEG (electroencephalogram) application where missing data is frequently encountered due to disconnections of electrodes and the problem of modeling computer network traffic where data may be absent due to the expense of the data collection process.

More Details

ParaText : scalable text analysis and visualization

Dunlavy, Daniel D.

Automated analysis of unstructured text documents (e.g., web pages, newswire articles, research publications, business reports) is a key capability for solving important problems in areas including decision making, risk assessment, social network analysis, intelligence analysis, scholarly research and others. However, as data sizes continue to grow in these areas, scalable processing, modeling, and semantic analysis of text collections becomes essential. In this paper, we present the ParaText text analysis engine, a distributed memory software framework for processing, modeling, and analyzing collections of unstructured text documents. Results on several document collections using hundreds of processors are presented to illustrate the exibility, extensibility, and scalability of the the entire process of text modeling from raw data ingestion to application analysis.

More Details

Compressively sensed complex networks

Pinar, Ali P.; Dunlavy, Daniel D.

The aim of this project is to develop low dimension parametric (deterministic) models of complex networks, to use compressive sensing (CS) and multiscale analysis to do so and to exploit the structure of complex networks (some are self-similar under coarsening). CS provides a new way of sampling and reconstructing networks. The approach is based on multiresolution decomposition of the adjacency matrix and its efficient sampling. It requires preprocessing of the adjacency matrix to make it 'blocky' which is the biggest (combinatorial) algorithm challenge. Current CS reconstruction algorithm makes no use of the structure of a graph, its very general (and so not very efficient/customized). Other model-based CS techniques exist, but not yet adapted to networks. Obvious starting point for future work is to increase the efficiency of reconstruction.

More Details

Link prediction on evolving graphs using matrix and tensor factorizations

Kolda, Tamara G.; Dunlavy, Daniel D.

More Details

ParaText : scalable text modeling and analysis

Dunlavy, Daniel D.

Automated processing, modeling, and analysis of unstructured text (news documents, web content, journal articles, etc.) is a key task in many data analysis and decision making applications. As data sizes grow, scalability is essential for deep analysis. In many cases, documents are modeled as term or feature vectors and latent semantic analysis (LSA) is used to model latent, or hidden, relationships between documents and terms appearing in those documents. LSA supplies conceptual organization and analysis of document collections by modeling high-dimension feature vectors in many fewer dimensions. While past work on the scalability of LSA modeling has focused on the SVD, the goal of our work is to investigate the use of distributed memory architectures for the entire text analysis process, from data ingestion to semantic modeling and analysis. ParaText is a set of software components for distributed processing, modeling, and analysis of unstructured text. The ParaText source code is available under a BSD license, as an integral part of the Titan toolkit. ParaText components are chained-together into data-parallel pipelines that are replicated across processes on distributed-memory architectures. Individual components can be replaced or rewired to explore different computational strategies and implement new functionality. ParaText functionality can be embedded in applications on any platform using the native C++ API, Python, or Java. The ParaText MPI Process provides a 'generic' text analysis pipeline in a command-line executable that can be used for many serial and parallel analysis tasks. ParaText can also be deployed as a web service accessible via a RESTful (HTTP) API. In the web service configuration, any client can access the functionality provided by ParaText using commodity protocols ... from standard web browsers to custom clients written in any language.

More Details

Scalable tensor factorizations with missing data

Dunlavy, Daniel D.; Kolda, Tamara G.

The problem of missing data is ubiquitous in domains such as biomedical signal processing, network traffic analysis, bibliometrics, social network analysis, chemometrics, computer vision, and communication networks|all domains in which data collection is subject to occasional errors. Moreover, these data sets can be quite large and have more than two axes of variation, e.g., sender, receiver, time. Many applications in those domains aim to capture the underlying latent structure of the data; in other words, they need to factorize data sets with missing entries. If we cannot address the problem of missing data, many important data sets will be discarded or improperly analyzed. Therefore, we need a robust and scalable approach for factorizing multi-way arrays (i.e., tensors) in the presence of missing data. We focus on one of the most well-known tensor factorizations, CANDECOMP/PARAFAC (CP), and formulate the CP model as a weighted least squares problem that models only the known entries. We develop an algorithm called CP-WOPT (CP Weighted OPTimization) using a first-order optimization approach to solve the weighted least squares problem. Based on extensive numerical experiments, our algorithm is shown to successfully factor tensors with noise and up to 70% missing data. Moreover, our approach is significantly faster than the leading alternative and scales to larger problems. To show the real-world usefulness of CP-WOPT, we illustrate its applicability on a novel EEG (electroencephalogram) application where missing data is frequently encountered due to disconnections of electrodes.

More Details

Poblano v1.0 : a Matlab toolbox for gradient-based optimization

Dunlavy, Daniel D.; Kolda, Tamara G.

We present Poblano v1.0, a Matlab toolbox for solving gradient-based unconstrained optimization problems. Poblano implements three optimization methods (nonlinear conjugate gradients, limited-memory BFGS, and truncated Newton) that require only first order derivative information. In this paper, we describe the Poblano methods, provide numerous examples on how to use Poblano, and present results of Poblano used in solving problems from a standard test collection of unconstrained optimization problems.

More Details
Results 51–75 of 105
Results 51–75 of 105