Algorithms and models for infinite streams
CADA: Counter Adversarial Data Analytics
CAGA: Counter Adversarial Graph Analytics
Graph algorithms for autonomous data center
Interaction Design for Usability, Usefulness and Adoptability
Mathematical Analysis of Tensor Decompositions
Modeling Relational Data with Hypergraphs
Reduced Order Modeling
Sublinear Algorithms for In-situ and In-transit Data Analysis at Exascale
The Networks Discovery, Characterization, and Prediction (NGC)
Networks like the Internet are vast, distributed, and there is no mechanism to completely collect such a topology. We are restricted to specific data collection schemes (like traceroute samples from router interfaces) that examine tiny portions of such a network. It has been empirically documented and theoretically proven that these measurements have significant biases, and direct inferences from them will be wrong. But these data collection mechanisms have limited flexibility and cannot be modified. Moreover, in many applications there are limits on how much data can be collected.
How do we make provably accurate inferences of network properties with biased and limited measurements? In this project, we use techniques from sublinear graph sampling to address this problem. This emerging area of theoretical computer science deals with the following question: what graph properties can be provably measured using a small random (but possibly biased) sampling of the network. Our aim is to understand what graph properties can be provably and accurately determined by sampling a graph using pre-specific mechanisms. Our methods provide a systematic, rigorous, and scientific approach to understand the structure of critical infrastructure networks.
Sandia brings a unique capability in human factors, organizational anthropology, and cognitive psychology to ensure that data science capabilities meet the real-world needs of national security analysts. We believe analytic systems should be:
- Usable. This describes whether a system is learnable, memorable, error-minimizing, and enjoyable for the people who are expected to use it.
- Useful. A useful process or technology enables people to accomplish relevant tasks and meaningful work.
- Adoptable. Adoptability is about the “fit” between the technology and the environment, and depends on both the intrinsic characteristics of the system and the social context in which it’s introduced. Adoptability is facilitated when people can try out, learn, communicate, adjust, and ultimately integrate a new process or technology into an existing context of work.
What makes any system usable, useful and adoptable depends not only on the technology, but on the processes and relationships through which a new system is developed and delivered. Contextually-sensitive, scenario-based, collaborative design research methods promote end-user satisfaction while generating valuable design knowledge for developers at all stages of a project - from algorithmic design all the way through user-centric evaluation.
Sandia researchers draw on a range of contextual design methodologies, including activity theory (Bonnie Nardi); cognitive systems engineering/work analysis (developed by Jens Rasmussen and extended by Kim Vicente); situation awareness system design (Mica Endsley and Debra Jones); and a range of task analysis and workflow mapping techniques. To characterize a work environment as fully and correctly as possible, we often mix qualitative and quantitative techniques across disciplines, from inductive ethnographic description and contextual modeling, to quantitative experimental studies that examine visual perception and cognition in analytic workflows. We’re comfortable working at all stages of the design, development and delivery pipeline, from functional requirements elicitation to experimental software evaluation.
- McNamara, L.A. and Klein, L.E. 2016. “Context Sensitive Design and Human Interaction Principles for Usable, Useful and Adoptable Radars.” Proc SPIE 9289, Radar Sensor Technology XX, 12 May 2016.
- Cole, K.S.; Stevens-Adams, S.M.; Ganter, J.H..; McNamara, L. A. 2014 “Applying Cognitive Work Analysis to a Synthetic Aperture Radar System.” 313-324 in Engineering Psychology and Cognitive Ergonomics: Lecture Notes in Computer Science v. 8532.
- Matzen, L.E.; Haass, M.J., McNamara, L.A. 2015. “Effects of Professional Visual Search Experience on Domain-General Visual Search Tasks.” 481-491 in Foundations of Augmented Condition: Lecture Notes in Computer Science v. 9183, 8 July 2015.
Algorithms and models for infinite streams: For streaming monitoring applications, one wishes to archive enough data to effectively fill available storage while quickly answering queries about the data. Since any infinite stream will eventually overflow finite storage, the system must have a mechanism for removing the least interesting data when the system fills. In our model, this removal is a bulk deletion. Algorithms must be able to handle incoming stream elements while rebuilding data structures after such a bulk delete. Algorithms are judged by the speed at which they can handle a stream and query response latency. We have a result in this model for maintaining connected components of streaming graph edges using a limited-connection distributed-memory parallel architecture.
A first publication appeared in the 2nd International Workshop on Big Data, Streams and Heterogeneous Source Mining 2013 workshop.
The Algorithm Foundry (formerly Cognitive Foundry) is a Sandia-developed open source library for integrating advanced learning algorithms into any Java application. It contains support for basic linear algebra, optimization algorithms, statistical modeling, supervised learning, incremental supervised learning, and unsupervised learning.
Sandia makes critical use of data analytics in defense of national security. Our adversaries therefore seek to sap, even suborn, those analytics. Through understanding our methods, they seek to produce data which is evolving, incomplete, deceptive, and otherwise custom-designed to defeat our analysis. Further, we cannot prevent them from doing so. We live in a changed world, in which we frequently must depend on data over which our adversaries have unprecedented influence.
Accordingly, Sandia executed a brief inaugural effort, the "Counter Adversarial Data Analytics" (CADA) project, designed to develop and assess novel data analysis methods to counter that adversarial influence. Given the brief, 1.5 year duration of this effort, CADA focused on machine learning specifically, as a data analytic that was specific enough to allow tractable analysis but powerful and broadly useful enough to have important Sandia application.
In the course of the CADA project we:
- demonstrated the counter-intuitive and alarming result that there exist label tampering attacks which are very effective while being nearly undetectable by the usual training data cross-validation tests,
- invented and thoroughly quantitatively investigated a novel method, "Ensembles of Outlier Measures" (EOM), for detecting and repairing tampered data,
- demonstrated the surprisingly generality of EOM as a data science principle by successfully applying it to an unsupervised problem very different from the one it was designed for, and
- developed a scheme for “quantified paranoia”, that is, a statistically principled mechanism for deciding whether a dataset as a whole has been tampered with, regardless of whether we can detect any individual bits of tampering with confidence.
- Counter Adversarial Data Analytics, Philip Kegelmeyer, Timothy M. Shead, Jonathan Crussell, Katie Rodhouse, Dave Robinson, Curtis Johnson, Dave Zage, Warren Davis, Jeremy Wendt, Justin “J.D.” Doak, Tiawna Cayton, Richard Colbaugh, Kristin Glass, Brian Jones, Jeff Shelburg, SAND Report SAND 2015-3711, Unlimited Release, May 6, 2015.
- Attacking DBSCAN for Fun and Profit, Jonathan Crussell and W. Philip Kegelmeyer, Proceedings of the 2015 SIAM International Conference on Data Mining, Vancouver, BC, Canada, April 30, 2015
Sandia is currently in the middle of a three year Counter-Adversarial Graph Analytics (CAGA) project. CAGA's charter is to discover, study, and quantifiably characterize vulnerabilities in graph analysis methods induced by attack by informed, empowered adversaries, to develop remediations for those vulnerabilities, and to make the results available both in published literature and in practical, useful software.
CAGA's initial focus is on evasion attacks, graph manipulations where an adversary wishes to induce a specific and local analysis failure; for instance, for a node *not* to be linked to its natural community. We have invented and evaluated a number of such manipulation strategies and have invented a novel machine-learning-based defense against them. We have also separately developed a novel tool, Visualization of Incremental Community Evolution, to interactively investigate the slow change of community assignment as attack edges are added to a graph, and as remediation edges are removed.
The FEAST project is led by Tamara Kolda and sponsored by the DARPA Defense Sciences Office (DSO) in the Mathematics Focus Area. Analysis of graphs arising in social, communication, and cyber networks is vitally important for national security. Research in this domain, however, is still relatively immature, often relying on un-validated hypotheses and focusing on a limited set of observables.
The purpose of this project has been to close critical gaps in existing knowledge by providing: salient observables that are efficient to compute and that can tie local structures to global characteristics of a graph, rigorously validated models of large-scale evolving networks, efficient sampling algorithms with provable error bounds that enable situational awareness for networks of billions of nodes, and techniques that enable analysis of multi-faceted networks by eliciting structure from complex network data.
As part of this work, we have provided analysis of fundamental problems with an existing model called Stochastic Kronecker Graph (SKG) [Seshadhri, Pinar, Kolda, J. ACM 2013], which was scalable but did not produce realistic models. To fix this problem, we developed a new scalable graph model called the Block Two-Level Erdȍs-Rényi (BTER) model [Seshadhri, Kolda, Pinar, Phys. Rev. E 2012; Kolda, Pinar, Planentga, Seshadhri, SIAM J. Scientific Computing 2014] that captures both connectivity and cohesion in graphs and easily scales to billions of nodes and edges. As part of this work, we also developed a new sampling-based method for counting triangles, which won the best-paper prize at the 2013 SIAM Conference on Data Mining [Seshadhri, Pinar, Kolda, SDM’13]. We have also considered streaming methods for counting triangles which won the best paper prize at the 2013 World Wide Web Conference [Jha, Seshadrhi, Pinar, KDD’13] and another streaming method for counting four-vertex patterns [Jha, Seshadhri, Pinar, WWW’15]. A paper on finding dense subgraphs was runner-up for best paper prize at WWW’15 [Sariyuce, Seshadhri, Pinar, Catalyurek, WWW’15]. The project has so far produced six journal papers and 10 conference papers.
Graph algorithms for autonomous data centers: When two or more data centers owned by separate entities collect similar types of (massive) data, it can be to their advantage to answer queries on the combined data. However, the data centers do not wish to just give their data to the other entities. Gathering that data cost time and money. Also sharing could be inhibited by bandwidth constraints, regulations, or competition. One example of this setting is companies considering a merger who would like to determine if their shared information is sufficiently valuable. They do not wish to share company proprietary information unnecessarily. Another example is law enforcement and government agencies who wish to share data to “connect the dots” in counterterrorism activities, but are prevented from direct sharing by law. We have modeled this problem with two different forms of minimum sharing: small open sharing, or secure multiparty sharing (SMP). In the SMP model, all entities learn the answer to the query but each reveals no other information to the others in the honest-but-curious model. We have results for point-to-point connectivity in distributed graph settings. In the SMP setting, our protocol does not reveal node names. A first publication for this work appeared in the IEEE International Symposium on Parallel and Distributed Processing in 2015.
The project on “Mathematical Analysis of Tensor Decompositions” is led by Tamara Kolda and sponsored by the DOE Office of Science Advanced Scientific Computing Research (ASCR) Applied Mathematics Program.
A tensor is a multidimensional or N-way array. Tensor decompositions are the higher-order analogue of matrix decompositions, and advances in the analysis and algorithms of tensor decompositions enable researchers in a variety of domains to tackle previously intractable problems. Our ultimate goal is to tackle "big N" engineering and science problems by providing effective tools for higher-order modeling and compression on large multidimensional data and multidimensional operators. Of particular interest to the DOE are compressed representations of expensive operators used in physics simulations, quantum information systems, stochastic partial differential equations, co-design, etc. Our goal is to overcome impediments, especially computational barriers, to applying tensor decompositions to real-world problems.
We will focus specifically on decompositions for symmetric and/or sparse data and the application of tensor decompositions to problems of DOE interest. Building models is only part of the solution. We will complement this work with research into the selection of appropriate models, fitting functions, and parameters as well as the sensitivity of the resulting models. This fundamental work will advance DOE's remarkable simulation and modeling abilities by enabling more sophisticated and compressed data and operator surrogates. The work has contributed new methods and algorithms to Sandia’s Tensor Toolbox for MATLAB, a package for working with tensors, especially sparse tensors prevalent in many data mining applications.
Many applications in the data sciences require the analysis of very large datasets, identifying important groups, structure, and trends in the data. Relational data, which describes the relationships between objects (often relationships between more than two objects), is commonly used in these analyses.
Graphs are common representations of relational data, where complex, non-pairwise relationships are translated to multiple pairwise relationships (edges), and have been frequently for analyzing relational data. Graph analysis, however, often suffers from information loss due to the complex, non-pairwise relationships being represented by a collection of pairwise edges. Relationships among multiple objects may be modeled better by hypergraphs, generalizations of graphs where edges (or hyperedges) are more general sets of one or more vertices (vertices still represent objects).
At Sandia National Laboratories, we have been researching how hypergraphs can be used to better model relational data. In the context of spectral clustering, we have shown that there are significant advantages to modeling relational data using hypergraphs versus graphs. We have demonstrated that hypergraph based algorithms can lead to significant improvements computationally in terms of number of floating point operations and runtime. In general, we also found that the clustering quality results of the hypergraph-based models were better than the graph models (producing results more similar to our ground truth). In future work, we plan to develop better theoretical models and extend this work to target other data science algorithms for analysis such as eigencentrality and information propagation.
Additional information on this work of modeling relational data and hypergraphs can be found in the following paper:
- M.M. Wolf, A.M. Klinvex and D.M. Dunlavy, "Advantages to modeling relational data using hypergraphs versus graphs," 2016 IEEE High Performance Extreme Computing Conference (HPEC), Waltham, MA, 2016, pp. 1-7.
Physics-based simulation has become indispensable in national-security applications ranging from weapons design to critical-infrastructure resilience. However, as simulation is playing an increasingly important role in analysis, design, and decision making, greater demands are being placed on model fidelity. This additional fidelity leads to extreme-scale nonlinear dynamical-system models whose simulation can consume weeks on thousands of computing cores. Thus, a ‘computational barrier’ arises that precludes truly high-fidelity models from being used in important real-time applications (e.g., control) and many-query scenarios (e.g., uncertainty quantification).
Sandia’s effort in reduced-order modeling aims to break this computational barrier by leveraging simulation data generated to construct low-dimensional approximations of these high-fidelity models. These techniques—which combine concepts from machine learning, computational mechanics, and high-performance computing—lead to orders-of-magnitude speedups, preserve important dynamical-system properties, and are equipped with statistical error models. These techniques are currently being used to simulate the response of national-security systems in normal fluid and abnormal thermal environments.
- K. Carlberg, Y. Choi, and S. Sargsyan. "Conservative model reduction for finite-volume models," arXiv e-Print, 1711.11550 (2017)
- K. Carlberg, M. Barone, and H. Antil. “Galerkin v. least-squares Petrov–Galerkin projection in nonlinear model reduction,” Journal of Computational Physics, Vol. 330, p. 693–734 (2017)
- M. Drohmann and K. Carlberg. “The ROMES method for statistical modeling of reduced-order-model error,” SIAM/ASA Journal on Uncertainty Quantification, Vol. 3, No. 1, p.116–145 (2015)
Post-Moore's law scaling is increasingly pushing data analysis and other traditional post-processing tasks in situ onto the primary compute platform due to I/O constraints. As a result, data analysis techniques must adapt to machines with extreme concurrency, low communication bandwidth, and high memory latency while operating within the time constraints prescribed by the simulation. We have used sublinear algorithms for in-situ analyses of scientific data sets and enabling adaptive workflows, where the composition and execution of computational and data manipulation steps dynamically depend on the evolutionary state of the simulation itself. Consider for example, the frequency of data storage. Critical phases of the simulation should be captured with high frequency and with high fidelity for post-analysis; however we cannot afford to retain the same frequency for the full simulation due to the high cost of data movement. We can instead look for triggers, indicators that the simulation will be entering a critical phase, and adapt the workflow accordingly.
Sandia Report: Sublinear Algorithms for In-situ and In-transit Data Analysis at Exascale
Networks engaged in weapons proliferation, terrorism, cyber attacks, clandestine resale of dual-use imports, arms and drug smuggling, and other illicit activities are major threats to national security. These adversarial networks in turn rely on legitimate and illegitimate secondary networks for financial, supply chain, communication, recruiting, and fund-raising activities. Complexity, dynamism, resilience and adaptability make adversarial networks extremely difficult to identify and disrupt. Often the only way an individual may be detected is through the networks they use, and the arrest of an individual may not remove the underlying threat if the networks remain intact. In short, our real adversaries are networks.
The Solution: Research and Develop Analysis Capabilities
Our goal, then, is to research and develop analysis capabilities that address adversarial networks. The full title of the project, “Network Discovery, Characterization, and Prediction,” conveys the scope and challenges involved. The discovery of adversarial networks is immensely difficult in its own right. A network may only reveal itself by the union of its parts. Individual relationships and activities may appear completely benign in isolation. Data relevant to network discovery may come from communications, financial transactions, human intelligence reports, shipment records, cyber events or many other sources. It may be geographically or temporally dispersed. Thus, very large and heterogeneous data collections must be analyzed collectively to detect networks. The characterization of networks requires methods for identifying likely relationships that are not captured in the data. The structure of a network conveys information about its purpose and the roles of its component individuals, organizations and activities. It can reveal command and control structure and critical components. Structure can also suggest likely evolution and intent, allowing prediction of the possible future shapes of the network.
In sum, we are creating at Sandia, in support of the nation, the unique capability to answer currently unanswerable questions.
NGC Project Goals
- Build upon considerable existing Sandia capabilities in scalable computing and advanced analysis algorithms
- Understand and elicit the needs of the intelligence community
- Do basic research on uncertainty in the intelligence domain
- Research and evaluate novel analysis algorithms
- Implement that research to address those needs to create a flexible, interactive capability for intelligence analysis on large datasets
NGC Project Team
The NGC project team includes research mathematicians, developers, experts in user elicitation, and end-user intelligence analysts, and so has all the needed talent to span the full LDRD spectrum—from Discover through Create to Prove.
NGC Project Funding
The NGC project was funded by Sandia’s Laboratory Directed Research & Development (LDRD) program and was a Grand Challenge. Project was funding from FY08-FY10.