


Algorithms and models for infinite streams

CADA: Counter Adversarial Data Analytics

CAGA: Counter Adversarial Graph Analytics

FEAST

Graph algorithms for autonomous data center

Interaction Design for Usability, Usefulness and Adoptability

Mathematical Analysis of Tensor Decompositions

Modeling Relational Data with Hypergraphs

Reduced Order Modeling

Sublinear Algorithms for Insitu and Intransit Data Analysis at Exascale

The Networks Discovery, Characterization, and Prediction (NGC)

Networks like the Internet are vast, distributed, and there is no mechanism to completely collect such a topology. We are restricted to specific data collection schemes (like traceroute samples from router interfaces) that examine tiny portions of such a network. It has been empirically documented and theoretically proven that these measurements have significant biases, and direct inferences from them will be wrong. But these data collection mechanisms have limited flexibility and cannot be modified. Moreover, in many applications there are limits on how much data can be collected.
How do we make provably accurate inferences of network properties with biased and limited measurements? In this project, we use techniques from sublinear graph sampling to address this problem. This emerging area of theoretical computer science deals with the following question: what graph properties can be provably measured using a small random (but possibly biased) sampling of the network. Our aim is to understand what graph properties can be provably and accurately determined by sampling a graph using prespecific mechanisms. Our methods provide a systematic, rigorous, and scientific approach to understand the structure of critical infrastructure networks.
Sandia brings a unique capability in human factors, organizational anthropology, and cognitive psychology to ensure that data science capabilities meet the realworld needs of national security analysts. We believe analytic systems should be:
 Usable. This describes whether a system is learnable, memorable, errorminimizing, and enjoyable for the people who are expected to use it.
 Useful. A useful process or technology enables people to accomplish relevant tasks and meaningful work.
 Adoptable. Adoptability is about the “fit” between the technology and the environment, and depends on both the intrinsic characteristics of the system and the social context in which it’s introduced. Adoptability is facilitated when people can try out, learn, communicate, adjust, and ultimately integrate a new process or technology into an existing context of work.
What makes any system usable, useful and adoptable depends not only on the technology, but on the processes and relationships through which a new system is developed and delivered. Contextuallysensitive, scenariobased, collaborative design research methods promote enduser satisfaction while generating valuable design knowledge for developers at all stages of a project  from algorithmic design all the way through usercentric evaluation.
Sandia researchers draw on a range of contextual design methodologies, including activity theory (Bonnie Nardi); cognitive systems engineering/work analysis (developed by Jens Rasmussen and extended by Kim Vicente); situation awareness system design (Mica Endsley and Debra Jones); and a range of task analysis and workflow mapping techniques. To characterize a work environment as fully and correctly as possible, we often mix qualitative and quantitative techniques across disciplines, from inductive ethnographic description and contextual modeling, to quantitative experimental studies that examine visual perception and cognition in analytic workflows. We’re comfortable working at all stages of the design, development and delivery pipeline, from functional requirements elicitation to experimental software evaluation.
References:
 McNamara, L.A. and Klein, L.E. 2016. “Context Sensitive Design and Human Interaction Principles for Usable, Useful and Adoptable Radars.” Proc SPIE 9289, Radar Sensor Technology XX, 12 May 2016.
 Cole, K.S.; StevensAdams, S.M.; Ganter, J.H..; McNamara, L. A. 2014 “Applying Cognitive Work Analysis to a Synthetic Aperture Radar System.” 313324 in Engineering Psychology and Cognitive Ergonomics: Lecture Notes in Computer Science v. 8532.
 Matzen, L.E.; Haass, M.J., McNamara, L.A. 2015. “Effects of Professional Visual Search Experience on DomainGeneral Visual Search Tasks.” 481491 in Foundations of Augmented Condition: Lecture Notes in Computer Science v. 9183, 8 July 2015.
Algorithms and models for infinite streams: For streaming monitoring applications, one wishes to archive enough data to effectively fill available storage while quickly answering queries about the data. Since any infinite stream will eventually overflow finite storage, the system must have a mechanism for removing the least interesting data when the system fills. In our model, this removal is a bulk deletion. Algorithms must be able to handle incoming stream elements while rebuilding data structures after such a bulk delete. Algorithms are judged by the speed at which they can handle a stream and query response latency. We have a result in this model for maintaining connected components of streaming graph edges using a limitedconnection distributedmemory parallel architecture.
A first publication appeared in the 2nd International Workshop on Big Data, Streams and Heterogeneous Source Mining 2013 workshop.
The Algorithm Foundry (formerly Cognitive Foundry) is a Sandiadeveloped open source library for integrating advanced learning algorithms into any Java application. It contains support for basic linear algebra, optimization algorithms, statistical modeling, supervised learning, incremental supervised learning, and unsupervised learning.
For more information, visit our forum (http://www.cognitivefoundry.org/) or GitHub repository (https://github.com/algorithmfoundry/Foundry).
Sandia makes critical use of data analytics in defense of national security. Our adversaries therefore seek to sap, even suborn, those analytics. Through understanding our methods, they seek to produce data which is evolving, incomplete, deceptive, and otherwise customdesigned to defeat our analysis. Further, we cannot prevent them from doing so. We live in a changed world, in which we frequently must depend on data over which our adversaries have unprecedented influence.
Accordingly, Sandia executed a brief inaugural effort, the "Counter Adversarial Data Analytics" (CADA) project, designed to develop and assess novel data analysis methods to counter that adversarial influence. Given the brief, 1.5 year duration of this effort, CADA focused on machine learning specifically, as a data analytic that was specific enough to allow tractable analysis but powerful and broadly useful enough to have important Sandia application.
In the course of the CADA project we:
 demonstrated the counterintuitive and alarming result that there exist label tampering attacks which are very effective while being nearly undetectable by the usual training data crossvalidation tests,
 invented and thoroughly quantitatively investigated a novel method, "Ensembles of Outlier Measures" (EOM), for detecting and repairing tampered data,
 demonstrated the surprisingly generality of EOM as a data science principle by successfully applying it to an unsupervised problem very different from the one it was designed for, and
 developed a scheme for “quantified paranoia”, that is, a statistically principled mechanism for deciding whether a dataset as a whole has been tampered with, regardless of whether we can detect any individual bits of tampering with confidence.
 Counter Adversarial Data Analytics, Philip Kegelmeyer, Timothy M. Shead, Jonathan Crussell, Katie Rodhouse, Dave Robinson, Curtis Johnson, Dave Zage, Warren Davis, Jeremy Wendt, Justin “J.D.” Doak, Tiawna Cayton, Richard Colbaugh, Kristin Glass, Brian Jones, Jeff Shelburg, SAND Report SAND 20153711, Unlimited Release, May 6, 2015.
 Attacking DBSCAN for Fun and Profit, Jonathan Crussell and W. Philip Kegelmeyer, Proceedings of the 2015 SIAM International Conference on Data Mining, Vancouver, BC, Canada, April 30, 2015
Sandia is currently in the middle of a three year CounterAdversarial Graph Analytics (CAGA) project. CAGA's charter is to discover, study, and quantifiably characterize vulnerabilities in graph analysis methods induced by attack by informed, empowered adversaries, to develop remediations for those vulnerabilities, and to make the results available both in published literature and in practical, useful software.
CAGA's initial focus is on evasion attacks, graph manipulations where an adversary wishes to induce a specific and local analysis failure; for instance, for a node *not* to be linked to its natural community. We have invented and evaluated a number of such manipulation strategies and have invented a novel machinelearningbased defense against them. We have also separately developed a novel tool, Visualization of Incremental Community Evolution, to interactively investigate the slow change of community assignment as attack edges are added to a graph, and as remediation edges are removed.
The FEAST project is led by Tamara Kolda and sponsored by the DARPA Defense Sciences Office (DSO) in the Mathematics Focus Area. Analysis of graphs arising in social, communication, and cyber networks is vitally important for national security. Research in this domain, however, is still relatively immature, often relying on unvalidated hypotheses and focusing on a limited set of observables.
The purpose of this project has been to close critical gaps in existing knowledge by providing: salient observables that are efficient to compute and that can tie local structures to global characteristics of a graph, rigorously validated models of largescale evolving networks, efficient sampling algorithms with provable error bounds that enable situational awareness for networks of billions of nodes, and techniques that enable analysis of multifaceted networks by eliciting structure from complex network data.
As part of this work, we have provided analysis of fundamental problems with an existing model called Stochastic Kronecker Graph (SKG) [Seshadhri, Pinar, Kolda, J. ACM 2013], which was scalable but did not produce realistic models. To fix this problem, we developed a new scalable graph model called the Block TwoLevel ErdȍsRényi (BTER) model [Seshadhri, Kolda, Pinar, Phys. Rev. E 2012; Kolda, Pinar, Planentga, Seshadhri, SIAM J. Scientific Computing 2014] that captures both connectivity and cohesion in graphs and easily scales to billions of nodes and edges. As part of this work, we also developed a new samplingbased method for counting triangles, which won the bestpaper prize at the 2013 SIAM Conference on Data Mining [Seshadhri, Pinar, Kolda, SDM’13]. We have also considered streaming methods for counting triangles which won the best paper prize at the 2013 World Wide Web Conference [Jha, Seshadrhi, Pinar, KDD’13] and another streaming method for counting fourvertex patterns [Jha, Seshadhri, Pinar, WWW’15]. A paper on finding dense subgraphs was runnerup for best paper prize at WWW’15 [Sariyuce, Seshadhri, Pinar, Catalyurek, WWW’15]. The project has so far produced six journal papers and 10 conference papers.
http://www.sandia.gov/~tgkolda/feastpack/
Graph algorithms for autonomous data centers: When two or more data centers owned by separate entities collect similar types of (massive) data, it can be to their advantage to answer queries on the combined data. However, the data centers do not wish to just give their data to the other entities. Gathering that data cost time and money. Also sharing could be inhibited by bandwidth constraints, regulations, or competition. One example of this setting is companies considering a merger who would like to determine if their shared information is sufficiently valuable. They do not wish to share company proprietary information unnecessarily. Another example is law enforcement and government agencies who wish to share data to “connect the dots” in counterterrorism activities, but are prevented from direct sharing by law. We have modeled this problem with two different forms of minimum sharing: small open sharing, or secure multiparty sharing (SMP). In the SMP model, all entities learn the answer to the query but each reveals no other information to the others in the honestbutcurious model. We have results for pointtopoint connectivity in distributed graph settings. In the SMP setting, our protocol does not reveal node names. A first publication for this work appeared in the IEEE International Symposium on Parallel and Distributed Processing in 2015.
The project on “Mathematical Analysis of Tensor Decompositions” is led by Tamara Kolda and sponsored by the DOE Office of Science Advanced Scientific Computing Research (ASCR) Applied Mathematics Program.
A tensor is a multidimensional or Nway array. Tensor decompositions are the higherorder analogue of matrix decompositions, and advances in the analysis and algorithms of tensor decompositions enable researchers in a variety of domains to tackle previously intractable problems. Our ultimate goal is to tackle "big N" engineering and science problems by providing effective tools for higherorder modeling and compression on large multidimensional data and multidimensional operators. Of particular interest to the DOE are compressed representations of expensive operators used in physics simulations, quantum information systems, stochastic partial differential equations, codesign, etc. Our goal is to overcome impediments, especially computational barriers, to applying tensor decompositions to realworld problems.
We will focus specifically on decompositions for symmetric and/or sparse data and the application of tensor decompositions to problems of DOE interest. Building models is only part of the solution. We will complement this work with research into the selection of appropriate models, fitting functions, and parameters as well as the sensitivity of the resulting models. This fundamental work will advance DOE's remarkable simulation and modeling abilities by enabling more sophisticated and compressed data and operator surrogates. The work has contributed new methods and algorithms to Sandia’s Tensor Toolbox for MATLAB, a package for working with tensors, especially sparse tensors prevalent in many data mining applications.
Many applications in the data sciences require the analysis of very large datasets, identifying important groups, structure, and trends in the data. Relational data, which describes the relationships between objects (often relationships between more than two objects), is commonly used in these analyses.
Graphs are common representations of relational data, where complex, nonpairwise relationships are translated to multiple pairwise relationships (edges), and have been frequently for analyzing relational data. Graph analysis, however, often suffers from information loss due to the complex, nonpairwise relationships being represented by a collection of pairwise edges. Relationships among multiple objects may be modeled better by hypergraphs, generalizations of graphs where edges (or hyperedges) are more general sets of one or more vertices (vertices still represent objects).
At Sandia National Laboratories, we have been researching how hypergraphs can be used to better model relational data. In the context of spectral clustering, we have shown that there are significant advantages to modeling relational data using hypergraphs versus graphs. We have demonstrated that hypergraph based algorithms can lead to significant improvements computationally in terms of number of floating point operations and runtime. In general, we also found that the clustering quality results of the hypergraphbased models were better than the graph models (producing results more similar to our ground truth). In future work, we plan to develop better theoretical models and extend this work to target other data science algorithms for analysis such as eigencentrality and information propagation.
Additional information on this work of modeling relational data and hypergraphs can be found in the following paper:
 M.M. Wolf, A.M. Klinvex and D.M. Dunlavy, "Advantages to modeling relational data using hypergraphs versus graphs," 2016 IEEE High Performance Extreme Computing Conference (HPEC), Waltham, MA, 2016, pp. 17.
Physicsbased simulation has become indispensable in nationalsecurity applications ranging from weapons design to criticalinfrastructure resilience. However, as simulation is playing an increasingly important role in analysis, design, and decision making, greater demands are being placed on model fidelity. This additional fidelity leads to extremescale nonlinear dynamicalsystem models whose simulation can consume weeks on thousands of computing cores. Thus, a ‘computational barrier’ arises that precludes truly highfidelity models from being used in important realtime applications (e.g., control) and manyquery scenarios (e.g., uncertainty quantification).
Sandia’s effort in reducedorder modeling aims to break this computational barrier by leveraging simulation data generated to construct lowdimensional approximations of these highfidelity models. These techniques—which combine concepts from machine learning, computational mechanics, and highperformance computing—lead to ordersofmagnitude speedups, preserve important dynamicalsystem properties, and are equipped with statistical error models. These techniques are currently being used to simulate the response of nationalsecurity systems in normal fluid and abnormal thermal environments.
Related publications:
 K. Carlberg, Y. Choi, and S. Sargsyan. "Conservative model reduction for finitevolume models," arXiv ePrint, 1711.11550 (2017)
 K. Carlberg, M. Barone, and H. Antil. “Galerkin v. leastsquares Petrov–Galerkin projection in nonlinear model reduction,” Journal of Computational Physics, Vol. 330, p. 693–734 (2017)
 M. Drohmann and K. Carlberg. “The ROMES method for statistical modeling of reducedordermodel error,” SIAM/ASA Journal on Uncertainty Quantification, Vol. 3, No. 1, p.116–145 (2015)
PostMoore's law scaling is increasingly pushing data analysis and other traditional postprocessing tasks in situ onto the primary compute platform due to I/O constraints. As a result, data analysis techniques must adapt to machines with extreme concurrency, low communication bandwidth, and high memory latency while operating within the time constraints prescribed by the simulation. We have used sublinear algorithms for insitu analyses of scientific data sets and enabling adaptive workflows, where the composition and execution of computational and data manipulation steps dynamically depend on the evolutionary state of the simulation itself. Consider for example, the frequency of data storage. Critical phases of the simulation should be captured with high frequency and with high fidelity for postanalysis; however we cannot afford to retain the same frequency for the full simulation due to the high cost of data movement. We can instead look for triggers, indicators that the simulation will be entering a critical phase, and adapt the workflow accordingly.
Sandia Report: Sublinear Algorithms for Insitu and Intransit Data Analysis at Exascale
Networks engaged in weapons proliferation, terrorism, cyber attacks, clandestine resale of dualuse imports, arms and drug smuggling, and other illicit activities are major threats to national security. These adversarial networks in turn rely on legitimate and illegitimate secondary networks for financial, supply chain, communication, recruiting, and fundraising activities. Complexity, dynamism, resilience and adaptability make adversarial networks extremely difficult to identify and disrupt. Often the only way an individual may be detected is through the networks they use, and the arrest of an individual may not remove the underlying threat if the networks remain intact. In short, our real adversaries are networks.
The Solution: Research and Develop Analysis Capabilities
Our goal, then, is to research and develop analysis capabilities that address adversarial networks. The full title of the project, “Network Discovery, Characterization, and Prediction,” conveys the scope and challenges involved. The discovery of adversarial networks is immensely difficult in its own right. A network may only reveal itself by the union of its parts. Individual relationships and activities may appear completely benign in isolation. Data relevant to network discovery may come from communications, financial transactions, human intelligence reports, shipment records, cyber events or many other sources. It may be geographically or temporally dispersed. Thus, very large and heterogeneous data collections must be analyzed collectively to detect networks. The characterization of networks requires methods for identifying likely relationships that are not captured in the data. The structure of a network conveys information about its purpose and the roles of its component individuals, organizations and activities. It can reveal command and control structure and critical components. Structure can also suggest likely evolution and intent, allowing prediction of the possible future shapes of the network.
In sum, we are creating at Sandia, in support of the nation, the unique capability to answer currently unanswerable questions.
NGC Project Goals
 Build upon considerable existing Sandia capabilities in scalable computing and advanced analysis algorithms
 Understand and elicit the needs of the intelligence community
 Do basic research on uncertainty in the intelligence domain
 Research and evaluate novel analysis algorithms
 Implement that research to address those needs to create a flexible, interactive capability for intelligence analysis on large datasets
NGC Project Team
The NGC project team includes research mathematicians, developers, experts in user elicitation, and enduser intelligence analysts, and so has all the needed talent to span the full LDRD spectrum—from Discover through Create to Prove.
NGC Project Funding
The NGC project was funded by Sandia’s Laboratory Directed Research & Development (LDRD) program and was a Grand Challenge. Project was funding from FY08FY10.