Publications

25 Results
Skip to search filters

Software Verification Toolkit (SVT): Survey on Available Software Verification Tools and Future Direction

Davis, Nickolas A.; Berger, Taylor E.; McDonald, Arthur A.; Ingram, Joey; Foster, James D.; Sanchez, Katherine A.

Writing software is difficult. However, writing complex, well tested and designed, and functionally correct software is incredibly difficult. An entire field of study is devoted to the validation and verification of software to address this problem, and in this paper we analyze the landscape of currently available third party software. We have divided our analyses into three separate subsections with regards to software validation: formal methods, static analysis, and test generation. Formal verification is the most complex method in which to validate software correctness, but also the most thorough as it truly validates the mathematical validity of the source code. Static analysis generally is relegated to abstract syntax tree traversal techniques to find errors related to faulty software such as memory leaks or stack overflow issues. Automatic test generation is similar in implementation to static analysis, but pushes a bit further in verifying the boundedness of function inputs and outputs with regards to annotated or parsed criteria. The crux of this report is to analyze and describe the software tools that implement these techniques to validate and verify software. Pros and cons related to installation, utilization, and capabilities of the frameworks are described, and reproducible examples are provided with a focus on usability. The initial survey concluded that the most interesting tools of note are Z3, Isabelle/HOL, and TLA+ with regards to formal verification; and Infer, Frama-C, and SonarQube with regards to static analysis. With these tools in mind, a final conjecture is provided that describes future avenues of utilizing these tools for developing a verification framework to assist in validating existing software at Sandia National Laboratories.

More Details

Mind the Gap: On Bridging the Semantic Gap between Machine Learning and Malware Analysis

AISec 2020 - Proceedings of the 13th ACM Workshop on Artificial Intelligence and Security

Smith, Michael R.; Johnson, Nicholas T.; Ingram, Joey; Carbajal, Armida J.; Haus, Bridget I.; Domschot, Eva; Ramyaa, Ramyaa; Lamb, Christopher L.; Verzi, Stephen J.; Kegelmeyer, William P.

Machine learning (ML) techniques are being used to detect increasing amounts of malware and variants. Despite successful applications of ML, we hypothesize that the full potential of ML is not realized in malware analysis (MA) due to a semantic gap between the ML and MA communities-as demonstrated in the data that is used. Due in part to the available data, ML has primarily focused on detection whereas MA is also interested in identifying behaviors. We review existing open-source malware datasets used in ML and find a lack of behavioral information that could facilitate stronger impact by ML in MA. As a first step in bridging this gap, we label existing data with behavioral information using open-source MA reports-1) altering the analysis from identifying malware to identifying behaviors, 2)~aligning ML better with MA, and 3)~allowing ML models to generalize to novel malware in a zero/few-shot learning manner. We classify the behavior of a malware family not seen during training using transfer learning from a state-of-the-art model for malware family classification and achieve 57%-84% accuracy on behavioral identification but fail to outperform the baseline set by a majority class predictor. This highlights opportunities for improvement on this task related to the data representation, the need for malware specific ML techniques, and a larger training set of malware samples labeled with behaviors.

More Details

Tracking Cyber Adversaries with Adaptive Indicators of Compromise

Proceedings - 2017 International Conference on Computational Science and Computational Intelligence, CSCI 2017

Doak, Justin E.; Ingram, Joey; Mulder, Samuel A.; Naegle, John H.; Cox, Jonathan A.; Aimone, James B.; Dixon, Kevin R.; James, Conrad D.; Follett, David R.

A forensics investigation after a breach often uncovers network and host indicators of compromise (IOCs) that can be deployed to sensors to allow early detection of the adversary in the future. Over time, the adversary will change tactics, techniques, and procedures (TTPs), which will also change the data generated. If the IOCs are not kept up-to-date with the adversary's new TTPs, the adversary will no longer be detected once all of the IOCs become invalid. Tracking the Known (TTK) is the problem of keeping IOCs, in this case regular expression (regexes), up-to-date with a dynamic adversary. Our framework solves the TTK problem in an automated, cyclic fashion to bracket a previously discovered adversary. This tracking is accomplished through a data-driven approach of self-adapting a given model based on its own detection capabilities.In our initial experiments, we found that the true positive rate (TPR) of the adaptive solution degrades much less significantly over time than the naïve solution, suggesting that self-updating the model allows the continued detection of positives (i.e., adversaries). The cost for this performance is in the false positive rate (FPR), which increases over time for the adaptive solution, but remains constant for the naïve solution. However, the difference in overall detection performance, as measured by the area under the curve (AUC), between the two methods is negligible. This result suggests that self-updating the model over time should be done in practice to continue to detect known, evolving adversaries.

More Details

Temporal Cyber Attack Detection

Ingram, Joey; Draelos, Timothy J.; Sahakian, Meghan A.; Doak, Justin E.

Rigorous characterization of the performance and generalization ability of cyber defense systems is extremely difficult, making it hard to gauge uncertainty, and thus, confidence. This difficulty largely stems from a lack of labeled attack data that fully explores the potential adversarial space. Currently, performance of cyber defense systems is typically evaluated in a qualitative manner by manually inspecting the results of the system on live data and adjusting as needed. Additionally, machine learning has shown promise in deriving models that automatically learn indicators of compromise that are more robust than analyst-derived detectors. However, to generate these models, most algorithms require large amounts of labeled data (i.e., examples of attacks). Algorithms that do not require annotated data to derive models are similarly at a disadvantage, because labeled data is still necessary when evaluating performance. In this work, we explore the use of temporal generative models to learn cyber attack graph representations and automatically generate data for experimentation and evaluation. Training and evaluating cyber systems and machine learning models requires significant, annotated data, which is typically collected and labeled by hand for one-off experiments. Automatically generating such data helps derive/evaluate detection models and ensures reproducibility of results. Experimentally, we demonstrate the efficacy of generative sequence analysis techniques on learning the structure of attack graphs, based on a realistic example. These derived models can then be used to generate more data. Additionally, we provide a roadmap for future research efforts in this area.

More Details

Statistical Techniques For Real-time Anomaly Detection Using Spark Over Multi-source VMware Performance Data

Sandia journal manuscript; Not yet accepted for publication

Solaimani, Mohiuddin S.; Iftekhar, Mohammed I.; Khan, Latifur K.; Thuraisingham, Bhavani T.; Ingram, Joey

Anomaly detection refers to the identi cation of an irregular or unusual pat- tern which deviates from what is standard, normal, or expected. Such deviated patterns typically correspond to samples of interest and are assigned different labels in different domains, such as outliers, anomalies, exceptions, or malware. Detecting anomalies in fast, voluminous streams of data is a formidable chal- lenge. This paper presents a novel, generic, real-time distributed anomaly detection framework for heterogeneous streaming data where anomalies appear as a group. We have developed a distributed statistical approach to build a model and later use it to detect anomaly. As a case study, we investigate group anomaly de- tection for a VMware-based cloud data center, which maintains a large number of virtual machines (VMs). We have built our framework using Apache Spark to get higher throughput and lower data processing time on streaming data. We have developed a window-based statistical anomaly detection technique to detect anomalies that appear sporadically. We then relaxed this constraint with higher accuracy by implementing a cluster-based technique to detect sporadic and continuous anomalies. We conclude that our cluster-based technique out- performs other statistical techniques with higher accuracy and lower processing time.

More Details

Streaming malware classification in the presence of concept drift and class imbalance

Proceedings - 2013 12th International Conference on Machine Learning and Applications, ICMLA 2013

Kegelmeyer, William P.; Chiang, Ken C.; Ingram, Joey

Malware, or malicious software, is capable of performing any action or command that can be expressed in code and is typically used for illicit activities, such as e-mail spamming, corporate espionage, and identity theft. Most organizations rely on anti-virus software to identifymalware, which typically utilize signatures that can only identify previously-seen malware instances. We consider the detection ofmalware executables that are downloaded in streaming network data as a supervised machine learning problem. Using malwaredata collected over multiple years, we characterize the effect of concept drift and class imbalance on batch and streaming decision tree ensembles. In particular, we illustrate a surprising vulnerability generated by precisely the aspect of streaming methods that seemed most likely to help them, when compared to batch methods. © 2013 IEEE.

More Details

GPU accelerated microarray data analysis using random matrix theory

Proc.- 2011 IEEE International Conference on HPCC 2011 - 2011 IEEE International Workshop on FTDCS 2011 -Workshops of the 2011 Int. Conf. on UIC 2011- Workshops of the 2011 Int. Conf. ATC 2011

Ingram, Joey; Zhu, Mengxia

Recent advances in high-throughput genomic technology, such as micro arrays, usually produce vast amounts of gene expression data under many experimental conditions. Analyzing such data is often difficult due to the colossal data size and the intensive computing involved. In addition, many existing analysis tools often require the inference of experienced analysts and subjective judgments. In this paper, we developed a parallel approach based on Random Matrix Theory (RMT) to generate transcription networks using Graphical Processing Units (GPUs). Recently, GPUs have been redesigned into a more unified architecture, which has allowed them to be used more readily in general purpose computing. This architectural advancement has resulted in GPUs becoming easily programmable parallel processors with performance that is vastly superior to CPUs. Our GPU-based approach makes automated micro array data analysis faster, more accurate and noise resistant without engaging remote high performance computing facilities, such as a cluster or supercomputer. The implementation moves some computationally intensive tasks, such as the calculations of Pearson correlation coefficients, tridiagonal reduction, back transformation of eigenvectors, and orthogonal rotation, to the GPU. Experimental results on real micro array datasets show that our GPU implementation runs faster than a CPU version using highly optimized LAPACK routines. The runtime speedup gets higher as the number of genes and sample points in a micro array dataset increases. © 2011 IEEE.

More Details
25 Results
25 Results