Malware Generation with Specific Behaviors to Improve Machine Learning-based Detectors
Abstract not provided.
Abstract not provided.
Abstract not provided.
This report focuses on the two primary goals set forth in Sandia’s TAFI effort, referred to here under the name Kebab. The first goal is to overlay a trajectory onto a large database of historical trajectories, all with very different sampling rates than the original track. We demonstrate a fast method to accomplish this, even for databases that hold over a million tracks. The second goal is to then demonstrate that these matched historical trajectories can be used to make predictions about unknown qualities associated with the original trajectory. As part of this work, we also examine the problem of defining the qualities of a trajectory in a reproducible way.
Proceedings - 2021 IEEE International Conference on Big Data, Big Data 2021
We describe efforts in generating synthetic malware samples that have specified behaviors that can then be used to train a machine learning (ML) algorithm to detect behaviors in malware. The idea behind detecting behaviors is that a set of core behaviors exists that are often shared in many malware variants and that being able to detect behaviors will improve the detection of novel malware. However, empirically the multi-label task of detecting behaviors is significantly more difficult than malware classification, only achieving on average 84% accuracy across all behaviors as opposed to the greater than 95% multi-class or binary accuracy reported in many malware detection studies. One of the difficulties in identifying behaviors is that while there are ample malware samples, most data sources do not include behavioral labels, which means that generally there is insufficient training data for behavior identification. Inspired by the success of generative models in improving image processing techniques, we examine and extend a 1) conditional variational auto-encoder and 2) a flow-based generative model for malware generation with behavior labels. Initial experiments indicate that synthetic data is able to capture behavioral information and increase the recall of behaviors in novel malware from 32% to 45% without increasing false positives and to 52% with increased false positives.
The well-known vulnerability of Deep Neural Networks to adversarial samples has led to a rapid cycle of increasingly sophisticated attack algorithms and proposed defenses. While most contemporary defenses have been shown to be vulnerable to carefully configured attacks, methods based on gradient regularization and out-of-distribution detection have attracted much interest recently by demonstrating higher resilience to a broad range of attack algorithms. However, no study has yet investigated the effect of combining these techniques. In this paper, we consider the effect of Jacobian matrix regularization on the detectability of adversarial samples on the CIFAR-10 image benchmark dataset. We find that regularization has a significant effect on detectability, and in some cases can make an undetectable attack on a baseline model detectable. In addition, we give evidence that regularization may mitigate the known weaknesses of detectors to high-confidence adversarial samples. The defenses we consider here are highly generalizable, and we believe they will be useful for further investigations to transfer machine learning robustness to other data domains.