Publications

7 Results

Search results

Jump to search filters

MalGen: Malware Generation with Specific Behaviors to Improve Machine Learning-based Detectors

Smith, Michael R.; Carbajal, Armida J.; Domschot, Eva D.; Johnson, Nicholas T.; Goyal, Akul A.; Lamb, Christopher L.; Lubars, Joseph L.; Kegelmeyer, William P.; Krishnakumar, Raga K.; Quynn, Sophie Q.; Ramyaa, Ramyaa; Verzi, Stephen J.; Zhou, Xin Z.

In recent years, infections and damage caused by malware have increased at exponential rates. At the same time, machine learning (ML) techniques have shown tremendous promise in many domains, often out performing human efforts by learning from large amounts of data. Results in the open literature suggest that ML is able to provide similar results for malware detection, achieving greater than 99% classifcation accuracy [49]. However, the same detection rates when applied in deployed settings have not been achieved. Malware is distinct from many other domains in which ML has shown success in that (1) it purposefully tries to hide, leading to noisy labels and (2) often its behavior is similar to benign software only differing in intent, among other complicating factors. This report details the reasons for the diffcultly of detecting novel malware by ML methods and offers solutions to improve the detection of novel malware.

More Details

Malware Generation with Specific Behaviors to Improve Machine Learning-based Detection

Proceedings - 2021 IEEE International Conference on Big Data, Big Data 2021

Laros, James H.; Verzi, Stephen J.; Johnson, Nicholas T.; Khanna, Kanad K.; Zhou, Xin Z.; Quynn, Sophie Q.; Krishnakumar, Raga K.

We describe efforts in generating synthetic malware samples that have specified behaviors that can then be used to train a machine learning (ML) algorithm to detect behaviors in malware. The idea behind detecting behaviors is that a set of core behaviors exists that are often shared in many malware variants and that being able to detect behaviors will improve the detection of novel malware. However, empirically the multi-label task of detecting behaviors is significantly more difficult than malware classification, only achieving on average 84% accuracy across all behaviors as opposed to the greater than 95% multi-class or binary accuracy reported in many malware detection studies. One of the difficulties in identifying behaviors is that while there are ample malware samples, most data sources do not include behavioral labels, which means that generally there is insufficient training data for behavior identification. Inspired by the success of generative models in improving image processing techniques, we examine and extend a 1) conditional variational auto-encoder and 2) a flow-based generative model for malware generation with behavior labels. Initial experiments indicate that synthetic data is able to capture behavioral information and increase the recall of behaviors in novel malware from 32% to 45% without increasing false positives and to 52% with increased false positives.

More Details
7 Results
7 Results