Publications

Results 1–25 of 51

Search results

Jump to search filters

BinSimDB: Benchmark Dataset Construction for Fine-Grained Binary Code Similarity Analysis

Lecture Notes of the Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering, LNICST

Zuo, Fei; Tompkins, Cody; Zeng, Qiang; Luo, Lannan; Choe, Yung R.; Rhee, Junghwan

Binary Code Similarity Analysis (BCSA) has a wide spectrum of applications, including plagiarism detection, vulnerability discovery, and malware analysis, thus drawing significant attention from the security community. However, conventional techniques often face challenges in balancing both accuracy and scalability simultaneously. To overcome these existing problems, a surge of deep learning-based work has been recently proposed. Unfortunately, many researchers still find it extremely difficult to conduct relevant studies or extend existing approaches. First, prior work typically relies on proprietary benchmark without making the entire dataset publicly accessible. Consequently, a large-scale, well-labeled dataset for binary code similarity analysis remains precious and scarce. Moreover, previous work has primarily focused on comparing at the function level, rather than exploring other finer granularities. Therefore, we argue that the lack of a fine-grained dataset for BCSA leaves a critical gap in current research. To address these challenges, we construct a benchmark dataset for fine-grained binary code similarity analysis called BinSimDB, which contains equivalent pairs of smaller binary code snippets, such as basic blocks. Specifically, we propose BMerge and BPair algorithms to bridge the discrepancies between two binary code snippets caused by different optimization levels or platforms. Furthermore, we empirically study the properties of our dataset and evaluate its effectiveness for the BCSA research. The experimental results demonstrate that BinSimDB significantly improves the performance of binary code similarity comparison.

More Details

ProvSec: Open Cybersecurity System Provenance Analysis Benchmark Dataset with Labels

International Journal of Networked and Distributed Computing

Shrestha, Madhukar; Kim, Yonghyun; Oh, Jeehyun; Rhee, Junghwan (John); Choe, Yung R.; Zuo, Fei; Park, Myungah; Qian, Gang

System provenance forensic analysis has been studied by a large body of research work. This area needs fine granularity data such as system calls along with event fields to track the dependencies of events. While prior work on security datasets has been proposed, we found a useful dataset of realistic attacks and details that are needed for high-quality provenance tracking is lacking. We created a new dataset of eleven vulnerable cases for system forensic analysis. It includes the full details of system calls including syscall parameters. Realistic attack scenarios with real software vulnerabilities and exploits are used. For each case, we created two sets of benign and adversary scenarios which are manually labeled for supervised machine-learning analysis. In addition, we present an algorithm to improve the data quality in the system provenance forensic analysis. We demonstrate the details of the dataset events and dependency analysis of our dataset cases.

More Details

ProvSec: Cybersecurity System Provenance Analysis Benchmark Dataset

Proceedings - 2023 IEEE/ACIS 21st International Conference on Software Engineering Research, Management and Applications, SERA 2023

Shrestha, Madhukar; Kim, Yonghyun; Oh, Jeehyun; Rhee, Junghwan; Choe, Yung R.; Zuo, Fei; Park, Myungah; Qian, Gang

System provenance forensic analysis has been studied by a large body of research work. This area needs fine granularity data such as system calls along with event fields to track the dependencies of events. While prior work on security datasets has been proposed, we found a useful dataset of realistic attacks and details that can be used for provenance tracking is lacking. We created a new dataset of eleven vulnerable cases for system forensic analysis. It includes the full details of system calls including syscall parameters. Realistic attack scenarios with real software vulnerabilities and exploits are used. Also, we created two sets of benign and adversary scenarios which are manually labeled for supervised machine-learning analysis. We demonstrate the details of the dataset events and dependency analysis.

More Details

Toward the analysis of embedded firmware through automated re-hosting

RAID 2019 Proceedings - 22nd International Symposium on Research in Attacks, Intrusions and Defenses

Gustafson, Eric D.; Muench, Marius; Spensky, Chad; Redini, Nilo; Machiry, Aravind; Fratantonio, Yanick; Francillon, Aurelien; Balzarotti, Davide; Choe, Yung R.; Kruegel, Christopher; Vigna, Giovanni

The recent paradigm shift introduced by the Internet of Things (IoT) has brought embedded systems into focus as a target for both security analysts and malicious adversaries. Typified by their lack of standardized hardware, diverse software, and opaque functionality, IoT devices present unique challenges to security analysts due to the tight coupling between their firmware and the hardware for which it was designed. In order to take advantage of modern program analysis techniques, such as fuzzing or symbolic execution, with any kind of scale or depth, analysts must have the ability to execute firmware code in emulated (or virtualized) environments. However, these emulation environments are rarely available and are cumbersome to create through manual reverse engineering, greatly limiting the analysis of binary firmware. In this work, we explore the problem of firmware re-hosting, the process by which firmware is migrated from its original hardware environment into a virtualized one. We show that an approach capable of creating virtual, interactive environments in an automated manner is a necessity to enable firmware analysis at scale. We present the first proof-of-concept system aiming to achieve this goal, called PRETENDER, which uses observations of the interactions between the original hardware and the firmware to automatically create models of peripherals, and allows for the execution of the firmware in a fully-emulated environment. Unlike previous approaches, these models are interactive, stateful, and transferable, meaning they are designed to allow the program to receive and process new input, a requirement of many analyses. We demonstrate our approach on multiple hardware platforms and firmware samples, and show that the models are flexible enough to allow for virtualized code execution, the exploration of new code paths, and the identification of security vulnerabilities.

More Details

Physically Unclonable Digital ID

Proceedings - 2015 IEEE 3rd International Conference on Mobile Services, MS 2015

Choi, Sung N.; Zage, David J.; Choe, Yung R.; Wasilow, Brent

The Center for Strategic and International Studies estimates the annual cost from cyber crime to be more than $400 billion. Most notable is the recent digital identity thefts that compromised millions of accounts. These attacks emphasize the security problems of using clonable static information. One possible solution is the use of a physical device known as a Physically Unclonable Function (PUF). PUFs can be used to create encryption keys, generate random numbers, or authenticate devices. While the concept shows promise, current PUF implementations are inherently problematic: inconsistent behavior, expensive, susceptible to modeling attacks, and permanent. Therefore, we propose a new solution by which an unclonable, dynamic digital identity is created between two communication endpoints such as mobile devices. This Physically Unclonable Digital ID (PUDID) is created by injecting a data scrambling PUF device at the data origin point that corresponds to a unique and matching descrambler/hardware authentication at the receiving end. This device is designed using macroscopic, intentional anomalies, making them inexpensive to produce. PUDID is resistant to cryptanalysis due to the separation of the challenge response pair and a series of hash functions. PUDID is also unique in that by combining the PUF device identity with a dynamic human identity, we can create true two-factor authentication. We also propose an alternative solution that eliminates the need for a PUF mechanism altogether by combining tamper resistant capabilities with a series of hash functions. This tamper resistant device, referred to as a Quasi-PUDID (Q-PUDID), modifies input data, using a black-box mechanism, in an unpredictable way. By mimicking PUF attributes, Q-PUDID is able to avoid traditional PUF challenges thereby providing high-performing physical identity assurance with or without a low performing PUF mechanism. Three different application scenarios with mobile devices for PUDID and Q-PUDID have been analyzed to show their unique advantages over traditional PUFs and outline the potential for placement in a host of applications.

More Details

Finding bugs in source code using commonly available development metadata

8th Workshop on Cyber Security Experimentation and Test, CSET 2015

Cook, Devin; Choe, Yung R.; Hamilton, John A.

Developers and security analysts have been using static analysis for a long time to analyze programs for defects and vulnerabilities. Generally a static analysis tool is run on the source code for a given program, flagging areas of code that need to be further inspected by a human analyst. These tools tend to work fairly well – every year they find many important bugs. These tools are more impressive considering the fact that they only examine the source code, which may be very complex. Now consider the amount of data available that these tools do not analyze. There are many additional pieces of information available that would prove useful for finding bugs in code, such as a history of bug reports, a history of all changes to the code, information about committers, etc. By leveraging all this additional data, it is possible to find more bugs with less user interaction, as well as track useful metrics such as number and type of defects injected by committer. This paper provides a method for leveraging development metadata to find bugs that would otherwise be difficult to find using standard static analysis tools. We showcase two case studies that demonstrate the ability to find new vulnerabilities in large and small software projects by finding new vulnerabilities in the cpython and Roundup open source projects.

More Details
Results 1–25 of 51
Results 1–25 of 51