The ever increasing need to ensure that code is reliably, efficiently and safely constructed has fueled the evolution of popular static binary code analysis tools. In identifying potential coding flaws in binaries, tools such as IDA Pro are used to disassemble the binaries into an opcode/assembly language format in support of manual static code analysis. Because of the highly manual and resource intensive nature involved with analyzing large binaries, the probability of overlooking potential coding irregularities and inefficiencies is quite high. In this paper, a light-weight, unsupervised data flow methodology is described which uses highly-correlated data flow graph (CDFGs) to identify coding irregularities such that analysis time and required computing resources are minimized. Such analysis accuracy and efficiency gains are achieved by using a combination of graph analysis and unsupervised machine learning techniques which allows an analyst to focus on the most statistically significant flow patterns while performing binary static code analysis.
Over the past few decades, software has become ubiquitous as it has been integrated into nearly every aspect of society, including household appliances, consumer electronics, industrial control systems, public utilities, government operations, and military systems. Consequently, many critical national security questions can no longer be answered convincingly without understanding software, including its purpose, its capabilities, its flaws, its communication, or how it processes and stores data. As software continues to become larger, more complex, and more widespread, our ability to answer important mission questions and reason about software in a timely way is falling behind. Today, to achieve such understanding of third-party software, we rely predominantly on the ability of reverse engineering experts to manually answer each particular mission question for every software system of interest. This approach often requires heroic human effort that nevertheless fails to meet current mission needs and will never scale to meet future needs. The result is an emerging crisis: a massive and expanding gap between the national security need to answer mission questions about software and our ability to do so. Sandia National Laboratories has established the Rapid Analysis of Mission Software Systems (RAMSeS) effort, a collaborative long-term effort aimed at dramatically improving our nation’s ability to answer mission questions about third-party software by growing an ecosystem of tools that augment the human reverse engineer through automation, interoperability, and reuse. Focusing on static analysis of binary programs, we are attempting to identify reusable software analysis components that advance our ability to reason about software, to automate useful aspects of the software analysis process, and to integrate new methodologies and capabilities into a working ecosystem of tools and experts. We aim to integrate existing tools where possible, adapt tools when modest modifications will enable them to interoperate, and implement missing capability when necessary. Although we do hope to automate a growing set of analysis tasks, we will approach this goal incrementally by assisting the human in an ever-widening range of tasks.
Vulnerability analysts protecting software lack adequate tools for understanding data flow in binaries. We present a case study in which we used human factors methods to develop a taxonomy for understanding data flow and the visual representations needed to support decision making for binary vulnerability analysis. Using an iterative process, we refined and evaluated the taxonomy by generating three different data flow visualizations for small binaries, trained an analyst to use these visualizations, and tested the utility of the visualizations for answering data flow questions. Throughout the process and with minimal training, analysts were able to use the visualizations to understand data flow related to security assessment. Our results indicate that the data flow taxonomy is promising as a mechanism for improving analyst understanding of data flow in binaries and for supporting efficient decision making during analysis.
National security missions require understanding third-party software binaries, a key element of which is reasoning about how data flows through a program. However, vulnerability analysts protecting software lack adequate tools for understanding data flow in binaries. To reduce the human time burden for these analysts, we used human factors methods in a rolling discovery process to derive user-centric visual representation requirements. We encountered three main challenges: analysis projects span weeks, analysis goals significantly affect approaches and required knowledge, and analyst tools, techniques, conventions, and prioritization are based on personal preference. To address these challenges, we initially focused our human factors methods on an attack surface characterization task. We generalized our results using a two-stage modified sorting task, creating requirements for a data flow visualization. We implemented these requirements partially in manual static visualizations, which we informally evaluated, and partially in automatically generated interactive visualizations, which have yet to be integrated into workflows for evaluation. Our observations and results indicate that 1) this data flow visualization has the potential to enable novel code navigation, information presentation, and information sharing, and 2) it is an excellent time to pursue research applying human factors methods to binary analysis workflows.