Reverse engineering (RE) analysts struggle to address critical questions about the safety of binary code accurately and promptly, and their supporting program analysis tools are simply wrong sometimes. The analysis tools have to approximate in order to provide any information at all, but this means that they introduce uncertainty into their results. And those uncertainties chain from analysis to analysis. We hypothesize that exposing sources, impacts, and control of uncertainty to human binary analysts will allow the analysts to approach their hardest problems with high-powered analytic techniques that they know when to trust. Combining expertise in binary analysis algorithms, human cognition, uncertainty quantification, verification and validation, and visualization, we pursue research that should benefit binary software analysis efforts across the board. We find a strong analogy between RE and exploratory data analysis (EDA); we begin to characterize sources and types of uncertainty found in practice in RE (both in the process and in supporting analyses); we explore a domain-specific focus on uncertainty in pointer analysis, showing that more precise models do help analysts answer small information flow questions faster and more accurately; and we test a general population with domain-general sudoku problems, showing that adding "knobs" to an analysis does not significantly slow down performance. This document describes our explorations in uncertainty in binary analysis.
Over the past few decades, software has become ubiquitous as it has been integrated into nearly every aspect of society, including household appliances, consumer electronics, industrial control systems, public utilities, government operations, and military systems. Consequently, many critical national security questions can no longer be answered convincingly without understanding software, including its purpose, its capabilities, its flaws, its communication, or how it processes and stores data. As software continues to become larger, more complex, and more widespread, our ability to answer important mission questions and reason about software in a timely way is falling behind. Today, to achieve such understanding of third-party software, we rely predominantly on the ability of reverse engineering experts to manually answer each particular mission question for every software system of interest. This approach often requires heroic human effort that nevertheless fails to meet current mission needs and will never scale to meet future needs. The result is an emerging crisis: a massive and expanding gap between the national security need to answer mission questions about software and our ability to do so. Sandia National Laboratories has established the Rapid Analysis of Mission Software Systems (RAMSeS) effort, a collaborative long-term effort aimed at dramatically improving our nation’s ability to answer mission questions about third-party software by growing an ecosystem of tools that augment the human reverse engineer through automation, interoperability, and reuse. Focusing on static analysis of binary programs, we are attempting to identify reusable software analysis components that advance our ability to reason about software, to automate useful aspects of the software analysis process, and to integrate new methodologies and capabilities into a working ecosystem of tools and experts. We aim to integrate existing tools where possible, adapt tools when modest modifications will enable them to interoperate, and implement missing capability when necessary. Although we do hope to automate a growing set of analysis tasks, we will approach this goal incrementally by assisting the human in an ever-widening range of tasks.
Software is becoming increasingly important in nearly every aspect of global society and therefore in nearly every aspect of national security as well. While there have been major advancements in recent years in formally proving properties of program source code during development, such approaches are still in the minority among development teams, and the vast majority of code in this software explosion is produced without such properties. In these cases, the source code must be analyzed in order to establish whether the properties of interest hold. Because of the volume of software being produced, automated approaches to software analysis are necessary to meet the need. However, this software boom is not occurring in just one language. There are a wide range of languages of interest in national security spaces, including well-known languages such as C, C++, Python, Java, Javascript, and many more. But recent years have produced a wide range of new languages, including Nim, (2008), Go (2009), Rust (2010), Dart (2011), Kotlin (2011), Elixir (2011), Red (2011), Julia (2012), Typescript (2012), Swift (2014), Hack (2014), Crystal (2014), Ballerina (2017) and more. Historically, automated software analyses are implemented as tools that intermingle both the analysis question at hand with target language dependencies throughout their code, making re-use of components for different analysis questions or different target languages impractical. This project seeks to explore how mission-relevant, static software analyses can be designed and constructed in a language-independent fashion, dramatically increasing the reusability of software analysis investments.
Vulnerability analysts protecting software lack adequate tools for understanding data flow in binaries. We present a case study in which we used human factors methods to develop a taxonomy for understanding data flow and the visual representations needed to support decision making for binary vulnerability analysis. Using an iterative process, we refined and evaluated the taxonomy by generating three different data flow visualizations for small binaries, trained an analyst to use these visualizations, and tested the utility of the visualizations for answering data flow questions. Throughout the process and with minimal training, analysts were able to use the visualizations to understand data flow related to security assessment. Our results indicate that the data flow taxonomy is promising as a mechanism for improving analyst understanding of data flow in binaries and for supporting efficient decision making during analysis.
National security missions require understanding third-party software binaries, a key element of which is reasoning about how data flows through a program. However, vulnerability analysts protecting software lack adequate tools for understanding data flow in binaries. To reduce the human time burden for these analysts, we used human factors methods in a rolling discovery process to derive user-centric visual representation requirements. We encountered three main challenges: analysis projects span weeks, analysis goals significantly affect approaches and required knowledge, and analyst tools, techniques, conventions, and prioritization are based on personal preference. To address these challenges, we initially focused our human factors methods on an attack surface characterization task. We generalized our results using a two-stage modified sorting task, creating requirements for a data flow visualization. We implemented these requirements partially in manual static visualizations, which we informally evaluated, and partially in automatically generated interactive visualizations, which have yet to be integrated into workflows for evaluation. Our observations and results indicate that 1) this data flow visualization has the potential to enable novel code navigation, information presentation, and information sharing, and 2) it is an excellent time to pursue research applying human factors methods to binary analysis workflows.
Understanding the data structures employed by a program is important for reverse engineering activities and can improve the results of automated software analysis techniques. In a compiled binary, access to data structure fields and array indices defined in the source program are replaced by raw pointer arithmetic. We present a representation for capturing the essential details of how a program accesses memory regions, which we call a Memory Access Graph (MAG), and a static analysis for automatically extracting this information from a program binary. The static analysis to extract the MAGs from the program is straightforward and does not require sophisticated integer or pointer analysis. The MAGs are readily understood by reverse engineers; they are generally able to perceive the data structure definition corresponding to a MAG. We briefly discuss automatic extraction of structure definitions outlining some of the difficulties in doing so.