The ground truth program used simulations as test beds for social science research methods. The simulations had known ground truth and were capable of producing large amounts of data. This allowed research teams to run experiments and ask questions of these simulations similar to social scientists studying real-world systems, and enabled robust evaluation of their causal inference, prediction, and prescription capabilities. We tested three hypotheses about research effectiveness using data from the ground truth program, specifically looking at the influence of complexity, causal understanding, and data collection on performance. We found some evidence that system complexity and causal understanding influenced research performance, but no evidence that data availability contributed. The ground truth program may be the first robust coupling of simulation test beds with an experimental framework capable of teasing out factors that determine the success of social science research.

Measures of simulation model complexity generally focus on outputs; we propose measuring the complexity of a model’s causal structure to gain insight into its fundamental character. This article introduces tools for measuring causal complexity. First, we introduce a method for developing a model’s causal structure diagram, which characterises the causal interactions present in the code. Causal structure diagrams facilitate comparison of simulation models, including those from different paradigms. Next, we develop metrics for evaluating a model’s causal complexity using its causal structure diagram. We discuss cyclomatic complexity as a measure of the intricacy of causal structure and introduce two new metrics that incorporate the concept of feedback, a fundamental component of causal structure. The first new metric introduced here is feedback density, a measure of the cycle-based interconnectedness of causal structure. The second metric combines cyclomatic complexity and feedback density into a comprehensive causal complexity measure. Finally, we demonstrate these complexity metrics on simulation models from multiple paradigms and discuss potential uses and interpretations. These tools enable direct comparison of models across paradigms and provide a mechanism for measuring and discussing complexity based on a model’s fundamental assumptions and design.

The causal structure of a simulation is a major determinant of both its character and behavior, yet most methods we use to compare simulations focus only on simulation outputs. We introduce a method that combines graphical representation with information theoretic metrics to quantitatively compare the causal structures of models. The method applies to agent-based simulations as well as system dynamics models and facilitates comparison within and between types. Comparing models based on their causal structures can illuminate differences in assumptions made by the models, allowing modelers to (1) better situate their models in the context of existing work, including highlighting novelty, (2) explicitly compare conceptual theory and assumptions to simulated theory and assumptions, and (3) investigate potential causal drivers of divergent behavior between models. We demonstrate the method by comparing two epidemiology models at different levels of aggregation.

Social systems are uniquely complex and difficult to study, but understanding them is vital to solving the world’s problems. The Ground Truth program developed a new way of testing the research methods that attempt to understand and leverage the Human Domain and its associated complexities. The program developed simulations of social systems as virtual world test beds. Not only were these simulations able to produce data on future states of the system under various circumstances and scenarios, but their causal ground truth was also explicitly known. Research teams studied these virtual worlds, facilitating deep validation of causal inference, prediction, and prescription methods. The Ground Truth program model provides a way to test and validate research methods to an extent previously impossible, and to study the intricacies and interactions of different components of research.

This paper examines the variability of predicted responses when multiple stress-strain curves (reflecting variability from replicate material tests) are propagated through a finite element model of a ductile steel can being slowly crushed. Over 140 response quantities of interest (including displacements, stresses, strains, and calculated measures of material damage) are tracked in the simulations. Each response quantity’s behavior varies according to the particular stress-strain curves used for the materials in the model. We desire to estimate response variability when only a few stress-strain curve samples are available from material testing. Propagation of just a few samples will usually result in significantly underestimated response uncertainty relative to propagation of a much larger population that adequately samples the presiding random-function source. A simple classical statistical method, Tolerance Intervals, is tested for effectively treating sparse stress-strain curve data. The method is found to perform well on the highly nonlinear input-to-output response mappings and non-standard response distributions in the can-crush problem. The results and discussion in this paper support a proposition that the method will apply similarly well for other sparsely sampled random variable or function data, whether from experiments or models. Finally, the simple Tolerance Interval method is also demonstrated to be very economical.

When very few samples of a random quantity are available from a source distribution of unknown shape, it is usually not possible to accurately infer the exact distribution from which the data samples come. Under-estimation of important quantities such as response variance and failure probabilities can result. For many engineering purposes, including design and risk analysis, we attempt to avoid under-estimation with a strategy to conservatively estimate (bound) these types of quantities -- without being overly conservative -- when only a few samples of a random quantity are available from model predictions or replicate experiments. This report examines a class of related sparse-data uncertainty representation and inference approaches that are relatively simple, inexpensive, and effective. Tradeoffs between the methods' conservatism, reliability, and risk versus number of data samples (cost) are quantified with multi-attribute metrics use d to assess method performance for conservative estimation of two representative quantities: central 95% of response; and 10^{-4} probability of exceeding a response threshold in a tail of the distribution. Each method's performance is characterized with 10,000 random trials on a large number of diverse and challenging distributions. The best method and number of samples to use in a given circumstance depends on the uncertainty quantity to be estimated, the PDF character, and the desired reliability of bounding the true value. On the basis of this large data base and study, a strategy is proposed for selecting the method and number of samples for attaining reasonable credibility levels in bounding these types of quantities when sparse samples of random variables or functions are available from experiments or simulations.

This report describes findings from the culminating experiment of the LDRD project entitled, "Analyst-to-Analyst Variability in Simulation-Based Prediction". For this experiment, volunteer participants solving a given test problem in engineering and statistics were interviewed at different points in their solution process. These interviews are used to trace differing solutions to differing solution processes, and differing processes to differences in reasoning, assumptions, and judgments. The issue that the experiment was designed to illuminate -- our paucity of understanding of the ways in which humans themselves have an impact on predictions derived from complex computational simulations -- is a challenging and open one. Although solution of the test problem by analyst participants in this experiment has taken much more time than originally anticipated, and is continuing past the end of this LDRD, this project has provided a rare opportunity to explore analyst-to-analyst variability in significant depth, from which we derive evidence-based insights to guide further explorations in this important area.

We introduce a novel technique, POF-Darts, to estimate the Probability Of Failure based on random disk-packing in the uncertain parameter space. POF-Darts uses hyperplane sampling to explore the unexplored part of the uncertain space. We use the function evaluation at a sample point to determine whether it belongs to failure or non-failure regions, and surround it with a protection sphere region to avoid clustering. We decompose the domain into Voronoi cells around the function evaluations as seeds and choose the radius of the protection sphere depending on the local Lipschitz continuity. As sampling proceeds, regions uncovered with spheres will shrink, improving the estimation accuracy. After exhausting the function evaluation budget, we build a surrogate model using the function evaluations associated with the sample points and estimate the probability of failure by exhaustive sampling of that surrogate. In comparison to other similar methods, our algorithm has the advantages of decoupling the sampling step from the surrogate construction one, the ability to reach target POF values with fewer samples, and the capability of estimating the number and locations of disconnected failure regions, not just the POF value. We present various examples to demonstrate the efficiency of our novel approach.

This SAND report summarizes our work on the Sandia National Laboratory LDRD project titled "Efficient Probability of Failure Calculations for QMU using Computational Geometry" which was project #165617 and proposal #13-0144. This report merely summarizes our work. Those interested in the technical details are encouraged to read the full published results, and contact the report authors for the status of the software and follow-on projects.

This report discusses the treatment of uncertainties stemming from relatively few samples of random quantities. The importance of this topic extends beyond experimental data uncertainty to situations involving uncertainty in model calibration, validation, and prediction. With very sparse data samples it is not practical to have a goal of accurately estimating the underlying probability density function (PDF). Rather, a pragmatic goal is that the uncertainty representation should be conservative so as to bound a specified percentile range of the actual PDF, say the range between 0.025 and .975 percentiles, with reasonable reliability. A second, opposing objective is that the representation not be overly conservative; that it minimally over-estimate the desired percentile range of the actual PDF. The presence of the two opposing objectives makes the sparse-data uncertainty representation problem interesting and difficult. In this report, five uncertainty representation techniques are characterized for their performance on twenty-one test problems (over thousands of trials for each problem) according to these two opposing objectives and other performance measures. Two of the methods, statistical Tolerance Intervals and a kernel density approach specifically developed for handling sparse data, exhibit significantly better overall performance than the others.