Deep learning (DL) models have enjoyed increased attention in recent years because of their powerful predictive capabilities. While many successes have been achieved, standard deep learning methods suffer from a lack of uncertainty quantification (UQ). While the development of methods for producing UQ from DL models is an active area of current research, little attention has been given to the quality of the UQ produced by such methods. In order to deploy DL models to high-consequence applications, high-quality UQ is necessary. This report details the research and development conducted as part of a Laboratory Directed Research and Development (LDRD) project at Sandia National Laboratories. The focus of this project is to develop a framework of methods and metrics for the principled assessment of UQ quality in DL models. This report presents an overview of UQ quality assessment in traditional statistical modeling and describes why this approach is difficult to apply in DL contexts. An assessment on relatively simple simulated data is presented to demonstrate that UQ quality can differ greatly between DL models trained on the same data. A method for simulating image data that can then be used for UQ quality assessment is described. A general method for simulating realistic data for the purpose of assessing a model’s UQ quality is also presented. A Bayesian uncertainty framework for understanding uncertainty and existing metrics is described. Research that came out of collaborations with two university partners are discussed along with a software toolkit that is currently being developed to implement the UQ quality assessment framework as well as serve as a general guide to incorporating UQ into DL applications.
The Machine Learning for Correlated Intelligence Laboratory Directed Research & Development (LDRD) Project explored competing a variety of machine learning (ML) classification techniques against a known, open source dataset through the use of a rapid and automated algorithm research & development (RD) infrastructure. This approach relied heavily on creating an infrastructure in which to provide a pipeline for automatic target recognition (ATR) ML algorithm competition. Results are presented for nine ML classifiers against a primary dataset using the pipeline infrastructure developed for this project. New approaches to feature set extraction are presented and discussed as well.
Traditional deep learning (DL) models are powerful classifiers, but many approaches do not provide uncertainties for their estimates. Uncertainty quantification (UQ) methods for DL models have received increased attention in the literature due to their usefulness in decision making, particularly for high-consequence decisions. However, there has been little research done on how to evaluate the quality of such methods. We use statistical methods of frequentist interval coverage and interval width to evaluate the quality of credible intervals, and expected calibration error to evaluate classification predicted confidence. These metrics are evaluated on Bayesian neural networks (BNN) fit using Markov Chain Monte Carlo (MCMC) and variational inference (VI), bootstrapped neural networks (NN), Deep Ensembles (DE), and Monte Carlo (MC) dropout. We apply these different UQ for DL methods to a hyperspectral image target detection problem and show the inconsistency of the different methods' results and the necessity of a UQ quality metric. To reconcile these differences and choose a UQ method that appropriately quantifies the uncertainty, we create a simulated data set with fully parameterized probability distribution for a two-class classification problem. The gold standard MCMC performs the best overall, and the bootstrapped NN is a close second, requiring the same computational expense as DE. Through this comparison, we demonstrate that, for a given data set, different models can produce uncertainty estimates of markedly different quality. This in turn points to a great need for principled assessment methods of UQ quality in DL applications.