Publications Details

Publications / SAND Report

Toward Uncertainty Quantification for Supervised Classification

Darling, Michael C.; Stracuzzi, David J.

Our goal is to develop a general theoretical basis for quantifying uncertainty in supervised machine learning models. Current machine learning accuracy-based validation metrics indicate how well a classifier performs on a given data set as a whole. However, these metrics do not tell us a model's efficacy in predicting particular samples. We quantify uncertainty by constructing probability distributions of the predictions made by an ensemble of classifiers. This report details our initial investigations into uncertainty quantification for supervised machine learning. We apply an uncertainty analysis to the problem of malicious website detection. Machine learning models can be trained to find suspicious characteristics in the text of a website's Uniform Resource Locator (URL). However, given the vast numbers of URLs and the ever changing tactics of malicious actors, it will always be possible to find sets of websites which are outliers with respect to a model's hypothesis. Therefore, we seek to understand a model's per-sample reliability when classifying URL data. Acknowledgements This work was funded by the Sandia National Laboratories Laboratory Directed Research and Development (LDRD) program.