In this work, we introduce a family of novel activation functions for deep neural networks that approximate n-ary, or n-argument, probabilistic logic. Logic has long been used to encode complex relationships between claims that are either true or false. Thus, these activation functions provide a step towards models that can efficiently encode information. Unfortunately, typical feedforward networks with elementwise activation functions cannot capture certain relationships succinctly, such as the exclusive disjunction (p xor q) and conditioned disjunction (if c then p else q). Our n-ary activation functions address this challenge by approximating belief functions (probabilistic Boolean logic) with logit representations of probability and experiments demonstrate the ability to learn arbitrary logical ground truths in a single layer. Further, by representing belief tables using a basis that associates the number of nonzero parameters with the effective arity of each belief function, we forge a concrete relationship between logical complexity and sparsity, thus opening new optimization approaches to suppress logical complexity during training. We provide a computationally efficient PyTorch implementation and test our activation functions against other logic-approximating activation functions on both traditional machine learning tasks as well as reproducing known logical relationships.
In this paper, we address the problem of convergence of sequential variational inference filter (VIF) through the application of a robust variational objective and H∞-norm based correction for a linear Gaussian system. As the dimension of state or parameter space grows, performing the full Kalman update with the dense covariance matrix for a large-scale system requires increased storage and computational complexity, making it impractical. The VIF approach, based on mean-field Gaussian variational inference, reduces this burden through the variational approximation to the covariance usually in the form of a diagonal covariance approximation. The challenge is to retain convergence and correct for biases introduced by the sequential VIF steps. We desire a frame-work that improves feasibility while still maintaining reasonable proximity to the optimal Kalman filter as data is assimilated. To accomplish this goal, a H∞-norm based optimization perturbs the VIF covariance matrix to improve robustness. This yields a novel VIF-H∞ recursion that employs consecutive variational inference and H∞ based optimization steps. We explore the development of this method and investigate a numerical example to illustrate the effectiveness of the proposed filter.
In this paper, we address the problem of convergence of sequential variational inference filter (VIF) through the application of a robust variational objective and H∞-norm based correction for a linear Gaussian system. As the dimension of state or parameter space grows, performing the full Kalman update with the dense covariance matrix for a large-scale system requires increased storage and computational complexity, making it impractical. The VIF approach, based on mean-field Gaussian variational inference, reduces this burden through the variational approximation to the covariance usually in the form of a diagonal covariance approximation. The challenge is to retain convergence and correct for biases introduced by the sequential VIF steps. We desire a frame-work that improves feasibility while still maintaining reasonable proximity to the optimal Kalman filter as data is assimilated. To accomplish this goal, a H∞-norm based optimization perturbs the VIF covariance matrix to improve robustness. This yields a novel VIF-H∞ recursion that employs consecutive variational inference and H∞ based optimization steps. We explore the development of this method and investigate a numerical example to illustrate the effectiveness of the proposed filter.
Our primary aim in this work is to understand how to efficiently obtain reliable uncertainty quantification in automatic learning algorithms with limited training datasets. Standard approaches rely on cross-validation to tune hyper parameters. Unfortunately, when our datasets are too small, holdout datasets become unreliable—albeit unbiased—measures of prediction quality due to the lack of adequate sample size. We should not place confidence in holdout estimators under conditions wherein the sample variance is both large and unknown. More poigniantly, our training experiments on limited data (Duersch and Catanach, 2021) show that even if we could improve estimator quality under these conditions, the typical training trajectory may never even encounter generalizable models.
This work examines how we may cast machine learning within a complete Bayesian framework to quantify and suppress explanatory complexity from first principles. Our investigation into both the philosophy and mathematics of rational belief leads us to emphasize the critical role of Bayesian inference in learning well-justified predictions within a rigorous and complete extended logic. The Bayesian framework allows us to coherently account for evidence in the learned plausibility of potential explanations. As an extended logic, the Bayesian paradigm regards probability as a notion of degrees of truth. In order to satisfy critical properties of probability as a coherent measure, as well as maintain consistency with binary propositional logic, we arrive at Bayes' Theorem as the only justifiable mechanism to update our beliefs to account for empiracle evidence. Yet, in the machine learning paradigm, where explanations are unconstrained algorithmic abstractions, we arrive at a critical challenge: Bayesian inference requires prior belief. Conventional approaches fail to yield a consistent framework in which we could compare prior plausibility among the infinities of potential choices in learning architectures. The difficulty of articulating well-justified prior belief over abstract models is the provinence of memorization in traditional machine learning training practices. This becomes exceptionally problematic in the context of limited datasets, when we wish to learn justifiable predictions from only a small amount of data.
Rank-revealing matrix decompositions provide an essential tool in spectral analysis of matrices, including the Singular Value Decomposition (SVD) and related low-rank approximation techniques. QR with Column Pivoting (QRCP) is usually suitable for these purposes, but it can be much slower than the unpivoted QR algorithm. For large matrices, the difference in performance is due to increased communication between the processor and slow memory, which QRCP needs in order to choose pivots during decomposition. Our main algorithm, Randomized QR with Column Pivoting (RQRCP), uses randomized projection to make pivot decisions from a much smaller sample matrix, which we can construct to reside in a faster level of memory than the original matrix. This technique may be understood as trading vastly reduced communication for a controlled increase in uncertainty during the decision process. For rank-revealing purposes, the selection mechanism in RQRCP produces results that are the same quality as the standard algorithm, but with performance near that of unpivoted QR (often an order of magnitude faster for large matrices). We also propose two formulas that facilitate further performance improvements. The first efficiently updates sample matrices to avoid computing new randomized projections. The second avoids large trailing updates during the decomposition in truncated low-rank approximations. Our truncated version of RQRCP also provides a key initial step in our truncated SVD approximation, TUXV. These advances open up a new performance domain for large matrix factorizations that will support efficient problem-solving techniques for challenging applications in science, engineering, and data analysis.
Information theory provides a mathematical foundation to measure uncertainty in belief. Belief is represented by a probability distribution that captures our understanding of an outcome's plausibility. Information measures based on Shannon's concept of entropy include realization information, Kullback-Leibler divergence, Lindley's information in experiment, cross entropy, and mutual information. We derive a general theory of information from first principles that accounts for evolving belief and recovers all of these measures. Rather than simply gauging uncertainty, information is understood in this theory to measure change in belief. We may then regard entropy as the information we expect to gain upon realization of a discrete latent random variable. This theory of information is compatible with the Bayesian paradigm in which rational belief is updated as evidence becomes available. Furthermore, this theory admits novel measures of information with well-defined properties, which we explored in both analysis and experiment. This view of information illuminates the study of machine learning by allowing us to quantify information captured by a predictive model and distinguish it from residual information contained in training data. We gain related insights regarding feature selection, anomaly detection, and novel Bayesian approaches.
Stochastic optimization is a fundamental field of research for machine learning. Stochastic gradient descent (SGD) and related methods provide a feasible means to train complicated prediction models over large datasets. SGD, however, does not explicitly address the problem of overfitting, which can lead to predictions that perform poorly on new data. This difference between loss performance on unseen testing data verses that of training data defines the generalization gap of a model. We introduce a new computational kernel called Stochastic Hessian Projection (SHP) that uses a maximum likelihood framework to simultaneously estimate gradient noise covariance and local curvature of the loss function. Our analysis illustrates that these quantities affect the evolution of parameter uncertainty and therefore generalizability. We show how these computations allow us to predict the generalization gap without requiring holdout data. Explicitly assessing this metric for generalizability during training may improve machine learning predictions when data is scarce and understanding prediction variability is critical.