This report summarizes the current statistical analysis capability of OVIS and how it works in conjunction with the OVIS data readers and interpolators. It also documents how to extend these capabilities. OVIS is a tool for parallel statistical analysis of sensor data to improve system reliability. Parallelism is achieved using a distributed data model: many sensors on similar components (metaphorically sheep) insert measurements into a series of databases on computers reserved for analyzing the measurements (metaphorically shepherds). Each shepherd node then processes the sheep data stored locally and the results are aggregated across all shepherds. OVIS uses the Visualization Tool Kit (VTK) statistics algorithm class hierarchy to perform analysis of each process's data but avoids VTK's model aggregation stage which uses the Message Passing Interface (MPI); this is because if a single process in an MPI job fails, the entire job will fail. Instead, OVIS uses asynchronous database replication to aggregate statistical models. OVIS has several additional features beyond those present in VTK that, first, accommodate its particular data format and, second, improve the memory and speed of the statistical analyses. First, because many statistical algorithms are multivariate in nature and sensor data is typically univariate, interpolation of data is required to provide simultaneous observations of metrics. Note that in this report, we will refer to a single value obtained from a sensor as a measurement while a collection of multiple sensor values simultaneously present in the system is an observation. A base class for interpolation is provided that abstracts the operation of converting multiple sensor measurements into simultaneous observations. A concrete implementation is provided that performs piecewise constant temporal interpolation of multiple metrics across a single component. Secondly, because calculations may summarize data too large to fit in memory OVIS analyses batches of observations at a time and aggregates these intermediate intra-process models as it goes before storing the final model for inter-process aggregation via database replication. This reduces the memory footprint of the analysis, interpolation, and the database client and server query processing. This also interleaves processing with the disk I/O required to fetch data from the database - also improving speed. This report documents how OVIS performs analyses and how to create additional analysis components that fetch measurements from the database, perform interpolation, or perform operations on streamed observations (such as model updates or assessments). The rest of this section outlines the OVIS analysis algorithm and is followed by sections specific to each subtask. Note that we are limiting our discussion for now to the creation of a model from a set of measurements, and not including the assessment of observations using a model. The same framework can be used for assessment but that use case is not detailed in this report.
Statistical analysis is typically used to reduce the dimensionality of and infer meaning from data. A key challenge of any statistical analysis package aimed at large-scale, distributed data is to address the orthogonal issues of parallel scalability and numerical stability. Many statistical techniques, e.g., descriptive statistics or principal component analysis, are based on moments and co-moments and, using robust online update formulas, can be computed in an embarrassingly parallel manner, amenable to a map-reduce style implementation. In this paper we focus on contingency tables, through which numerous derived statistics such as joint and marginal probability, point-wise mutual information, information entropy, and {chi}{sup 2} independence statistics can be directly obtained. However, contingency tables can become large as data size increases, requiring a correspondingly large amount of communication between processors. This potential increase in communication prevents optimal parallel speedup and is the main difference with moment-based statistics (which we discussed in [1]) where the amount of inter-processor communication is independent of data size. Here we present the design trade-offs which we made to implement the computation of contingency tables in parallel. We also study the parallel speedup and scalability properties of our open source implementation. In particular, we observe optimal speed-up and scalability when the contingency statistics are used in their appropriate context, namely, when the data input is not quasi-diffuse.
This report summarizes existing statistical engines in VTK/Titan and presents both the serial and parallel k-means statistics engines. It is a sequel to [PT08], [BPRT09], and [PT09] which studied the parallel descriptive, correlative, multi-correlative, principal component analysis, and contingency engines. The ease of use of the new parallel k-means engine is illustrated by the means of C++ code snippets and algorithm verification is provided. This report justifies the design of the statistics engines with parallel scalability in mind, and provides scalability and speed-up analysis results for the k-means engine.
This report summarizes existing statistical engines in VTK/Titan and presents the recently parallelized contingency statistics engine. It is a sequel to [PT08] and [BPRT09] which studied the parallel descriptive, correlative, multi-correlative, and principal component analysis engines. The ease of use of this new parallel engines is illustrated by the means of C++ code snippets. Furthermore, this report justifies the design of these engines with parallel scalability in mind; however, the very nature of contingency tables prevent this new engine from exhibiting optimal parallel speed-up as the aforementioned engines do. This report therefore discusses the design trade-offs we made and study performance with up to 200 processors.
The purpose of this project was to investigate the use of Bayesian methods for the estimation of the reliability of complex systems. The goals were to find methods for dealing with continuous data, rather than simple pass/fail data; to avoid assumptions of specific probability distributions, especially Gaussian, or normal, distributions; to compute not only an estimate of the reliability of the system, but also a measure of the confidence in that estimate; to develop procedures to address time-dependent or aging aspects in such systems, and to use these models and results to derive optimal testing strategies. The system is assumed to be a system of systems, i.e., a system with discrete components that are themselves systems. Furthermore, the system is 'engineered' in the sense that each node is designed to do something and that we have a mathematical description of that process. In the time-dependent case, the assumption is that we have a general, nonlinear, time-dependent function describing the process. The major results of the project are described in this report. In summary, we developed a sophisticated mathematical framework based on modern probability theory and Bayesian analysis. This framework encompasses all aspects of epistemic uncertainty and easily incorporates steady-state and time-dependent systems. Based on Markov chain, Monte Carlo methods, we devised a computational strategy for general probability density estimation in the steady-state case. This enabled us to compute a distribution of the reliability from which many questions, including confidence, could be addressed. We then extended this to the time domain and implemented procedures to estimate the reliability over time, including the use of the method to predict the reliability at a future time. Finally, we used certain aspects of Bayesian decision analysis to create a novel method for determining an optimal testing strategy, e.g., we can estimate the 'best' location to take the next test to minimize the risk of making a wrong decision about the fitness of a system. We conclude this report by proposing additional fruitful areas of research.