Publications Search

This document describes how to obtain, install, use, and enjoy a better life with OVIS version 2.0. The OVIS project targets scalable, real-time analysis of very large data sets. We characterize the behaviors of elements and aggregations of elements (e.g., across space and time) in data sets in order to detect anomalous behaviors. We are particularly interested in determining anomalous behaviors that can be used as advance indicators of significant events of which notification can be made or upon which action can be taken or invoked. The OVIS open source tool (BSD license) is available for download at ovis.ca.sandia.gov. While we intend for it to support a variety of application domains, the OVIS tool was initially developed for, and continues to be primarily tuned for, the investigation of High Performance Compute (HPC) cluster system health. In this application it is intended to be both a system administrator tool for monitoring and a system engineer tool for exploring the system state in depth. OVIS 2.0 provides a variety of statistical tools for examining the behavior of elements in a cluster (e.g., nodes, racks) and associated resources (e.g., storage appliances and network switches). It calculates and reports model values and outliers relative to those models. Additionally, it provides an interactive 3D physical view in which the cluster elements can be colored by raw element values (e.g., temperatures, memory errors) or by the comparison of those values to a given model. The analysis tools and the visual display allow the user to easily determine abnormal or outlier behaviors. The OVIS project envisions the OVIS tool, when applied to compute cluster monitoring, to be used in conjunction with the scheduler or resource manager in order to enable intelligent resource utilization. For example, nodes that are deemed less healthy, that is, nodes that exhibit outlier behavior in some variable, or set of variables, that has shown to be correlated with future failure, can be discovered and assigned to shorter duration or less important jobs. Further, applications with fault-tolerant capabilities can invoke those mechanisms on demand, based upon notification of a node exhibiting impending failure conditions, rather than performing such mechanisms (e.g. checkpointing) at regular intervals unnecessarily.

More Details

TYPE SAND Report YEAR 2009

DOI OSTI

Methodologies for advance warning of compute cluster problems via statistical analysis : a case study

Brandt, James M.; Gentile, Ann C.; Mayo, Jackson R.; Pebay, Philippe P.; Roe, Diana C.; Wong, Matthew H.

Abstract not provided.

More Details

TYPE Conference YEAR 2009

OSTI

Resource Monitoring and Management with OVIS to Enable HPC in Cloud Computing Environments

Brandt, James M.; Gentile, Ann C.; Mayo, Jackson R.; Pebay, Philippe P.; Roe, Diana C.; Thompson, David; Wong, Matthew H.

Abstract not provided.

More Details

TYPE Conference YEAR 2008

OSTI

FCLib: The Feature Characterization Library

Gentile, Ann C.; Kegelmeyer, William P.; Ulmer, Craig

The Feature Characterization Library (FCLib) is a software library that simplifies the process of interrogating, analyzing, and understanding complex data sets generated by finite element applications. This document provides an overview of the library, a description of both the design philosophy and implementation of the library, and examples of how the library can be utilized to extract understanding from raw datasets.

More Details

TYPE SAND Report YEAR 2008

DOI DOI OSTI OSTI

Using Emulation and Simulation to Understand the Large-scale Behavior of the Internet

Adalsteinsson, Helgi; Armstrong, Robert C.; Chiang, Ken; Gentile, Ann C.; Lloyd, Levi; Minnich, Ronald G.; Vanderveen, Keith; Vanrandwyk, Jamie; Rudish, Donald W.

We report on the work done in the late-start LDRD Using Emulation and Simulation to Understand the Large-Scale Behavior of the Internet. We describe the creation of a research platform that emulates many thousands of machines to be used for the study of large-scale inter-net behavior. We describe a proof-of-concept simple attack we performed in this environment. We describe the successful capture of a Storm bot and, from the study of the bot and further literature search, establish large-scale aspects we seek to understand via emulation of Storm on our research platform in possible follow-on work. Finally, we discuss possible future work.

More Details

TYPE SAND Report YEAR 2008

DOI OSTI DOI OSTI

OVIS-2: A Robust Distributed Architecture for Scalable RAS

Brandt, James M.; Gentile, Ann C.; Wong, Matthew H.; Thompson, David; Pebay, Philippe P.; Debusschere, Bert J.; Mayo, Jackson R.

Abstract not provided.

More Details

TYPE Conference YEAR 2008

OSTI

OVIS-2: A Robust Distributed Architecture for Scalable RAS

Wong, Matthew H.; Thompson, David; Pebay, Philippe P.; Mayo, Jackson R.; Gentile, Ann C.; Debusschere, Bert J.; Brandt, James M.

Abstract not provided.

More Details

TYPE Conference YEAR 2007

OSTI

Using Probabilistic Characterization to Reduce Runtime Faults in HPC Systems

Brandt, James M.; Gentile, Ann C.; Pebay, Philippe P.; Thompson, David; Wong, Matthew H.; Debusschere, Bert J.; Mayo, Jackson R.

Abstract not provided.

More Details

TYPE Conference YEAR 2007

OSTI

OVIS reliably monitors computers using novel parallel calculations

Brandt, James M.; Gentile, Ann C.; Pebay, Philippe P.; Thompson, David; Wong, Matthew H.; Jolly, James

Abstract not provided.

More Details

TYPE Report YEAR 2007

OSTI

Monitoring computational clusters with OVIS

Pebay, Philippe P.; Brandt, James M.; Gentile, Ann C.; Wong, Matthew H.

Traditional cluster monitoring approaches consider nodes in singleton, using manufacturer-specified extreme limits as thresholds for failure ''prediction''. We have developed a tool, OVIS, for monitoring and analysis of large computational platforms which, instead, uses a statistical approach to characterize single device behaviors from those of a large number of statistically similar devices. Baseline capabilities of OVIS include the visual display of deterministic information about state variables (e.g., temperature, CPU utilization, fan speed) and their aggregate statistics. Visual consideration of the cluster as a comparative ensemble, rather than as singleton nodes, is an easy and useful method for tuning cluster configuration and determining effects of real-time changes.

More Details

TYPE SAND Report YEAR 2006

DOI OSTI DOI OSTI

OVIS: A Tool for Intelligent Real-time Monitoring of Computational Clusters

Gentile, Ann C.; Wong, Matthew H.; Brandt, James M.

Abstract not provided.

More Details

TYPE Conference YEAR 2006

OSTI

OVIS: A Tool for Intelligent Real-time Monitoring of Computational Clusters

Gentile, Ann C.; Wong, Matthew H.; Brandt, James M.

Abstract not provided.

More Details

TYPE Conference YEAR 2006

OSTI

Extracting Information from Data: Ease Data Analysis Development with FCLib

Gentile, Ann C.; Kegelmeyer, William P.

Abstract not provided.

More Details

TYPE Conference YEAR 2006

OSTI

Meaningful statistical analysis of large computational clusters

Gentile, Ann C.; Marzouk, Youssef M.; Pebay, Philippe P.

Effective monitoring of large computational clusters demands the analysis of a vast amount of raw data from a large number of machines. The fundamental interactions of the system are not, however, well-defined, making it difficult to draw meaningful conclusions from this data, even if one were able to efficiently handle and process it. In this paper we show that computational clusters, because they are comprised of a large number of identical machines, behave in a statistically meaningful fashion. We therefore can employ normal statistical methods to derive information about individual systems and their environment and to detect problems sooner than with traditional mechanisms. We discuss design details necessary to use these methods on a large system in a timely and low-impact fashion.

More Details

TYPE SAND Report YEAR 2005

DOI OSTI

Meaningful statistical analysis of large computational clusters

Brandt, James M.; Marzouk, Youssef M.; Pebay, Philippe P.; Gentile, Ann C.

Abstract not provided.

More Details

TYPE Conference YEAR 2005

OSTI

Publications

Search results