Publications

Publications / SAND Report

Flexible and Scalable Data Fusion using Proactive Schemaless Information Services

Widener, Patrick W.

Exascale data environments are fast approaching, driven by diverse structured and unstructured data such as system and application telemetry streams, open-source information capture, and on-demand simulation output. Storage costs having plummeted, the question is now one of converting vast stores of data to actionable information. Complicating this problem are the low degrees of awareness across domain boundaries about what potentially useful data may exist, and write-once- read-never issues (data generation/collection rates outpacing data analysis and integration rates). Increasingly, technologists and researchers need to correlate previously unrelated data sources and artifacts to produce fused data views for domain-specific purposes. New tools and approaches for creating such views from vast amounts of data are vitally important to maintaining research and operational momentum. We propose to research and develop tools and services to assist in the creation, refinement, discovery and reuse of fused data views over large, diverse collections of heterogeneously structured data. We innovate in the following ways. First, we enable and encourage end-users to introduce customized index methods selected for local benefit rather than for global interaction (flexible multi-indexing). We envision rich combinations of such views on application data: views that span backing stores with different semantics, that introduce analytic methods of indexing, and that define multiple views on individual data items. We specifically decline to build a big fused database of everything providing a centralized index over all data, or to export a rigid schema to all comers as in federated query approaches. Second, we proactively advertise these application-specific views so that they may be programmatically reused and extended (data proactivity). Through this mechanism, both changes in state (new data in existing view collected) and changes in structure (new or derived view exists) are made known. Lastly, we embrace found data heterogeneity by coupling multi-indexing to backing stores with appropriate semantics (as opposed to a single store or schema).