Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.

Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets Swarup Kumar Sahoo Gagan Agrawal

Ohio State University Department of Computer Science and EngineeringRoadmap Motivation Introduction System Overview XQuery, Low and High Level schema and HDF5 storage Compiler Analysis and Algorithm Experiment Summary and Future Work

Ohio State University Department of Computer Science and EngineeringMotivation Emergence of grid-based data repositories –Can enable sharing of data Emergence of applications that process large datasets –Complicated by complex and specialized storage formats Need for easily portable applications –Compatibility with web/grid services

Ohio State University Department of Computer Science and Engineering Data Virtualization An abstract view of data dataset Data Service Data Virtualization By Global Grid Forum’s DAIS working group: A Data Virtualization describes an abstract view of data. A Data Service implements the mechanism to access and process data through the Data Virtualization

Ohio State University Department of Computer Science and Engineering Introduction : Automatic Data Virtualization Goal : Enable Automatic creation of efficient data services – Support a high-level or abstract view of data – Data is stored in low-level format Application development: –assume a high-level or virtual view Application Execution: –On actual low-level layout

Ohio State University Department of Computer Science and Engineering Overview of Our Automatic Data Virtualization Work Previous work on XML Based virtualization –Techniques for XQuery Compilation (Li and Agrawal, ICS 2003, DBPL 2003) –Supporting XML Based high-level abstraction on flat-file datasets (LCPC 2003, XIME-P 2004) Relational Table/SQL Based Implementation –Supporting SQL Select and Where (HPDC 2004) –Supporting SQL-3 Aggregations (LCPC 2004)

Ohio State University Department of Computer Science and Engineering XML-based Virtualization TEXT … NetCDF RDBMS HDF5 XML XQuer y ???

Ohio State University Department of Computer Science and Engineering Challenges and Contributions Challenges –Compiler generates efficient data processing code »Uses the information about the low-level layout and mapping between virtual and low-level layout –Challenge in compilation »High level to low level »to ensure high locality in processing of large datasets Contributions of this paper –An improved data- centric transformation algorithm –An implementation specific to HDF5 as the low-level format

Ohio State University Department of Computer Science and Engineering System Overview High level XML Schema Mapping Schema XQuery Source Code Compiler Generated Code Processor and Disk System Overview Low level XML Schema HDF5 Library

Ohio State University Department of Computer Science and Engineering XQuery and HDF5 High-level declarative languages ease application development –XQuery is a high-level language for processing XML datasets –Derived from database, declarative, and functional languages! HDF5: –Hierarchical Data Format –Widely used in scientific communities –A case study with a format which has optimized access libraries

Ohio State University Department of Computer Science and Engineering Use of XML Schemas High-level schema – XML is used to provide a virtual view of the dataset Low-level schema –reflects actual physical layout in HDF5 Mapping schema: –describes mapping between each element of high-level schema and low-level schema

Ohio State University Department of Computer Science and Engineering Oil Reservoir Simulation Support cost-effective Oil Production Simulations on a 3-D grid 17 variables and cell locations in 3-D grid at each time step Computation of bypassed regions –Expression to determine if a cell is bypassed for a time-step –Within a spatial region and range of time steps –Grid cells that are bypassed for every time-step in the range Oil Reservoir management

Ohio State University Department of Computer Science and Engineering High-Level Schema

Ohio State University Department of Computer Science and Engineering High-Level XQuery Code Of Oil Reservoir management unordered( for $i in ($x1 to $x2) for $j in ($y1 to $y2) for $k in ($z1 to $z2) let $p := document("OilRes.xml")/data where ($p/x=$i) and ($p/y = $j) and ($p/z = $k) and ($p/time >= $tmin) and ($p/time <= $tmax) return {$i, $j, $k} { analyze($p) } )

Ohio State University Department of Computer Science and Engineering Low-Level Schema integer 1 [1] float 1 [x]..............

Ohio State University Department of Computer Science and Engineering Mapping Schema //high/data/velocity //low/info/data/velocity //high/data/time //low/info/data/time //high/data/mom //low/info/data/mom [index(//low/info/data/velocity, 1)] //high/data/x //low/coord/x [index(//low/info/data/velocity, 1)]

Ohio State University Department of Computer Science and Engineering Compiler Analysis Problem with direct translation : –Each let expression involves complete scan over dataset –So final code will need several passes over the data Solution : –Apply Data Centric Transformations to read a portion HDF5 dataset only once

Ohio State University Department of Computer Science and Engineering Na ï ve Strategy DatasetOutput Requires 3 Scans

Ohio State University Department of Computer Science and Engineering Data Centric Strategy DatasetsOutput Requires just one scan

Ohio State University Department of Computer Science and Engineering Data Centric Transformation Overall Idea in Data-Centric Transformation –Iterate over each data element in actual storage –Find out iterations of the original loop in which they are accessed. –Execute computation corresponding to those iterations. Previous Work –Pingali et al.: blocking –Ferreira and Agrawal: data-parallel Java on disk-resident datasets –Li and Agrawal: XQuery, invert getData functions Our contribution: –Use Low-Level and Mapping Schema –Extend the idea when multiple datasets need to be accessed

Ohio State University Department of Computer Science and Engineering Data Centric Transformation Mapping Function T : Iteration space → High-Level data Mapping Function C : High-Level data → Low-Level data Mapping Function C · T = M : Iteration space → Low-Level data Our Goal is to compute M -1.

Ohio State University Department of Computer Science and Engineering Data Centric Transformation Choose one dataset as base dataset S 1 from n datasets to be accessed Apply M 1 -1 to compute set of iterations. The expression M i · M 1 -1 gives the portion of dataset S i that needs to be accessed along with S 1 Choice of base dataset might impact the data locality.

Ohio State University Department of Computer Science and Engineering Choice of Base Dataset Min-IO-Volume Strategy –Minimize repeated access to any dataset Min-Seek-Time Strategy –Minimize any discontinuity in access

Ohio State University Department of Computer Science and Engineering Template for Generated Code Generated_Query { Create an abstract iteration space using Source code. Allocate and initialize an array of output element corresponding to iteration space. For k = 1, …, NO_OF_CHUNKS { Read k th chunk of dataset S 1 using HDF5 functions and structural tree. Foreach of the other datasets S 2, …, S n access the required chunk of the dataset. Foreach data element in the chunks of data { compute the iteration instance. apply the reduction computation and update the output. }

Ohio State University Department of Computer Science and EngineeringExperiment 200*200*200 grid with 10 time steps (1.28 GB) 50*50*50 Storage Chunk Size

Ohio State University Department of Computer Science and EngineeringExperiment 50*50*50 grid with 200 time steps (400 MB) 25*25*25 Storage Chunk Size

Ohio State University Department of Computer Science and Engineering Key Observations Overall minimum execution time –Min-IO-Volume strategy when read chuck size matches storage chunk size Execution time –Very sensitive to Read Chunk-Size in Min-IO-Volume Strategy –Not sensitive to Read Chunk-Size in Min-Seek-Time Strategy due to buffering of Storage chunks

Ohio State University Department of Computer Science and EngineeringSummary Compiler techniques –Support High-level abstractions on complex low-level data formats –Enables use of the same source code across a variety of data formats –Perform data centric transformations automatically –Experimental result shows minor change in strategy can affect performance significantly Future Work –Cost models to guide strategy and chunk size selection –Compare performance with manual implementations –parallelizing data processing –extend applicability of the algorithm to more general class of queries

Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.

Similar presentations

Presentation on theme: "Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.

Similar presentations

Presentation on theme: "Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets."— Presentation transcript:

Similar presentations

About project

Feedback