Ohio State University Department of Computer Science and Engineering An Approach for Automatic Data Virtualization Li Weng, Gagan Agrawal et al.

Ohio State University Department of Computer Science and Engineering 2 Motivating Applications Magnetic Resonance Imaging Oil Reservoir Management Data-driven applications from science, Engineering, biomedicine: Oil Reservoir ManagementWater Contamination Studies Cancer Studies using MRITelepathology with Digitized Slides Satellite Data ProcessingVirtual Microscope …

Ohio State University Department of Computer Science and Engineering 3 Opportunity and Issues Emergence of grid-based data repositories –Can enable sharing of data in an unprecedented way Access mechanisms for remote repositories –Complex low-level formats make accessing and processing of data difficult Main desired functionality –Ability to select and down-load a subset of data

Ohio State University Department of Computer Science and Engineering 4 Current Approaches Databases –Relational model using SQL –Properties of transactions: Atomicity, Isolation, Durability, Consistency –Good! But is it too heavyweight for read-mostly scientific data ? Manual implementation based on low-level datasets –Need detailed understanding of low-level formats HDF5, NetCDF, etc –No single established standard BinX, BFD, DFDL –Machine readable descriptions, but application is dependent on a specific layout

Ohio State University Department of Computer Science and Engineering 5 Data Virtualization An abstract view of data dataset Data Service Data Virtualization By Global Grid Forum’s DAIS working group: A Data Virtualization describes an abstract view of data. A Data Service implements the mechanism to access and process data through the Data Virtualization

Ohio State University Department of Computer Science and Engineering 6 Our Approach: Automatic Data Virtualization Automatically create data services –A new application of compiler technology A meta-data descriptor describes the layout of data on a repository An abstract view is exposed to the users This paper: – Relational table view – Specify subsetting through SQL Select and Where statements

Ohio State University Department of Computer Science and Engineering 7 Outline Introduction –Motivation –system overview System design and algorithm –Design a meta-data descriptor –Automatic data virtualization using our meta-data descriptor Experimental results Related work Conclusions and future work

Ohio State University Department of Computer Science and Engineering 8 System Overview SELECT FROM WHERE …. AND Filter( );

Ohio State University Department of Computer Science and Engineering 9 STORM Runtime System A middleware to support data selection, data partitioning, and data transfer operations on flat-file datasets hosted on a parallel system. Services Query service Data source service Indexing service Filtering service Partition generation service Data mover service

Ohio State University Department of Computer Science and Engineering 11 Scientific datasets Large volume –Gigabyte, Terabyte, Petabyte, … –Stored as binary/character flat files with highly repetitive structure Distributed datasets –Generated/collected by scientific simulations or instruments Multi-dimensional datasets –Spatial and/or temporal coordinates as subsetting index attributes –Filtering attributes

Ohio State University Department of Computer Science and Engineering 12 Design a Meta-data Description Language Requirements –Specify the relationship of a dataset to the virtual dataset schema –Describe the dataset physical layout within a file –Describe the dataset distribution on nodes of one or more clusters –Specify the subsetting index attributes –Easy to use for data repository administrators and also convenient for our code generation

Ohio State University Department of Computer Science and Engineering 13 Design Overview Dataset Schema Description Component Dataset Storage Description Component Dataset Layout Description Component

Ohio State University Department of Computer Science and Engineering 14 An Example Oil Reservoir Management –The dataset comprises several simulation on the same grid –For each realization, each grid point, a number of attributes are stored. –The dataset is stored on a 4 node cluster. Component I: Dataset Schema Description [IPARS]// { * Dataset schema name *} REL = short int// {* Data type definition *} TIME = int X = float Y = float Z = float SOIL = float SGAS = float Component II: Dataset Storage Description [IparsData] //{* Dataset name *} //{* Dataset schema for IparsData *} DatasetDescription = IPARS DIR[0] = osu0/ipars DIR[1] = osu1/ipars DIR[2] = osu2/ipars DIR[3] = osu3/ipars

Ohio State University Department of Computer Science and Engineering 15 Data Layout Description Component Dataset Root dataset 1 dataset 2 dataset 3 Data1 Data2 Data3Data4 Data5 Data6 DATASET “ROOT” { DATATYPE { … } DATAINDEX { … } DATA { DATASET dataset1 DATASET dataset2 DATASET dataset3 } DATASET “dataset1” { DATATYPE { … } DATASPACE { … } DATA { data1 data2 data3 } } DATASET “dataset2” { DATATYPE { … } DATASPACE { … } DATA { data4 } } DATASET “dataset3” { …. }

Ohio State University Department of Computer Science and Engineering 16 An Example Oil Reservoir Management –Use LOOP keyword for capturing the repetitive structure within a file. –The grid has 4 partitions (0~3). –“IparsData” comprises “ipars1” and “ipars2”. “ipars1” describes the data files with the spatial coordinates’ stored; “ipars2” specifies the data files with other attributes stored. Component III: Dataset Layout Description DATASET “IparsData” { //{* Name for Dataset *} DATATYPE { IPARS } //{* Schema for Dataset *} DATAINDEX { REL TIME } DATA { DATASET ipars1 DATASET ipars2 } DATASET “ipars1” { DATASPACE { LOOP GRID ($DIRID*100+1):(($DIRID+1)*100):1 { X Y Z } DATA { $DIR[$DIRID]/COORDS $DIRID = 0:3:1 } } // {* end of DATASET “ipars1” *} DATASET “ipars2” { DATASPACE { LOOP TIME 1:500:1 { LOOP GRID ( $DIRID*100+1):(( $DIRID+1)*100):1 { SOIL SGAS } DATA { $DIR[ $DIRID]/DATA$REL $REL = 0:3:1 $DIRID = 0:3:1 } } //{* end of DATASET “ipars2” *} }

Ohio State University Department of Computer Science and Engineering 17 Automatic Virtualization Using Meta-data Aligned file chunks {num_rows, {File 1,Offset 1,Num_Bytes 1 }, {File 2,Offset 2,Num_Bytes 2 }, ……, {File m,Offset m,Num_Bytes m } } Our tool parses the meta-data descriptor and generates function codes. At run time, the query would provide parameters to invoke the generated functions to create Aligned File Chunks. Dataset Root dataset 1 dataset 2 dataset 3 Data1 Data2 Data3Data4 Data5 Data6

Ohio State University Department of Computer Science and Engineering 18 Compiler Analysis Compiler Analysis Meta-data descriptor Create AFC Process AFC Index & Extraction function code Data _Extract { Find _File _Groups() Process _File _Groups() } Find _File _Groups { Let S be the set of files that match against the query Classify files in S by the set of attributes they have Let S 1, …,S m be the m sets T = Ø foreach {s 1, …,s m } s i ∈ S i { {* cartesian product between S 1, …,S m *} If the values of implicit attributes are not inconsistent { T = T ∪ {s 1, …,s m } } Output T } Process _File _Groups { foreach {s 1, …,s m } ∈ T Find _Aligned _File _Chunks() Supply implicit attributes for each file chunk foreach Aligned File Chunk { Check against index Compute offset and length Output the aligned file chunk }

Ohio State University Department of Computer Science and Engineering 19 An Example Consider a query for selecting a subset with REL values of 0 and 1, TIME from 1 to 100. –Exclude DATA2, DATA3 –Exclude COORD2, COORD3 –Decide eight file groups k = 0, 1, 2, 3 DIR[k]/{COORD0, DATA0} DIR[k]/{COORD1, DATA1} –Create 100 Aligned File Chunks for each file group Component III: Dataset Layout Description DATASET “IparsData” { //{* Name for Dataset *} DATATYPE { IPARS } //{* Schema for Dataset *} DATAINDEX { REL TIME } DATA { DATASET ipars1 DATASET ipars2} DATASET “ipars1” { DATASPACE { LOOP GRID ( $DIRID*100+1):(( $DIRID+1)*100):1 { X Y Z } DATA { $DIR[$DIRID]/COORDS $DIRID = 0:3:1 } } // {* end of DATASET “ipars1” *} DATASET “ipars2” { DATASPACE { LOOP TIME 1:500:1 { LOOP GRID ( $DIRID*100+1):(( $DIRID+1)*100):1 { SOIL SGAS } DATA { $DIR[ $DIRID]/DATA$REL $REL = 0:3:1 $DIRID = 0:3:1 } } //{* end of DATASET “ipars2” *} }

Ohio State University Department of Computer Science and Engineering 21 Experimental Setup & Design A Linux cluster connected via a Switched Fast Ethernet. Each node with a PIII 933MHz CPU, 512 MB main Memory, and three 100GB IDE disks. Three sets of experiments: 1.Code generation ability 2.Evaluate scalability 3.Comparison with hand written codes

Ohio State University Department of Computer Science and Engineering 22 Test the ability of our code generation tool Layout0 - original layout from the application collaborators Layout1 – all data stored as a table in a file Layout2 - all data in a file and each attribute stored as an array Layout3 – split the layout1into multiple files based on value of the time step Layout4 – like layout3, but each attribute stored as an array in each data file Layout5 – data stored in 7 files where the first file with spatial coordinates and the other attributes divided into 6 files Layout6 – like layout5, but each attribute stored as an array in each data file

Ohio State University Department of Computer Science and Engineering 23 Test the ability of our code generation tool Oil Reservoir Management The performance difference is within 4%~10% as for Layout 0. Correctly and efficiently handle a variety of different layouts for the same data

Ohio State University Department of Computer Science and Engineering 24 Evaluate the Scalability of Our Tool Scale the number of nodes hosting the Oil reservoir management dataset Extract a subset of interest at the size of 1.3GB The execution times scale almost linearly. The performance difference varies between 5%~34%, with an average difference of 16%.

Ohio State University Department of Computer Science and Engineering 25 Comparison with hand written codes Oil reservoir management dataset stored on 16 nodes. Performance difference is within 17%, With an average difference of 14 % Satellite data processing stored on a single node. Performance difference is within 4%

Ohio State University Department of Computer Science and Engineering 26 Related Work Describe data on the Grid –BinX and Binary Format Description –HDF5 Parallel / distributed databases –Data cube –Magda on top of MySQL –Oracle’s external tables –OpeNDAP –SRS

Ohio State University Department of Computer Science and Engineering 27 Conclusions and Future Work An automatic approach to support data virtualization for large distributed scientific datasets in low-level formats. –Design a meta-data description language –Compiler based strategy to generate extractor codes automatically –The dataset can be stored in the format it is generated in and no effort is involved in loading it in a database system. –Experimental evaluation demonstrates the efficacy and efficiency of our tool Future work –Experimental studies for more real data-driven and interactive applications with larger scientific datasets under distributed and heterogeneous computing environment –Extend computation capability and flexibility by supporting User Defined Aggregate –Multiple datasets’ integration in the grid computing environment

Ohio State University Department of Computer Science and Engineering 28 Comparison with an existing database (PostgreSQL) 6GB data for Satellite data processing. The total storage required after loading the data in PostgreSQL is 18GB. Create Index for both spatial coordinates and S1 in PostgreSQL. No special performance tuning applied for the experiment. No.Description 1SELECT * FROM TITAN; 2SELECT * FROM TITAN WHERE X>=0 AND X =0 AND Y =0 AND Z<=100; 3SELECT * FROM TITAN WHERE DISTANCE(X,Y,Z) < 1000; 4SELECT * FROM TITAN WHERE S1 < 0.01; 5SELECT * FROM TITAN WHERE S1 < 0.5;

Ohio State University Department of Computer Science and Engineering An Approach for Automatic Data Virtualization Li Weng, Gagan Agrawal et al.

Similar presentations

Presentation on theme: "Ohio State University Department of Computer Science and Engineering An Approach for Automatic Data Virtualization Li Weng, Gagan Agrawal et al."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Ohio State University Department of Computer Science and Engineering An Approach for Automatic Data Virtualization Li Weng, Gagan Agrawal et al.

Similar presentations

Presentation on theme: "Ohio State University Department of Computer Science and Engineering An Approach for Automatic Data Virtualization Li Weng, Gagan Agrawal et al."— Presentation transcript:

Similar presentations

About project

Feedback