Automatic and Efficient Data Virtualization System on Scientific Datasets Li Weng.

Slides:



Advertisements
Similar presentations
An Array-Based Algorithm for Simultaneous Multidimensional Aggregates By Yihong Zhao, Prasad M. Desphande and Jeffrey F. Naughton Presented by Kia Hall.
Advertisements

LIBRA: Lightweight Data Skew Mitigation in MapReduce
Esma Yildirim Department of Computer Engineering Fatih University Istanbul, Turkey DATACLOUD 2013.
Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.
CS263 Lecture 19 Query Optimisation.  Motivation for Query Optimisation  Phases of Query Processing  Query Trees  RA Transformation Rules  Heuristic.
The Virtual Microscope Umit V. Catalyurek Department of Biomedical Informatics Division of Data Intensive and Grid Computing.
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information.
Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Database Management 9. course. Execution of queries.
Ohio State University Department of Computer Science and Engineering 1 Cyberinfrastructure for Coastal Forecasting and Change Analysis Gagan Agrawal Hakan.
Ohio State University Department of Computer Science and Engineering 1 Supporting SQL-3 Aggregations on Grid-based Data Repositories Li Weng, Gagan Agrawal,
HPDC 2014 Supporting Correlation Analysis on Scientific Datasets in Parallel and Distributed Settings Yu Su*, Gagan Agrawal*, Jonathan Woodring # Ayan.
Chapter 13 Query Processing Melissa Jamili CS 157B November 11, 2004.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.
Efficiently Processing Queries on Interval-and-Value Tuples in Relational Databases Jost Enderle, Nicole Schneider, Thomas Seidl RWTH Aachen University,
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.
The Replica Location Service The Globus Project™ And The DataGrid Project Copyright (c) 2002 University of Chicago and The University of Southern California.
Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.
Reporter : Yu Shing Li 1.  Introduction  Querying and update in the cloud  Multi-dimensional index R-Tree and KD-tree Basic Structure Pruning Irrelevant.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
Compiler Supported High-level Abstractions for Sparse Disk-resident Datasets Renato Ferreira Gagan Agrawal Joel Saltz Ohio State University.
CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and.
Ohio State University Department of Computer Science and Engineering Data-Centric Transformations on Non- Integer Iteration Spaces Swarup Kumar Sahoo Gagan.
Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.
Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.
Ohio State University Department of Computer Science and Engineering An Approach for Automatic Data Virtualization Li Weng, Gagan Agrawal et al.
INTRODUCTION TO GIS  Used to describe computer facilities which are used to handle data referenced to the spatial domain.  Has the ability to inter-
M.Kersten MonetDB, Cracking and recycling Martin Kersten CWI Amsterdam.
PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.
High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.
Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,
Disco: Running Commodity Operating Systems on Scalable Multiprocessors Presented by: Pierre LaBorde, Jordan Deveroux, Imran Ali, Yazen Ghannam, Tzu-Wei.
Chapter 9: Web Services and Databases Title: NiagaraCQ: A Scalable Continuous Query System for Internet Databases Authors: Jianjun Chen, David J. DeWitt,
Ohio State University Department of Computer Science and Engineering 1 Tools and Techniques for the Data Grid Gagan Agrawal The Ohio State University.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Servicing Seismic and Oil Reservoir Simulation Data through Grid Data Services Sivaramakrishnan Narayanan, Tahsin Kurc, Umit Catalyurek and Joel Saltz.
Igor EPIMAKHOV Abdelkader HAMEURLAIN Franck MORVAN
Efficient Evaluation of XQuery over Streaming Data
Memory Management.
Module 11: File Structure
INTRODUCTION TO GEOGRAPHICAL INFORMATION SYSTEM
Database Management System
Applying Control Theory to Stream Processing Systems
Introduction to HDF5 Session Five Reading & Writing Raw Data Values
Parallel Data Laboratory, Carnegie Mellon University
Chapter 12: Query Processing
Database Performance Tuning and Query Optimization
CSCE 990: Advanced Distributed Systems
Supporting Fault-Tolerance in Streaming Grid Applications
Chapter 15 QUERY EXECUTION.
Database Management Systems (CS 564)
Li Weng, Umit Catalyurek, Tahsin Kurc, Gagan Agrawal, Joel Saltz
Predictive Performance
Communication and Memory Efficient Parallel Decision Tree Construction
Physical Database Design
Overview of Query Evaluation
Chapter 8 Advanced SQL.
Chapter 11 Database Performance Tuning and Query Optimization
Evaluation of Relational Operations: Other Techniques
Automatic and Efficient Data Virtualization System on Scientific Datasets Li Weng.
New (Applications of) Compiler Techniques for Data Grids
Efficient Aggregation over Objects with Extent
Accelerating Regular Path Queries using FPGA
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Automatic and Efficient Data Virtualization System on Scientific Datasets Li Weng

Outline Introduction Motivation Contributions Overall system framework System design, algorithm and experimental evaluation Automatic data virtualization system Data virtualization through data services over scientific datasets Data analysis of data virtualization system Replica selection module Performance optimization using partial replicas Generalizing the work of partial replication optimization Efficient execution of multiple queries on scientific datasets Related research Conclusions 2/4/2019

Data Grids Datasets Data-intensive applications Large volume Gigabyte, Terabyte, Petabyte Distributed datasets Generated/collected by scientific simulations or instruments Multi-dimensional datasets Dimension attributes, measure attributes Data-intensive applications Data Specification Data Organization Data Extraction Data Movement Data Analysis Data Visualization 2/4/2019

Motivating Applications Digitized Microscopy Image Analysis Oil Reservoir Management Data-driven applications from science, Engineering, biomedicine: Oil Reservoir Management Water Contamination Studies Cancer Studies using MRI Telepathology with Digitized Slides Satellite Data Processing Virtual Microscope … 2/4/2019

Two Challenges In view of large dataset sizes, geographic distribution of users and resources, and complex analysis, we concentrated on the two critical challenges – Low-level and specialized data formats Various query types and increasing number of clients 2/4/2019

Contributions Data virtualization system Realizing data virtualization through automatically generated data services (HPDC2004) Supporting complex data analysis processing by SQL-3 query and aggregations (LCPC2004) Replica selection module Designing new techniques toward efficient execution of data analysis queries using partial replicas. (CCGRID2005) Generalizing the functionalities of the replica selection module according to two significant extensions. (ICPP2006) Efficient execution of multiple queries Exploring the performance optimization potential of multiple queries. (under submission) 2/4/2019

Automatic Data Virtualization System (HPDC2004) SELECT < Data Elements > SELECT * FROM < Dataset Name > FROM IPARS WHERE < Expression > WHERE REL in (0,6,26,27) AND TIME>1000 AND Filer( < Data Element> ); AND TIME<1100 AND SOIL>0.7 AND SPEED(OILVX, OILVY,OILVZ)<30.0; 2/4/2019

Data Analysis in Data Virtualization System (LCPC2004) 2/4/2019

Replica Selection Module (CCGRID2005, ICPP2006) 2/4/2019

Outline Introduction Motivation Contributions Overall system framework System design, algorithm and experimental evaluation Automatic data virtualization system Data virtualization through data services over scientific datasets Data analysis of data virtualization system Replica selection module Performance optimization using partial replicas Generalizing the work of partial replication optimization Efficient execution of multiple queries on scientific datasets Related research Conclusions 2/4/2019

Automatic Data Virtualization System An abstract view of data dataset Data Virtualization Data Service Design a meta-data descriptor Automatic data virtualization using our meta-data descriptor 2/4/2019

Design a Meta-data Description Language Dataset Schema Description Component Dataset Storage Description Component Dataset Layout Description Component 2/4/2019

An Example Oil Reservoir Management Component I: Dataset Schema Description [IPARS] // { * Dataset schema name *} REL = short int // {* Data type definition *} TIME = int X = float Y = float Z = float SOIL = float SGAS = float Oil Reservoir Management The dataset comprises several simulation on the same grid For each realization, each grid point, a number of attributes are stored. The dataset is stored on a 4 node cluster. Component II: Dataset Storage Description [IparsData] //{* Dataset name *} //{* Dataset schema for IparsData *} DatasetDescription = IPARS DIR[0] = osu0/ipars DIR[1] = osu1/ipars DIR[2] = osu2/ipars DIR[3] = osu3/ipars 2/4/2019

An example of Oil Reservoir Management Component III: Dataset Layout Description DATASET “IparsData” { //{* Name for Dataset *} DATATYPE { IPARS } //{* Schema for Dataset *} DATAINDEX { REL TIME } DATA { DATASET ipars1 DATASET ipars2 } DATASET “ipars1” { DATASPACE { LOOP GRID ($DIRID*100+1):(($DIRID+1)*100):1 { X Y Z } DATA { $DIR[$DIRID]/COORDS $DIRID = 0:3:1 } } // {* end of DATASET “ipars1” *} DATASET “ipars2” { LOOP TINE 1:500:1 { LOOP GRID ( $DIRID*100+1):(( $DIRID+1)*100):1 { SOIL SGAS DATA { $DIR[ $DIRID]/DATA$REL $REL = 0:3:1 $DIRID = 0:3:1 } } //{* end of DATASET “ipars2” *} An example of Oil Reservoir Management Use LOOP keyword for capturing the repetitive structure within a file. The grid has 4 partitions (0~3). “IparsData” comprises “ipars1” and “ipars2”. “ipars1” describes the data files with the spatial coordinates’ stored; “ipars2” specifies the data files with other attributes stored. 2/4/2019

Index & Extraction function code Compiler Analysis Data _Extract { Find _File _Groups() Process _File _Groups() } Find _File _Groups { Let S be the set of files that match against the query Classify files in S by the set of attributes they have Let S1, … ,Sm be the m sets T = Ø foreach {s1, … ,sm } si ∈ Si { {* cartesian product between S1, … ,Sm *} If the values of implicit attributes are not inconsistent { T = T ∪ {s1, … ,sm } Output T Process _File _Groups { foreach {s1, … ,sm } ∈ T Find _Aligned _File _Chunks() Supply implicit attributes for each file chunk foreach Aligned File Chunk { Check against index Compute offset and length Output the aligned file chunk Meta-data descriptor Create AFC Process AFC Index & Extraction function code 2/4/2019

Experimental Setup & Design A Linux cluster connected via a Switched Fast Ethernet. Each node with a PIII 933MHz CPU, 512 MB main Memory, and three 100GB IDE disks. Four sets of experiments: Ability test Scalability test Comparison with hand written codes Comparison with an existing database (PostgreSQL) 2/4/2019

Test the ability of our code generation tool Layout0 - original layout from the application collaborators Layout1 – all data stored as a table in a file Layout2 - all data in a file and each attribute stored as an array Layout3 – split the layout1into multiple files based on value of the time step Layout4 – like layout3, but each attribute stored as an array in each data file Layout5 – data stored in 7 files where the first file with spatial coordinates and the other attributes divided into 6 files Layout6 – like layout5, but each attribute stored as an array in each 2/4/2019

Test the ability of our code generation tool Oil Reservoir Management ( 12GB ) The performance difference is within 4%~10% as for Layout 0. Correctly and efficiently handle a variety of different layouts for the same data 2/4/2019

Evaluate the Scalability of Our Tool Scale the number of nodes hosting the Oil reservoir management dataset Extract a subset of interest at the size of 1.3GB The execution times scale almost linearly. The performance difference varies between 5%~34%, with an average difference of 16%. 2/4/2019

Data Analysis in Data Virtualization System Express the query and the data analysis declaratively on a virtual relational table view Generate and optimize data aggregate service for the desired processing Employ a data partitioning strategy in parallel and distributed configurations 2/4/2019

Oil Reservoir Management (IPARS) SELECT X, Y, Z, ipars_bypass_sum(IPARS) FROM IPARS WHERE REL in (0,5,10) AND TIME >= 1000 AND TIME <= 1200 GROUP BY X, Y, Z HAVING ipars_bypass_sum(OIL)>0; CREATE AGGREGATE ipars_bypass_sum ( BASETYPE = IPARS, SFUNC = ipars_func, STYPE = int, INITCOND = '1' ); CREATE FUNCTION ipars_func(int, IPARS) RETURNS int AS ' SELECT CASE WHEN $2.soil > 0.7 AND |/($2.oilx * $2.oilx + $2.oily * $2.oily + $2.oilz * $2.oilz)<30.0 THEN $1 & 1 ELSE 0 END; ' LANGUAGE SQL; 2/4/2019

Compiler Analysis and Code Generation Transform the canonical query into two pipelined sub-queries. Data Extraction Service TempDataset = SELECT <all attributes> From <Dataset Name> WHERE <Expression> ; Data Aggregation Service SELECT <attribute list> , <AGG_name(Dataset Name)> FROM TempDataset GROUP BY <group-by attribute_list>; 2/4/2019

Generate Data Aggregation Service Aggregate function analysis Projection push-down helps to extract data only needed for a particular query and its aggregation. TempDataset = SELECT <useful attributes> From <Dataset Name> WHERE <Expression> ; As for the IPARS application, only 7 out of the 22 attributes are actually needed for the considered query. The reduction of the data volume to be retrieved and communicated is 66%. As for the TITAN application, 5 out of the 8 attributes are actually needed and the reduction is 38%. 2/4/2019

Generate Data Aggregation Service 2. Aggregate function decomposition The first step involves computations applied on each tuple; The second step updates the aggregate status variable. Replace the largest expression with TempAttr. As for the IPARS, the number of attributes is reduced further from 7 to 4. CREATE FUNCTION ipars_func(int, IPARS) RETURNS int AS ' SELECT CASE WHEN $2.TempAttr THEN $1 & 1 ELSE 0 END; ' LANGUAGE SQL; 2/4/2019

Generate Data Aggregation Service Partition the subset of interest based on the values of the group-by attributes if more client nodes are provided as the computing unit. Construct a hash-table using the values of the group-by attributes as the hash-key. And translate the aggregate function in SQL-3 into the imperative C/C++ code. 2/4/2019

Experimental Setup & Design A Linux cluster connected via a Switched Fast Ethernet. Each node has a PIII 933MHz CPU, 512 MB main Memory, and three 100GB IDE disks. Scalability test when varying the number of nodes for hosting data and performing the computations; Performance test when the amount of data to be processed is increased; Comparison with hand-written code; The impact of the aggregation decomposition. 2/4/2019

Experimental Results for IPARS Scale the number of nodes Extract a subset of interest at the size of 640MB from scanning the 1.9GB data. The performance difference varies between 6%~20%, with an average difference of 14%. The aggregate decomposition can reduce the difference to be between 1% and 10%. Scale the volume of dataset Use 8 data source nodes and 8 client nodes. The execution time stays proportional to the amount of data to be retrieved and processed. 2/4/2019

Experimental Results for TITAN Scale the number of nodes Extract a subset of interest at the size of 228MB from scanning the 456MB data. The performance difference is 17%. The aggregate decomposition can reduce the difference to be 6%. Scale the volume of dataset Use 8 data source nodes and 8 client nodes. The execution time stays proportional to the amount of data to be retrieved and processed. 2/4/2019

Outline Introduction Motivation Contributions Overall system framework System design, algorithm and experimental evaluation Automatic data virtualization system Data virtualization through data services over scientific datasets Data analysis of data virtualization system Replica selection module Performance optimization using partial replicas Generalizing the work of partial replication optimization Efficient execution of multiple queries on scientific datasets Related research Conclusions 2/4/2019

Problem The requirements of efficient access and high-performance processing The challenge for various query types and increasing number of clients Harnessing an optimization technique Partial Replication 2/4/2019

Our Approach – Using Partial Replicas How to assemble queried data efficiently from replicas and the original dataset Computing goodness value Replica selection algorithm comprising a greedy strategy and one extension. The Replica Selection Module is coupled tightly with our prior work on supporting SQL Select queries on scientific datasets in a cluster environment. 2/4/2019

Partial Replicas Considered Replica information file describes the replicas created by users. Space partitioned partial replicas Contain all data attributes of a hot portion of the original dataset. Hot range Use a group of representative queries to identify the portions of the dataset to be replicated. Chunking Allow flexible chunk shapes and sizes. Affect data read cost. Dimension order Layout chunks following different dimension sequences. Affect data seek cost. 2/4/2019

Computing Goodness Value goodness = useful dataper-chunk / costper-chunk Chunk: an atomic unit or a logic unit Full chunks and partial chunks of a partial replica Cost per-chunk = tread * nread + tseek tread : average read time for a disk page nread : number of pages fetched tseek : average seek time Fragment intermediate unit between a replica and its chunks a group of full or partial chunks having same goodness value in a replica goodnessper-fragment = useful dataper-fragment / costper-fragment 2/4/2019

An Example – Query and Intersecting Replicas 3 full chunks and 2 partial chunks 3 fragments Composite Replica 2 10 full chunks 1 fragment 2/4/2019

Input Q, R, D Greedy Strategy Q : an issued query R : the partial replicas D : the original dataset F : all fragments intersecting with the query boundary Fmax : the fragment with the maximum goodness value in F S : the ordered list of the candidate fragments in decreasing order of their goodness value The runtime complexity is O(m2), where m is the number of fragments intersecting the query boundary. Calculate the fragment set F Yes F is null? No Append Fmax Into S Remove Fmax from F No Overlap with Fmax exists in F? Yes Subtract the overlap Re-compute the goodness value Add D if needed Output S 2/4/2019

Input S One extension S : the ordered list of the candidate fragments in decreasing order of their goodness value Fi : a fragment in S C : a chunk in Fi r : the union range contained by the filtered areas of other fragments The runtime complexity is O(n2), where n is the number of chunks intersecting the query boundary. Redundant I/O exists? No Yes Foreach Fi ∈ S from the head of S Foreach chunk C in Fi No C ∈ r ? Yes Drop it from Fi Modify other fragments in S to retrieve C Output Recommended fragments 2/4/2019

Experimental Setup & Design A Linux cluster connected via a Switched Fast Ethernet. Each node has a PIII 933MHz CPU, 512 MB main Memory, and three 100GB IDE disks. Scalability test when increasing data size; Performance test when the number of nodes hosting dataset is varied; Showing the robustness of the proposed algorithm. 2/4/2019

12GB IPARS data 2/4/2019

and Y>=0 and Y<=28 and Z>=0 and Z<=28; Query #1 SELECT * from IPARS where TIME>=1000 and TIME<=TIMEVAL and X>=0 and X<=11 and Y>=0 and Y<=28 and Z>=0 and Z<=28; Set #1 in the previous table 1. Our algorithm has chosen {0,1,2,4} out of 6 replicas in Set #1. The query filters 83% of the retrieved data when using the original dataset only; however, it need to filter about 25% of the retrieved data in the presence of replicas as in set #1. 2/4/2019

and Y>=0 and Y<=63 and Z>=0 and Z<=63; Query #3 SELECT * from IPARS where TIME>=1000 and TIME<=TIMEVAL and X>=0 and X<=15 and Y>=0 and Y<=63 and Z>=0 and Z<=63; Set #1 in the previous table 1. Our algorithm extension could detect the redundant I/O in the candidate replicas for this query. The final recommendation is to avoid using replicas. 2/4/2019

Replica Selection Module – Two Extensions Combined use of space and attribute partitioned replicas Design a dynamic programming algorithm for selecting the best set of attribute partitioned replicas. Implement a new greedy strategy based on a new cost model for recommending a combination of replicas. Uneven partitioning and processing of aggregations SELECT <Attributes>(or SELECT Aggregate (<Attributes>)) FROM Dataset WHERE <Predicate Expression> Extend the cost model and replica selection algorithm to address queries with aggregations while replicas are unevenly stored across storage units. 2/4/2019

Extension I – General Structure of our Algorithm 2/4/2019

The runtime complexity is O(l3) Input R Calculate the cost of Costj,j Dynamic Programming Algorithm R: a group of attribute-partitioned replicas R’: the optimal combination output l: the number of referred attributes in Q M1..l: the referred attribute list The runtime complexity is O(l3) Foreach k from 2 to l Foreach u from 1 to l-k+1 Yes r1 contains only Mu..v No Calculate Costu..v, Locu..v->s=-1, Locu..v->r=r1 Yes r2 contains Mu..v No Calculate Costu..v, Locu..v->s=-1, Locu..v->r=r2 Costu..v=∞ Find the qmin=Costu..p+Costp+1..v Costu..v=q, Locu..v->s=p, Locu..v->r=-1 Output(loc1..l) Output R’ 2/4/2019

Extension II – Uneven Partitioning and Aggregation Operations Computing Goodness Value Goodness(F)= (ΣpᄐP costp(CurLoad)+ ΣpᄐP data(F)) /maxpᄐP (costp(CurLoad)+costp(F)) P : all available storage nodes CurLoad : current workload across all storage nodes due to previously chosen candidate replicas Cost fragment = tread*nread+tseek* nseek+tfilter*nfilter+tagg*nagg+ttrans*ntrans tfilter : average filtering time for a tuple nfilter : number of total tuples in all chunks taggr : average aggregate computation time for a tuple naggr : number of total useful tuples ttrans : network transfer time for one unit of data ntrans : the amount of data after aggregate operation 2/4/2019

Workload aware greedy strategy Input Q, F, D Foreach Fi in F Workload aware greedy strategy Q : an issued query F : the interesting fragment set D : the original dataset F : all fragments intersecting with the query boundary Fmax : the fragment with the maximum goodness value in F S : the ordered list of the candidate fragments in decreasing order of their goodness value Yes Overlap with F-{Fi} exists? No Append Fi into S Yes F is NULL? No Calculate the current goodness value for Fi in F Append Fmax Into S Remove Fmax from F Overlap with Fmax exists in F? No Yes Subtract the overlap Add D if needed Output S 2/4/2019

Experimental Setup & Design A Linux cluster connected via a Switched Fast Ethernet. Each node has a PIII 933MHz CPU, 512 MB main Memory, and three 100GB IDE disks. Performance evaluation of the combination of space-partitioned and attribute-partitioned replicas, and the benefit of attribute-partitioned replicas; Scalability test when increasing the number of nodes hosting dataset; Performance test when data query sizes are varied; Performance evaluation for aggregate queries with unevenly partitioned replicas. 2/4/2019

120GB IPARS data 2/4/2019

and Y>=0 and Y<=28 and Z>=0 and Z<=28; SELECT attrlist from IPARS where RID in [0,1] and TIME in [1000,1399] and X>=0 and X<=11 and Y>=0 and Y<=28 and Z>=0 and Z<=28; 2/4/2019

the combined use of all replicas space part : attr+space part : the combined use of all replicas space part : only use the space-partitioned replicas A run-time optimization 2/4/2019

Aggregate Queries with Unevenly Partitioned Replicas 5GB TITAN data 2/4/2019

Aggregate Queries with Unevenly Partitioned Replicas 120GB IPARS data 2/4/2019

Alg – solution by the proposed algorithm Alg+Ref – solution after the refinement step Solution-1 & 2 – two manually created solutions 2/4/2019

Outline Introduction Motivation Contributions Overall system framework System design, algorithm and experimental evaluation Automatic data virtualization system Data virtualization through data services over scientific datasets Data analysis of data virtualization system Replica selection module Performance optimization using partial replicas Generalizing the work of partial replication optimization Efficient execution of multiple queries on scientific datasets Related research Conclusions 2/4/2019

Motivating Application Mouse Placenta Data Analysis Analyzing digital microscopic images and studying the phenotype change. Querying an irregular polygonal region. Using five adjacent query regions to approximate the boundary of mouse placenta Two overlapping regions are interesting due to the density of red cells. 160GB data in total 2/4/2019

Problem Characteristics of scientific datasets and applications Large size of distributed multidimensional data Large amount of I/O retrieval operation Two interested scenarios An irregular sub-region of multi-dimensional data space Multiple different exploratory queries for overlapping regions 2/4/2019

Our Approach Building on our previous work of performance optimization using partial replicas. Propose a cost model incorporating the effect of data locality Design the greedy algorithm using the cost model Implement three important sub- procedures for generating execution plans 2/4/2019

Computing Goodness Value Exploiting two sources of chunk reuses across different queries Temporal locality Spatial locality goodnessper-chunk = useful dataper-chunk / costper-chunk Cost chunk = tread*nread+tseek+tfilter*nfilter+tsplit*nsplit tsplit : number of useful tuples if one chunk exhibits locality, or 0 if one chunk does not show any locality nsplit : average comparison time for judging the query range one tuple belongs to. 2/4/2019

One Example – Using partial replicas for answering multiple queries {Q1, Q2, Q3} 4 chunks show temporal locality 2 chunks show spatial locality A coalescing and aggregating global query space 2/4/2019

Detecting interesting fragments Input Q , R, D Calculate the global query range for multiple queries Detecting interesting fragments Q : multiple queries R : partial replica set D : original dataset F : interesting fragment set F’ : output of interesting fragment set with calculated goodness values Generating Execution Plans Divide the output single list of candidate fragments into multiple ones regarding on respective queries. Generate and index memory stored replicas Avoid the buffering of duplicate data attributes Find the interesting fragment set F For the global query range Foreach Fi in F Identify whether Fi has locality Tuple(Fi) = 0 , Cost(Fi) = 0 Foreach chunk C in Fi Factor in the cost of split operation Tuple(Fi) = Tuple(Fi) + Tuple(C) Cost(Fi) = Cost(Fi) + Cost(C) Goodness(Fi) = Tuple(Fi) / Cost(Fi) Output F’ 2/4/2019

Experimental Setup & Design A Linux cluster connected via a Switched Fast Ethernet. Each node has two AMD Opteron(tm) 2411MHz CPU, 8GB main Memory, and two 250GB SATA disks. Performance improvement using our proposed approach Scalability test when increasing the number of nodes hosting dataset; Performance test when data query sizes are varied; 2/4/2019

160GB Mouse Data 2/4/2019

2/4/2019

The scalability is affected by unbalanced filtering operation and splitting operation. Seek operation while increasing the number of nodes starts to dominate the I/O cost. 2/4/2019

Related Research Data Virtualization through data services Databases Data description on the Grid HDF5, DFDL, BinX & BFD Manual implementation based on low-level datasets Support aggregations for data analysis Parallelization of SQL-based aggregation and reductions Reduction research in parallelizing compilers Performance optimization on Large Datasets Replication research Availability and reliability File level and dataset level replication and replica management Exact replicated copies are located near or in best clients Indexing techniques Caching techniques Query transformation for finding common sub-expression from inter- and intra-query level Memory aggregation and cooperative cache management to speed up query execution. 2/4/2019

Conclusions We have designed and implemented our automatic data virtualization system and our replication selection module for providing a light weight layer over large distributed scientific datasets. The complexity of manipulating scientific datasets and efficient processing is shielded from users to the underlying system. Our experimental results demonstrated the efficiency of our system, performance improvement using partial replication, and good scalability under parallel configurations. 2/4/2019

Automatic Virtualization Using Meta-data Aligned file chunks {num_rows, {File1,Offset1,Num_Bytes1}, {File2,Offset2,Num_Bytes2}, ……, {Filem,Offsetm,Num_Bytesm} } Our tool parses the meta-data descriptor and generates function codes. At run time, the query would provide parameters to invoke the generated functions to create Aligned File Chunks. Dataset Root dataset 1 dataset 2 dataset 3 Data1 Data2 Data5 Data3 Data4 Data6 2/4/2019

Comparison with hand written codes Oil reservoir management dataset stored on 16 nodes. Performance difference is within 17%, With an average difference of 14% Satellite data processing stored on a single node. Performance difference is within 4% 2/4/2019

Comparison with an existing database (PostgreSQL) No. Description 1 SELECT * FROM TITAN; 2 SELECT * FROM TITAN WHERE X>=0 AND X<=10000 AND Y>=0 AND Y<=10000 AND Z>=0 AND Z<=100; 3 SELECT * FROM TITAN WHERE DISTANCE(X,Y,Z) < 1000; 4 SELECT * FROM TITAN WHERE S1 < 0.01; 5 SELECT * FROM TITAN WHERE S1 < 0.5; 6GB data for Satellite data processing. The total storage required after loading the data in PostgreSQL is 18GB. Create Index for both spatial coordinates and S1 in PostgreSQL. No special performance tuning applied for the experiment. 2/4/2019

and Y>=0 and Y<=31 and Z>=0 and Z<=31; Query #2 SELECT * from IPARS where TIME>=1000 and TIME<=1599 and X>=0 and X<=11 and Y>=0 and Y<=31 and Z>=0 and Z<=31; Set #1 in the previous table 1. Our algorithm has chosen {0,1,2,4} out of 6 replicas in Set #1. Upto 4 nodes, query execution time scales linearly. Due to the dominating seek cost in the total I/O overhead, execution time is not reduced by half while using 8 nodes. 2/4/2019

Execution time (seconds) 101.2 120.4 80.8 Original Set #2 Set #3 Execution time (seconds) 101.2 120.4 80.8 Data processed (MB) 1206 1210 Number of seeks 217 1012 23 Query #4 SELECT * from IPARS where TIME>=1000 and TIME<=1199; An accurate cost modeling should take into account both the seek cost and the read cost. 2/4/2019

and Y>=0 and Y<=31 and Z>=0 and Z<=31; #Query SELECT * from IPARS where TIME>=1000 and TIME<=1599 and X>=0 and X<=11 and Y>=0 and Y<=31 and Z>=0 and Z<=31; Upto 4 nodes, query execution time scales linearly. Due to the dominating seek cost in the total I/O overhead, execution time is not reduced by half while using 8 nodes. 2/4/2019

and Y>=0 and Y<=28 and Z>=0 and Z<=28; # Query SELECT * from IPARS where TIME>=1000 and TIME<=TIMEVAL and X>=0 and X<=11 and Y>=0 and Y<=28 and Z>=0 and Z<=28; Our algorithm has chosen {1,3,4,6} out of all replicas in Table #1. The query filters 83% of the retrieved data when using the original dataset only; however, it need to filter about 50% of the retrieved data in the presence of replicas. 2/4/2019

are needed in Plan (c ) than Plan (a). More data is retrieved from disk and correspondingly more associated filtering operation are needed in Plan (c ) than Plan (a). 2/4/2019