Automatic and Efficient Data Virtualization System on Scientific Datasets Li Weng
Outline Introduction Motivation Contributions Overall system framework System design, algorithm and experimental evaluation Automatic data virtualization system Data virtualization through data services over scientific datasets Data analysis of data virtualization system Replica selection module Performance optimization using partial replicas Generalizing the work of partial replication optimization Efficient execution of multiple queries on scientific datasets Related research Conclusions 2/4/2019
Data Grids Datasets Data-intensive applications Large volume Gigabyte, Terabyte, Petabyte Distributed datasets Generated/collected by scientific simulations or instruments Multi-dimensional datasets Dimension attributes, measure attributes Data-intensive applications Data Specification Data Organization Data Extraction Data Movement Data Analysis Data Visualization 2/4/2019
Motivating Applications Digitized Microscopy Image Analysis Oil Reservoir Management Data-driven applications from science, Engineering, biomedicine: Oil Reservoir Management Water Contamination Studies Cancer Studies using MRI Telepathology with Digitized Slides Satellite Data Processing Virtual Microscope … 2/4/2019
Two Challenges In view of large dataset sizes, geographic distribution of users and resources, and complex analysis, we concentrated on the two critical challenges – Low-level and specialized data formats Various query types and increasing number of clients 2/4/2019
Contributions Data virtualization system Realizing data virtualization through automatically generated data services (HPDC2004) Supporting complex data analysis processing by SQL-3 query and aggregations (LCPC2004) Replica selection module Designing new techniques toward efficient execution of data analysis queries using partial replicas. (CCGRID2005) Generalizing the functionalities of the replica selection module according to two significant extensions. (ICPP2006) Efficient execution of multiple queries Exploring the performance optimization potential of multiple queries. (under submission) 2/4/2019
Automatic Data Virtualization System (HPDC2004) SELECT < Data Elements > SELECT * FROM < Dataset Name > FROM IPARS WHERE < Expression > WHERE REL in (0,6,26,27) AND TIME>1000 AND Filer( < Data Element> ); AND TIME<1100 AND SOIL>0.7 AND SPEED(OILVX, OILVY,OILVZ)<30.0; 2/4/2019
Data Analysis in Data Virtualization System (LCPC2004) 2/4/2019
Replica Selection Module (CCGRID2005, ICPP2006) 2/4/2019
Outline Introduction Motivation Contributions Overall system framework System design, algorithm and experimental evaluation Automatic data virtualization system Data virtualization through data services over scientific datasets Data analysis of data virtualization system Replica selection module Performance optimization using partial replicas Generalizing the work of partial replication optimization Efficient execution of multiple queries on scientific datasets Related research Conclusions 2/4/2019
Automatic Data Virtualization System An abstract view of data dataset Data Virtualization Data Service Design a meta-data descriptor Automatic data virtualization using our meta-data descriptor 2/4/2019
Design a Meta-data Description Language Dataset Schema Description Component Dataset Storage Description Component Dataset Layout Description Component 2/4/2019
An Example Oil Reservoir Management Component I: Dataset Schema Description [IPARS] // { * Dataset schema name *} REL = short int // {* Data type definition *} TIME = int X = float Y = float Z = float SOIL = float SGAS = float Oil Reservoir Management The dataset comprises several simulation on the same grid For each realization, each grid point, a number of attributes are stored. The dataset is stored on a 4 node cluster. Component II: Dataset Storage Description [IparsData] //{* Dataset name *} //{* Dataset schema for IparsData *} DatasetDescription = IPARS DIR[0] = osu0/ipars DIR[1] = osu1/ipars DIR[2] = osu2/ipars DIR[3] = osu3/ipars 2/4/2019
An example of Oil Reservoir Management Component III: Dataset Layout Description DATASET “IparsData” { //{* Name for Dataset *} DATATYPE { IPARS } //{* Schema for Dataset *} DATAINDEX { REL TIME } DATA { DATASET ipars1 DATASET ipars2 } DATASET “ipars1” { DATASPACE { LOOP GRID ($DIRID*100+1):(($DIRID+1)*100):1 { X Y Z } DATA { $DIR[$DIRID]/COORDS $DIRID = 0:3:1 } } // {* end of DATASET “ipars1” *} DATASET “ipars2” { LOOP TINE 1:500:1 { LOOP GRID ( $DIRID*100+1):(( $DIRID+1)*100):1 { SOIL SGAS DATA { $DIR[ $DIRID]/DATA$REL $REL = 0:3:1 $DIRID = 0:3:1 } } //{* end of DATASET “ipars2” *} An example of Oil Reservoir Management Use LOOP keyword for capturing the repetitive structure within a file. The grid has 4 partitions (0~3). “IparsData” comprises “ipars1” and “ipars2”. “ipars1” describes the data files with the spatial coordinates’ stored; “ipars2” specifies the data files with other attributes stored. 2/4/2019
Index & Extraction function code Compiler Analysis Data _Extract { Find _File _Groups() Process _File _Groups() } Find _File _Groups { Let S be the set of files that match against the query Classify files in S by the set of attributes they have Let S1, … ,Sm be the m sets T = Ø foreach {s1, … ,sm } si ∈ Si { {* cartesian product between S1, … ,Sm *} If the values of implicit attributes are not inconsistent { T = T ∪ {s1, … ,sm } Output T Process _File _Groups { foreach {s1, … ,sm } ∈ T Find _Aligned _File _Chunks() Supply implicit attributes for each file chunk foreach Aligned File Chunk { Check against index Compute offset and length Output the aligned file chunk Meta-data descriptor Create AFC Process AFC Index & Extraction function code 2/4/2019
Experimental Setup & Design A Linux cluster connected via a Switched Fast Ethernet. Each node with a PIII 933MHz CPU, 512 MB main Memory, and three 100GB IDE disks. Four sets of experiments: Ability test Scalability test Comparison with hand written codes Comparison with an existing database (PostgreSQL) 2/4/2019
Test the ability of our code generation tool Layout0 - original layout from the application collaborators Layout1 – all data stored as a table in a file Layout2 - all data in a file and each attribute stored as an array Layout3 – split the layout1into multiple files based on value of the time step Layout4 – like layout3, but each attribute stored as an array in each data file Layout5 – data stored in 7 files where the first file with spatial coordinates and the other attributes divided into 6 files Layout6 – like layout5, but each attribute stored as an array in each 2/4/2019
Test the ability of our code generation tool Oil Reservoir Management ( 12GB ) The performance difference is within 4%~10% as for Layout 0. Correctly and efficiently handle a variety of different layouts for the same data 2/4/2019
Evaluate the Scalability of Our Tool Scale the number of nodes hosting the Oil reservoir management dataset Extract a subset of interest at the size of 1.3GB The execution times scale almost linearly. The performance difference varies between 5%~34%, with an average difference of 16%. 2/4/2019
Data Analysis in Data Virtualization System Express the query and the data analysis declaratively on a virtual relational table view Generate and optimize data aggregate service for the desired processing Employ a data partitioning strategy in parallel and distributed configurations 2/4/2019
Oil Reservoir Management (IPARS) SELECT X, Y, Z, ipars_bypass_sum(IPARS) FROM IPARS WHERE REL in (0,5,10) AND TIME >= 1000 AND TIME <= 1200 GROUP BY X, Y, Z HAVING ipars_bypass_sum(OIL)>0; CREATE AGGREGATE ipars_bypass_sum ( BASETYPE = IPARS, SFUNC = ipars_func, STYPE = int, INITCOND = '1' ); CREATE FUNCTION ipars_func(int, IPARS) RETURNS int AS ' SELECT CASE WHEN $2.soil > 0.7 AND |/($2.oilx * $2.oilx + $2.oily * $2.oily + $2.oilz * $2.oilz)<30.0 THEN $1 & 1 ELSE 0 END; ' LANGUAGE SQL; 2/4/2019
Compiler Analysis and Code Generation Transform the canonical query into two pipelined sub-queries. Data Extraction Service TempDataset = SELECT <all attributes> From <Dataset Name> WHERE <Expression> ; Data Aggregation Service SELECT <attribute list> , <AGG_name(Dataset Name)> FROM TempDataset GROUP BY <group-by attribute_list>; 2/4/2019
Generate Data Aggregation Service Aggregate function analysis Projection push-down helps to extract data only needed for a particular query and its aggregation. TempDataset = SELECT <useful attributes> From <Dataset Name> WHERE <Expression> ; As for the IPARS application, only 7 out of the 22 attributes are actually needed for the considered query. The reduction of the data volume to be retrieved and communicated is 66%. As for the TITAN application, 5 out of the 8 attributes are actually needed and the reduction is 38%. 2/4/2019
Generate Data Aggregation Service 2. Aggregate function decomposition The first step involves computations applied on each tuple; The second step updates the aggregate status variable. Replace the largest expression with TempAttr. As for the IPARS, the number of attributes is reduced further from 7 to 4. CREATE FUNCTION ipars_func(int, IPARS) RETURNS int AS ' SELECT CASE WHEN $2.TempAttr THEN $1 & 1 ELSE 0 END; ' LANGUAGE SQL; 2/4/2019
Generate Data Aggregation Service Partition the subset of interest based on the values of the group-by attributes if more client nodes are provided as the computing unit. Construct a hash-table using the values of the group-by attributes as the hash-key. And translate the aggregate function in SQL-3 into the imperative C/C++ code. 2/4/2019
Experimental Setup & Design A Linux cluster connected via a Switched Fast Ethernet. Each node has a PIII 933MHz CPU, 512 MB main Memory, and three 100GB IDE disks. Scalability test when varying the number of nodes for hosting data and performing the computations; Performance test when the amount of data to be processed is increased; Comparison with hand-written code; The impact of the aggregation decomposition. 2/4/2019
Experimental Results for IPARS Scale the number of nodes Extract a subset of interest at the size of 640MB from scanning the 1.9GB data. The performance difference varies between 6%~20%, with an average difference of 14%. The aggregate decomposition can reduce the difference to be between 1% and 10%. Scale the volume of dataset Use 8 data source nodes and 8 client nodes. The execution time stays proportional to the amount of data to be retrieved and processed. 2/4/2019
Experimental Results for TITAN Scale the number of nodes Extract a subset of interest at the size of 228MB from scanning the 456MB data. The performance difference is 17%. The aggregate decomposition can reduce the difference to be 6%. Scale the volume of dataset Use 8 data source nodes and 8 client nodes. The execution time stays proportional to the amount of data to be retrieved and processed. 2/4/2019
Outline Introduction Motivation Contributions Overall system framework System design, algorithm and experimental evaluation Automatic data virtualization system Data virtualization through data services over scientific datasets Data analysis of data virtualization system Replica selection module Performance optimization using partial replicas Generalizing the work of partial replication optimization Efficient execution of multiple queries on scientific datasets Related research Conclusions 2/4/2019
Problem The requirements of efficient access and high-performance processing The challenge for various query types and increasing number of clients Harnessing an optimization technique Partial Replication 2/4/2019
Our Approach – Using Partial Replicas How to assemble queried data efficiently from replicas and the original dataset Computing goodness value Replica selection algorithm comprising a greedy strategy and one extension. The Replica Selection Module is coupled tightly with our prior work on supporting SQL Select queries on scientific datasets in a cluster environment. 2/4/2019
Partial Replicas Considered Replica information file describes the replicas created by users. Space partitioned partial replicas Contain all data attributes of a hot portion of the original dataset. Hot range Use a group of representative queries to identify the portions of the dataset to be replicated. Chunking Allow flexible chunk shapes and sizes. Affect data read cost. Dimension order Layout chunks following different dimension sequences. Affect data seek cost. 2/4/2019
Computing Goodness Value goodness = useful dataper-chunk / costper-chunk Chunk: an atomic unit or a logic unit Full chunks and partial chunks of a partial replica Cost per-chunk = tread * nread + tseek tread : average read time for a disk page nread : number of pages fetched tseek : average seek time Fragment intermediate unit between a replica and its chunks a group of full or partial chunks having same goodness value in a replica goodnessper-fragment = useful dataper-fragment / costper-fragment 2/4/2019
An Example – Query and Intersecting Replicas 3 full chunks and 2 partial chunks 3 fragments Composite Replica 2 10 full chunks 1 fragment 2/4/2019
Input Q, R, D Greedy Strategy Q : an issued query R : the partial replicas D : the original dataset F : all fragments intersecting with the query boundary Fmax : the fragment with the maximum goodness value in F S : the ordered list of the candidate fragments in decreasing order of their goodness value The runtime complexity is O(m2), where m is the number of fragments intersecting the query boundary. Calculate the fragment set F Yes F is null? No Append Fmax Into S Remove Fmax from F No Overlap with Fmax exists in F? Yes Subtract the overlap Re-compute the goodness value Add D if needed Output S 2/4/2019
Input S One extension S : the ordered list of the candidate fragments in decreasing order of their goodness value Fi : a fragment in S C : a chunk in Fi r : the union range contained by the filtered areas of other fragments The runtime complexity is O(n2), where n is the number of chunks intersecting the query boundary. Redundant I/O exists? No Yes Foreach Fi ∈ S from the head of S Foreach chunk C in Fi No C ∈ r ? Yes Drop it from Fi Modify other fragments in S to retrieve C Output Recommended fragments 2/4/2019
Experimental Setup & Design A Linux cluster connected via a Switched Fast Ethernet. Each node has a PIII 933MHz CPU, 512 MB main Memory, and three 100GB IDE disks. Scalability test when increasing data size; Performance test when the number of nodes hosting dataset is varied; Showing the robustness of the proposed algorithm. 2/4/2019
12GB IPARS data 2/4/2019
and Y>=0 and Y<=28 and Z>=0 and Z<=28; Query #1 SELECT * from IPARS where TIME>=1000 and TIME<=TIMEVAL and X>=0 and X<=11 and Y>=0 and Y<=28 and Z>=0 and Z<=28; Set #1 in the previous table 1. Our algorithm has chosen {0,1,2,4} out of 6 replicas in Set #1. The query filters 83% of the retrieved data when using the original dataset only; however, it need to filter about 25% of the retrieved data in the presence of replicas as in set #1. 2/4/2019
and Y>=0 and Y<=63 and Z>=0 and Z<=63; Query #3 SELECT * from IPARS where TIME>=1000 and TIME<=TIMEVAL and X>=0 and X<=15 and Y>=0 and Y<=63 and Z>=0 and Z<=63; Set #1 in the previous table 1. Our algorithm extension could detect the redundant I/O in the candidate replicas for this query. The final recommendation is to avoid using replicas. 2/4/2019
Replica Selection Module – Two Extensions Combined use of space and attribute partitioned replicas Design a dynamic programming algorithm for selecting the best set of attribute partitioned replicas. Implement a new greedy strategy based on a new cost model for recommending a combination of replicas. Uneven partitioning and processing of aggregations SELECT <Attributes>(or SELECT Aggregate (<Attributes>)) FROM Dataset WHERE <Predicate Expression> Extend the cost model and replica selection algorithm to address queries with aggregations while replicas are unevenly stored across storage units. 2/4/2019
Extension I – General Structure of our Algorithm 2/4/2019
The runtime complexity is O(l3) Input R Calculate the cost of Costj,j Dynamic Programming Algorithm R: a group of attribute-partitioned replicas R’: the optimal combination output l: the number of referred attributes in Q M1..l: the referred attribute list The runtime complexity is O(l3) Foreach k from 2 to l Foreach u from 1 to l-k+1 Yes r1 contains only Mu..v No Calculate Costu..v, Locu..v->s=-1, Locu..v->r=r1 Yes r2 contains Mu..v No Calculate Costu..v, Locu..v->s=-1, Locu..v->r=r2 Costu..v=∞ Find the qmin=Costu..p+Costp+1..v Costu..v=q, Locu..v->s=p, Locu..v->r=-1 Output(loc1..l) Output R’ 2/4/2019
Extension II – Uneven Partitioning and Aggregation Operations Computing Goodness Value Goodness(F)= (ΣpᄐP costp(CurLoad)+ ΣpᄐP data(F)) /maxpᄐP (costp(CurLoad)+costp(F)) P : all available storage nodes CurLoad : current workload across all storage nodes due to previously chosen candidate replicas Cost fragment = tread*nread+tseek* nseek+tfilter*nfilter+tagg*nagg+ttrans*ntrans tfilter : average filtering time for a tuple nfilter : number of total tuples in all chunks taggr : average aggregate computation time for a tuple naggr : number of total useful tuples ttrans : network transfer time for one unit of data ntrans : the amount of data after aggregate operation 2/4/2019
Workload aware greedy strategy Input Q, F, D Foreach Fi in F Workload aware greedy strategy Q : an issued query F : the interesting fragment set D : the original dataset F : all fragments intersecting with the query boundary Fmax : the fragment with the maximum goodness value in F S : the ordered list of the candidate fragments in decreasing order of their goodness value Yes Overlap with F-{Fi} exists? No Append Fi into S Yes F is NULL? No Calculate the current goodness value for Fi in F Append Fmax Into S Remove Fmax from F Overlap with Fmax exists in F? No Yes Subtract the overlap Add D if needed Output S 2/4/2019
Experimental Setup & Design A Linux cluster connected via a Switched Fast Ethernet. Each node has a PIII 933MHz CPU, 512 MB main Memory, and three 100GB IDE disks. Performance evaluation of the combination of space-partitioned and attribute-partitioned replicas, and the benefit of attribute-partitioned replicas; Scalability test when increasing the number of nodes hosting dataset; Performance test when data query sizes are varied; Performance evaluation for aggregate queries with unevenly partitioned replicas. 2/4/2019
120GB IPARS data 2/4/2019
and Y>=0 and Y<=28 and Z>=0 and Z<=28; SELECT attrlist from IPARS where RID in [0,1] and TIME in [1000,1399] and X>=0 and X<=11 and Y>=0 and Y<=28 and Z>=0 and Z<=28; 2/4/2019
the combined use of all replicas space part : attr+space part : the combined use of all replicas space part : only use the space-partitioned replicas A run-time optimization 2/4/2019
Aggregate Queries with Unevenly Partitioned Replicas 5GB TITAN data 2/4/2019
Aggregate Queries with Unevenly Partitioned Replicas 120GB IPARS data 2/4/2019
Alg – solution by the proposed algorithm Alg+Ref – solution after the refinement step Solution-1 & 2 – two manually created solutions 2/4/2019
Outline Introduction Motivation Contributions Overall system framework System design, algorithm and experimental evaluation Automatic data virtualization system Data virtualization through data services over scientific datasets Data analysis of data virtualization system Replica selection module Performance optimization using partial replicas Generalizing the work of partial replication optimization Efficient execution of multiple queries on scientific datasets Related research Conclusions 2/4/2019
Motivating Application Mouse Placenta Data Analysis Analyzing digital microscopic images and studying the phenotype change. Querying an irregular polygonal region. Using five adjacent query regions to approximate the boundary of mouse placenta Two overlapping regions are interesting due to the density of red cells. 160GB data in total 2/4/2019
Problem Characteristics of scientific datasets and applications Large size of distributed multidimensional data Large amount of I/O retrieval operation Two interested scenarios An irregular sub-region of multi-dimensional data space Multiple different exploratory queries for overlapping regions 2/4/2019
Our Approach Building on our previous work of performance optimization using partial replicas. Propose a cost model incorporating the effect of data locality Design the greedy algorithm using the cost model Implement three important sub- procedures for generating execution plans 2/4/2019
Computing Goodness Value Exploiting two sources of chunk reuses across different queries Temporal locality Spatial locality goodnessper-chunk = useful dataper-chunk / costper-chunk Cost chunk = tread*nread+tseek+tfilter*nfilter+tsplit*nsplit tsplit : number of useful tuples if one chunk exhibits locality, or 0 if one chunk does not show any locality nsplit : average comparison time for judging the query range one tuple belongs to. 2/4/2019
One Example – Using partial replicas for answering multiple queries {Q1, Q2, Q3} 4 chunks show temporal locality 2 chunks show spatial locality A coalescing and aggregating global query space 2/4/2019
Detecting interesting fragments Input Q , R, D Calculate the global query range for multiple queries Detecting interesting fragments Q : multiple queries R : partial replica set D : original dataset F : interesting fragment set F’ : output of interesting fragment set with calculated goodness values Generating Execution Plans Divide the output single list of candidate fragments into multiple ones regarding on respective queries. Generate and index memory stored replicas Avoid the buffering of duplicate data attributes Find the interesting fragment set F For the global query range Foreach Fi in F Identify whether Fi has locality Tuple(Fi) = 0 , Cost(Fi) = 0 Foreach chunk C in Fi Factor in the cost of split operation Tuple(Fi) = Tuple(Fi) + Tuple(C) Cost(Fi) = Cost(Fi) + Cost(C) Goodness(Fi) = Tuple(Fi) / Cost(Fi) Output F’ 2/4/2019
Experimental Setup & Design A Linux cluster connected via a Switched Fast Ethernet. Each node has two AMD Opteron(tm) 2411MHz CPU, 8GB main Memory, and two 250GB SATA disks. Performance improvement using our proposed approach Scalability test when increasing the number of nodes hosting dataset; Performance test when data query sizes are varied; 2/4/2019
160GB Mouse Data 2/4/2019
2/4/2019
The scalability is affected by unbalanced filtering operation and splitting operation. Seek operation while increasing the number of nodes starts to dominate the I/O cost. 2/4/2019
Related Research Data Virtualization through data services Databases Data description on the Grid HDF5, DFDL, BinX & BFD Manual implementation based on low-level datasets Support aggregations for data analysis Parallelization of SQL-based aggregation and reductions Reduction research in parallelizing compilers Performance optimization on Large Datasets Replication research Availability and reliability File level and dataset level replication and replica management Exact replicated copies are located near or in best clients Indexing techniques Caching techniques Query transformation for finding common sub-expression from inter- and intra-query level Memory aggregation and cooperative cache management to speed up query execution. 2/4/2019
Conclusions We have designed and implemented our automatic data virtualization system and our replication selection module for providing a light weight layer over large distributed scientific datasets. The complexity of manipulating scientific datasets and efficient processing is shielded from users to the underlying system. Our experimental results demonstrated the efficiency of our system, performance improvement using partial replication, and good scalability under parallel configurations. 2/4/2019
Automatic Virtualization Using Meta-data Aligned file chunks {num_rows, {File1,Offset1,Num_Bytes1}, {File2,Offset2,Num_Bytes2}, ……, {Filem,Offsetm,Num_Bytesm} } Our tool parses the meta-data descriptor and generates function codes. At run time, the query would provide parameters to invoke the generated functions to create Aligned File Chunks. Dataset Root dataset 1 dataset 2 dataset 3 Data1 Data2 Data5 Data3 Data4 Data6 2/4/2019
Comparison with hand written codes Oil reservoir management dataset stored on 16 nodes. Performance difference is within 17%, With an average difference of 14% Satellite data processing stored on a single node. Performance difference is within 4% 2/4/2019
Comparison with an existing database (PostgreSQL) No. Description 1 SELECT * FROM TITAN; 2 SELECT * FROM TITAN WHERE X>=0 AND X<=10000 AND Y>=0 AND Y<=10000 AND Z>=0 AND Z<=100; 3 SELECT * FROM TITAN WHERE DISTANCE(X,Y,Z) < 1000; 4 SELECT * FROM TITAN WHERE S1 < 0.01; 5 SELECT * FROM TITAN WHERE S1 < 0.5; 6GB data for Satellite data processing. The total storage required after loading the data in PostgreSQL is 18GB. Create Index for both spatial coordinates and S1 in PostgreSQL. No special performance tuning applied for the experiment. 2/4/2019
and Y>=0 and Y<=31 and Z>=0 and Z<=31; Query #2 SELECT * from IPARS where TIME>=1000 and TIME<=1599 and X>=0 and X<=11 and Y>=0 and Y<=31 and Z>=0 and Z<=31; Set #1 in the previous table 1. Our algorithm has chosen {0,1,2,4} out of 6 replicas in Set #1. Upto 4 nodes, query execution time scales linearly. Due to the dominating seek cost in the total I/O overhead, execution time is not reduced by half while using 8 nodes. 2/4/2019
Execution time (seconds) 101.2 120.4 80.8 Original Set #2 Set #3 Execution time (seconds) 101.2 120.4 80.8 Data processed (MB) 1206 1210 Number of seeks 217 1012 23 Query #4 SELECT * from IPARS where TIME>=1000 and TIME<=1199; An accurate cost modeling should take into account both the seek cost and the read cost. 2/4/2019
and Y>=0 and Y<=31 and Z>=0 and Z<=31; #Query SELECT * from IPARS where TIME>=1000 and TIME<=1599 and X>=0 and X<=11 and Y>=0 and Y<=31 and Z>=0 and Z<=31; Upto 4 nodes, query execution time scales linearly. Due to the dominating seek cost in the total I/O overhead, execution time is not reduced by half while using 8 nodes. 2/4/2019
and Y>=0 and Y<=28 and Z>=0 and Z<=28; # Query SELECT * from IPARS where TIME>=1000 and TIME<=TIMEVAL and X>=0 and X<=11 and Y>=0 and Y<=28 and Z>=0 and Z<=28; Our algorithm has chosen {1,3,4,6} out of all replicas in Table #1. The query filters 83% of the retrieved data when using the original dataset only; however, it need to filter about 50% of the retrieved data in the presence of replicas. 2/4/2019
are needed in Plan (c ) than Plan (a). More data is retrieved from disk and correspondingly more associated filtering operation are needed in Plan (c ) than Plan (a). 2/4/2019