Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,

Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng, Umit Catalyurek, Tahsin Kurc, Gagan Agrawal, Joel Saltz

Ohio State University Department of Computer Science and Engineering CCGRID 20052 Outline Introduction –Motivation –Partial Replicas Considered –system overview Query execution and algorithm design –Computing goodness value –Replica selection algorithm Experimental results Related work Conclusions

Ohio State University Department of Computer Science and Engineering CCGRID 20053 Motivating Applications Magnetic Resonance Imaging Oil Reservoir Management Data-driven applications from science, Engineering, biomedicine: Oil Reservoir ManagementWater Contamination Studies Cancer Studies using MRITelepathology with Digitized Slides Satellite Data ProcessingVirtual Microscope Oil Reservoir Management is used by us as a case study.

Ohio State University Department of Computer Science and Engineering CCGRID 20054 Motivation The combination of large dataset sizes, geographic distribution of users and resources and complex analysis results in the requirements of efficient access and high- performance processing. To achieve good performance for various query types and increasing need of clients, we need to harness an optimization technique Partial Replication. Under a distributed environment, assembling required data efficiently from replicas and the original dataset for a query is an interesting challenge.

Ohio State University Department of Computer Science and Engineering CCGRID 20055 Partial Replicas Considered Replica information file describes the replicas created by users. –Hot range »Use a group of representative queries to identify the portions of the dataset to be replicated. –Chunking »Allow flexible chunk shapes and sizes. »Affect data read cost. –Dimension order »Layout chunks following different dimension sequences. »Affect data seek cost.

Ohio State University Department of Computer Science and Engineering CCGRID 20056 Partial Replicas Considered To maximize I/O parallelism, users need to partition each chunk of one replica across all available data source nodes. After re-organizing, re-distributing and re-ordering hot ranges of the dataset, there will not be one-to- one mapping between data chunks in the original dataset and those in replicas.

Ohio State University Department of Computer Science and Engineering CCGRID 20057 System Overview The Replica Selection Module is coupled tightly with our prior work on supporting SQL Select queries on scientific datasets in a cluster environment.

Ohio State University Department of Computer Science and Engineering CCGRID 20058 STORM Runtime System A middleware to support data selection, data partitioning, and data transfer operations on flat-file datasets hosted on a parallel system. Services Query service Data source service Indexing service Filtering service Partition generation service Data mover service

Ohio State University Department of Computer Science and Engineering CCGRID 200510 Computing Goodness Value goodness = useful data per-chunk / cost per-chunk –Full chunks and partial chunks of a partial replica –Chunk retrieval cost Cost = k 1 * C read-operation + k 2 * C seek-operation –k 1 : average read time for a page –C read-operation : number of pages fetched –k 2 : average seek time –C seek-operation : number of seeks Fragment –intermediate unit between a replica and its chunks –a group of full or partial chunks having same goodness value in a replica –goodness = useful data per-fragment / cost per-fragment

Ohio State University Department of Computer Science and Engineering CCGRID 200511 An Example – Query and Intersecting Replicas Replica 1 –3 full chunks and 2 partial chunks –3 fragments Replica 2 –10 full chunks –1 fragment

Ohio State University Department of Computer Science and Engineering CCGRID 200512 Calculate the fragment set F Append F max Into S Re-compute the goodness value Subtract the overlap InputQ, R, D F is null? Overlap with F max exists in F? Remove F max from F Add D if needed S No Yes No Yes Replica Selection Algorithm –Greedy Strategy »Q : an issued query »R : the partial replicas »D : the original dataset »F : all fragments intersecting with the query boundary »F max : the fragment with the maximum goodness value in F »S : the ordered list of the candidate fragments in decreasing order of their goodness value –The runtime complexity is O(m 2 ), where m is the number of fragments intersecting the query boundary. Output

Ohio State University Department of Computer Science and Engineering CCGRID 200513 An Example – 4 Fragments from 2 Replicas Assume Fragment 4 has the maximum goodness value. Candidate fragments set is { 1, 2 (with overlap), 3, 4, D }.

Ohio State University Department of Computer Science and Engineering CCGRID 200514 S Foreach F i ∈ S from the head of S Foreach chunk C in F i Drop it from F i Modify other fragments in S to retrieve C Input Recommended fragments Redundant I/O exists? C ∈ r ? Output No Yes No Yes Replica Selection Algorithm –Extension to the greedy algorithm »S : the ordered list of the candidate fragments in decreasing order of their goodness value »Fi : a fragment in S »C : a chunk in Fi »r : the union range contained by the filtered areas of other fragments –The runtime complexity is O(n 2 ), where n is the number of chunks intersecting the query boundary.

Ohio State University Department of Computer Science and Engineering CCGRID 200515 An Example – Recommended Chunks Final recommendation –Overlap region has been deleted from Fragment 4 and retrieved in Fragment 2 instead. –We get fewer I/O operations and less filtering computation.

Ohio State University Department of Computer Science and Engineering CCGRID 200517 Experimental Setup & Design A Linux cluster connected via a Switched Fast Ethernet. Each node has a PIII 933MHz CPU, 512 MB main Memory, and three 100GB IDE disks. Scalability test when increasing data size; Performance test when the number of nodes hosting dataset is varied; Showing the robustness of the proposed algorithm.

Ohio State University Department of Computer Science and Engineering CCGRID 200518

Ohio State University Department of Computer Science and Engineering CCGRID 200519 Query #1 SELECT * from IPARS where TIME>=1000 and TIME =0 and X<=11 and Y>=0 and Y =0 and Z<=28; Set #1 in the previous table 1. Our algorithm has chosen {0,1,2,4} out of 6 replicas in Set #1. The query filters 83% of the retrieved data when using the original dataset only; however, it need to filter about 25% of the retrieved data in the presence of replicas as in set #1.

Ohio State University Department of Computer Science and Engineering CCGRID 200520 Query #2 SELECT * from IPARS where TIME>=1000 and TIME =0 and X<=11 and Y>=0 and Y =0 and Z<=31; Set #1 in the previous table 1. Our algorithm has chosen {0,1,2,4} out of 6 replicas in Set #1. Upto 4 nodes, query execution time scales linearly. Due to the dominating seek cost in the total I/O overhead, execution time is not reduced by half while using 8 nodes.

Ohio State University Department of Computer Science and Engineering CCGRID 200521 Query #3 SELECT * from IPARS where TIME>=1000 and TIME =0 and X<=15 and Y>=0 and Y =0 and Z<=63; Set #1 in the previous table 1. Our algorithm extension could detect the redundant I/O in the candidate replicas for this query. The final recommendation is to avoid using replicas.

Ohio State University Department of Computer Science and Engineering CCGRID 200522 OriginalSet #2Set #3 Execution time (seconds) 101.2120.480.8 Data processed (MB) 1206 1210 Number of seeks217101223 Query #4 SELECT * from IPARS where TIME>=1000 and TIME<=1199; An accurate cost modeling should take into account both the seek cost and the read cost.

Ohio State University Department of Computer Science and Engineering CCGRID 200523 Related Work Parallel file systems and I/O libraries –Supporting regular strided access to uniform distributed datasets File level and dataset level replication and replica management –Exact replica copies –Availability and reliability Data caching –Remote memory –Cooperative caches –Active semantic cache

Ohio State University Department of Computer Science and Engineering CCGRID 200524 Conclusions We have investigated a compiler-runtime approach for execution of range queries on distributed environment when employing partial replication. We have proposed a cost metric and algorithm to select the set of replicas and possibly the original dataset to answer a given query efficiently. Experimental results demonstrate the efficacy, scalability and robustness of our algorithm.

Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,

Similar presentations

Presentation on theme: "Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,

Similar presentations

Presentation on theme: "Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,"— Presentation transcript:

Similar presentations

About project

Feedback