Li Weng, Umit Catalyurek, Tahsin Kurc, Gagan Agrawal, Joel Saltz

Slides:



Advertisements
Similar presentations
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Advertisements

Esma Yildirim Department of Computer Engineering Fatih University Istanbul, Turkey DATACLOUD 2013.
Disco: Running Commodity Operating Systems on Scalable Multiprocessors Bugnion et al. Presented by: Ahmed Wafa.
The Virtual Microscope Umit V. Catalyurek Department of Biomedical Informatics Division of Data Intensive and Grid Computing.
VLDB Revisiting Pipelined Parallelism in Multi-Join Query Processing Bin Liu and Elke A. Rundensteiner Worcester Polytechnic Institute
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
Query Planning for Searching Inter- Dependent Deep-Web Databases Fan Wang 1, Gagan Agrawal 1, Ruoming Jin 2 1 Department of Computer.
Managing Large RDF Graphs (Infinite Graph) Vaibhav Khadilkar Department of Computer Science, The University of Texas at Dallas FEARLESS engineering.
Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Ohio State University Department of Computer Science and Engineering 1 Supporting SQL-3 Aggregations on Grid-based Data Repositories Li Weng, Gagan Agrawal,
An Autonomic Framework in Cloud Environment Jiedan Zhu Advisor: Prof. Gagan Agrawal.
Workload-driven Analysis of File Systems in Shared Multi-Tier Data-Centers over InfiniBand K. Vaidyanathan P. Balaji H. –W. Jin D.K. Panda Network-Based.
MonetDB/X100 hyper-pipelining query execution Peter Boncz, Marcin Zukowski, Niels Nes.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
File System Implementation Chapter 12. File system Organization Application programs Application programs Logical file system Logical file system manages.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Euro-Par, A Resource Allocation Approach for Supporting Time-Critical Applications in Grid Environments Qian Zhu and Gagan Agrawal Department of.
ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.
EFFECTIVE LOAD-BALANCING VIA MIGRATION AND REPLICATION IN SPATIAL GRIDS ANIRBAN MONDAL KAZUO GODA MASARU KITSUREGAWA INSTITUTE OF INDUSTRIAL SCIENCE UNIVERSITY.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
Impact of High Performance Sockets on Data Intensive Applications Pavan Balaji, Jiesheng Wu, D.K. Panda, CIS Department The Ohio State University Tahsin.
CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.
1 Using Tiling to Scale Parallel Datacube Implementation Ruoming Jin Karthik Vaidyanathan Ge Yang Gagan Agrawal The Ohio State University.
SAGA: Array Storage as a DB with Support for Structural Aggregations SSDBM 2014 June 30 th, Aalborg, Denmark 1 Yi Wang, Arnab Nandi, Gagan Agrawal The.
CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and.
A Fault-Tolerant Environment for Large-Scale Query Processing Mehmet Can Kurt Gagan Agrawal Department of Computer Science and Engineering The Ohio State.
Ohio State University Department of Computer Science and Engineering Data-Centric Transformations on Non- Integer Iteration Spaces Swarup Kumar Sahoo Gagan.
Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.
Ohio State University Department of Computer Science and Engineering An Approach for Automatic Data Virtualization Li Weng, Gagan Agrawal et al.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
Massive Semantic Web data compression with MapReduce Jacopo Urbani, Jason Maassen, Henri Bal Vrije Universiteit, Amsterdam HPDC ( High Performance Distributed.
High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.
Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,
Data Consolidation: A Task Scheduling and Data Migration Technique for Grid Networks Author: P. Kokkinos, K. Christodoulopoulos, A. Kretsis, and E. Varvarigos.
Disco: Running Commodity Operating Systems on Scalable Multiprocessors Presented by: Pierre LaBorde, Jordan Deveroux, Imran Ali, Yazen Ghannam, Tzu-Wei.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
Servicing Seismic and Oil Reservoir Simulation Data through Grid Data Services Sivaramakrishnan Narayanan, Tahsin Kurc, Umit Catalyurek and Joel Saltz.
Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.
Igor EPIMAKHOV Abdelkader HAMEURLAIN Franck MORVAN
Memory Management.
CPS216: Data-intensive Computing Systems
Distributed Network Traffic Feature Extraction for a Real-time IDS
Database Management System
Parallel Data Laboratory, Carnegie Mellon University
Chapter 12: Query Processing
Database Performance Tuning and Query Optimization
Auburn University COMP7500 Advanced Operating Systems I/O-Aware Load Balancing Techniques (2) Dr. Xiao Qin Auburn University.
CSCE 990: Advanced Distributed Systems
Selectivity Estimation of Big Spatial Data
Sameh Shohdy, Yu Su, and Gagan Agrawal
Evaluation of Relational Operations: Other Operations
Spatial Online Sampling and Aggregation
Yu Su, Yi Wang, Gagan Agrawal The Ohio State University
Communication and Memory Efficient Parallel Decision Tree Construction
Joining Interval Data in Relational Databases
(A Research Proposal for Optimizing DBMS on CMP)
Automatic and Efficient Data Virtualization System on Scientific Datasets Li Weng.
Laura Bright David Maier Portland State University
Chapter 12 Query Processing (1)
Overview of Query Evaluation
Chapter 11 Database Performance Tuning and Query Optimization
Evaluation of Relational Operations: Other Techniques
Compiler Supported Coarse-Grained Pipelined Parallelism: Why and How
Resource Allocation for Distributed Streaming Applications
Automatic and Efficient Data Virtualization System on Scientific Datasets Li Weng.
Donghui Zhang, Tian Xia Northeastern University
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Li Weng, Umit Catalyurek, Tahsin Kurc, Gagan Agrawal, Joel Saltz Using Space and Attribute Partitioned Partial Replicas for Data Subsetting and Aggregation Queries Li Weng, Umit Catalyurek, Tahsin Kurc, Gagan Agrawal, Joel Saltz

Motivation: Data-Driven Science Oil Reservoir Management Magnetic Resonance Imaging Data-driven applications from science, Engineering, biomedicine: Large Spatio-temporal datasets Several attributes at each point 11/11/2018 ICPP2006

Replication of Scientific Datasets A variety of queries on the same dataset Each requires different spatial-temporal region and subset of attributes No chunking and indexing strategy can optimize for all Replication: Create multiple copies Use different chunking and indexing schemes Large storage overhead 11/11/2018 ICPP2006

Partial Replication Can we get benefits of replication without the large overheads ? Not all attributes accessed uniformly Not all spatio-temporal regions accessed with uniform probability Partial Replication: Each replica has Only a subset of attributes (attribute partitioned) and/or Only a rectilinear spatio-temporal region (space partitioned) Challenge: No single partial replica may be able to answer the query Can we choose and combine partial replicas to optimize query processing ? 11/11/2018 ICPP2006

Prior Work (CCGRID 05) Query planning with partial replicas Cost models Greedy selection algorithm Only considered space partitioned replicas Consider SELECT SQL queries Implemented as an extension to Automatic Data Virtualization System (HPDC 04) 11/11/2018 ICPP2006

Contributions Support combined use of space and attribute partitioned partial replicas Dynamic programming algorithm for selecting the best set of attribute partitioned replicas New greedy strategy for recommending a combination of replicas Extend replica selection algorithm to address queries with aggregations -- replicas may be unevenly stored across storage units 11/11/2018 ICPP2006

System Overview The Replica Selection Module is coupled tightly with our prior work of supporting SQL Select queries on scientific datasets in a cluster environment. 11/11/2018 ICPP2006

Outline Introduction Query execution and algorithm design Motivation Contributions System overview Query execution and algorithm design Uniformly partitioned chunks and select queries Uneven partitioning and aggregation operation Experimental results Related work Conclusions 11/11/2018 ICPP2006

Uniformly Partitioned Chunks and Select Queries Computing Goodness Value goodness = useful dataper-chunk / costper-chunk Chunk: an atomic unit in space partitioned replicas or a logic unit in attribute partitioned replicas Full chunks and partial chunks of a partial replica Cost per-chunk = tread * nread + tseek tread : average read time for a disk page nread : number of pages fetched tseek : average seek time Fragment intermediate unit between a replica and its chunks a group of full or partial chunks having same goodness value in a replica goodnessper-fragmen = useful dataper-fragment / costper-fragment 11/11/2018 ICPP2006

An Example – Query and Intersecting Replicas 3 full chunks and 2 partial chunks 3 fragments Composite Replica 2 10 full chunks 1 fragment 11/11/2018 ICPP2006

General Structure of Replica Selection Algorithm 11/11/2018 ICPP2006

Dynamic Programming Algorithm Input R Calculate the Costj,j Dynamic Programming Algorithm R: a group of attribute-partitioned replicas R’: the optimal combination output l: the number of referred attributes in Q M1..l: the referred attribute list Foreach k from 2 to l Foreach u from 1 to l-k+1 Yes r1 contains only Mu..v No Calculate Costu..v, Locu..v->s=-1, Locu..v->r=r1 Yes r2 contains Mu..v No Calculate Costu..v, Locu..v->s=-1, Locu..v->r=r2 Costu..v=∞ Find the qmin=Costu..p+Costp+1..v Costu..v=q, Locu..v->s=p, Locu..v->r=-1 Output(loc1..l) Output R’ 11/11/2018 ICPP2006

Greedy Strategy Q : an issued query R : the partial replicas Input Q, R, D Greedy Strategy Q : an issued query R : the partial replicas D : the original dataset F : all fragments intersecting with the query boundary Fmax : the fragment with the maximum goodness value in F S : the ordered list of the candidate fragments in decreasing order of their goodness value Calculate the fragment set F Yes F is null? No Append Fmax Into S Remove Fmax from F No Overlap with Fmax exists in F? Yes Subtract the overlap Re-compute the goodness value Add D if needed Output S 11/11/2018 ICPP2006

Uneven Partitioning and Aggregation Operations Computing Goodness Value Goodness(F) = ΣpᄐP data(F) /maxpᄐP (costp(CurLoad)+costp(F)) P : all available storage nodes CurLoad : current workload across all storage nodes due to previously chosen candidate replicas Cost fragment = tread*nread+tseek* nseek+tfilter*nfilter+tagg*nagg+ttrans*ntrans tfilter : average filtering time for a tuple nfilter : number of total tuples in all chunks taggr : average aggregate computation time for a tuple naggr : number of total useful tuples ttrans : network transfer time for one unit of data ntrans : the amount of data after aggregate operation 11/11/2018 ICPP2006

Workload aware greedy strategy Input Q, F, D Foreach Fi in F Workload aware greedy strategy Q : an issued query F : the interesting fragment sets D : the original dataset F : all fragments intersecting with the query boundary Fmax : the fragment with the maximum goodness value in F S : the ordered list of the candidate fragments in decreasing order of their goodness value Yes Overlap with F-{Fi} exists? No Append Fi into S Yes F is NULL? No Calculate the current goodness value for Fi in F Append Fmax Into S Remove Fmax from F Overlap with Fmax exists in F? No Yes Subtract the overlap Add D if needed Output S 11/11/2018 ICPP2006

Outline Introduction Query execution and algorithm design Motivation Contributions System overview Query execution and algorithm design Uniformly partitioned chunks and select queries Uneven partitioning and aggregation operation Experimental results Related work Conclusions 11/11/2018 ICPP2006

Experimental Setup & Design A Linux cluster connected via a Switched Fast Ethernet. Each node has a PIII 933MHz CPU, 512 MB main Memory, and three 100GB IDE disks. Performance evaluation of the combination of space-partitioned and attribute-partitioned replicas, and the benefit of attribute-partitioned replicas; Scalability test when increasing the number of nodes hosting dataset; Performance test when data query sizes are varied; Performance evaluation for aggregate queries with unevenly partitioned replicas. 11/11/2018 ICPP2006

11/11/2018 ICPP2006

and Y>=0 and Y<=28 and Z>=0 and Z<=28; SELECT attrlist from IPARS where RID in [0,1] and TIME in [1000,1399] and X>=0 and X<=11 and Y>=0 and Y<=28 and Z>=0 and Z<=28; 11/11/2018 ICPP2006

the combined use of all replicas space part : attr+space part : the combined use of all replicas space part : only use the space-partitioned replicas A run-time optimization 11/11/2018 ICPP2006

and Y>=0 and Y<=31 and Z>=0 and Z<=31; #Query SELECT * from IPARS where TIME>=1000 and TIME<=1599 and X>=0 and X<=11 and Y>=0 and Y<=31 and Z>=0 and Z<=31; Upto 4 nodes, query execution time scales linearly. Due to the dominating seek cost in the total I/O overhead, execution time is not reduced by half while using 8 nodes. 11/11/2018 ICPP2006

and Y>=0 and Y<=28 and Z>=0 and Z<=28; # Query SELECT * from IPARS where TIME>=1000 and TIME<=TIMEVAL and X>=0 and X<=11 and Y>=0 and Y<=28 and Z>=0 and Z<=28; Our algorithm has chosen {1,3,4,6} out of all replicas in Table #1. The query filters 83% of the retrieved data when using the original dataset only; however, it need to filter about 50% of the retrieved data in the presence of replicas. 11/11/2018 ICPP2006

Aggregate Queries with Unevenly Partitioned Replicas 11/11/2018 ICPP2006

Aggregate Queries with Unevenly Partitioned Replicas 11/11/2018 ICPP2006

Alg – solution by the proposed algorithm Alg+Ref – solution after the refinement step Solution-1 & 2 – two manually created solutions 11/11/2018 ICPP2006

Related Work Replication research Data caching techniques Exact copies of portions of data Data availability and reliability Multi-disk system with replicated data Data caching techniques Using aggregate memory and cooperative caches Management and replacement of replicas Our previous work on performance optimization using space partitioned replicas 11/11/2018 ICPP2006

Conclusions The proposed cost models are capable of estimating execution time trends. The designed greedy strategy together with dynamic programming algorithm can choose a good set of candidate replicas that decrease the query execution time. Our implementations show good scalability. When data transfer bandwidth is the limiting factor, using a combination of space and attribute partitioned replicas should be preferred. 11/11/2018 ICPP2006