Selectivity-Based Partitioning Alkis Polyzotis UC Santa Cruz.

Slides:



Advertisements
Similar presentations
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Advertisements

CS4432: Database Systems II
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.
1 Top-k Spatial Joins
Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Distributed DBMSPage © 1998 M. Tamer Özsu & Patrick Valduriez Outline Introduction Background Distributed DBMS Architecture Distributed Database.
Distributed DBMS© M. T. Özsu & P. Valduriez Ch.6/1 Outline Introduction Background Distributed Database Design Database Integration Semantic Data Control.
The Volcano/Cascades Query Optimization Framework
Constraint Optimization Presentation by Nathan Stender Chapter 13 of Constraint Processing by Rina Dechter 3/25/20131Constraint Optimization.
Composite Subset Measures Lei Chen, Paul Barford, Bee-Chung Chen, Vinod Yegneswaran University of Wisconsin - Madison Raghu Ramakrishnan Yahoo! Research.
Linked Bernoulli Synopses Sampling Along Foreign Keys Rainer Gemulla, Philipp Rösch, Wolfgang Lehner Technische Universität Dresden Faculty of Computer.
February 14, 2006CS DB Exploration 1 Congressional Samples for Approximate Answering of Group-By Queries Swarup Acharya Phillip B. Gibbons Viswanath.
S. J. Shyu Chap. 1 Introduction 1 The Design and Analysis of Algorithms Chapter 1 Introduction S. J. Shyu.
1 EL736 Communications Networks II: Design and Algorithms Class8: Networks with Shortest-Path Routing Yong Liu 10/31/2007.
A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.
Data Integration Aggregate Query Answering under Uncertain Schema Mappings Avigdor Gal, Maria Vanina Martinez, Gerardo I. Simari, VS Subrahmanian Presented.
1 Primitives for Workload Summarization and Implications for SQL Prasanna Ganesan* Stanford University Surajit Chaudhuri Vivek Narasayya Microsoft Research.
Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.
Approximate XML Query Answers Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Bell Labs) Yannis Ioannidis (U. of Athens, Hellas)
Distributed Regression: an Efficient Framework for Modeling Sensor Network Data Carlos Guestrin Peter Bodik Romain Thibaux Mark Paskin Samuel Madden.
MAE 552 – Heuristic Optimization Lecture 27 April 3, 2002
1 Distributed Databases CS347 Lecture 14 May 30, 2001.
VLDB Revisiting Pipelined Parallelism in Multi-Join Query Processing Bin Liu and Elke A. Rundensteiner Worcester Polytechnic Institute
August 2005RSFDGrC 2005, Regina, Canada 1 Feature Selection Based on Relative Attribute Dependency: An Experimental Study Jianchao Han 1, Ricardo Sanchez.
CS 347Notes 041 CS 347: Distributed Databases and Transaction Processing Notes04: Query Optimization Hector Garcia-Molina.
Exploiting Correlated Attributes in Acquisitional Query Processing Amol Deshpande University of Maryland Joint work with Carlos Sam
MAE 552 – Heuristic Optimization Lecture 26 April 1, 2002 Topic:Branch and Bound.
Solving the Protein Threading Problem in Parallel Nocola Yanev, Rumen Andonov Indrajit Bhattacharya CMSC 838T Presentation.
Parametric Query Generation Student: Dilys Thomas Mentor: Nico Bruno Manager: Surajit Chaudhuri.
Depth Estimation for Ranking Query Optimization Karl Schnaitter, UC Santa Cruz Joshua Spiegel, BEA Systems, Inc. Neoklis Polyzotis, UC Santa Cruz.
A Tool for Partitioning and Pipelined Scheduling of Hardware-Software Systems Karam S Chatha and Ranga Vemuri Department of ECECS University of Cincinnati.
September 12, 2006IEEE PIMRC 2006, Helsinki, Finland1 On the Packet Header Size and Network State Tradeoff for Trajectory-Based Routing in Wireless Networks.
Approximate XML Query Answers Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Bell Labs) Yannis Ioannidis (U. of Athens, Hellas)
Query Processing Presented by Aung S. Win.
PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning RASTOGI, Rajeev and SHIM, Kyuseok Data Mining and Knowledge Discovery, 2000, 4.4.
Providing Resiliency to Load Variations in Distributed Stream Processing Ying Xing, Jeong-Hyon Hwang, Ugur Cetintemel, Stan Zdonik Brown University.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Query Evaluation Chapter 12: Overview.
Access Path Selection in a Relational Database Management System Selinger et al.
Diversified Top-k Graph Pattern Matching 1 Yinghui Wu UC Santa Barbara Wenfei Fan University of Edinburgh Southwest Jiaotong University Xin Wang.
Mohamed Hefeeda 1 School of Computing Science Simon Fraser University, Canada Optimal Partitioning of Fine-Grained Scalable Video Streams Mohamed Hefeeda.
Status “Lifetime of a Query” –Query Rewrite –Query Optimization –Query Execution Optimization –Use cost-estimation to iterate over all possible plans,
Efficiently Processing Queries on Interval-and-Value Tuples in Relational Databases Jost Enderle, Nicole Schneider, Thomas Seidl RWTH Aachen University,
End-biased Samples for Join Cardinality Estimation Cristian Estan, Jeffrey F. Naughton Computer Sciences Department University of Wisconsin-Madison.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Applications of Dynamic Programming and Heuristics to the Traveling Salesman Problem ERIC SALMON & JOSEPH SEWELL.
PMIT-6101 Advanced Database Systems By- Jesmin Akhter Assistant Professor, IIT, Jahangirnagar University.
Efficient EMD-based Similarity Search in Multimedia Databases via Flexible Dimensionality Reduction / 16 I9 CHAIR OF COMPUTER SCIENCE 9 DATA MANAGEMENT.
Introduction to Query Optimization, R. Ramakrishnan and J. Gehrke 1 Introduction to Query Optimization Chapter 13.
CS4432: Database Systems II Query Processing- Part 2.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Introduction to Query Optimization Chapter 13.
AnHai Doan & Alon Halevy Department of Computer Science & Engineering University of Washington Efficiently Ordering Query Plans for Data Integration.
Sorting 1. Insertion Sort
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 15 – Query Optimization.
Rate-Based Query Optimization for Streaming Information Sources Stratis D. Viglas Jeffrey F. Naughton.
Optimal Relay Placement for Indoor Sensor Networks Cuiyao Xue †, Yanmin Zhu †, Lei Ni †, Minglu Li †, Bo Li ‡ † Shanghai Jiao Tong University ‡ HK University.
Query Optimization Problem Pick the best plan from the space of physical plans.
Protein Structure Prediction: Threading and Rosetta BMI/CS 576 Colin Dewey Fall 2008.
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Authored by Sameer Agarwal, et. al. Presented by Atul Sandur.
Computer Science and Engineering Jianye Yang 1, Ying Zhang 2, Wenjie Zhang 1, Xuemin Lin 1 Influence based Cost Optimization on User Preference 1 The University.
Data Driven Resource Allocation for Distributed Learning
RE-Tree: An Efficient Index Structure for Regular Expressions
Revisiting and Bounding the Benefit From 3D Integration
Query-Friendly Compression of Graph Streams
On Spatial Joins in MapReduce
Automatic Physical Design Tuning: Workload as a Sequence
DATA CACHING IN WSN Mario A. Nascimento Univ. of Alberta, Canada
Branch and Bound.
Efficient Subgraph Similarity All-Matching
Presentation transcript:

Selectivity-Based Partitioning Alkis Polyzotis UC Santa Cruz

Parser Optimizer Execution Engine R 1 R 2 R 3 R 4 ( (R 2 R 3 ) R 1 ) R 4 Query Optimization Integral component of declarative query processing Key problem: join ordering Most important (and most complex!) module of a DBMS

“Monolithic” Query Optimization Output: a single join order based on join selectivities between tables Plan: (P E) D

Partition-Based Query Optimization Output: multiple join orders based on selectivities between fragments of tables Plan: ( (P D 2 ) E )  ( (E D 1 ) P )

Selectivity-Based Partitioning Divide-and-Union paradigm Optimization problem and analysis Partitioning algorithm Experimental results

Roadmap Preliminaries Problem Definition Partitioning Algorithm Optimal Splits Iterative Partitioning Experimental Results Conclusions

Data and Query Model Chain-join queries Example: R 1 R 2 R 3 R 4 Relations may have optional selections Relation  Frequency matrix Left-deep evaluation plans Example: R 3 R 2 R 4 R 1 R3R3 R2R2 R4R4 R1R1

Problem Definition Given: query Q, maximum partition count N Goal: find partitioning of Q in n  N partitions that minimizes query cost On-the-fly partitioning vs. Off-line partitioning Difficult optimization problem! Determine the pivot relation Determine the number of partitions Compute a partitioning of the pivot Determine the orderings of partitioned plans R 1 R 2 R 3 R 4 R 1 R 21 R 4 R 3 R 3 R 22 R 1 R 4

Query Cost Function One possibility: optimizer’s cost model Accurate cost estimation Solution depends on low-level system details Difficult to gain intuitions Our approach: query cost = number of intermediate results Simple function that admits analysis Sound connections to realistic cost models (Cluet and Moerkotte, ICDT’95) Cost(R 3 R 2 R 4 R 1 ) = |R 3 R 2 | + |R 3 R 2 R 4 |

Roadmap Preliminaries Problem Definition Partitioning Algorithm Optimal Splits Iterative Partitioning Experimental Results Conclusions

Partitioning Algorithm - Overview State space: partitioned join orders Partitioning algorithm: Explore a set of states Compute optimal partitioning for each state Return global optimum Our approach: order joins then partition Another possibility: partition then order joins

Distributing Tuples Goal: Distribute tuples to minimize cost Optimal distribution depends on: Frequency matrices of other relations Position (m,l)

Optimal Split Theorem Distribute each value (m,l) independently Place (m,l) in partition that minimizes g(L,T,m,l)

Partitioning Algorithm - Overview State space: partitioned join orders Partitioning algorithm: Explore a set of states Compute optimal partitioning for each state Return global optimum

Search Algorithm Exhaustive search is impractical [ Pivot, Leading orders, Trailing orders ] Search heuristics: Tighter search space: [ Pivot, Optimal Leading orders ] Iterative Partitioning Guided search by using lower bounds on cost of partitions

Encoding of State Space State: [ Pivot, Optimal leading orders ] Transition: insert relation in a leading order

R 5 R 1 R 3 R 4 R 5 Iterative Partitioning Key idea: (Partition, Optimize)+ Compute optimal split for leading/trailing orders Optimize trailing orders for the current split Theorem: query cost can only decrease Idea extended to more detailed cost models R1R1 R 3 R 4 R2R2 R 21 R 22 R 3 R 5 R 4 R 1 R 5 R 21 R 22 LeadingTrailing

Search Algorithm Initial states: single-relation leading orders Search process: Compute partitions with IP Open more states with transition function Transitions are guided by lower bound on cost function Same lower bound can also prune states Stopping criteria: Search space is exhausted Time budget is exhausted

System Integration Parser Optimizer Execution Engine Parser Optimizer Execution Engine Partitioner MonolithicPartition-based

Roadmap Preliminaries Problem Definition Partitioning Algorithm Optimal Splits Iterative Partitioning Experimental Results Conclusions

Effect of Skew Synthetic Data

Execution Time Synthetic Data (Skew=1.5)

Varying Time Budget Synthetic Data (Skew=1.5)

Results on Real-Life Data SwissProt

Conclusions Monolithic optimization  Missed opportunities Selectivity-Based Partitioning Divide & Union approach Multiple join orders per query Join selectivity between relation fragments Partitioning Algorithm Iterative Partitioning Experimental Results Significant reduction of intermediate results

Future Work Extension to multiple pivots Partition-then-order optimization Efficient execution of partitioned plans Off-line workload-aware partitioning

Thank you!

Partitioning Model General case: Multi-relation partitioning Our approach: Single-relation partitioning R 1 R 2 R 3 R 4 R 1 R 21 R 4 R 3 R 31 R 22 R 1 R 4 R 1 R 22 R 32 R 4