Selectivity-Based Partitioning Alkis Polyzotis UC Santa Cruz
Parser Optimizer Execution Engine R 1 R 2 R 3 R 4 ( (R 2 R 3 ) R 1 ) R 4 Query Optimization Integral component of declarative query processing Key problem: join ordering Most important (and most complex!) module of a DBMS
“Monolithic” Query Optimization Output: a single join order based on join selectivities between tables Plan: (P E) D
Partition-Based Query Optimization Output: multiple join orders based on selectivities between fragments of tables Plan: ( (P D 2 ) E ) ( (E D 1 ) P )
Selectivity-Based Partitioning Divide-and-Union paradigm Optimization problem and analysis Partitioning algorithm Experimental results
Roadmap Preliminaries Problem Definition Partitioning Algorithm Optimal Splits Iterative Partitioning Experimental Results Conclusions
Data and Query Model Chain-join queries Example: R 1 R 2 R 3 R 4 Relations may have optional selections Relation Frequency matrix Left-deep evaluation plans Example: R 3 R 2 R 4 R 1 R3R3 R2R2 R4R4 R1R1
Problem Definition Given: query Q, maximum partition count N Goal: find partitioning of Q in n N partitions that minimizes query cost On-the-fly partitioning vs. Off-line partitioning Difficult optimization problem! Determine the pivot relation Determine the number of partitions Compute a partitioning of the pivot Determine the orderings of partitioned plans R 1 R 2 R 3 R 4 R 1 R 21 R 4 R 3 R 3 R 22 R 1 R 4
Query Cost Function One possibility: optimizer’s cost model Accurate cost estimation Solution depends on low-level system details Difficult to gain intuitions Our approach: query cost = number of intermediate results Simple function that admits analysis Sound connections to realistic cost models (Cluet and Moerkotte, ICDT’95) Cost(R 3 R 2 R 4 R 1 ) = |R 3 R 2 | + |R 3 R 2 R 4 |
Roadmap Preliminaries Problem Definition Partitioning Algorithm Optimal Splits Iterative Partitioning Experimental Results Conclusions
Partitioning Algorithm - Overview State space: partitioned join orders Partitioning algorithm: Explore a set of states Compute optimal partitioning for each state Return global optimum Our approach: order joins then partition Another possibility: partition then order joins
Distributing Tuples Goal: Distribute tuples to minimize cost Optimal distribution depends on: Frequency matrices of other relations Position (m,l)
Optimal Split Theorem Distribute each value (m,l) independently Place (m,l) in partition that minimizes g(L,T,m,l)
Partitioning Algorithm - Overview State space: partitioned join orders Partitioning algorithm: Explore a set of states Compute optimal partitioning for each state Return global optimum
Search Algorithm Exhaustive search is impractical [ Pivot, Leading orders, Trailing orders ] Search heuristics: Tighter search space: [ Pivot, Optimal Leading orders ] Iterative Partitioning Guided search by using lower bounds on cost of partitions
Encoding of State Space State: [ Pivot, Optimal leading orders ] Transition: insert relation in a leading order
R 5 R 1 R 3 R 4 R 5 Iterative Partitioning Key idea: (Partition, Optimize)+ Compute optimal split for leading/trailing orders Optimize trailing orders for the current split Theorem: query cost can only decrease Idea extended to more detailed cost models R1R1 R 3 R 4 R2R2 R 21 R 22 R 3 R 5 R 4 R 1 R 5 R 21 R 22 LeadingTrailing
Search Algorithm Initial states: single-relation leading orders Search process: Compute partitions with IP Open more states with transition function Transitions are guided by lower bound on cost function Same lower bound can also prune states Stopping criteria: Search space is exhausted Time budget is exhausted
System Integration Parser Optimizer Execution Engine Parser Optimizer Execution Engine Partitioner MonolithicPartition-based
Roadmap Preliminaries Problem Definition Partitioning Algorithm Optimal Splits Iterative Partitioning Experimental Results Conclusions
Effect of Skew Synthetic Data
Execution Time Synthetic Data (Skew=1.5)
Varying Time Budget Synthetic Data (Skew=1.5)
Results on Real-Life Data SwissProt
Conclusions Monolithic optimization Missed opportunities Selectivity-Based Partitioning Divide & Union approach Multiple join orders per query Join selectivity between relation fragments Partitioning Algorithm Iterative Partitioning Experimental Results Significant reduction of intermediate results
Future Work Extension to multiple pivots Partition-then-order optimization Efficient execution of partitioned plans Off-line workload-aware partitioning
Thank you!
Partitioning Model General case: Multi-relation partitioning Our approach: Single-relation partitioning R 1 R 2 R 3 R 4 R 1 R 21 R 4 R 3 R 31 R 22 R 1 R 4 R 1 R 22 R 32 R 4