Download presentation
Presentation is loading. Please wait.
Published byLionel Sharp Modified over 9 years ago
1
An Adaptive Query Execution Engine for Data Integration Zachary Ives, Daniela Florescu, Marc Friedman, Alon Levy, Daniel S. Weld University of Washington Slides by Peng Li, Modified by Rachel Pottinger Modified by Ben Zhu, Yidan Liu Presenter: Ben Zhu Discussion Leader: Yidan Liu
2
Outline Motivations for Tukwila Tukwila Architecture Interleaving of planning and execution Adaptive Query Operators Dynamic Collector Double Pipelined Join Summary
3
Why data integration? The goal is to provide a uniform query interface to a multitude of data sources. The key advantage of data integration is that it frees users from having to do the followings, locate the sources relevant to their query interact with each source independently manually combine the data from the different sources
4
The main challenges of the design of DISs: Query Reformulation The construction of wrapper programs Query optimizers and efficient query execution engines
5
The need for adaptivity Absence of statistics Unpredictable data arrival characteristics Overlap and redundancy among sources Optimizing the time to the initial answers to the query Network bandwidth generally constrains the data sources to be “small”
6
Tukwila Architecture
7
Interleaving of planning and execution Novel features of Tukwila: The optimizer can create a partial plan if essential statistics are missing or uncertain The optimizer generates both annotated operator trees and appropriate event-condition-action rules. Optimizer conserves the state of its search space when it calls the execution engine, so it is able to resume optimization in an incremental fashion.
8
Interleaving of planning and execution – Query plans Operators in Tukwila are organized into pipelined units called fragments A plan includes a partially-ordered set of fragments and a set of global rules A fragment consists of a fully pipelined tree of physical operators and a set of local rules. At the end of a fragment, results are materialized, and the rest of the plan can be re- optimized or rescheduled
9
Interleaving of planning and execution - Rules Rules are the key mechanism for implementing several kinds of adaptive behavior in Tukwila Re-optimization the optimizer’s cardinality estimate for the fragment’s result is significantly different from the actual size re-invoke optimizer Contingent planning check properties of the result to select the next fragment Adaptive operators policy for memory overflow resolution and collectors Rescheduling reschedule if a source times out
10
Interleaving of planning and execution - Rule format When event if condition then actions When closed(frag1) if card(join1)>2*est_card(join1) then replan An event triggers a rule, causing it to check its condition. If the condition is true, the rule fires, executing the actions.
11
Discussion 1 For one of the following motivating situations of Tukwila Absence of statistics Unpredictable data arrival characteristics Overlap and redundancy among sources Optimizing the time to initial answers Q1: Can you give some examples where the chosen topic matters? Q2: If you are a member of Tukwila team, what rules or policy would you have to deal with the problem? Discussion Form 4 groups (3~4 person per group, one team per topic) Discuss Q1 and Q2 for one topic (5 ~ 7 minutes)
12
Interleaving of planning and execution – Query Execution Executes a query plan (basic function) Gathers statistics about each operation Handles exception conditions or re-invoke the optimizer
13
Interleaving of planning and execution – Event Handling Interprets rules attached to query execution plan Execution system may generate events at any time, which are fed into an event queue For each event, the event handler uses a hash table to find all matching rules in the active set For each active rule, it evaluates the conditions and if they are satisfied, all of the rule’s actions are executed
14
Adaptive Query Operators - Dynamic Collectors Performs a union over a large number of overlapping sources Standard union operators can’t handle errors or decide to ignore slow mirror data sources Tukwila treats the task as a primitive operator so that it can provide guidance about the order in which data sources should be accessed A key distinguishing aspect: the flexibility to contact only some of the sources
15
Adaptive Query Operators Double Pipelined Join Issues with conventional joins Sort merge joins & indexed joins --can not be pipelined, require an initial sorting or indexing step Nested loops joins and hash joins --follow an asymmetric execution model For Nested loops joins, we must wait for the entire inner table to be transmitted initially before pipelining begins For hash joins, we must load the entire inner relation into a hash table before we can pipeline.
16
Adaptive Query Operators Double Pipelined Join Challenges in the context of data integration Optimizer may not know the relative size of each relation The time to first tuple is important, so it may be better to use the larger data source as the inner relation if it sends data faster The time to first tuple is extended by the hash join’s non-pipelined behavior when it is reading the inner relation
17
Adaptive Query Operators Double Pipelined Hash Join Symmetric and incremental Produces tuples almost immediately and masks slow data source transmission rates Data-driven in behavior: each join relation sends tuples through the join operator as quickly as possible. At any time, all of the data encountered so far has been joined and the resulting tuples have already been output Trade-off is that it MUST hold hash tables for BOTH relations in memory
18
Adaptive Query Operators Double Pipelined Hash Join 2 problems occur It follows a data-driven, bottom-up execution model, while Tukwila is top-down, iterator-based Multithreading: the join consists of separate threads for output, left child and right child As each child reads tuples, it places them into a small tuple transfer queue It requires enough memory to hold both join relations
19
Adaptive Query Operators Double Pipelined Hash Join Handling memory overflow Take some portion of the hash table and swap it to disk Incremental left flush: switch to a strategy of reading only tuples from the right-side relation and as necessary flush a bucket from the left-side relation’s hash table when system runs out of memory – gradually degrade into hybrid hash, flushing buckets lazily Incremental symmetric flush: pick a bucket to flush to disk and flush the bucket from both sources Incremental left flush will perform fewer disk I/Os Incremental symmetric flush may have reduced latency since both relations continue to be processed in parallel
20
Discussion 2 When this paper was written, perhaps it was okay to claim that “the sizes of most data integration queries are expected to be only moderately large”. But does this hold today, especially with the coming era of “cloud computing”? Specifically: 1. Do double pipelined hash joins seem efficient enough for today’s data? 2. Would you used double pipelined hash joins in non-data integration applications?
21
Summary Challenges for data integration and motivations for Tukwila General Tukwila architecture Interleaving of planning and execution Dynamic collector operator Double pipelined hash join
22
Eddies: Continuously Adaptive Query Processing Ran Avnur, Jesepth M. Hellestein University of California, Berkeley Previously presented by Hongrae Lee Modified by Ben Zhu, Yidan Liu Presenter: Ben Zhu Discussion Leader: Yidan Liu
23
Outline Introduction Reorderability of plans Rivers and Eddies Routing tuples in Eddies Summary
24
Static Query Processing Traditional query processing scheme 1. Optimizing a query 2. Executing a static query plan This traditional scheme is not appropriate for Large scale widely-distributed information resources or Massively parallel database systems !
25
New Requirements Increased complexity in large-scale system Hardware and workload Data User interface We want query execution plans To be reoptimized regularly during query processing To allow system to adapt dynamically to fluctuations in computing resources, data characteristics, and user preferences
26
Run-Time Fluctuations Three properties that vary during query processing The costs of operators Their selectivities The rates at which tuples arrive from the inputs The first and third issues commonly occur in wide area environments, the second one commonly arises due to correlations between predicates and the order of tuple delivery These issues may become more common in cluster (shared-nothing) systems
27
Discussion 1 "we favor adaptability over best-case performance" 1. Does this seem reasonable? In this case? In general? 2. If adaptivity is needed only when best-case missing or could also be a general strategy in regular query processing. Do you think it is good or bad to apply it in traditional query processing? Please give reasons or examples to support your opinions.
28
Eddy
29
Two Challenges for This Scheme How can we reorder operators? Reorderability of plans How should we route tuples? Routing tuples in Eddies
30
Reorderability of Plans Synchronization Barriers One task waits for other tasks to be finished Barriers limit concurrency, and hence performance It is desirable to minimize the overhead of synchronization barriers in a dynamic performance environment Issues affect the overhead: the frequency of barriers and the gap between arrival times of the two inputs at the barrier
31
Reorderability of Plans Moments of Symmetry The barrier where the order of the inputs to a join can often be changed without modifying any state in the join Allow reordering of the inputs to a single binary operator The combination of commutativity and moments of symmetry allows for very aggressive reordering of a plan tree
33
Join Algorithms and Reordering Constraints on reordering Unindexed join input is ordered before the indexed input Preserving the ordered inputs Some join algorithms work only for equijoins Join algorithms in Eddy We favor join algorithms with Frequent moments of symmetry Adaptive or nonexistent barriers Minimal ordering constraints Rules out hybrid hash join, merge joins, and nested loops joins – Choice: Ripple Join Frequently-symmetric versions of traditional iteration, hashing and indexing schemes Favors adaptivity over best-case performance
34
Ripple Joins Ripple joins Have moments of symmetry at each “corner” of a rectangular ripple Are designed to allow changing rates for each input Offer attractive adaptivity features at a modest overhead in performance and memory footprint BlockIndexHash
35
Rivers and Eddies River A shared-nothing parallel query processing framework Pre-optimization A heuristic pre-optimizer must choose how to initially pair off relations into joins An eddy in the River Is implemented via a module in a river Encapsulates the scheduling of its participating operators Explicitly merges multiple unary and binary operators into a single n-ary operator within a query plan A tuple is associated with a tuple descriptor containing a vector of Ready and Done bits
36
Routing Tuples in Eddies An eddy module Directs the flow of tuples from the inputs through the various operators to the output Provides the flexibility to allow each tuple to be routed individually through the operators The routing policy determines the efficiency
37
Na ï ve eddy Tuples enter eddy with low priority, and when returned to eddy from an operator are given high priority Tuples flow completely through eddy before new tuples Prevents being ‘clogged’ with new tuples Edges in a River DFG -> Fixed-size queue -> back- pressure in a fluid flow Production along the input to any edge is limited by the rate of consumption at the output Tuples are routed to the low-cost operator first Cost-aware policy Selectivity-unaware policy
38
Learning Selectivity : Lottery Scheduling To track both Consumption (determined by cost) Production (determined by cost and selectivity) Lottery Scheduling Maintain ‘tickets’ for an operator An operator’s chance of receiving the tuple ∝ The counts of tickets The eddy can track (learn) an ordering of the operators that gives good overall efficiencyDebitCredit EddyOperatorEddyOperator
39
Responding to Dynamic Fluctuations Eddies couldn’t adaptively react over time to the changes in performance and data characteristics Use a window scheme instead of point scheme Banked tickets for running a lottery Escrow tickets for measuring efficiency during the window At the beginning of the window, the value of the escrow account replaces the banked account, and the escrow account is reset It ensures that operators “re-prove themselves” each window
40
Some Experimental Results
41
Discussion 2 If you were to design an adaptive query processor, what could be the possible tradeoffs you need to balance? Which would you rather use: Tukwila or Eddies? Why?
42
Summary Eddies are A query processing mechanism that allow fine- grained, adaptive, online optimization Beneficial in the unpredictable query processing environments Challenges To develop eddy ‘ticket’ policies that can be formally proved to converge quickly To attack the remaining static aspects To harness the parallelism and adaptivity available to us in rivers To explore the application of eddies and rivers to the generic space of dataflow programming
43
Thank you
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.