Download presentation
Presentation is loading. Please wait.
1
MiniCon Reformulation & Adaptive Re-Optimization Zachary G. Ives University of Pennsylvania CIS 650 – Database & Information Systems February 23, 2005
2
2 Administrivia Next reading assignment: Urhan & Franklin – Query Scrambling Ives et al. – Adaptive Data Partitioning Compare the different approaches One-page proposal of your project scope, goals, and means of assessing success/failure due next Monday, Feb. 28 th
3
3 Today’s Trivia Question
4
4 Buckets, Rev. 2: The MiniCon Algorithm A “much smarter” bucket algorithm: In many cases, we don’t need to perform the cross- product of all items in all buckets Eliminates the need for the containment check This – and the Chase & Backchase strategy of Tannen et al – are the two methods most used in virtual data integration today
5
5 Minicon Descriptions (MCDs) Basically, a modification to the bucket approach “head homomorphism” – defines what variables must be equated Variable-substituted version of the subgoals Mapping of variable names Info about what’s covered Property 1: If a variable occurs in the head of a query, then there must be a corresponding variable in the head of the MCD view If a variable participates in a join predicate in the query, then it must be in the head of the view
6
6 MCD Construction For each subgoal of the query For each subgoal of each view Choose the least restrictive head homomorphism to match the subgoal of the query If we can find a way of mapping the variables, then add MCD for each possible “maximal” extension of the mapping that satisfies Property 1
7
7 MCDs for Our Example 5star(i) show(i, t, y, g), rating(i, 5, s) TVguide(t,y,g,r) show(i, t, y, g), rating(i, r, “TVGuide”) movieInfo(i,t,y,g) show(i, t, y, g) critics(i,r,s) rating(i, r, s) goodMovies(t,y) show(i, t, 1997, “drama”), rating(i, 5, s) good98(t,y) show(i, t, 1998, “drama”), rating(i, 5, s) viewh.h.mappinggoals sat. 5star(i)iiiiiiii2 TVguide(t,y,g,r)t t, y y, g gt t, y y, g g, r r1,2 movieInfo(i,t,y,g)i i, t t, y y, g g 1 critics(i,r,s)i i, r r, s s 2 goodMovies(t,y)t t,y y 1,2 good98(t,y)t t,y y 1,2 q(t) :- show(i, t, y, g), rating(i, r, s), r = 5
8
8 Combining MCDs Now look for ways of combining pairwise disjoint subsets of the goals Greatly reduces the number of candidates! Also proven to be correct without the use of a containment check Variations need to be made for: Constants in general (I sneaked those in) “Semi-interval” predicates (x <= c) Note that full-blown inequality predicates are co-NP-hard in the size of the data, so they don’t work
9
9 MiniCon Performance, Many Rewritings
10
10 Larger Query, Fewer Rewritings
11
11 MiniCon and LAV Summary The state-of-the-art for AQUV in the relational world of data integration It’s been extended to support “conjunctive XQuery” as well Scales to large numbers of views, which we need in LAV data integration A similar approach: Chase & Backchase by Tannen et al. Slightly more general in some ways – but: Produces equivalent rewritings, not maximally contained ones Not always polynomial in the size of the data
12
12 Motivations for Adaptive Query Processing Many domains where cost-based query optimization fails: Complex queries in traditional databases: estimation error grows exponentially with # joins [IC91] Querying over the Internet: unpredictable access rates, delays Querying external data sources: limited information available about properties of this source Monitor real-world conditions, adapt processing strategy in response
13
13 Can We Get RDBMS-Level Optimization for Data Integration, without Statistics? Multiple remote sources Described and mapped “loosely” Data changes frequently Generally, would like to support same kinds of queries as in a local setting Data Integration System Mediated Schema Remote, Autonomous Data Sources Schema Mappings Source Catalog Query Results
14
14 What Are the Sources of Inefficiency? Delays – we “stall” in waiting for I/O We’ll talk about this on Monday Bad estimation of intermediate result sizes The focus of the Kabra and DeWitt paper No info about source cardinalities The focus of the eddies paper – and Monday’s paper The latter two are closely related Major challenges: Trading off information acquisition (exploration) vs. use (exploitation) Extrapolating performance based on what you’ve seen so far
15
15 Kabra and DeWitt Goal: “minimal update” to a traditional optimizer in order to compensate for bad decisions General approach: Break the query plan into stages Instrument it Allow for re-invocation of optimizer if it’s going awry
16
16 Elements of Mid-Query Re- Optimization Annotated Query Execution Plans Annotate plan with estimates of size Runtime Collection of Statistics Statistics collectors embedded in execution tree Keep overhead down Dynamic Resource Re-allocation Reallocate memory to individual operations Query Plan Modification May wish to re-optimize the remainder of query
17
17 Annotated Query Plans We save at each point in the tree the expected: Sizes and cardinalities Selectivities of predicates Estimates of number of groups to be aggregated
18
18 Statistics Collectors Add into tree Must be collectable in a single pass Will only help with portions of query “beyond” the current pipeline
19
19 Resource Re-Allocation Based on improved estimates, we can modify the memory allocated to each operation Results: less I/O, better performance Only for operations that have not yet begun executing, i.e., not in the pipeline
20
20 Plan Modification Only re-optimize part not begun Suspend query, save intermediate in temp file Create new plan for remainder, treating temp as an input
21
21 Re-Optimization When to re-optimize: Calculate time current should take (using gathered stats) Only consider re-optimization if: Our original estimate was off by at least some factor 2 and if T opt, estimated < 1 T cur-plan,improved where 1 5% and cost of optimization depends on number of operators, esp. joins Only modify the plan if the new estimate, including the cost of writing the temp file, is better
22
22 Low-Overhead Statistics Want to find “most effective” statistics Don’t want to gather statistics for “simple” queries Want to limit effect of algorithm to maximum overhead ratio, Factors: Probability of inaccuracy Fraction of query affected How do we know this without having stats?
23
23 Inaccuracy Potentials The following heuristics are used: Inaccuracy potential = low, medium, high Lower if we have more information on table value distribution 1+max of inputs for multiple-input selection Always high for user-defined methods Always high for non-equijoins For most other operators, same as worst of inputs
24
24 More Heuristics Check fraction of query affected Check how many other operators use the same statistic The winner: Higher inaccuracy potentials first Then, if a tie, the one affecting the larger portion of the plan
25
25 Implementation On top of Paradise (parallel database that supports ADTs, built on OO framework) Using System-R optimizer New SCIA (Stat Collector Insertion Algorithm) and Dynamic Re-Optimization modules
26
26 It Works! Results are 5% worse for simple queries, much better for complex queries Of course, we would not really collect statistics on simple queries Data skew made a slight difference - both normal and re- optimized queries performed slightly better
27
27 Pros and Cons Provides significant potential for improvement without adding much overhead Biased towards exploitation, with very limited information-gathering A great way to retrofit an existing system In SIGMOD04, IBM had a paper that did this in DB2 But fairly limited to traditional DB context Relies on us knowing the (rough) cardinalities of the sources Query plans aren’t pipelined, meaning: If the pipeline is broken too infrequently, MQRO may not help If the pipeline is broken too frequently, time-to-first-answer is slow
28
28 The Opposite Extreme: Eddies The basic idea: Query processing consists of sending tuples through a series of operators Why not treat it like a routing problem? Rely on “back-pressure” (i.e., queue overflow) to tell us where to send tuples Part of the ongoing Telegraph project at Berkeley Large-scale federated, shared-nothing data stream engine Variations in data transfer rates Little knowledge of data sources
29
29 Telegraph Architecture Simple “pre-optimizer” to generate initial plan Creates operators, e.g.: Select: predicate (sourceA) Join: predicate (sourceA, sourceB) (No support for aggregation, union, etc.) Chooses implementations Select using index, join using hash join Goal: dataflow-driven scheduling of operations Tuple comes into system Adaptively routed through operators in “eddies” May be combined, discarded, etc.
30
30 Can’t Always Re-order Arbitrarily Need “moment of symmetry” Some operators have scheduling dependency e.g. nested loops join: for each tuple in left table for each tuple in right table If tuples meet predicate, output result Index joins, pipelined hash joins always symmetric Sometimes have order restrictions e.g. request tuple from one source, ship to another
31
31 The Eddy Represents set of possible orderings of a subplan Each tuple may “flow” through a different ordering (which may be constrained) N-ary module consisting of query operators Basic unit of adaptivity Subplan with select, project, join operators U
32
32 Example Join Subplan Alternatives R2R1 R2.x = R1.x R3 R1.x = R3.x R2R3 R2.x = R3.x R1 R1.x = R3.x R2R3 R2.x = R3.xR1 R1.x = R3.x R1R2 R2.x = R3.xR3 R1.x = R3.x … Join(R3 R1.x = R2.x, Join R1.x = R3.x (R1, R3))
33
33 Naïve Eddy Given tuple, route to operator that’s ready Analogous to fluid dynamics Adjusts to operator costs Ignores operator selectivity (reduction of input)
34
34 Adapting to Variable-Cost Selection
35
35 But Selectivity is Ignored
36
36 Lottery-Based Eddy Need to favor more selective operators Ticket given per tuple input, returned per output Lottery scheduling based on number of tickets Now handles both selectivity and cost
37
37 Enhancing Adaptivity: Sliding Window Tickets were for entire query Weighted sliding window approach “Escrow” tickets during a window “Banked” tickets from a previous window
38
38 What about Delays? Problems here: Don’t know when join buffers vs. “discards” tuples SR (SLOW) T
39
39 Eddy Pros and Cons Mechanism for adaptively re-routing queries Makes optimizer’s task simpler Can do nearly as well as well-optimized plan in some cases Handles variable costs, variable selectivities But doesn’t really handle joins very well – attempts to address in follow-up work: STeMs – break a join into separate data structures; requires re-computation at each step STAIRs – create intermediate state and shuffle it back and forth
40
40 Other Areas Where Things Can be Improved The work of pre-optimization or re-optimization Choose good initial/next query plan Pick operator implementations, access methods STeMs fix this Handle arbitrary operators Aggregation, outer join, sorting, … Next time we’ll see techniques to address this Distribute work Distributed eddies (by DeWitt and students)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.