Cost-based Query Scrambling for Initial Delays Tolga Urhan Michael J. Franklin Laurent Amsaleg
Introduction: n Problem: Remote data access (e.g. from the Web) in query processing introduce unpredictable delays. n Traditional query scrambling based on simple heuristics is susceptible to problems from bad scrambling decisions. n Three different approaches to using query optimization for scrambling are proposed. n More intelligent scrambling decisions are made.
Query Scrambling Overview: n Goal: To hide the delays encountered when obtaining data from remote sources by performing other useful work. n Consists of two phases: –Rescheduling phase –Operator Synthesis phase n Objective function of optimization based either on –total work –or response time
Goals of this paper: n Focus on the problem of initial delay: –delay in receiving the first tuple from a particular remote source. –Due to difficulty in establishing a connection to a remote source, heavy load of the remote source large amount of work remote source needs to perform (lack of global optimization) n Investigate both the use of response time-based and total work-based optimization for query scrambling
n Assumption: –Query execution environment consists of a query site and remote data sources. –Processing work occurs in both query site and remote sites –example of a complex query tree: DEB A C Query Result Communication Link Select Join Cost-based Query Scrambling:
Iterative Process of Query Scrambling n (1) Rescheduling: execution plan of a query is dynamically rescheduled when delay is detected. n (2) Operator Synthesis: new operators can be created when there are no other operators that can execute. n E.g.: Stalls in getting A tuples: –(1) retrieve B tuples, execute D join E –(2) execute a new join between B and (D join E) join C
Cost-based Rescheduling: n Identify runnable subtrees: subtrees made up entirely of nonbocked operators. n Runnable subtrees can be scheduled out of order by inserting materialization operators at its root. n Materialization operators: they issue Open, Next, close calls to the root of the subtree and save results in a temporary relation.
n Selection of runnable subtrees to execute: –Traditional way: “maximal” one. –Maximal efficiency: (P - MR)/(P + MW) MW: cost of writing materialized temporary result to disk MR: cost of reading temporary results form disk P: cost of executing the subtree efficiency: improvement in response time per unit of scrambling execution –Another iteration of rescheduling is started until the delayed data has arrived. Cost-based Rescheduling:
n Second phase starts when no more progress can be made in phase 1. n Three approaches of optimization strategies: –Pair –(IN) Include Delayed –(ED) Estimated Delay Cost-based Operator Synthesis:
n Construct a query plan containing only a single join using two unblocked relations. n Analyzes each pair of unblocked relations sharing a join predicate. n Chooses the join with the least total cost to execute. n Materialize the results of the join to disk. n Avoids Cartesian products, joins whose produced results take longer to read from disk than to compute from scratch. Pair: total work-based optimizer
n At the end of each join, checks for the arrival of delayed data. If not arrived, do another iteration n If no qualified joins exist, wait for delayed data to arrive n Reconstruction phase: –when all blocked relations become available, need to construct a single query tree –necessary, since Pair policy works only on pairs of relations and does not maintain a complete query plan Pair continued:
n Each iteration generates a complete alternative plan n Chooses a very long delay duration (relative to response time) to postpone any access to the delayed data. n Chooses a plan with the greatest benefit (potential improvement in response time) whose risk (duration of the optimization step) can be overlapped with the expected delay duration. IN (Include Delayed): response time-based optimizer
n Use risk/benefit knob (Rbknob) to prevent optimizer from choosing high-risk plans for relatively small potential gains over low risk plans. n Rbknob: ratio of the amount of benefit the optimizer is willing to give up for a given savings in risk. n Increasing Rbknob ----> more conservative plans. IN Contined:
n Similar to IN except that it uses relatively short delay estimates (relative to the response time of the non-delayed plan) n Delay estimates successively increase when necessary to make more progress n Motivation: Use low risk plans when delays are short, use high risk/high pay off plans for larger delays. ED (Estimated Delay): response time-based optimizer
n Execution steps: –Starts by picking an estimated delay value equal to 25% of the original query response time –Repeat iterations until progress is too small –Increase delay value to 50% of response time –Increase to 100% of response time if progress is still too small. n For short delays: scrambling more useful n For long delays: more aggressive. ED Continued:
n Two-phase randomized query optimizer n Workload based on queries from TPC-D benchmarks n Single query site, six remote data source sites. n Experimental methodology: plots the duration of initial delay of a remote source vs. the response time achieved using scrambling Experimental Setup:
One case study:
n 1. With sufficient memory, all cost-based approaches (Pair, IN, ED) can effectively hide initial delays. –When delayed relations are encountered early in the query execution, delay as long as normal response time can be absorbed. –Heuristic-based algorithm performs worse than original query w/o scrambling, unless there is substantial amount of extra memory for scrambling Lessons learned:
n 2. Cost-based scrambling: tradeoff between conservative approaches and aggressive ones. –conservative: safer for short delays –aggressive: bigger savings for long delays –Amount of delay to be hidden is limited by the normal response time of the query –As delay increases beyond normal response time, benefits of scrambling as a percentage of total execution time decreases. –So, maybe more conservative plans? Lessons learned:
n 3. As memory available for scrambling is reduced, scrambling plans are more expensive. –Longer delay duration is needed for scrambling to pay off –Good predictions of delay duration needed to make good scrambling decisions Lessons Learned:
n 4. Aggressiveness of IN and ED policies can be adjusted using Rbknob. –Give up potential gains for long delays in order to reduce risks for short delays –This tradeoff is useful in the absence of accurate predictions of delay duration. Lessons Learned:
n 5. Pair (total work-based optimizer) may perform unnecessary work –lack of a global view of the scrambled plan –unable to pick slightly sub-optimal plan to generate interesting orders –Therefore, response time should be used as the objective function to generated a complete and reasonable scrambled plan. Lessons Learned:
Discussion: n How to tune Rbknob? n Query scrambling can be very dangerous, when delay duration is short. How to reduce the risks? n Cross products might be Okay sometimes? n Reality of scenarios studied? n QS treats remote sites as black boxes. Additional processing at data source sites? n Nonblocking join algorithm instead of hash join?
Discussion continued: n Better delay duration estimates? (using probes) n Quality of decisions limited by that of optimizer. n Replicas complicate problems? n Query scrambling decision should take selectivity, size of intermediate results, importance of operators into consideration. n Only addressed the problem of initial delay, ignores bandwidth problem.
Discussion continued: n Checking for arrival of delayed data during a scrambling step?
n Do not adapt to changes in the system parameters during query execution: n Volcano optimizer: –introduces choose-plan operators into a query plan to compensate for the lack of information about system parameters at compile time. n Y. Ioannidis et.al: –generates multiple alternative plans, chooses among them when the query is initialized. Related Work:
n Rdb/VMS –Start multiple different executions of the same logical operators, choose the best operator, terminate the others. n MIND heterogeneous database project –divide query into subqueries and send to each subquery to a site for execution, compose results incrementally –resembles Pair. n Reorder left-deep join trees into busy join trees n Mermaid, InterViso Related Work:
n Three different approaches (Pair, IN, ED) are investigated to make intelligent query scrambling decisions n In general, use of a response time optimizer has the advantage of being able to construct complete query execution plans that include access to delayed data n Fundamental trade-offs between risk aversion and ability to hide large delays in ED and IN. Conclusion: