Presentation is loading. Please wait.

Presentation is loading. Please wait.

Adaptive Query Processing in Data Stream Systems Paper written by Shivnath Babu Kamesh Munagala, Rajeev Motwani, Jennifer Widom stanfordstreamdatamanager.

Similar presentations


Presentation on theme: "Adaptive Query Processing in Data Stream Systems Paper written by Shivnath Babu Kamesh Munagala, Rajeev Motwani, Jennifer Widom stanfordstreamdatamanager."— Presentation transcript:

1 Adaptive Query Processing in Data Stream Systems Paper written by Shivnath Babu Kamesh Munagala, Rajeev Motwani, Jennifer Widom stanfordstreamdatamanager Itaru Nishizawa Hitachi, Ltd. Stanford University

2 Data Streams Continuous, unbounded, rapid, time- varying streams of data elements Continuous, unbounded, rapid, time- varying streams of data elements Occur in a variety of modern applications Occur in a variety of modern applications Network monitoring and intrusion detection Network monitoring and intrusion detection Sensor networks Sensor networks Telecom call records Telecom call records Financial applications Financial applications Web logs and click-streams Web logs and click-streams Manufacturing processes Manufacturing processes

3 Example Continuous Queries Web Web Amazon’s best sellers over last hour Amazon’s best sellers over last hour Network Intrusion Detection Network Intrusion Detection Track HTTP packets with destination address matching a prefix in a given table and content matching “*\.ida” Track HTTP packets with destination address matching a prefix in a given table and content matching “*\.ida” Finance Finance Monitor NASDAQ stocks between $20 and $200 that have moved down more than 2% in the last 20 minutes Monitor NASDAQ stocks between $20 and $200 that have moved down more than 2% in the last 20 minutes

4 Traditional Query Optimization Executor: Runs chosen plan to completion Chosen query plan Optimizer: Finds “best” query plan to process this query Query Statistics Manager: Periodically collects statistics, e.g., table sizes, histograms Which statistics are required Estimated statistics

5 Optimizing Continuous Queries is Different Continuous queries are long-running Continuous queries are long-running Stream characteristics can change over time Stream characteristics can change over time Data properties: Selectivities, correlations Data properties: Selectivities, correlations Arrival properties: Bursts, delays Arrival properties: Bursts, delays System conditions can change over time System conditions can change over time  Performance of a fixed plan can change significantly over time  Adaptive processing: find best plan for current conditions

6 Traditional Optimization  Adaptive Optimization Optimizer: Finds “best” query plan to process this query Executor: Runs chosen plan to completion Chosen query plan Query Statistics Manager: Periodically collects statistics, e.g., table sizes, histograms Which statistics are required Estimated statistics Reoptimizer: Ensures that plan is efficient for current characteristics Profiler: Monitors current stream and system characteristics Executor: Executes current plan Decisions to adapt Combined in part for efficiency

7 Preliminaries Let query Q process input stream I, applying the conjunction of n commutative filters F 1, F 2, …, F n. Each filter F i takes a stream tuple e as input and returns either true or false. If F i returns false for tuple e we say that F i drops e. A tuple is emitted in the continuous query result if and only if all n filters return true. A plan for executing Q consists of an ordering P =F f(1), F f(2),.., F f(n) where f is the mapping from positions in the filter ordering to the indexes of the filters at those positions When a tuple e is processed by P, first F f(1) is evaluated. If it returns false (e is dropped by F f(1) ), then e is not processed Further. Otherwise, F f(2) is evaluated on e, and so on.

8 Preliminaries – cont’d At any time, the cost of an ordering O is the expected time to process an incoming tuple in I to completion (either emitted or dropped), using O. Consider O = F f(1), F f(2),.., F f(n). d(i|j) is the conditional probability that F f(i) will drop a tuple e from input stream I, given that e was not dropped by any of F f(1), F f(2),.., F f(j). The unconditional probability that F f(i) will drop an I tuple is d(i|0). t i is the expected time for Fi to process one tuple.

9 Preliminaries – cont’d Given the notations the cost of O = F f(1), F f(2),.., F f(n). per tuple can be formalized as: Notice D i is the portion of tuple that is left for operator F f(i) to process The goal is to maintain filter orderings that minimize this cost at any point in time.

10 Example In this picture a In this picture a sequence of tuples is arriving on stream I: 1, 2, 1, 4,... We have four filters F1–F4, such that F i drops a tuple e if and only if F i does not contain e.  Note that all of the incoming tuples except e = 1 are dropped by some filter.  For O1 = F1, F2, F3, F4, the total number of probes for the eight I tuples shown is 20. (For example, e = 2 requires three probes — F1, F2, and F3 – before it is dropped by F3.)  The corresponding number for O2 = F3, F2, F4, F1 is 18  O3 = F3, F1, F2, F4 is optimal for this example at 16 probes.

11 Greedy Algorithm Assume for the moment uniform times t i for all filters. A greedy approach to filter ordering proceeds as follows: 1. Choose the filter F i with highest unconditional drop probability d(i|0) as F f(1). 2. Choose the filter F j with highest conditional drop probability d(j|1) as F f(2). 3. Choose the filter F k with highest conditional drop probability d(k|2) as F f(3). 4. And so on.

12 Greedy Invariant To factor in varying filter times ti, replace d(i|0) in step 1 with d(i|0)/t i, d(j|1) in step 2 with d(j|1)/t j, and so on. We refer to this ordering algorithm as Static Greedy, or simply Greedy. Greedy maintains the following Greedy Invariant (GI):

13 So far - Pipelined Filters: Stable Statistics Assume statistics are not changing Assume statistics are not changing Order filters by decreasing unconditional drop- rate/cost [prev. work] Order filters by decreasing unconditional drop- rate/cost [prev. work] Correlations  NP-Hard Correlations  NP-Hard Greedy algorithm: Use conditional selectivities Greedy algorithm: Use conditional selectivities F  (1) has maximum drop-rate/cost F  (1) has maximum drop-rate/cost F  (2) has maximum drop-rate/cost ratio for tuples not dropped by F  (1) F  (2) has maximum drop-rate/cost ratio for tuples not dropped by F  (1) And so on And so on

14 Adaptive Version of Greedy Greedy gives strong guarantees Greedy gives strong guarantees 4-approximation, best poly-time approx. possible 4-approximation, best poly-time approx. possible For arbitrary (correlated) characteristics For arbitrary (correlated) characteristics Usually optimal in experiments Usually optimal in experiments Challenge: Challenge: Online algorithm Online algorithm Fast adaptivity to Greedy ordering Fast adaptivity to Greedy ordering Low run-time overhead Low run-time overhead  A-Greedy: Adaptive Greedy

15 A-Greedy Profiler: Maintains conditional filter selectivities and costs over recent tuples Executor: Processes tuples with current filter ordering Reoptimizer: Ensures that filter ordering is Greedy for current statistics statistics Estimated are required Which statistics Combined in part for efficiency Changes in filter ordering

16 A-Greedy Profiler For n filters, the total number of conditional selectivities is n2 n-1 Clearly it is impractical for the profiler to maintain online estimates of all these selectivities. /2 = O(n 2 ) selectivities only. Fortunately, to check whether a given ordering satisfies the GI, we need to check (n + 2)(n - 1) /2 = O(n 2 ) selectivities only. Once a GI violation has occurred, to find a new ordering that satisfies the GI we may need O(n 2 ) new selectivities in the worst case. The new set of required selectivities depends on the new input characteristics, so it cannot be predicted in advance.

17 Profiler cont’d The profiler maintains a profile of tuples dropped in the recent past. The profile is a sliding window of profile tuples created by sampling tuples from input stream I that get dropped during filter processing. A profile tuple contains n boolean attributes b 1, …, b n corresponding to filters F 1, …, F n. When a tuple e є I is dropped during processing, e is profiled with some probability p, called the drop-profiling probability. If e is chosen for profiling, processing of e continues artificially to determine whether any of the remaining filters unconditionally drop e.

18 Profiler cont’d The profiler then logs a tuple with attribute b i = 1 if Fi drops e and b i = 0 otherwise, 1 ≤ i ≤ n. The profile is maintained as a sliding window so that older input data does not contribute to statistics used by the reoptimizer. a sliding window of processing-time samples is also maintained to calculate the avg processing time a i for each filter F i

19 A-Greedy Reoptimizer The reoptimizer’s job is to maintain an ordering O such that O satisfies the GI for statistics estimated from the tuples in the current profile window. The view maintained over the profile window is an n X n upper triangular matrix V [i, j], 1 ≤ i ≤ j ≤ n, so we call it the matrix view. The n columns of V correspond in order to the n filters in O. That is, the filter corresponding to column c is F f(c).

20 Reoptimizer cont’d Entries in the ith row of V represent the conditional selectivities of filters F f(i),,F f(i+1),..,F f(n) for tuples that are not dropped by F f(1),F f(2), …, F f(i-1) Specifically, V [I, j] is the number of tuples in the profile window that were dropped by F f(j) among tuples that were not dropped by F f(1),F f(2), …, F f(i-1) Notice that V [i, j] is proportional to d(j|i)

21 Updating V on an insert to profile Window

22 Violation of GI The reoptimizer maintains the ordering O such that the matrix view for O always satisfies the condition: V [i, i]/a f(i) ≥ V [i, j]/a f(j), 1 ≤ i ≤ j ≤ n Suppose an update to the matrix view or to a processing-time estimate causes the following condition to hold: V [i, i]/a f(i) ≤ V [i, j]/a f(j), 1 ≤ i ≤ j ≤ n Then a GI violation has occurred at position i

23 Detecting a violation An update to V or to an a i can cause a GI violation at position i either because it reduces V [i, i] / a f(i), or because it increases some V [i, j] / a f(j), j > i.

24 Correcting a violation We may need to reevaluate the filters at positions > i because their conditional selectivities may have changed. The adaptive ordering can thrash if both sides of the Equation are almost equal for some pair of filters. To avoid thrashing, the thrashing- avoidance parameter β is introduced in the equation: V [i, i]/a f(i) ≤ β V [i, j]/a f(j), 1 ≤ i ≤ j ≤ n

25 Tradeoffs Suppose changes are infrequent Suppose changes are infrequent Slower adaptivity is okay Slower adaptivity is okay Want best plans at very low run-time overhead Want best plans at very low run-time overhead Three-way tradeoff among speed of adaptivity, run-time overhead, and convergence properties Three-way tradeoff among speed of adaptivity, run-time overhead, and convergence properties


Download ppt "Adaptive Query Processing in Data Stream Systems Paper written by Shivnath Babu Kamesh Munagala, Rajeev Motwani, Jennifer Widom stanfordstreamdatamanager."

Similar presentations


Ads by Google