Adaptive Query Processing in Data Stream Systems Paper written by Shivnath Babu Kamesh Munagala, Rajeev Motwani, Jennifer Widom stanfordstreamdatamanager.

Slides:

Advertisements

Similar presentations

Sampling From a Moving Window Over Streaming Data Brian Babcock * Mayur Datar Rajeev Motwani * Speaker Stanford University.

Advertisements

Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.

EKF, UKF TexPoint fonts used in EMF.

Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.

Active Learning for Streaming Networked Data Zhilin Yang, Jie Tang, Yutao Zhang Computer Science Department, Tsinghua University.

Fast Algorithms For Hierarchical Range Histogram Constructions

Maintaining Variance and k-Medians over Data Stream Windows Brian Babcock, Mayur Datar, Rajeev Motwani, Liadan O’Callaghan Stanford University.

Mining Data Streams.

A Data Stream Management System for Network Traffic Management Shivnath Babu Stanford University Lakshminarayanan Subramanian Univ. California, Berkeley.

Adaptive Monitoring of Bursty Data Streams Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani.

Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.

Discrete Structure Li Tak Sing( 李德成 ) Lectures

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification.

Static Optimization of Conjunctive Queries with Sliding Windows over Infinite Streams Presented by: Andy Mason and Sheng Zhong Ahmed M.Ayad and Jeffrey.

IntroductionAQP FamiliesComparisonNew IdeasConclusions Adaptive Query Processing in the Looking Glass Shivnath Babu (Stanford Univ.) Pedro Bizarro (Univ.

CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.

CS 8751 ML & KDDEvaluating Hypotheses1 Sample error, true error Confidence intervals for observed hypothesis error Estimators Binomial distribution, Normal.

Adaptive Ordering of Pipelined Stream Filters S. Babu, R. Motwani, K. Munagala, I. Nishizawa, and J. Widom In Proc. of SIGMOD 2004, June 2004.

Evaluating Window Joins Over Unbounded Streams By Nishant Mehta and Abhishek Kumar.

Analysis of Algorithms. Time and space To analyze an algorithm means: –developing a formula for predicting how fast an algorithm is, based on the size.

Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.

Chain: Operator Scheduling for Memory Minimization in Data Stream Systems Authors: Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani (Dept.

1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.

Hashing General idea: Get a large array

Adaptive Processing in Data Stream Systems Shivnath Babu stanfordstreamdatamanager Stanford University.

 Structured peer to peer overlay networks are resilient – but not secure.  Even a small fraction of malicious nodes may result in failure of correct.

COMP s1 Computing 2 Complexity

Detecting Distance-Based Outliers in Streams of Data Fabrizio Angiulli and Fabio Fassetti DEIS, Universit `a della Calabria CIKM 07.

Data Selection In Ad-Hoc Wireless Sensor Networks Olawoye Oyeyele 11/24/2003.

Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.

Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.

Cloud and Big Data Summer School, Stockholm, Aug Jeffrey D. Ullman.

CPS 216: Advanced Database Systems Shivnath Babu.

CPS 216: Advanced Database Systems Shivnath Babu Fall 2006.

Database Management 9. course. Execution of queries.

MINING FREQUENT ITEMSETS IN A STREAM TOON CALDERS, NELE DEXTERS, BART GOETHALS ICDM2007 Date: 5 June 2008 Speaker: Li, Huei-Jyun Advisor: Dr. Koh, Jia-Ling.

Maximum Likelihood Estimator of Proportion Let {s 1,s 2,…,s n } be a set of independent outcomes from a Bernoulli experiment with unknown probability.

Motif finding with Gibbs sampling CS 466 Saurabh Sinha.

CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations: Other Operations Chapter 14 Ramakrishnan & Gehrke (Sections ; )

1 Approximating Quantiles over Sliding Windows Srimathi Harinarayanan CMPS 565.

CSC 211 Data Structures Lecture 13

Jennifer Rexford Princeton University MW 11:00am-12:20pm Measurement COS 597E: Software Defined Networking.

Aum Sai Ram Security for Stream Data Modified from slides created by Sujan Pakala.

CS6321 Query Optimization Over Web Services Utkarsh Kamesh Jennifer Rajeev Shrivastava Munagala Wisdom Motwani Presented By Ajay Kumar Sarda.

A Membrane Algorithm for the Min Storage problem Dipartimento di Informatica, Sistemistica e Comunicazione Università degli Studi di Milano – Bicocca WMC.

Eddies: Continuously Adaptive Query Processing Ross Rosemark.

De novo discovery of mutated driver pathways in cancer Discussion leader: Matthew Bernstein Scribe: Kun-Chieh Wang Computational Network Biology BMI 826/Computer.

Query Processing – Query Trees. Evaluation of SQL Conceptual order of evaluation – Cartesian product of all tables in from clause – Rows not satisfying.

Evaluating Window Joins over Unbounded Streams Jaewoo Kang Jeffrey F. Naughton Stratis D. Viglas {jaewoo, naughton, Univ. of Wisconsin-Madison.

27-Jan-16 Analysis of Algorithms. 2 Time and space To analyze an algorithm means: developing a formula for predicting how fast an algorithm is, based.

HASE: A Hybrid Approach to Selectivity Estimation for Conjunctive Queries Xiaohui Yu University of Toronto Joint work with Nick Koudas.

Adaptive Ordering of Pipelined Stream Filters Babu, Motwani, Munagala, Nishizawa, and Widom SIGMOD 2004 Jun 13-18, 2004 presented by Joshua Lee Mingzhu.

Understanding DBMSs. Data Management Data Query Application DataBase Management System (DBMS)

Chapter 15 Running Time Analysis. Topics Orders of Magnitude and Big-Oh Notation Running Time Analysis of Algorithms –Counting Statements –Evaluating.

1 VLDB, Background What is important for the user.

Adaptive Processing in Data Stream Systems Shivnath Babu stanfordstreamdatamanager Stanford University.

Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.

Mining Data Streams (Part 1)

CPU Scheduling CSSE 332 Operating Systems

Proactive Re-optimization

Matrix Sketching over Sliding Windows

Fast Approximate Query Answering over Sensor Data with Deterministic Error Guarantees Chunbin Lin Joint with Etienne Boursier, Jacque Brito, Yannis Katsis,

Evaluation of Relational Operations: Other Operations

Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani

CUBE MATERIALIZATION E0 261 Jayant Haritsa

Implementation of Relational Operations

Introduction to Stream Computing and Reservoir Sampling

Evaluation of Relational Operations: Other Techniques

Approximation and Load Shedding Sampling Methods

Evaluation of Relational Operations: Other Techniques

Presentation transcript:

Adaptive Query Processing in Data Stream Systems Paper written by Shivnath Babu Kamesh Munagala, Rajeev Motwani, Jennifer Widom stanfordstreamdatamanager Itaru Nishizawa Hitachi, Ltd. Stanford University

Data Streams Continuous, unbounded, rapid, time- varying streams of data elements Continuous, unbounded, rapid, time- varying streams of data elements Occur in a variety of modern applications Occur in a variety of modern applications Network monitoring and intrusion detection Network monitoring and intrusion detection Sensor networks Sensor networks Telecom call records Telecom call records Financial applications Financial applications Web logs and click-streams Web logs and click-streams Manufacturing processes Manufacturing processes

Example Continuous Queries Web Web Amazon’s best sellers over last hour Amazon’s best sellers over last hour Network Intrusion Detection Network Intrusion Detection Track HTTP packets with destination address matching a prefix in a given table and content matching “*\.ida” Track HTTP packets with destination address matching a prefix in a given table and content matching “*\.ida” Finance Finance Monitor NASDAQ stocks between $20 and $200 that have moved down more than 2% in the last 20 minutes Monitor NASDAQ stocks between $20 and $200 that have moved down more than 2% in the last 20 minutes

Traditional Query Optimization Executor: Runs chosen plan to completion Chosen query plan Optimizer: Finds “best” query plan to process this query Query Statistics Manager: Periodically collects statistics, e.g., table sizes, histograms Which statistics are required Estimated statistics

Optimizing Continuous Queries is Different Continuous queries are long-running Continuous queries are long-running Stream characteristics can change over time Stream characteristics can change over time Data properties: Selectivities, correlations Data properties: Selectivities, correlations Arrival properties: Bursts, delays Arrival properties: Bursts, delays System conditions can change over time System conditions can change over time  Performance of a fixed plan can change significantly over time  Adaptive processing: find best plan for current conditions

Traditional Optimization  Adaptive Optimization Optimizer: Finds “best” query plan to process this query Executor: Runs chosen plan to completion Chosen query plan Query Statistics Manager: Periodically collects statistics, e.g., table sizes, histograms Which statistics are required Estimated statistics Reoptimizer: Ensures that plan is efficient for current characteristics Profiler: Monitors current stream and system characteristics Executor: Executes current plan Decisions to adapt Combined in part for efficiency

Preliminaries Let query Q process input stream I, applying the conjunction of n commutative filters F 1, F 2, …, F n. Each filter F i takes a stream tuple e as input and returns either true or false. If F i returns false for tuple e we say that F i drops e. A tuple is emitted in the continuous query result if and only if all n filters return true. A plan for executing Q consists of an ordering P =F f(1), F f(2),.., F f(n) where f is the mapping from positions in the filter ordering to the indexes of the filters at those positions When a tuple e is processed by P, first F f(1) is evaluated. If it returns false (e is dropped by F f(1) ), then e is not processed Further. Otherwise, F f(2) is evaluated on e, and so on.

Preliminaries – cont’d At any time, the cost of an ordering O is the expected time to process an incoming tuple in I to completion (either emitted or dropped), using O. Consider O = F f(1), F f(2),.., F f(n). d(i|j) is the conditional probability that F f(i) will drop a tuple e from input stream I, given that e was not dropped by any of F f(1), F f(2),.., F f(j). The unconditional probability that F f(i) will drop an I tuple is d(i|0). t i is the expected time for Fi to process one tuple.

Preliminaries – cont’d Given the notations the cost of O = F f(1), F f(2),.., F f(n). per tuple can be formalized as: Notice D i is the portion of tuple that is left for operator F f(i) to process The goal is to maintain filter orderings that minimize this cost at any point in time.

Example In this picture a In this picture a sequence of tuples is arriving on stream I: 1, 2, 1, 4,... We have four filters F1–F4, such that F i drops a tuple e if and only if F i does not contain e.  Note that all of the incoming tuples except e = 1 are dropped by some filter.  For O1 = F1, F2, F3, F4, the total number of probes for the eight I tuples shown is 20. (For example, e = 2 requires three probes — F1, F2, and F3 – before it is dropped by F3.)  The corresponding number for O2 = F3, F2, F4, F1 is 18  O3 = F3, F1, F2, F4 is optimal for this example at 16 probes.

Greedy Algorithm Assume for the moment uniform times t i for all filters. A greedy approach to filter ordering proceeds as follows: 1. Choose the filter F i with highest unconditional drop probability d(i|0) as F f(1). 2. Choose the filter F j with highest conditional drop probability d(j|1) as F f(2). 3. Choose the filter F k with highest conditional drop probability d(k|2) as F f(3). 4. And so on.

Greedy Invariant To factor in varying filter times ti, replace d(i|0) in step 1 with d(i|0)/t i, d(j|1) in step 2 with d(j|1)/t j, and so on. We refer to this ordering algorithm as Static Greedy, or simply Greedy. Greedy maintains the following Greedy Invariant (GI):

So far - Pipelined Filters: Stable Statistics Assume statistics are not changing Assume statistics are not changing Order filters by decreasing unconditional drop- rate/cost [prev. work] Order filters by decreasing unconditional drop- rate/cost [prev. work] Correlations  NP-Hard Correlations  NP-Hard Greedy algorithm: Use conditional selectivities Greedy algorithm: Use conditional selectivities F  (1) has maximum drop-rate/cost F  (1) has maximum drop-rate/cost F  (2) has maximum drop-rate/cost ratio for tuples not dropped by F  (1) F  (2) has maximum drop-rate/cost ratio for tuples not dropped by F  (1) And so on And so on

Adaptive Version of Greedy Greedy gives strong guarantees Greedy gives strong guarantees 4-approximation, best poly-time approx. possible 4-approximation, best poly-time approx. possible For arbitrary (correlated) characteristics For arbitrary (correlated) characteristics Usually optimal in experiments Usually optimal in experiments Challenge: Challenge: Online algorithm Online algorithm Fast adaptivity to Greedy ordering Fast adaptivity to Greedy ordering Low run-time overhead Low run-time overhead  A-Greedy: Adaptive Greedy

A-Greedy Profiler: Maintains conditional filter selectivities and costs over recent tuples Executor: Processes tuples with current filter ordering Reoptimizer: Ensures that filter ordering is Greedy for current statistics statistics Estimated are required Which statistics Combined in part for efficiency Changes in filter ordering

A-Greedy Profiler For n filters, the total number of conditional selectivities is n2 n-1 Clearly it is impractical for the profiler to maintain online estimates of all these selectivities. /2 = O(n 2 ) selectivities only. Fortunately, to check whether a given ordering satisfies the GI, we need to check (n + 2)(n - 1) /2 = O(n 2 ) selectivities only. Once a GI violation has occurred, to find a new ordering that satisfies the GI we may need O(n 2 ) new selectivities in the worst case. The new set of required selectivities depends on the new input characteristics, so it cannot be predicted in advance.

Profiler cont’d The profiler maintains a profile of tuples dropped in the recent past. The profile is a sliding window of profile tuples created by sampling tuples from input stream I that get dropped during filter processing. A profile tuple contains n boolean attributes b 1, …, b n corresponding to filters F 1, …, F n. When a tuple e є I is dropped during processing, e is profiled with some probability p, called the drop-profiling probability. If e is chosen for profiling, processing of e continues artificially to determine whether any of the remaining filters unconditionally drop e.

Profiler cont’d The profiler then logs a tuple with attribute b i = 1 if Fi drops e and b i = 0 otherwise, 1 ≤ i ≤ n. The profile is maintained as a sliding window so that older input data does not contribute to statistics used by the reoptimizer. a sliding window of processing-time samples is also maintained to calculate the avg processing time a i for each filter F i

A-Greedy Reoptimizer The reoptimizer’s job is to maintain an ordering O such that O satisfies the GI for statistics estimated from the tuples in the current profile window. The view maintained over the profile window is an n X n upper triangular matrix V [i, j], 1 ≤ i ≤ j ≤ n, so we call it the matrix view. The n columns of V correspond in order to the n filters in O. That is, the filter corresponding to column c is F f(c).

Reoptimizer cont’d Entries in the ith row of V represent the conditional selectivities of filters F f(i),,F f(i+1),..,F f(n) for tuples that are not dropped by F f(1),F f(2), …, F f(i-1) Specifically, V [I, j] is the number of tuples in the profile window that were dropped by F f(j) among tuples that were not dropped by F f(1),F f(2), …, F f(i-1) Notice that V [i, j] is proportional to d(j|i)

Updating V on an insert to profile Window

Violation of GI The reoptimizer maintains the ordering O such that the matrix view for O always satisfies the condition: V [i, i]/a f(i) ≥ V [i, j]/a f(j), 1 ≤ i ≤ j ≤ n Suppose an update to the matrix view or to a processing-time estimate causes the following condition to hold: V [i, i]/a f(i) ≤ V [i, j]/a f(j), 1 ≤ i ≤ j ≤ n Then a GI violation has occurred at position i

Detecting a violation An update to V or to an a i can cause a GI violation at position i either because it reduces V [i, i] / a f(i), or because it increases some V [i, j] / a f(j), j > i.

Correcting a violation We may need to reevaluate the filters at positions > i because their conditional selectivities may have changed. The adaptive ordering can thrash if both sides of the Equation are almost equal for some pair of filters. To avoid thrashing, the thrashing- avoidance parameter β is introduced in the equation: V [i, i]/a f(i) ≤ β V [i, j]/a f(j), 1 ≤ i ≤ j ≤ n

Tradeoffs Suppose changes are infrequent Suppose changes are infrequent Slower adaptivity is okay Slower adaptivity is okay Want best plans at very low run-time overhead Want best plans at very low run-time overhead Three-way tradeoff among speed of adaptivity, run-time overhead, and convergence properties Three-way tradeoff among speed of adaptivity, run-time overhead, and convergence properties