Stream Data Operator Ordering  Query Optimization Query Index.

Slides:

Advertisements

Similar presentations

Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.

Advertisements

Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.

Fast Algorithms For Hierarchical Range Histogram Constructions

DISCOVER: Keyword Search in Relational Databases Vagelis Hristidis University of California, San Diego Yannis Papakonstantinou University of California,

Mining Data Streams.

Adaptive Monitoring of Bursty Data Streams Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani.

1 Load Shedding CS240B notes. 22 Load Shedding in a DSMS zDSMS: online response on boundless and bursty data streams—How? zBy using approximations and.

Static Optimization of Conjunctive Queries with Sliding Windows over Infinite Streams Presented by: Andy Mason and Sheng Zhong Ahmed M.Ayad and Jeffrey.

Load Shedding in a Data Stream Manager Kevin Hoeschele Anurag Shakti Maskey.

A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.

Kuang-Hao Liu et al Presented by Xin Che 11/18/09.

1 Data Persistence in Large-scale Sensor Networks with Decentralized Fountain Codes Yunfeng Lin, Ben Liang, Baochun Li INFOCOM 2007.

Adaptive Ordering of Pipelined Stream Filters S. Babu, R. Motwani, K. Munagala, I. Nishizawa, and J. Widom In Proc. of SIGMOD 2004, June 2004.

VLDB Revisiting Pipelined Parallelism in Multi-Join Query Processing Bin Liu and Elke A. Rundensteiner Worcester Polytechnic Institute

Tributaries and Deltas: Efficient and Robust Aggregation in Sensor Network Streams Amit Manjhi, Suman Nath, Phillip B. Gibbons Carnegie Mellon University.

1 Load Shedding Algorithm Evaluation Step –When to shed load? Load Shedding Road Map (LSRM) –Where to shed load? –How much load to shed?

Continuous Data Stream Processing MAKE Lab Date: 2006/03/07 Post-Excellence Project Subproject 6.

Approximate data collection in sensor networks the appeal of probabilistic models David Chu Amol Deshpande Joe Hellerstein Wei Hong ICDE 2006 Atlanta,

Exploiting Correlated Attributes in Acquisitional Query Processing Amol Deshpande University of Maryland Joint work with Carlos Sam

1 Anna Östlin Pagh and Rasmus Pagh IT University of Copenhagen Advanced Database Technology March 25, 2004 QUERY COMPILATION II Lecture based on [GUW,

Flow Algorithms for Two Pipelined Filtering Problems Anne Condon, University of British Columbia Amol Deshpande, University of Maryland Lisa Hellerstein,

Monitoring Streams -- A New Class of Data Management Applications Don Carney Brown University Uğur ÇetintemelBrown University Mitch Cherniack Brandeis.

Continuous Data Stream Processing MAKE Lab Date: 2006/03/07 Post-Excellence Project Subproject 6.

Chain: Operator Scheduling for Memory Minimization in Data Stream Systems Authors: Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani (Dept.

1 Load Shedding in a Data Stream Manager Slides edited from the original slides of Kevin Hoeschele Anurag Shakti Maskey.

Monitoring Streams -- A New Class of Data Management Applications Don Carney Brown University Uğur ÇetintemelBrown University Mitch Cherniack Brandeis.

Model-driven Data Acquisition in Sensor Networks Amol Deshpande 1,4 Carlos Guestrin 4,2 Sam Madden 4,3 Joe Hellerstein 1,4 Wei Hong 4 1 UC Berkeley 2 Carnegie.

Cloud and Big Data Summer School, Stockholm, Aug Jeffrey D. Ullman.

An adaptive framework of multiple schemes for event and query distribution in wireless sensor networks Vincent Tam, Keng-Teck Ma, and King-Shan Lui IEEE.

Efficient Scheduling of Heterogeneous Continuous Queries Mohamed A. Sharaf Panos K. Chrysanthis Alexandros Labrinidis Kirk Pruhs Advanced Data Management.

MONITORING STREAMS: A NEW CLASS OF DATA MANAGEMENT APPLICATIONS DON CARNEY, U Ğ UR ÇETINTEMEL, MITCH CHERNIACK, CHRISTIAN CONVEY, SANGDON LEE, GREG SEIDMAN,

Providing Resiliency to Load Variations in Distributed Stream Processing Ying Xing, Jeong-Hyon Hwang, Ugur Cetintemel, Stan Zdonik Brown University.

Network Aware Resource Allocation in Distributed Clouds.

1 11 Subcarrier Allocation and Bit Loading Algorithms for OFDMA-Based Wireless Networks Gautam Kulkarni, Sachin Adlakha, Mani Srivastava UCLA IEEE Transactions.

Database Management 9. course. Execution of queries.

Index Tuning for Adaptive Multi-Route Data Stream Systems Karen Works, Elke A. Rundensteiner, and Emmanuel Agu Database Systems Research.

©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.

Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.

Query Optimization Chap. 19. Evaluation of SQL Conceptual order of evaluation – Cartesian product of all tables in from clause – Rows not satisfying where.

Wireless Sensor Networks In-Network Relational Databases Jocelyn Botello.

CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations: Other Operations Chapter 14 Ramakrishnan & Gehrke (Sections ; )

한국기술교육대학교 컴퓨터공학부 민준기.  Stream data ◦ A growing number of applications generate streams of data  Performance measurements in network monitoring and traffic.

Dave McKenney 1.  Introduction  Algorithms/Approaches  Tiny Aggregation (TAG)  Synopsis Diffusion (SD)  Tributaries and Deltas (TD)  OPAG  Exact.

1 STREAM: The Stanford Data Stream Management System STanfordstREamdatAManager 陳盈君吳哲維林冠良.

Adaptive Query Processing in Data Stream Systems Paper written by Shivnath Babu Kamesh Munagala, Rajeev Motwani, Jennifer Widom stanfordstreamdatamanager.

Load Shedding Techniques for Data Stream Systems Brian Babcock Mayur Datar Rajeev Motwani Stanford University.

Lecture 15- Parallel Databases (continued) Advanced Databases Masood Niazi Torshiz Islamic Azad University- Mashhad Branch

CS6321 Query Optimization Over Web Services Utkarsh Kamesh Jennifer Rajeev Shrivastava Munagala Wisdom Motwani Presented By Ajay Kumar Sarda.

Accommodating Bursts in Distributed Stream Processing Systems Yannis Drougas, ESRI Vana Kalogeraki, AUEB

Presented By Anirban Maiti Chandrashekar Vijayarenu

Aurora: a new model and architecture for data stream management Daniel J. Abadi 1, Don Carney 2, Ugur Cetintemel 2, Mitch Cherniack 1, Christian Convey.

Adaptivity in continuous query systems Luis A. Sotomayor & Zhiguo Xu Professor Carlo Zaniolo CS240B - Spring 2003.

Adaptive Ordering of Pipelined Stream Filters Babu, Motwani, Munagala, Nishizawa, and Widom SIGMOD 2004 Jun 13-18, 2004 presented by Joshua Lee Mingzhu.

1 CSIS 7101: CSIS 7101: Spatial Data (Part 1) The R*-tree ： An Efficient and Robust Access Method for Points and Rectangles Rollo Chan Chu Chung Man Mak.

Monitoring Streams -- A New Class of Data Management Applications based on paper and talk by authors below, slightly adapted for CS561: Don Carney Brown.

REED ： Robust, Efficient Filtering and Event Detection in Sensor Network Daniel J. Abadi, Samuel Madden, Wolfgang Lindner Proceedings of the 31st VLDB.

Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.

Mining Data Streams (Part 1)

S. Sudarshan CS632 Course, Mar 2004 IIT Bombay

Load Shedding CS240B notes.

A paper on Join Synopses for Approximate Query Answering

An overview of Data Streaming

Evaluation of Relational Operations: Other Operations

Load Shedding Techniques for Data Stream Systems

Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani

Evaluation of Relational Operations: Other Techniques

Load Shedding CS240B notes.

A. Kemper, R. Kuntschke, and B. Stegmaier

Efficient Processing of Top-k Spatial Preference Queries

Evaluation of Relational Operations: Other Techniques

Presentation transcript:

Stream Data Operator Ordering  Query Optimization Query Index

Query Optimization Operator Ordering Problem Assumption –A query consists of a set of commutative filters –Filter Drop or Select Overall processing costs can vary widely across different filter order Ex –Filter O 1 drops 1, 3, 5 –Filter O 2 drops 2, 4, 6 –Let an input stream be 2, 4, 6. –The cost of Operator Order O 2,O 1 is cheaper than that of O 1, O 2

Operator ordering Operator Ordering –Choose efficient order –The optimal order is changed over time. Eddy[4] –Tuple routing Technique –An operator dropping many tuples has high priority

Operator ordering A-Greedy[9] –Query Cost C –d(i|j) denotes the conditional probability that i th operator O f(i) will drop a tuple e, given that e was not dropped by any of operators O f(1), O f(2),..., O f(j). –t i represents the expect time for O f(i) to process one tuple Goal  Minimized C –Greedy heuristic rule which rearrange the operator order satisfying the following formula 

Operator ordering A-Greedy –Profiler To obtain conditional selectivity d(i|j), profiling is used. In profiling, a tuple e which is dropped during processing is selected with probability p Then, profiler artificially applies e to all operators and generate a profile tuple whose attribute b i is 1 if O i drops e –Reoptimizer Keeps the operator order Maintains a matrix view Ex) first row: O4 drops most tuples, second row : reports the numbers of tuples which are not dropped by O4 droped by O1,O3, and O2. Profile matrix view

Operator ordering Problem of A-Greedy –Profiling overhead –A normal tuple may be dropped by an operator, but a tuple for profiling is applied to all operators. –In other words, when 10% data of input are profiled, the increment of system overheads is greater than 10%.

Push-based data source High and unpredictable data rates Problem –Load > Capacity –Load Shedding: eliminate excess load by dropping data Load Shedding[8]

Aurora App QoS App QoS App QoS        Slide Tumble                       App Tumble App

QoS: Aurora QoS Specifies “Utility” Of Imperfect Query Results Delay-Based (specify utility of late results) Delivery-Based, Value-Based (specify utility of partial results) QoS Influences… Scheduling, Storage Management, Load Shedding B A C

Load Shedding: Aurora Two Load Shedding Techniques: Random Tuple Drops Add DROP box to network (DROP a special case of FILTER) Position to affect queries w/ tolerant delivery-based QoS reqts Semantic Load Shedding FILTER values with low utility (acc to value-based QoS)

Load Coefficient Load Shedding: Aurora

Best location of Drop operator –Maximize cycle gain, minimize utility loss –Cycle gain: processor cycles gained fro each percentage of tuples dropped G(x) = R*(x*L-D) R: input rate, L is load coefficient –Loss/Gain ratio  the smaller, the better Load Shedding: Aurora Drop x% RL D cycles/tuple Loss-tolerant graph

Load Shedding  where, when, how much. –Where ->[8], How much  [26[ Particularly, in multi-Query Environments Ex) Two Query, Q1 and Q2 Data size = 24, Processing cost per tuple = c Overall cost = 24*2*c = 48c System capability = 30c Goal : Min G =  ((1-r p )/r p )*f p where r p is the fraction to be considered for a query Q p f p is actual frequency of tuples to be result. Assume f a =1, f b =4 Plan 1) Uniform  r a = r b =15  G = 3 Plan 2) Proportional  fb/fa = 4  6:24  r a = 6/24, r b = 24/24  G=3 Plan 3) Optimal  r a = 10/24, r b 20/24  G = 2.2 Load Shedding[26]

Estimate f p –Let b i = 1 if a tuple t i is a query result. Otherwise b i =0 –f p =  b i –Each tuple t i is processed with a probability r q and discard with a probability 1-r q –Let X i = b i /r q with a probability rq and X i = 0 with a probability 1-r q –Estimate f p =  X i E(f p ) = E(  X i ) =  b i = f p –Var(f p ) =((1-r q )/r q ) *f p Variance means average error e p Load Shedding[26]

–Let S is a set of query, |S|= N –Error vector E = [e 1,…, e N ] –Importance of queries V = [v 1,…,v N ] –Resource Cost C = [c 1,…c N ] –Processing ratio r = [r 1,…, r N ] –Total resource limitation = L –Data Size = W Goal :  Constraint r  C =  r i *c i <= L/W Minimize G = E  V=  e i *v i –Apply e q =((1-r q )/r q ) *f p –G= -  f i *v i + G 1 where G 1 =  (f j *v j )/r j –To minimize G, it suffices to minimize G 1 –  non-linear programming(separable and convex resource allocation) –  Sorting  O(NlogN) –In the paper, suggest O(N) algorithm Load Shedding[26]

Query Index Invoke all query whenever data arrives –  Query Index Property of Stream Data –Locality –ex. the temperature in near future will be similar to the current temperature –Some or all queries will be reused in near future

Query Index –The number of registered queries is huge –Overhead to find out the proper queries which can evaluate the input stream item. –IBS(Interval Binary Search Tree) –R-Tree Multi-Dimensional data access method Range conditions of Queries are overlaped. Many nodes should be traversed due to a large amount of overlap of query conditions

Query Index IBS[10] –Use balanced binary search tree for query indexes –When a data item arrives, balanced binary search trees and hash table are probed with the value of tuples –Not appropriate to general range queries which have two bounded conditions Each condition is indexed in individual binary tree.  unnecessary partial result Query Conditions q1: R.a  1 and R.a < 10 q2: R.a > 5 q3: R.a > 7 q4: R.a = 4 q5: R.a = q1q2q3 10 q1 1=q1 4=q4 6=q5 Group Filter for R.a < > = !=

Query Processing Based on Spatial Join[26] –Query- represented as a region –Data – represented as a point Batch mode Accumulate arriving data elements and process continuous queries  Set of data  represented as a region –Uses Spatial Indexes for data set and queries Query Index

A set of data  region Query  region –  compute overlap relationships In [26], Use Corner Transformation –n-dim object  2n-dim point Query Index

–BMQ-Index [11] DMR List is a list of DN i –DN i = –DR i is a matching Region (b i-1, b i ) –+DQSet is a set of queries whose lower bound l k = b i-1 –-DQSet is a set of queries whose upper bound u k = b i-1 A stream table keeps the recently accessed DN i Query Conditions q1: R.a  1 and R.a < 10 q2: R.a > 5 q3: R.a > 7 q4: R.a = 4 q5: R.a = inf q1 q2 q3 DN 1 DN 2 DN 3 DN 4 DN 5 DN 6 DN 6 {+q1} {+q2} {+q3} {-q1} {-q2,-q3} stream table

Query Index QSet(t) is a set of queries for data v t Let v t be in DN j and v t+1 be in DN h, –e.g., b j-1 <= v t < b j and b h-1 <= v t+1 < b h Then QSet(t+1) is obtained as follows For example v t = 4.5, QSet(t) = {q1} if v t+1 = 12, –U+DQSet = {q2,q3} –U-DQSet = {q1} –Thus QSet(t+1) = {q2,q3} inf q1 q2 q3 DN 1 DN 2 DN 3 DN 4 DN 5 DN 6 DN 6 {+q1} {+q2} {+q3} {-q1} {-q2,-q3} stream table

Query Index Problem of BMQ-Index –If the forthcoming data is quite different from the current data, many DRM nodes should be retrieved like a linear search –Support only (l, u) style condition. q4 and q5 is not registered –does not work correctly on the boundary condition inf q1 q2 q3 DN 1 DN 2 DN 3 DN 4 DN 5 DN 6 DN 6 {+q1} {+q2} {+q3} {-q1} {-q2,-q3} stream table Let v t = 5.5 and QSet(t) = {q1,q2} If v t+1 = 5, Then QSet(t+1) is also {q1,q2} But, actual query set of v t+1 is {q1}.

[1] D. Carney, U. Cetintemel, M. Cherniack, C. Convey, S. Lee,G. Seidman, M. Stonebraker, N. Tatbul, and S. Zdonik. Monitoring streams–a new class of data management applications. In Proc. 28th Intl. Conf. on Very Large Data Bases, Aug [2] A. Arasu, B. Babcock, S. Babu, M. Datar, K. Ito, R. Motwani, I. Nishizawa, U. Srivastava, D. Thomas, R. Varma, J. Widom, J., “Stream: The stanford stream data manager”, IEEE Data Engineering Bulletin, Vol 26, No 1, pp , [3]J. M. Hellerstein, M. J. Franklin, S. Chandrasekaran, A. Deshpande, K. Hildrum, S. Madden, V. Raman, V., M. A. Shah, “Adaptive query processing: Technology in evolution”, IEEE Data Engineering Bulletin, Vol 23, No 2, pp. 7-18, [4] R. Avnur, J. M. Hellerstein, “Eddies: Continuously adaptive query processing”, In Proceedings of ACM SIGMOD Conference, pp , [5] Brain Babcock et.al, “Chain: Operator scheduling for Memory minimization in Data Stream Systems,” ACM SIGMOD [6] Don Carney et.al, “Operator Scheduling in a Data Stream Manager”, VLDB 2003 [7] B. Pielech, “Adaptive scheduling algorithm selection in a streaming query system,” Master thesis, Worcester polytechnic institute, [8] N Tatbul, U Çetintemel, S Zdonik, M Cherniack, M Stonebraker, “Load shedding in a data stream manager”, VLDB [9]. Babu, S., Motwani, R., Munagala, K., Nishizawa, I., Widom, J.: Adaptive ordering of pipelined stream filters. In: Proceedings of ACM SIGMOD Conference. (2004) 407–418 [10] S. Madden, M.A. Shah, J.M. Hellerstein, V. Raman, “Continuously adaptive continuous queries over streams”, In Proceedings of ACM SIGMOD Conference, [11] Jinwon Lee, Seungwoo Kang, Youngki Lee, SangJeong Lee, and Junehwa Song, "BMQ-Processor: A High-Performance Border Crossing Event Detection Framework for Large-scale Monitoring Applications", IEEE Transactions on Knowledge and Data Engineering (TKDE), Vol. 21, No. 2, pp , February 2009BMQ-Processor: A High-Performance Border Crossing Event Detection Framework for Large-scale Monitoring Applications Reference

[12] S. Madden et.al., “TAG: Aggregation Service for Ad-Hoc Sensor Networks”, OSDI, 2002 [13] N. Shrivastava et.al., “Medians and Beyond: New Aggregation Techniques for Sensor Networks,” ACM Sensys 2004 [14] N. Trigoni et.al., “Multi-Query Optimization for Sensor Networks” DCOSS 2005 [15]N. Trigoni, et.al., "Routing and Processing Multiple Aggregate Queries in Sensor Networks,“ ACM SenSys, [16] A. Deshpande et.al., "Model-Driven Data Acquisition in Sensor Networks,“ VLDB, [17] D. Chu et.al., "Approximate Data Collection in Sensor Networks using Probabilistic Models,“ ICDE, 2006 [18] D. Tulone et. al., “PAQ: Time Series Forecasting For Approximate Query Answering In Sensor Networks,” European Conf. Wireless Sensor Networks, 2006 [19] A. Deligiannakis et.al., “Compressing Historical Information in Sensor Networks,” ACM SIGMOD 2004 [20] A. Jain et.al., “Adaptive Stream Resource Management Using Kalman Filters,” ACM SIGMOD 2004 [21] X. Yang et.al., “In-Network Execution of Monitoring Queries in Sensor Networks,” ACM SIGMOD [22]M. Stern et.al., “Towards Efficient Processing of General-Purpose Joins in Sensor Networks,” ICDE [23]A. Pandit et.al, “ Communication-Efficient Implementation of Range-Joins in Sensor Networks,” International Conference on Database Systems for Advanced Applications (DASFAA), 2006 [24] H. Yu et.al, “In-Network Join Processing for Sensor Networks,” APWeb [25] A. Coman et.al, “On Join Location in Sensor Networks,” MDM [26] H.S. Lin, J.G. Lee, M.J. Lee, K.Y. Whang, I.Y. Song,” Continuous Query Processing in Data Streams Using Duality of Data and Queries,” ACM SIGMOD [27] B. Mozafari, C. Zaniolo, “Optimal Load Shedding with Aggregates and Mining Queries,” ICDE Reference