Evaluating Window Joins over Unbounded Streams Jaewoo Kang Jeffrey F. Naughton Stratis D. Viglas {jaewoo, naughton, Univ. of Wisconsin-Madison.

Slides:

Advertisements

Similar presentations

Evaluating Window Joins over Unbounded Streams Author: Jaewoo Kang, Jeffrey F. Naughton, Stratis D. Viglas University of Wisconsin-Madison CS Dept. Presenter:

Advertisements

Analysis of : Operator Scheduling in a Data Stream Manager CS561 – Advanced Database Systems By Eric Bloom.

Revisiting Co-Processing for Hash Joins on the Coupled CPU- GPU Architecture School of Computer Engineering Nanyang Technological University 27 th Aug.

Hadi Goudarzi and Massoud Pedram

Kaushik Chakrabarti(Univ Of Illinois) Minos Garofalakis(Bell Labs) Rajeev Rastogi(Bell Labs) Kyuseok Shim(KAIST and AITrc) Presented at 26 th VLDB Conference,

CS4432: Database Systems II

Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.

Query Execution, Concluded Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 18, 2003 Some slide content may.

Maintaining Sliding Widow Skylines on Data Streams.

Chapter 11 Indexing and Hashing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.

Composite Subset Measures Lei Chen, Paul Barford, Bee-Chung Chen, Vinod Yegneswaran University of Wisconsin - Madison Raghu Ramakrishnan Yahoo! Research.

Adaptive Monitoring of Bursty Data Streams Brian Babcock, Shivnath Babu, Mayur Datar, and Rajeev Motwani.

Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 19 Scheduling IV.

Static Optimization of Conjunctive Queries with Sliding Windows over Infinite Streams Presented by: Andy Mason and Sheng Zhong Ahmed M.Ayad and Jeffrey.

1 Virtual Memory in the Real World Implementing exact LRU Approximating LRU Hardware Support Clock Algorithm Thrashing Cause Working Set.

B+-tree and Hashing.

1 Reversible Sketches for Efficient and Accurate Change Detection over Network Data Streams Robert Schweller Ashish Gupta Elliot Parsons Yan Chen Computer.

1 Virtual Memory in the Real World Implementing exact LRU Approximating LRU Hardware Support Clock Algorithm Thrashing Cause Working Set.

Evaluating Window Joins Over Unbounded Streams By Nishant Mehta and Abhishek Kumar.

1 SINA: Scalable Incremental Processing of Continuous Queries in Spatio-temporal Databases Mohamed F. Mokbel, Xiaopeng Xiong, Walid G. Aref Presented by.

Time-Decaying Sketches for Sensor Data Aggregation Graham Cormode AT&T Labs, Research Srikanta Tirthapura Dept. of Electrical and Computer Engineering.

1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman

16.5 Introduction to Cost- based plan selection Amith KC Student Id: 109.

An Adaptive Multi-Objective Scheduling Selection Framework For Continuous Query Processing Timothy M. Sutherland Bradford Pielech Yali Zhu Luping Ding.

Analysis of Algorithms COMP171 Fall Analysis of Algorithms / Slide 2 Introduction * What is Algorithm? n a clearly specified set of simple instructions.

A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Rainer Gemulla (University of Technology Dresden) Wolfgang Lehner (University.

Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.

Database Architecture Optimized for the New Bottleneck: Memory Access Peter Boncz Data Distilleries B.V. Amsterdam The Netherlands Stefan.

NGDM’02 1 Efficient Data-Reduction Methods for On-line Association Rule Mining H. Bronnimann B. ChenM. Dash, Y. Qiao, P. ScheuermannP. Haas Polytechnic.

CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,

A Unified Modeling Framework for Distributed Resource Allocation of General Fork and Join Processing Networks in ACM SIGMETRICS

Database Management Systems, R. Ramakrishnan and J. Gehrke1 Query Evaluation Chapter 12: Overview.

Index Tuning for Adaptive Multi-Route Data Stream Systems Karen Works, Elke A. Rundensteiner, and Emmanuel Agu Database Systems Research.

Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,

©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.

Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.

Join Synopses for Approximate Query Answering Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy.

Multiple Aggregations Over Data Streams Rui ZhangNational Univ. of Singapore Nick KoudasUniv. of Toronto Beng Chin OoiNational Univ. of Singapore Divesh.

Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.

Online aggregation Joseph M. Hellerstein University of California, Berkley Peter J. Haas IBM Research Division Helen J. Wang University of California,

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

GSLPI: a Cost-based Query Progress Indicator

Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.

Introduction to Query Optimization, R. Ramakrishnan and J. Gehrke 1 Introduction to Query Optimization Chapter 13.

CS4432: Database Systems II Query Processing- Part 2.

Database Management Systems, R. Ramakrishnan and J. Gehrke1 Introduction to Query Optimization Chapter 13.

Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.

32nd International Conference on Very Large Data Bases September , 2006 Seoul, Korea Efficient Detection of Empty Result Queries Gang Luo IBM T.J.

Query Processing CS 405G Introduction to Database Systems.

Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From

Rate-Based Query Optimization for Streaming Information Sources Stratis D. Viglas Jeffrey F. Naughton.

By: Peter J. Haas and Joseph M. Hellerstein published in June 1999 : Presented By: Sthuti Kripanidhi 9/28/20101 CSE Data Exploration.

Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.

1 Parallel Datacube Construction: Algorithms, Theoretical Analysis, and Experimental Evaluation Ruoming Jin Ge Yang Gagan Agrawal The Ohio State University.

DATABASE OPERATORS AND SOLID STATE DRIVES Geetali Tyagi ( ) Mahima Malik ( ) Shrey Gupta ( ) Vedanshi Kataria ( )

Continuous Monitoring of Distributed Data Streams over a Time-based Sliding Window MADALGO – Center for Massive Data Algorithmics, a Center of the Danish.

SketchVisor: Robust Network Measurement for Software Packet Processing

Large-scale file systems and Map-Reduce

Efficient Join Query Evaluation in a Parallel Database System

A paper on Join Synopses for Approximate Query Answering

CS222P: Principles of Data Management Lecture #15 Query Optimization (System-R) Instructor: Chen Li.

Cache Memory Presentation I

Introduction to Query Optimization

A Framework for Automatic Resource and Accuracy Management in A Cloud Environment Smita Vijayakumar.

On Spatial Joins in MapReduce

Heavy Hitters in Streams and Sliding Windows

Continuous Density Queries for Moving Objects

Eddies for Continuous Queries

CS222: Principles of Data Management Lecture #15 Query Optimization (System-R) Instructor: Chen Li.

Presentation transcript:

Evaluating Window Joins over Unbounded Streams Jaewoo Kang Jeffrey F. Naughton Stratis D. Viglas {jaewoo, naughton, Univ. of Wisconsin-Madison ICDE’03 Bangalore, India

Outline of the talk Introduction: Continuous Queries over Unbounded Streams Measuring the Cost of Sliding Window Joins On Maximizing the Efficiency of Processing Joins Summary

Sliding Windows Handling internal states is big challenge. Approximate answers Sliding windows – toss out expired tuples Synopses – resort to reduced answer precision

A Simple Sliding Window Query On arrival of a new tuple to window A 1. Scan window B and propagate matching tuples 2. Insert new tuple into window A 3. Invalidate all expired tuples in window A λaTaλaTa λaλa λbλb λbTbλbTb AB

Some interesting questions How should we measure the efficiency of a sliding window join evaluation strategy? Can a sliding window join algorithm take advantages of asymmetries in two input stream speeds?

Interesting questions (Cont’d) How should we allocate computing resources between the two windows to maximize join efficiency? If memory is the bottleneck, how should we allocate memory between the two windows for the two inputs?

Interesting questions How should we measure the efficiency of a sliding window join evaluation strategy? Can a sliding window join algorithm take advantages of asymmetries in two input stream speeds?

Outline of the talk Introduction: Continuous Queries over Unbounded Streams Measuring the Cost of Sliding Window Joins On Maximizing the Efficiency of Processing Joins Summary

Cost Model Unit-time basis cost model Aggregate cost of processing tuples arriving in each window in a time unit λaTaλaTa λaλa λbλb λbTbλbTb AB

Cost Model (Cont’d) Cost formula can be divided into two independent groups, one for each input stream Thus, can evaluate join algorithms for each join direction independently λaTaλaTa λaλa λbλb λbTbλbTb AB

Cost of One-way NLJ P(D) - cost of accessing one tuple in data structure D during search operation I(D) - cost of accessing one tuple in data structure D during update operation Total number of tuples processed in a time unit multiplied by the tuple access cost

Cost of One-way HJ |B| -- #of hash buckets in window B B/|B| -- #of tuples in a hash bucket Implement hash bucket to preserve tuple arrival order – avoid invalidation overhead.

Cost of One-way T-tree INLJ N – size of a T-tree node (#of tuples) B/N – total #of nodes in a T-tree

Implementation Implemented: Four join algorithms: NLJ, HJ, BJ, and TJ. Asymmetric join operator Stream emulator System: Java HotSpot VM 1.4 AMD Athlon XP 1533Mhz, 1GB memory Windows XP Professional

Fitting Parameters in the Model Process 60 seconds worth of tuples without intermittent delays, at 20 different points with increasing workload rates. Then, equate the measured values with the cost formula, and solve the equation. Hash bucket size = 10, T-tree node size = 100 used P(N) = 3x10 -4 P(H) = 5.5x10 -4 P(BT) = 2.6x10 -4 P(TT) = 2.6x10 -4 I(N) = 1x10 -4 I(H) = 7.8x10 -4 I(BT) = 2.6x10 -4 I(N) = 2.7x10 -4

Outline of the talk Introduction: Continuous Queries over Unbounded Streams Measuring the Cost of Sliding Window Joins On Maximizing the Efficiency of Processing Joins Summary

Interesting questions How should we measure the efficiency of a sliding window join evaluation strategy? Can a sliding window join algorithm take advantages of asymmetries in two input stream speeds?

Taking Advantage of Asymmetry There are cases where an asymmetric combination of join algorithms outperforms symmetric counterparts! E.g. for some A, B

Join Cost Estimation using Cost Model Size of window A = 5000 Size of window B = 5000 Five winning combinations: TN, TH, HH, HT, NT

Join Cost Estimation using Cost Model Size of window A = 5000 Size of window B = 5000 Five winning combinations: TN, TH, HH, HT, NT

Join Cost Estimation using Cost Model Size of window A = 5000 Size of window B = 5000 Five winning combinations: TN, TH, HH, HT, NT

Join Cost Estimation using Cost Model Size of window A = 5000 Size of window B = 5000 Five winning combinations: TN, TH, HH, HT, NT

Join Cost Estimation using Cost Model Size of window A = 5000 Size of window B = 5000 Five winning combinations: TN, TH, HH, HT, NT

Join Cost Estimation using Cost Model Size of window A = 5000 Size of window B = 5000 Five winning combinations: TN, TH, HH, HT, NT

Measured Join Cost (CPU Time) A=5000, B=5000 Memory utilization: HJ (h=10) consumed 5% more than TJ (n=100). Same five winners: TN, TH, HH, HT, NT Cost model prediction was accurate for both overall shape and crossover points. What if we increase window A and decrease window B? (e.g. A=7000, B=3000 as opposed to current 5000:5000)

Cross-over Point TN-TH TN-TH only dependent on window size B TN-TH = (B=500), meaning TNJ will outperform THJ when stream B is more than 106 times faster than stream A. TN-TH = (B=100), 18 times. λaTaλaTa λaλa λbλb λbTbλbTb AB

Cross-over Point TH-HH TH-HH only dependent on the size of window A If the size of window A increases the crossover point TH-HH will move toward left, and vice versa.

A=9500, B=500, λa=2, λb=998 Join Performance A=7000, B=3000, λa=800,λb=200

Interesting questions (Cont’d) How should we allocate computing resources between the two windows to maximize join efficiency? If memory is the bottleneck, how should we allocate memory between the two windows for the two inputs?

Resource Allocation & Join Performance Focus on cases where system resources are insufficient to fully support queries and workloads. Input streams are simply too fast to keep up with. Evaluating expensive join operator and its service rate is lower than the input rates. System memory cannot hold both windows.

Resource Allocation & Join Performance (Cont’d) Approximate answers may be acceptable E.g. query involving aggregate (e.g. average) over join Question is how to maximize the accuracy of the approximate answers, given the limited resources. We use insight that larger samples produce better answers Goal is to maximize the #of join result tuples Care must be taken to ensure that the result produced is statistically comparable to a random sample of the full join result.

Limited Computing Resources λa=800, λb=200 A=100, B=200  =0.01, μ=100 Window Join Output Rate : w/ Effective Rates =

Limited Memory Resources λa=10, λb=50 M=1000,  =0.005 Window Join Output Rate =

Limited Memory & Computing Resources μ=10, M=100  =0.01 Best performers are groups that allocate maximum computing resources to one stream and maximum memory to the another.

Summary Introduced unit-time basis cost model and experimentally validated it. Extended traditional join framework to include asymmetric combinations of join algorithms. Investigated resource allocation strategies for improving the accuracy of approximate answers. Developed powerful optimization framework for sliding window join queries by addressing these issues in a unified manner.