Ripple Joins for Online Aggregation by Peter J. Haas and Joseph M. Hellerstein published in June 1999 presented by Ronda Hilton.

Slides:

Advertisements

Similar presentations

Raghavendra Madala. Introduction Icicles Icicle Maintenance Icicle-Based Estimators Quality Guarantee Performance Evaluation Conclusion 2 ICICLES: Self-tuning.

Advertisements

Overview of Query Evaluation (contd.) Chapter 12 Ramakrishnan and Gehrke (Sections )

Supporting Top-k join Queries in Relational Databases By:Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by: Calvin R Noronha ( )

Supporting top-k join queries in relational databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by Rebecca M. Atchley Thursday, April.

Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.

Fast Algorithms For Hierarchical Range Histogram Constructions

CS 540 Database Management Systems

Query Execution, Concluded Zachary G. Ives University of Pennsylvania CIS 550 – Database & Information Systems November 18, 2003 Some slide content may.

Sampling: Final and Initial Sample Size Determination

Harikrishnan Karunakaran Sulabha Balan CSE  Introduction  Icicles  Icicle Maintenance  Icicle-Based Estimators  Quality & Performance  Conclusion.

Database Management Systems 3ed, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 14, Part B.

Implementation of Other Relational Algebra Operators, R. Ramakrishnan and J. Gehrke1 Implementation of other Relational Algebra Operators Chapter 12.

Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.

Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.

Online Aggregation Liu Long Aggregation Operations related to aggregating data in DBMS –AVG –SUM –COUNT.

Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.

Midterm Review Spring Overview Sorting Hashing Selections Joins.

Joseph M. Hellerstein Peter J. Haas Helen J. Wang

Evaluating Window Joins Over Unbounded Streams By Nishant Mehta and Abhishek Kumar.

1 Optimization - Selection. 2 The Selection Operation Table: Reserves(sid, bid, day, agent) A page (block) can hold 100 Reserves tuples There are 1,000.

Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.

1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.

Sampling We have a known population.  We ask “what would happen if I drew lots and lots of random samples from this population?”

Copyright © Cengage Learning. All rights reserved. CHAPTER 11 ANALYSIS OF ALGORITHM EFFICIENCY ANALYSIS OF ALGORITHM EFFICIENCY.

Evaluation of Relational Operations. Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation.

CONTROL group Joe Hellerstein, Ron Avnur, Christian Hidber, Bruce Lo, Chris Olston, Vijayshankar Raman, Tali Roth, Kirk Wylie, UC Berkeley CONTROL: Continuous.

Standard error of estimate & Confidence interval.

1 Relational Operators. 2 Outline Logical/physical operators Cost parameters and sorting One-pass algorithms Nested-loop joins Two-pass algorithms.

CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,

Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.

Query Optimization. overview Histograms A histogram is a data structure maintained by a DBMS to approximate a data distribution Equiwidth vs equidepth.

CSCE Database Systems Chapter 15: Query Execution 1.

Database Management 9. course. Execution of queries.

Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.

CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations: Other Operations Chapter 14 Ramakrishnan & Gehrke (Sections ; )

Relational Operator Evaluation. Overview Index Nested Loops Join If there is an index on the join column of one relation (say S), can make it the inner.

Online aggregation Joseph M. Hellerstein University of California, Berkley Peter J. Haas IBM Research Division Helen J. Wang University of California,

Ripple Joins for Online Aggregation by Peter J. Haas and Joseph M. Hellerstein published in June 1999 presented by Nag Prajval B.C.

Joseph M. Hellerstein Peter J. Haas Helen J. Wang Presented by: Calvin R Noronha ( ) Deepak Anand ( ) By:

BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

CS4432: Database Systems II Query Processing- Part 2.

Database Management Systems, R. Ramakrishnan and J. Gehrke1 Introduction to Query Optimization Chapter 13.

Presented By Anirban Maiti Chandrashekar Vijayarenu

Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.

CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP BY QUERIES Swaroop Acharya,Philip B Gibbons, VishwanathPoosala By Agasthya Padisala Anusha Reddy.

Query Processing CS 405G Introduction to Database Systems.

File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li.

Relational Operator Evaluation. overview Projection Two steps –Remove unwanted attributes –Eliminate any duplicate tuples The expensive part is removing.

Query Execution Query compiler Execution engine Index/record mgr. Buffer manager Storage manager storage User/ Application Query update Query execution.

By: Peter J. Haas and Joseph M. Hellerstein published in June 1999 : Presented By: Sthuti Kripanidhi 9/28/20101 CSE Data Exploration.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 14, Part A (Joins)

Execution Plans Detail From Zero to Hero İsmail Adar.

BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Authored by Sameer Agarwal, et. al. Presented by Atul Sandur.

Storage Estimation for Multidimensional Aggregates in the Presence of Hierarchies 병렬 분산 컴퓨팅 연구실 석사 1 학기 김남희.

A paper on Join Synopses for Approximate Query Answering

Proactive Re-optimization

Ripple Joins for Online Aggregation

Anthony Okorodudu CSE ICICLES: Self-tuning Samples for Approximate Query Answering By Venkatesh Ganti, Mong Li Lee, and Raghu Ramakrishnan.

Bolin Ding Silu Huang* Surajit Chaudhuri Kaushik Chakrabarti Chi Wang

Evaluation of Relational Operations

ICICLES: Self-tuning Samples for Approximate Query Answering

Chapter 15 QUERY EXECUTION.

Evaluation of Relational Operations: Other Operations

Spatial Online Sampling and Aggregation

File Processing : Query Processing

One-Pass Algorithms for Database Operations (15.2)

Implementation of Relational Operations

CS222P: Principles of Data Management Notes #13 Set operations, Aggregation, Query Plans Instructor: Chen Li.

Evaluation of Relational Operations: Other Techniques

Evaluation of Relational Operations: Other Techniques

Presentation transcript:

Ripple Joins for Online Aggregation by Peter J. Haas and Joseph M. Hellerstein published in June 1999 presented by Ronda Hilton

Overview This paper tells how to join a bunch of tables and get the SUM, COUNT, or AVG in GROUP BY clauses showing approximate results immediately and the confidence interval of the results from the first few tuples retrieved updating a GUI display with closer approximation information as the join adds more tuples.

Ripple joins compared to our previous topics  General research area: algorithms  another approximation algorithm  online processing  not maintaining a sample set  aggregate queries: joins, and group-by  requires random retrieval  uses probabilistic calculations to determine the quality of the approximate result  not optimizing  implemented as middleware on the DBMS

Traditional Hash Join stores the smaller relation in memory Two relations R and S with a common attribute: on each distinct value of that attribute, match up the tuples which have the same value. Example: select R.roomnumber, COUNT(S.homeroom) from Rooms R join Student S on R.roomnumber=S.homeroom For each tuple r in R add hash(roomnumber) to the hashtable in memory if hashtable has filled up memory for every tuple s in S if hash(homeroom) is found in the hashtable add tuple r and tuple s to the output reset the hashtable Finally, scan S and add the resulting join tuples to the output.

What's different about ripple join? Traditional hash join blocks until the entire query output is finished. Ripple join reports approximate results after each sampling step, and allows user intervention. In the inner loop, an entire table is scanned. Ripple join expands the sample set incrementally.

The most important difference The tuples are processed in random order.

Pipelining  In pipelining join algorithms, as the join progresses, more and more information gets added to the result.  In ripple joins, each new tuple gets joined with all previously-seen tuples of the other operand(s).  The relative rates of the two (or more) operands are dynamically adjusted.

Worst-case scenario Ripple join reduces to a nested loop join.

The relations do not have to be relatively equal size. Aspect ratio: how many tuples are retrieved from each base relation per sampling step. e.g.β 1 = 1, β 2 = 3, … Ripple join adjusts the aspect ratio according to the sizes of the base relations.

Rectangular version

What can the end user control?  how many groups continue to process Any one group can be stopped. All other groups will continue to process (faster).  the speed of the query selection process What happens to make the process faster? More tuples are skipped in the aggregation, so the approximation will be less accurate, and the confidence interval will be wider. The end user controls the trade-off between speed and accuracy.

GUI, 1999

Confidence interval A running confidence interval displays how close this answer is to the final result. This could be calculated in many ways. The authors present an example calculation built on extending the Central Limit Theorem.

Central Limit Theorem ˆμ ⁿ is estimator for true μ average of the n values in the sample; a random quantity CLT: for large n (e.g. after joining 30 tuples), ˆμ ⁿ has a normal distribution with mean μ and variance σ 2 /n

Random variable Z Shift and scale ˆμ ⁿ to get a "standardized" random variable Z: (ˆμ ⁿ μ) / (σ /√n) Z also has a standard normal distribution. There are a lot of ways to compute the z p values.

"Interval" column on the GUI The authors use ˆσ n as an estimator for true variance: ε n = ( z p ˆσ n ) / √n This is displayed quantity as the final half-width of the confidence interval.

Why call this "Ripple Join"? 1.The algorithm seems to ripple out from a corner of the join. 2.Acronym: "Rectangles of Increasing Perimeter Length"

Variants of ripple join  Block ripple join  Index ripple join  Hash ripple join

Performance

Further publications  Eddies: Continuously Adaptive Query Processing, by Ron Avnur and Joseph M. Hellerstein, MOD 2000, Dallas  Confidence Bounds for Sampling-Based GROUP BY Estimates, by Fei Xu, Christopher Jermaine, and Alin Dobra, ACMTrans. Datab. Syst. 33, 3 (Aug. 2008)  Wavelet synopsis for hierarchical range queries with workloads, by Sudipto Guha, Hyoungmin Park, and Kyuseok Shim, VLDB Journal (2008) 17:1079–1099

Questions?