Ripple Joins for Online Aggregation Peter J. Haas Joseph M. Hellerstein Joseph, Z.M. – CSE, UTA 2/16/2006
Ripple Joins: Introduction Follow up to Online Aggregation Extends Online Aggregation to a family of join algorithms Allows online aggregation to be used on multiple table queries
Ripple Joins: Introduction Targets queries of the form: SELECT op(expression) FROM R1, R2, … , RK WHERE predicate GROUP BY columns; Running estimates can be calculated based statistical properties of data already seen. User can control frequency of updates to this estimate.
Ripple Join vs. Online Nested Loop Problems with online nested loop If one table is large → Long time between updates Confidence Interval may not narrow down enough Ripple Join avoids complete relation scan.
Ripple Join: Operation Assume ripple join of relations R and S Select random tuple r from R Join with previously selected S tuples Do random select s from S Join with previous R tuples
Ripple Join: Square Two-Table Join S X N = 1
Ripple Join: Square Two-Table Join S X X X X N = 2
Ripple Join: Square Two-Table Join S X X X X X X N = 3
Ripple Join: Square Two-Table Join S X X X X X X X X N = 4
Ripple Join: Operation Thus is like nested loop join, but alternates between sampling and scanning from either relation. Can have various aspect ratios (non unitary) Select more samples from one table Leads to rectangular ripple Configurable by user
Enhanced Ripple Join Iterator: Rectangular Requires special handling by iterator to ensure that ripple grows correctly.
Pipelined Ripple Join Can easily be pipelined for multiple binary joins Cannot do three-table joins as two binary ripple joins. Authors recommend additional steps to handle building of such K-dimensional hyper rectangles.
Block Ripple Join Takes disk blocks of R and S in turn (not tuples) Read a disk block of R and scan against old S Evict from memory Read Block of S and compare with older R tuples. Exact same growth as normal, except thicker. Has I/O saving since each block is taken out at a time.
Further Variations of Ripple Joins Index Ripple Join Identical to indexed-enhanced nested loop join Hash Ripple Join Used only for Equijoin
Statistics As with online aggregation, ripple joins allow continuously updating running estimates Estimator unbiased, consistent Running average is biased but consistent Capable of giving tight confidence intervals Variance can also be calculated
Optimization and Design Can choose aspect ratios Animation Speed – Sweep out of rectangles Aim is to maximize the rate of updates Make confidence interval get as narrow as fast as possible
Conclusion Gives users visible progress of query as it zones in on average Useful UI enhancement Achieves reasonable answer in up to two orders of magnitude faster than normal offline techniques. Sublinear confidence interval guarantee Prototypes in Informix, IBM DB2
References Haas & Hollerstein, “Ripple Joins for Online Aggregation” (SIGMOD ’99) Haas & Hollerstein, “Online Query Processing: A Tutorial” Elmasri & Navathe, “Fundamentals of Database Systems”