Download presentation
Presentation is loading. Please wait.
Published byTobias George Modified over 9 years ago
2
Online Aggregation Joe Hellerstein UC Berkeley
3
Online Aggregation: Motivation Select AVG(grade) from ENROLL; A “fancy” interface: + Query Results AVG 3.262574342
4
A Better Approach Don’t process in batch! Online aggregation:
5
Isn’t This Just Sampling? Yes –we need values to arrive in random order –“confidence intervals” similar to sampling work … and No! –stopping condition set on the fly! –statistical techniques are more sophisticated –can handle GROUP BY w/o a priori knowledge...
6
Grouping Select AVG(grade) from ENROLL GROUP BY major;
7
Requirements Usability –Continuous output non-blocking query plans –time/precision control –fairness/partiality Performance –time to accuracy –time to completion –pacing
8
Fairness/Partiality Speed controls!
9
Applications Large-scale Data Analysis –Online drill-down/roll-up/CUBE –Visualization tools –Data mining –Distributed requests Generally: –Any long-running activity should be visualized and controllable on line. –Accuracy rarely critical for non-lookups
10
A Naïve Approach Select onln_avg(grade) from ENROLL; Can do it in Illustra/Informix today! But... –No grouping –Can’t meet performance & usability needs: no guarantee of continuous output –optimized for completion time! no guarantee of fairness (or control over partiality) no control over pacing
11
Random Access to Data Heap Scan –OK if clustering uncorrelated to agg & grouping attrs Index Scan – can scan an index on attrs uncorrelated to agg or grouping (or index on random()) Sampling: – could introduce new sampling access methods (e.g. Olken’s work)
12
Group By & Distinct Fair, Non-Blocking Group By/Distinct –Can’t sort! Must use hash-based techniques sorting blocks sorting is unfair –Hybrid hashing! especially for duplicate elimination. –“Hybrid Cache” even better.
13
Index Striding For fair Group By: – want random tuple from Group 1, random tuple from Group 2,... –Idea: Index gives tuples from a single group. –Sol’n: one access method opens many cursors in index, one per group. Fetch round-robin. –Can control speed by weighting the schedule –Gives fairness/partiality, info/speed match! (Next step: “heap stride”)
14
Join Algorithms Non-Blocking Joins –no sorting! –merge join OK, but watch “interesting orders”! –hybrid hash not great –symmetric “pipeline” hash [Wilschut/Apers 91] –nested loops always good, can be too slow –optimization...?
15
Query Optimization 2 components in cost function: –dead time (t d ): time spent doing “invisible” work -- tax this at a high rate! –output time (t o ): time spent producing output –this will do a lot automagically! User control vs. performance? –can use competition (e.g., a la Rdb) “interesting” (i.e. bad!) ordering
16
Extended Aggregate Functions Aggs need to return tuples in the middle of processing –add a 4th “current estimate” function open, iterate, estimate, close –sometimes like close function (AVG), sometimes not (COUNT) –aggregation code needs to spit this out to the front-end
17
API Current API uses built-in methods –button-press = query invocation e.g., select StopGroup( val ); High overhead, need to know internals. Very flexible. Easy to code! –Estimates returned as tuples no distinction between estimates and final answer OK? –A general API for running status?
18
Pacing Inter-tuple speed is critical!!
19
Statistical Issues Confidence Intervals for SQL aggs –given an estimate, probability p that we’re within of the right answer 3 types of estimates Conservative (Hoeffding’s inequality) Large-Sample (Central Limit Theorems) Deterministic Previous work + new results from Peter Haas
20
Initial Implementation prototype in PostgreSQL (a/k/a Postgres95, a/k/a PG-Lite) –aggs with running output –hash-based grouping, dup elimination –index striding –optimizer tweaks –“API” hacks and a simple UI
21
Pacing study Select AVG(grade), Interval(0.99) From ENROLL; Normal plan
22
Access Methods, Big Group Select AVG(grade), Interval(0.99) From ENROLL GROUP BY college; College L: 925K/1.5 Mtuples
23
Access Methods, Small Group Surprise!!! Cost of API call? Select AVG(grade), Interval(0.99) From ENROLL GROUP BY college; College S: 15K/1.5 Mtuples
24
Future Work Better UI –simple example: histograms w/error bars –online data visualization (Tioga DataSplash) data viz = “graphical” aggregate sampled points, or sampled wavelet coefficients? great for panning, zooming Nested Queries –Tie-ins with AI’s “anytime algorithms”
25
Future Work II Control w/o Indices –Heap Striding = do multiple scans optimization 1: fast scans can pass hints to slow optimization 2: “piggyback” scans via buffering big payoff for complex queries Checkpointing/continuation –also continuous data streams
26
Future Work III Sample Tracking –important for financial auditing, statistical quality control –cached views, RID-lists, logical views More stats: –non-standard aggs –simultaneous confidence intervals –allow for slop in cardinality estimation
27
Summary DSS/OLAP users very demanding! –“speed of thought” performance required –only choices: precomputation or estimation –for estimation, need to provide control and steady, useful feedback puts user in the driver’s seat takes some performance burden away from system –Requires significant backend support but well worth the effort runs the “impossible” queries
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.