CONTROL group Joe Hellerstein, Ron Avnur, Christian Hidber, Bruce Lo, Chris Olston, Vijayshankar Raman, Tali Roth, Kirk Wylie, UC Berkeley CONTROL: Continuous Output and Navigation Technology with Refinement On-Line
Batch vs. On-Line Processing Batch Processing –Gives 100% accurate answers, but users must wait for entire query to finish... On-Line Processing –Gives progressively refining answers as the query runs! –Allow users to control processing. Applications of On-Line Processing –Large, ad-hoc queries in domains where approximate answers are acceptable (“big picture”)
Demo Outline On-Line Aggregation –Refining estimates Statistics give confidence –User Control The user can speed up the processing of certain groups The user can stop the processing at any time On-Line Visualization –Displays an approximation of an image based on data while the data is being fetched Shows the estimated density and distribution of data estimate
On-Line Agg.: Query Processing New Access Methods –Randomly delivered data. –Index Striding We can take advantage of B-Trees to access the groups –Heap Striding More generally, on-line permutation Non-blocking Join Algorithms –Ripple Join Family RIPL = Rectangles of Increasing Perimeter Length Join progressively larger samples of two tables
Access Methods for On-Line Agg. Heap Stride (On-Line Permutation) –Reorder tuples on the fly to get a fair sample AAABABACDCDAAA...ABCDABCDABCD... Heap FileFair Sample Output Index Stride –Round-robin through the groups to get a fair sample Works with an index on the grouping column
Progressively refining join: Ripple Join –Ever-larger rectangles in R S –Comes in naive, block, and hash flavors Multi-Table On-Line Aggregation Traditional R S Ripple R S Benefits: sample from both relations simultaneously gives better statistical confidences much faster intimate relationship between delivery and estimation
On-Line Aggregation User Interface User Controls Graph of Estimates w/Confidence Intervals Estimates for Each Group
On-Line Visualization: CLOUDS CLOUDS displays an approximation of an image based on data while the data is being fetched Conventional Algorithm CLOUDS Algorithm CLOUDS (with Index) Note that CLOUDS predicts the high density of cities in the Midwest
Quantifying the benefit of CLOUDS CLOUDS gives a better approximate image faster than the conventional algorithm Error Conventional CLOUDS Time (seconds)