CONTROL Overview CONTROL group Joe Hellerstein, Ron Avnur, Christian Hidber, Bruce Lo, Chris Olston, Vijayshankar Raman, Tali Roth, Kirk Wylie, UC Berkeley.

Slides:



Advertisements
Similar presentations
MapReduce Online Tyson Condie UC Berkeley Slides by Kaixiang MO
Advertisements

A Non-Blocking Join Achieving Higher Early Result Rate with Statistical Guarantees Shimin Chen* Phillip B. Gibbons* Suman Nath + *Intel Labs Pittsburgh.
Supporting top-k join queries in relational databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by Rebecca M. Atchley Thursday, April.
Database Management Systems 3ed, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 14, Part B.
Implementation of Other Relational Algebra Operators, R. Ramakrishnan and J. Gehrke1 Implementation of other Relational Algebra Operators Chapter 12.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Online Aggregation Joe Hellerstein UC Berkeley Online Aggregation: Motivation Select AVG(grade) from ENROLL; A “fancy” interface: + Query Results AVG.
Online Aggregation Liu Long Aggregation Operations related to aggregating data in DBMS –AVG –SUM –COUNT.
1 Online Query Processing A Tutorial Peter J. Haas IBM Almaden Research Center Joseph M. Hellerstein UC Berkeley.
1 Relational Query Optimization Module 5, Lecture 2.
Information Capture and Re-Use Joe Hellerstein. Scenario Ubiquitous computing is more than clients! –sensors and their data feeds are key –smart dust.
Karl Schnaitter and Neoklis Polyzotis (UC Santa Cruz) Serge Abiteboul (INRIA and University of Paris 11) Tova Milo (University of Tel Aviv) Automatic Index.
Eddies: Continuously Adaptive Query Processing Ron Avnur Joseph M. Hellerstein UC Berkeley.
Midterm Review Spring Overview Sorting Hashing Selections Joins.
Joseph M. Hellerstein Peter J. Haas Helen J. Wang
ACM GIS An Interactive Framework for Raster Data Spatial Joins Wan Bae (Computer Science, University of Denver) Petr Vojtěchovský (Mathematics,
Unary Query Processing Operators CS 186, Spring 2006 Background for Homework 2.
Unary Query Processing Operators Not in the Textbook!
Telegraph: An Adaptive Global- Scale Query Engine Joe Hellerstein.
CMSC724: Database Management Systems Instructor: Amol Deshpande
1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Towards Adaptive Dataflow Infrastructure Joe Hellerstein, UC Berkeley.
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Interactive Query Processing Vijayshankar Raman Computer Science Division University of California at Berkeley.
A Crystal Ball for Data-Intensive Processing CONTROL group Joe Hellerstein, Ron Avnur, Christian Hidber, Bruce Lo, Chris Olston, Vijayshankar Raman, Tali.
CONTROL group Joe Hellerstein, Ron Avnur, Christian Hidber, Bruce Lo, Chris Olston, Vijayshankar Raman, Tali Roth, Kirk Wylie, UC Berkeley CONTROL: Continuous.
Telegraph Continuously Adaptive Dataflow Joe Hellerstein.
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
Access Path Selection in a Relational Database Management System Selinger et al.
Ripple Joins for Online Aggregation by Peter J. Haas and Joseph M. Hellerstein published in June 1999 presented by Ronda Hilton.
Relational Operator Evaluation. Overview Index Nested Loops Join If there is an index on the join column of one relation (say S), can make it the inner.
ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.
Online aggregation Joseph M. Hellerstein University of California, Berkley Peter J. Haas IBM Research Division Helen J. Wang University of California,
Ripple Joins for Online Aggregation by Peter J. Haas and Joseph M. Hellerstein published in June 1999 presented by Nag Prajval B.C.
1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.
Joseph M. Hellerstein Peter J. Haas Helen J. Wang Presented by: Calvin R Noronha ( ) Deepak Anand ( ) By:
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)
Eddies: Continuously Adaptive Query Processing Ross Rosemark.
CS4432: Database Systems II Query Processing- Part 2.
Query Execution. Where are we? File organizations: sorted, hashed, heaps. Indexes: hash index, B+-tree Indexes can be clustered or not. Data can be stored.
Query Execution Query compiler Execution engine Index/record mgr. Buffer manager Storage manager storage User/ Application Query update Query execution.
By: Peter J. Haas and Joseph M. Hellerstein published in June 1999 : Presented By: Sthuti Kripanidhi 9/28/20101 CSE Data Exploration.
GPFS: A Shared-Disk File System for Large Computing Clusters Frank Schmuck & Roger Haskin IBM Almaden Research Center.
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Authored by Sameer Agarwal, et. al. Presented by Atul Sandur.
Query Optimization. overview Application Programmer (e.g., business analyst, Data architect) Sophisticated Application Programmer (e.g., SAP admin) DBA,
Unary Query Processing Operators
Wander Join: Online Aggregation via Random Walks
Applying Control Theory to Stream Processing Systems
Proactive Re-optimization
Ripple Joins for Online Aggregation
Drum: A Rhythmic Approach to Interactive Analytics on Large Data
Introduction to Query Optimization
Evaluation of Relational Operations
Telegraph: An Adaptive Global-Scale Query Engine
Evaluation of Relational Operations: Other Operations
Introduction to Database Systems
Spatial Online Sampling and Aggregation
April 30th – Scheduling / parallel
AQUA: Approximate Query Answering
Random Sampling over Joins Revisited
Selected Topics: External Sorting, Join Algorithms, …
Parallel Analytic Systems
Implementation of Relational Operations
CS222P: Principles of Data Management Notes #13 Set operations, Aggregation, Query Plans Instructor: Chen Li.
Evaluation of Relational Operations: Other Techniques
Control Theory in Log Processing Systems
Evaluation of Relational Operations: Other Techniques
Information Capture and Re-Use
Presentation transcript:

CONTROL Overview CONTROL group Joe Hellerstein, Ron Avnur, Christian Hidber, Bruce Lo, Chris Olston, Vijayshankar Raman, Tali Roth, Kirk Wylie, UC Berkeley Peter Haas, IBM Almaden

Context (wild assertions) Value from information –The pressing problem in CS (?) (!!) “Point” querying and data management is a solved problem –at least for traditional data (business data, documents) “Big picture” analysis still hard

Data Analysis c Complex: people using many tools –SQL Aggregation (Decision Support Sys, OLAP) –AI-style WYGIWIGY systems (e.g. Data Mining, IR) Both are Black Boxes –Users must iterate to get what they want –batch processing (big picture = big wait) We are failing important users! –Decision support is for decision-makers! –Black box is the world’s worst UI

Black Box Begone! Black boxes are bad –cannot be observed while running –cannot be controlled while running These tools can be very slow –exacerbates previous problems Thesis: –there will always be slow computer programs, usually data-intensive –fundamental issue: looking into the box...

Crystal Balls Allow users to observe processing –as opposed to “lucite watches” Allow users to predict future Ideally, allow users to change future –online control of processing The CONTROL Project: –online delivery, estimation, and control for data- intensive processes

Performance Regime for CONTROL Online performance: –Maximize 1 st derivative of the “mirth index” Time 100% CONTROL Traditional

Examples Online Aggregation –Informix Dynamic Server Enhanced by UCB students with Control algorithms Lots of algorithmics, many fussy end-to-end system issues [Avnur, Hellerstein, Raman DMKD ’00] –IBM has ongoing project to do this in DB2 –IBM buys Informix (4/01) Online Visualization –Visual enumeration & aggregation Interactive data cleaning & analysis –Potter’s Wheel ABC –Online “enumeration” and discrepancy detection

Example: Online Aggregation SELECT AVG(gpa) FROM students GROUP BY college

Example: Online Data Visualization In Tioga DataSplash

Visual Transformation Shot

Scalable Spreadsheets

Decision-Support in DBMSs Aggregation queries –compute a set of qualifying records –partition the set into groups –compute aggregation functions on the groups –e.g.: Select college, AVG(grade) From ENROLL Group By college;

Interactive Decision Support? Precomputation –the typical “OLAP” approach (a.k.a. Data Cubes) –doesn’t scale, no ad hoc analysis –blindingly fast when it works Sampling –makes real people nervous? –no ad hoc precision sample in advance can’t vary stats requirements –per-query granularity only

Online Aggregation Think “progressive” sampling –a la images in a web browser –good estimates quickly, improve over time Shift in performance goals –online mirth index Shift in the science –UI emphasis drives system design –leads to different data delivery, result estimation –motivates online control

Not everything can be CONTROLed “needle in haystack” scenarios –the nemesis of any sampling approach –e.g. highly selective queries, MIN, MAX, MEDIAN not useless, though –unlike presampling, users can get some info (e.g. max-so-far) we advocate a mixed approach –explore the big picture with online processing –when you drill down to the needles, or want full precision, go batch-style –can do both in parallel

New Techniques Online Reordering –gives control of group delivery rates –applicable outside the RDBMS setting Ripple Join family of join algorithms –comes in naïve, block & hash Statistical estimators & confidence intervals –for single-table & multi-table queries –for AVG, SUM, COUNT, STDEV –Leave it to Peter Visual estimators & analysis

S T R S T R Online Reordering users perceive data being processed over time –prioritize processing for “interesting” tuples –interest based on user-specified preferences reorder dataflow so that interesting tuples go first encapsulate reordering as pipelined dataflow operator

online aggregation –for SQL aggregate queries, give gradually improving estimates –with confidence intervals –allow users to speed up estimate refinement for groups of interest –prioritize for processing at a per-group granularity SELECT AVG(gpa) FROM students GROUP BY college Context: an application of reordering

Framework for Online Reordering want no delay in processing  in general, reordering can only be best-effort typically process/consume slower than produce –exploit throughput difference to reorder two aspects –mechanism for best-effort reordering –reordering policy acddbadb... abcdabc.. reorder produce process f(t) user interest network xfer.

Juggle mechanism for reordering two threads -- prefetch from input -- spool/enrich from auxiliary side disk juggle data between buffer and side disk –keep buffer full of “interesting” items –getNext chooses best item currently on buffer getNext, enrich/spool decisions -- based on reordering policy side disk management –hash index, populated in a way that postpones random I/O buffer spoolprefetchenrich getNext side disk produce process/consume

Reordering policies quality of feedback for a prefix t  1 t  2 …t  k QOF( UP(t  1 ), UP(t  2 ), … UP(t  k ) ), UP = user preference –determined by application goodness of reordering: dQOF/dt implication for juggle mechanism –process gets item from buffer that increases QOF the most –juggle tries to maintain buffer with such items time QOF GOAL: “good” permutation of items t 1 …t n to t  1 …t  n

QOF in Online Aggregation avg weighted confidence interval preference acts as weight on confidence interval (Recall from Central Limit Theorem that sample mean’s confidence interval half- width is proportional to  /  n. Conservative (Hoeffding) confidence intervals also have a  n in the denominator. So…) QOF =  UP i /  n i, n i = number of tuples processed from group I  process pulls items from group with max UP i / n i  n i  desired ratio of group i tuples on buffer = UP i 2/3 /  UP j 2/3 – juggle tries to maintain this by enrich/spool

Other QOF functions rate of processing (for a group)  preference – QOF =  (n i - nUP i ) 2 (variance from ideal proportions)  process pulls items from group with max (nUP i - n i )  desired ratio of group i tuples in buffer = UP i

Results: Reordering in Online Aggregation implemented in Informix UDO server experiments with modified TPC-D queries questions: –how much throughput difference is needed for reordering –can we reorder handle skewed data one stress test: skew, very small proc. cost –index-only join –5 orderpriorities, zipf distribution consume process scan juggle index SELECT AVG(o_totalprice), o_orderpriority FROM order WHERE exists ( SELECT * FROM lineitem WHERE l_orderkey = o_orderkey) GROUP BY o_orderpriority

Performance results time # tuples processed 3 times faster for interesting groups 2% completion time overhead E C A confidence interval time

Ripple Joins Good confidence intervals for joins of samples –Vs. samples of joins! –Requires “Cross-Product CLT” Progressively Refining join: –ever-larger rectangles in R  S –we can update confidence intervals at “corners” –comes in loop, index and hash flavors Benefits: –sample from both relations simultaneously –“animation rate”: Goal for the next “corner”, determines an optimization problem based on observations so far Old-fashioned systems are one extreme –adaptively tune “aspect ratio” for next “corner” sample from higher-variance relation faster –intimate relationship between delivery and estimation Traditional R S Ripple R S Haas & Hellerstein, SIGMOD 99

Aspect Ratios Consider an extreme example: In general, to get to the next corner: –Need a cost model parameterized by relation Different for block and hash –“Benefit”: change in confidence interval –An online linear optimization problem Arguments about estimates converging quickly, stabilizing…

Fussy Implementation Details How to implement as an iterator? Issues: –Need cursors on all inputs (as usual) –Need to maintain aspect ratios –Need to maintain current “inner” & cursor I.e. the relation currently being scanned –Need to know current sampling step To know how far to scan current “inner” –Need to know “starter” for next step Determines length of scan (see pic), end of sampling step And pass that role along at EOF

Ensuring Aspect Ratios

Ripple Join Performance Too lazy to fetch graphs, but… –Typical orders of magnitude benefit vs. batch…

CONTROL Lessons Dream about UIs, work on systems –User needs drive systems design! Systems and statistics intertwine –“what unlike things must meet and mate” Art, Herman Melville Sloppy, adaptive systems a promising direction

Questions Where else do these lessons apply? –Outside of data analysis, manipulation Systems people think a lot about interfaces (APIs)… –Encapsulation, narrow interfaces … –In the CONTROL regime, how do you design these APIs and build systems? Ubiquitous computing: –Is it about portable computing and point access/delivery? –Or sensors/actuators, dataflow, big-picture queries?

More? CONTROL: –Overview: IEEE Computer, 8/99 Telegraph:

Backup slides The following slides may be used to answer questions...

Sampling Much is known here –Olken’s thesis –DB Sampling literature –more recent work by Peter Haas Progressive random sampling –can use a randomized access method (watch dups!) –can maintain file in random order –can verify statistically that values are independent of order as stored

Estimators & Confidence Intervals Conservative Confidence Intervals –Extensions of Hoeffding’s inequality –Appropriate early on, give wide intervals Large-Sample Confidence Intervals –Use Central Limit Theorem –Appropriate after “a while” (~dozens of tuples) –linear memory consumption –tight bounds Deterministic Intervals –only useful in “the endgame”