A Crystal Ball for Data-Intensive Processing CONTROL group Joe Hellerstein, Ron Avnur, Christian Hidber, Bruce Lo, Chris Olston, Vijayshankar Raman, Tali.

Slides:



Advertisements
Similar presentations
MapReduce Online Tyson Condie UC Berkeley Slides by Kaixiang MO
Advertisements

Dynamic Sample Selection for Approximate Query Processing Brian Babcock Stanford University Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research.
C6 Databases.
Monte Carlo Methods and Statistical Physics
Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
1 CS 361 Lecture 5 Approximate Quantiles and Histograms 9 Oct 2002 Gurmeet Singh Manku
Online Aggregation Joe Hellerstein UC Berkeley Online Aggregation: Motivation Select AVG(grade) from ENROLL; A “fancy” interface: + Query Results AVG.
Online Aggregation Liu Long Aggregation Operations related to aggregating data in DBMS –AVG –SUM –COUNT.
All Hands Meeting, 2006 Title: Grid Workflow Scheduling in WOSE (Workflow Optimisation Services for e- Science Applications) Authors: Yash Patel, Andrew.
Planning under Uncertainty
Information Capture and Re-Use Joe Hellerstein. Scenario Ubiquitous computing is more than clients! –sensors and their data feeds are key –smart dust.
Wrapup Amol Deshpande CMSC424. “Inventing the Future” Wednesday at 3:30pm 1115 CSIC Exam.
Joseph M. Hellerstein Peter J. Haas Helen J. Wang
CONTROL Overview CONTROL group Joe Hellerstein, Ron Avnur, Christian Hidber, Bruce Lo, Chris Olston, Vijayshankar Raman, Tali Roth, Kirk Wylie, UC Berkeley.
ACM GIS An Interactive Framework for Raster Data Spatial Joins Wan Bae (Computer Science, University of Denver) Petr Vojtěchovský (Mathematics,
Computer Science Spatio-Temporal Aggregation Using Sketches Yufei Tao, George Kollios, Jeffrey Considine, Feifei Li, Dimitris Papadias Department of Computer.
Exploiting Correlated Attributes in Acquisitional Query Processing Amol Deshpande University of Maryland Joint work with Carlos Sam
Unary Query Processing Operators Not in the Textbook!
Telegraph: An Adaptive Global- Scale Query Engine Joe Hellerstein.
Evaluation of MineSet 3.0 By Rajesh Rathinasabapathi S Peer Mohamed Raja Guided By Dr. Li Yang.
Towards Adaptive Dataflow Infrastructure Joe Hellerstein, UC Berkeley.
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
CONTROL group Joe Hellerstein, Ron Avnur, Christian Hidber, Bruce Lo, Chris Olston, Vijayshankar Raman, Tali Roth, Kirk Wylie, UC Berkeley CONTROL: Continuous.
1 Database Security & Encryption
A Dip in the Reservoir: Maintaining Sample Synopses of Evolving Datasets Rainer Gemulla (University of Technology Dresden) Wolfgang Lehner (University.
Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1.
Objectives of the Lecture :
Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization.
CORE Rome Meeting – 3/4 October WP3: A Process Scenario for Testing the CORE Environment Diego Zardetto (Istat CORE team)
1 DAN FARRAR SQL ANYWHERE ENGINEERING JUNE 7, 2010 SCHEMA-DRIVEN EXPERIMENT MANAGEMENT DECLARATIVE TESTING WITH “DEXTERITY”
Ripple Joins for Online Aggregation by Peter J. Haas and Joseph M. Hellerstein published in June 1999 presented by Ronda Hilton.
1 CS 430 Database Theory Winter 2005 Lecture 17: Objects, XML, and DBMSs.
The many facets of approximate similarity search Marco Patella and Paolo Ciaccia DEIS, University of Bologna - Italy.
Online aggregation Joseph M. Hellerstein University of California, Berkley Peter J. Haas IBM Research Division Helen J. Wang University of California,
Ripple Joins for Online Aggregation by Peter J. Haas and Joseph M. Hellerstein published in June 1999 presented by Nag Prajval B.C.
DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015.
1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
All right reserved by Xuehua Shen 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)
Joseph M. Hellerstein Peter J. Haas Helen J. Wang Presented by: Calvin R Noronha ( ) Deepak Anand ( ) By:
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)
Object Recognition Part 2 Authors: Kobus Barnard, Pinar Duygulu, Nado de Freitas, and David Forsyth Slides by Rong Zhang CSE 595 – Words and Pictures Presentation.
Trust Me, I’m Partially Right: Incremental Visualization Lets Analysts Explore Large Datasets Faster Shengliang Dai.
CS4432: Database Systems II Query Processing- Part 2.
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
Mining real world data RDBMS and SQL. Index RDBMS introduction SQL (Structured Query language)
1 A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Proceedings of the.
Evacuating the Comfort Zone: (Via Curriculum Reform…)
By: Peter J. Haas and Joseph M. Hellerstein published in June 1999 : Presented By: Sthuti Kripanidhi 9/28/20101 CSE Data Exploration.
University of Texas at Arlington Presented By Srikanth Vadada Fall CSE rd Sep 2010 Dynamic Sample Selection for Approximate Query Processing.
1 VLDB, Background What is important for the user.
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Authored by Sameer Agarwal, et. al. Presented by Atul Sandur.
Wander Join: Online Aggregation via Random Walks
Query-by-Example (QBE)
A Black-Box Approach to Query Cardinality Estimation
Ripple Joins for Online Aggregation
Bolin Ding Silu Huang* Surajit Chaudhuri Kaushik Chakrabarti Chi Wang
Introduction to Query Optimization
Blazing-Fast Performance:
Telegraph: An Adaptive Global-Scale Query Engine
Spatial Online Sampling and Aggregation
AQUA: Approximate Query Answering
Random Sampling over Joins Revisited
Parallel Analytic Systems
CMPT 354: Database System I
CS222P: Principles of Data Management Notes #13 Set operations, Aggregation, Query Plans Instructor: Chen Li.
Information Capture and Re-Use
Presentation transcript:

A Crystal Ball for Data-Intensive Processing CONTROL group Joe Hellerstein, Ron Avnur, Christian Hidber, Bruce Lo, Chris Olston, Vijayshankar Raman, Tali Roth, Kirk Wylie, UC Berkeley Peter Haas, IBM Almaden

Context (wild assertions) Value from information –The pressing problem in CS (?) (!!) –(in 1998, is CS about computation, or information? If the latter, what are the hard problems?) “Point” querying and data management is a solved problem –at least for traditional data (business data, documents) “Big picture” analysis still hard

Data Analysis c Complex: people using many tools –SQL Aggregation (Decision Support Sys, OLAP) –AI-style WYGIWIGY systems (e.g. “Data Mining”) Both are Black Boxes –Users must iterate to get what they want –batch processing (big picture = big wait) We are failing important users! –Decision support is for decision-makers! –Black box is the world’s worst UI

Black Box Begone! Black boxes are bad –cannot be observed while running –cannot be controlled while running These tools can be very slow –exacerbates previous problems Thesis: –there will always be slow computer programs, usually data-intensive –fundamental issue: looking into the box...

Crystal Balls Allow users to observe processing –as opposed to “lucite watches” Allow users to predict future Ideally, allow users to change future –online control of processing The CONTROL Project: –online delivery, estimation, and control for data- intensive processes

berkeley *Online Aggregation –in collaboration with Informix & IBM –DBMS emphasis, but insights for other contexts *Online Data Visualization –in Tioga Datasplash Online Data Mining UI widgets for large data sets estimate

Decision-Support in DBMSs Aggregation queries –compute a set of qualifying records –partition the set into groups –compute aggregation functions on the groups –e.g.: Select college, AVG(grade) From ENROLL Group By college;

Interactive Decision Support? Precomputation –the typical OLAP approach (think Essbase, Stanford) –doesn’t scale, no ad hoc analysis –blindingly fast when it works Sampling –makes real people nervous? –no ad hoc precision sample in advance can’t vary stats requirements –per-query granularity only

Online Aggregation Think “progressive” sampling –a la images in a web browser –good estimates quickly, improve over time Shift in performance goals –traditional “performance”: time to completion –our performance: time to “acceptable” accuracy Shift in the science –UI emphasis drives system design –leads to different data delivery, result estimation –motivates online control

Not everything can be CONTROLed “needle in haystack” scenarios –the nemesis of any sampling approach –e.g. highly selective queries, MIN, MAX, MEDIAN not useless, though –unlike presampling, users can get some info (e.g. max-so-far) we advocate a mixed approach –explore the big picture with online processing –when you drill down to the needles, or want full precision, go batch-style –can do both in parallel

GiST: Generalized Search Tree –extensible index for objects & methods –concurrency/recovery –indexability theory (w/Papadimitriou, etc.) –analysis/debugging toolkit (amdb) –selectivity estimation for new types Things I Do CONTROL –Continuous feedback and control for long jobs online aggregation (OLAP) data visualization data mining GUI widgets –database + UI + stats

Online Aggregation Demo

New technologies Online Reordering –gives control of group delivery rates –applicable outside the RDBMS setting Ripple Join family of join algorithms –comes in naïve, block & hash Statistical estimators & confidence intervals –for single-table & multi-table queries –for AVG, SUM, COUNT, STDEV –Leave it to Peter Visual estimators & analysis

Reordering For Online Aggregation Fairness across groups? –want random tuple from Group 1, random tuple from Group 2, … Speed-up, Slow-down, Stop –opposite of fairness: partiality Idea: only deliver interesting data –client specifies a weighting on groups –maps to a –we should deliver items to

Online Reordering Performance: –Effective when Process or Consume > Produce –Zero-overhead, responsive to user changes –Index-assisted version too AABABCADCA... ABCDABCDABCD... Process Reorder Other applications –Scaleable spreadsheets scroll, jump –Batch processing! sloppy ordering Consume Produce ABCD

Benefits: sample from both relations simultaneously sample from higher-variance relation faster (auto-tune) intimate relationship between delivery and estimation Ripple Joins Progressively Refining join: – (k n rows of R)  (l n rows of S), increasing n ever-larger rectangles in R  S –comes in naive, block, and hash flavors Traditional R S Ripple R S

CLOUDS Online visualization –the big picture as a picture! –plot points as they arrive –layer “clouds” to compensate for expected error –how to segment picture? v1: grid into squares (quad tree) v2: image segmentation techniques? Tie-ins w/previous algorithms –delivery techniques for online agg appear beneficial for online viz. Proof?

CLOUDS demo

Future CONTROL research push the online query processing work –e.g. query optimization, parallelism, middleware push the online viz work –empirical or mathematical assessments of goodness, both in delivery and estimation widget toolkit for massive datasets –Java toolkit (GADGETS)  spreadsheet data mining –online association rules (CARMA) –what is CONTROL data “mining”?

Traditional benchmarks (e.g. TPC): –cost/speed Automobile analogy –Ford vs. Mercedes –better: f(cost,speed,quality) Performance wakeup call! CONTROL is cheap! quality $ 100%

Lessons Dream about UIs, work on systems Systems, UIs and statistics intertwine “what unlike things must meet and mate” – Art, Herman Melville

Status Things will soon be under CONTROL –online agg in Postgres, Informix/MetaCube –joint work with IBM Almaden, possible integration into DB2 –In-house: CLOUDS, CARMA, Spreadsheets More? –IEEE Computer ‘99, Database Programming & Design 8/98, DE Bulletin 9/97 –Ripple Join: SIGMOD 99, Juggle: VLDB 99 –SIGMOD ‘97, SSDBM ‘97 –

Backup slides The following slides may be used to answer questions...

Sampling Much is known here –Olken’s thesis –DB Sampling literature –more recent work by Peter Haas Progressive random sampling –can use a randomized access method (watch dups!) –can maintain file in random order –can verify statistically that values are independent of order as stored

Estimators & Confidence Intervals Conservative Confidence Intervals –Extensions of Hoeffding’s inequality –Appropriate early on, give wide intervals Large-Sample Confidence Intervals –Use Central Limit Theorem –Appropriate after “a while” (~dozens of tuples) –linear memory consumption –tight bounds Deterministic Intervals –only useful in “the endgame”