Dynamic Sample Selection for Approximate Query Processing Brian Babcock Stanford University Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research.

Slides:



Advertisements
Similar presentations
Youre Smarter than a Database Overcoming the optimizers bad cardinality estimates.
Advertisements

Raghavendra Madala. Introduction Icicles Icicle Maintenance Icicle-Based Estimators Quality Guarantee Performance Evaluation Conclusion 2 ICICLES: Self-tuning.
Lindsey Bleimes Charlie Garrod Adam Meyerson
Extending Q-Grams to Estimate Selectivity of String Matching with Low Edit Distance [1] Pirooz Chubak May 22, 2008.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 12, Part A.
Eiman Elnahrawy WSNA’03 Cleaning and Querying Noisy Sensors Eiman Elnahrawy and Badri Nath Rutgers University WSNA September 2003 This work was supported.
UNIT-2 Data Preprocessing LectureTopic ********************************************** Lecture-13Why preprocess the data? Lecture-14Data cleaning Lecture-15Data.
A Non-Blocking Join Achieving Higher Early Result Rate with Statistical Guarantees Shimin Chen* Phillip B. Gibbons* Suman Nath + *Intel Labs Pittsburgh.
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Robust query processing Goetz Graefe, Christian König, Harumi Kuno, Volker Markl, Kai-Uwe Sattler Dagstuhl – September 2010.
Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
CS4432: Database Systems II
Using the Optimizer to Generate an Effective Regression Suite: A First Step Murali M. Krishna Presented by Harumi Kuno HP.
Fast Algorithms For Hierarchical Range Histogram Constructions
STHoles: A Multidimensional Workload-Aware Histogram Nicolas Bruno* Columbia University Luis Gravano* Columbia University Surajit Chaudhuri Microsoft Research.
Linked Bernoulli Synopses Sampling Along Foreign Keys Rainer Gemulla, Philipp Rösch, Wolfgang Lehner Technische Universität Dresden Faculty of Computer.
Brian Babcock Surajit Chaudhuri Gautam Das at the 2003 ACM SIGMOD International Conference By Shashank Kamble Gnanoba.
Harikrishnan Karunakaran Sulabha Balan CSE  Introduction  Icicles  Icicle Maintenance  Icicle-Based Estimators  Quality & Performance  Conclusion.
February 14, 2006CS DB Exploration 1 Congressional Samples for Approximate Answering of Group-By Queries Swarup Acharya Phillip B. Gibbons Viswanath.
Written By Surajit Chaudhuri, Gautam Das, Vivek Marasayya (Microsoft Research, Washington) Presented By Melissa J Fernandes.
Approximate Queries on Very Large Data UC Berkeley Sameer Agarwal Joint work with Ariel Kleiner, Henry Milner, Barzan Mozafari, Ameet Talwalkar, Michael.
Optimal Workload-Based Weighted Wavelet Synopsis
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
1 Primitives for Workload Summarization and Implications for SQL Prasanna Ganesan* Stanford University Surajit Chaudhuri Vivek Narasayya Microsoft Research.
Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
Ch2 Data Preprocessing part2 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
Ripple Joins for Online Aggregation by Peter J. Haas and Joseph M. Hellerstein published in June 1999 presented by Ronda Hilton.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Join Synopses for Approximate Query Answering Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy.
Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.
The Ohio State University Efficient and Effective Sampling Methods for Aggregation Queries on the Hidden Web Fan Wang Gagan Agrawal Presented By: Venu.
A Novel Approach for Approximate Aggregations Over Arrays SSDBM 2015 June 29 th, San Diego, California 1 Yi Wang, Yu Su, Gagan Agrawal The Ohio State University.
February 14, 2006CS DB Exploration 1 Congressional Samples for Approximate Answering of Group-By Queries Swarup Acharya Phillip B. Gibbons Viswanath.
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Presented by Sushanth.
Histograms for Selectivity Estimation
Join Synopses for Approximate Query Answering Swarup Acharya, Philip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy By Vladimir Gamaley.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)
Presented By Anirban Maiti Chandrashekar Vijayarenu
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Presented By: Vivek Tanneeru.
SqlExam1Review.ppt EXAM - 1. SQL stands for -- Structured Query Language Putting a manual database on a computer ensures? Data is more current Data is.
Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP BY QUERIES Swaroop Acharya,Philip B Gibbons, VishwanathPoosala By Agasthya Padisala Anusha Reddy.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations – Join Chapter 14 Ramakrishnan and Gehrke (Section 14.4)
HASE: A Hybrid Approach to Selectivity Estimation for Conjunctive Queries Xiaohui Yu University of Toronto Joint work with Nick Koudas.
1 A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Proceedings of the.
By: Peter J. Haas and Joseph M. Hellerstein published in June 1999 : Presented By: Sthuti Kripanidhi 9/28/20101 CSE Data Exploration.
University of Texas at Arlington Presented By Srikanth Vadada Fall CSE rd Sep 2010 Dynamic Sample Selection for Approximate Query Processing.
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
1 Introduction to Database Systems, CS420 SQL Views and Indexes.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support Chapter 25.
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Authored by Sameer Agarwal, et. al. Presented by Atul Sandur.
AQAX: Approximate Query Answering for XML Josh Spiegel, M. Pontikakis, S. Budalakoti, N. Polyzotis Univ. of California Santa Cruz.
ICICLES: Self-tuning Samples for Approximate Query Answering By Venkatesh Ganti, Mong Li Lee, and Raghu Ramakrishnan Shruti P. Gopinath CSE 6339.
Dense-Region Based Compact Data Cube
BlinkDB.
BlinkDB.
A paper on Join Synopses for Approximate Query Answering
Anthony Okorodudu CSE ICICLES: Self-tuning Samples for Approximate Query Answering By Venkatesh Ganti, Mong Li Lee, and Raghu Ramakrishnan.
Bolin Ding Silu Huang* Surajit Chaudhuri Kaushik Chakrabarti Chi Wang
Overcoming Limitations of Sampling for Aggregation Queries
ICICLES: Self-tuning Samples for Approximate Query Answering
Presented by: Mariam John CSE /14/2006
Presentation transcript:

Dynamic Sample Selection for Approximate Query Processing Brian Babcock Stanford University Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research

Why Approximation is Useful Large data warehouses Gigabytes to terabytes of data Data analysis applications Decision support Data Mining Query characteristics: Access large fraction of database Seek to identify general patterns / trends Absolute precision unnecessary $89,000 after 5 secs vs. $89, after 2 hrs

Two Phases of Approximate Query Processing (AQP) 1.Offline pre-processing of the database  E.g. generate histograms or random samples  OK to use considerable space and time (hours) 2.Runtime query processing  Query answers must be fast (seconds)  Only time to access small amount of data  E.g. extrapolate from random sample

AQP Example ProductAmount CPU Disk1 2 Monitor1 ProductAmount CPU1 2 3 Disk2 Sales SalesSample SELECT SUM(Amount) FROM Sales WHERE Product = 'CPU' Exact Answer: = 11 Approx. Answer: (1+2+3)*2= 12

Non-uniform Sampling “Biased” samples often more accurate than uniform samples All data records are not created equal Frequently queried values Extreme high and low values Uncommon values Optimal bias differs from query to query Past work: carefully select biased sample to give good answers for many queries

Related Work Non-sampling-based approaches Online Aggregation Hellerstein, Haas, and Wang 97 Histograms Ioannidis and Poosala 99 Wavelets Chakrabarti, Garofalakis, Rastogi, and Shim 00 Sampling-based approaches AQUA project Acharya, Gibbons, and Poosala 99 Congressional Acharya, Gibbons, and Poosala 00 Self-Tuning Ganti, Lee, and Ramakrishnan 00 Outliers Chaudhuri, Das, Datar, Motwani, and Narasayya 01 Workload Chaudhuri, Das, and Narasayya 01

Dynamic Sample Selection DATA SAMPLE Standard Sampling DATA SAMPLE Dynamic Sample Selection

Construct many differently-biased samples For each query, use the best sample and ignore the others How to pick a good set of samples? Given a query, what’s the best sample? Improved accuracy, no change to query time Query time is the scarce resource OK to use extra pre-processing, disk space

Small vs. Large Groups Consider group-by aggregation queries. E.g. Total sales of CPUs in each state E.g. Avg sale price for each product in each state Number of records per group may vary widely Problem: Rare values are under-represented in uniform sample “California” much more common than “Alaska” “Alaska” only appears a few times in the sample Approximate answer for “Alaska” likely to be bad In a group-by query, small groups are hard

Small Group Sampling Large Groups: Use Uniform Random Sample Well-represented in sample Good quality of approximation Main idea: Treat small and large groups differently

Small Group Sampling Small Groups: Use Original Data Contain few records, by definition Thus can be scanned very quickly Main idea: Treat small and large groups differently

Small Group Sampling Small groups are query-dependent Depend on grouping attributes Depend on selection predicates How do we know which rows to scan to find the small groups? Main idea: Treat small and large groups differently

Finding the Small Groups Heuristic idea: Most small groups in most queries have a rare value for at least one grouping attribute Small group in this query  rare value in entire DB Not always true (snowblower sales in California) Summary of Small Group Sampling: Identify rare values during pre-processing Store rows with rare values in a different (small) table for each attribute: the small groups tables At query time, scan small groups table for each grouping attribute

Pre-Processing Steps Create a table sample_all containing a uniform random sample of all data For each attribute A in the schema: Identify rare values for attribute A Create a table smGrps_A containing all records with rare A values Size of smGrps_A table limited by threshold (2:1 ratio between sample_all and smGrps) smGrps_B smGrps_D smGrps_C smGrps_A sample_all

Pre-Processing Steps Augment rows in sample_all, smGrps_* with table membership information Some rows may be added to multiple tables One extra bitmask column: which small group tables contain this row? Used to avoid double-counting during query processing smGrps_B smGrps_D smGrps_C smGrps_A sample_all DATA

smGrps_B smGrps_A Answering Queries Using Small Group Sampling Values of attribute A Values of attribute B CommonRare Common Rare sample_all

Query Answering Example Run query on small group table for each grouping attribute Run scaled query on sample_all Combine answers SELECT A,B,COUNT(*) FROM FACT_TBL WHERE C=10 GROUP BY A,B SELECT A,B,COUNT(*) as cnt FROM smGrps_A WHERE C=10 GROUP BY A,B UNION ALL SELECT A,B,100 * COUNT(*) as cnt FROM sample_all WHERE C=10 AND bitmask & 3 = 0 GROUP BY A,B SELECT A,B,COUNT(*) as cnt FROM smGrps_B WHERE C=10 AND bitmask & 1 = 0 GROUP BY A,B UNION ALL

Experimental Setup Two data sources Skewed version of TPC-H benchmark database Real-world database: 1 month of product sales Randomly generated queries Compared different AQP methods Small Group, Uniform, Basic Congress Each allowed to query same number of rows Evaluating approximate answers Average relative error in approximate answer across groups Number of groups absent from approximate answer (not present in sample)

Relative Error – TPC-H

Groups Missed – TPC-H

Relative Error – Sales Data

Groups Missed – Sales Data

Summary Dynamic Sample Selection Gain accuracy at the cost of disk space. Non-uniform samples are good, but different ones are good for different queries. Build lots of different non-uniform samples. For each query, pick the best sample. Small Group Sampling Treat large and small groups differently. Uniform sampling works well for large groups. Small groups are cheap to scan in their entirety.