ICICLES: Self-tuning Samples for Approximate Query Answering

Slides:

Advertisements

Similar presentations

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.

Advertisements

Raghavendra Madala. Introduction Icicles Icicle Maintenance Icicle-Based Estimators Quality Guarantee Performance Evaluation Conclusion 2 ICICLES: Self-tuning.

Dynamic Sample Selection for Approximate Query Processing Brian Babcock Stanford University Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research.

Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.

1 DynaMat A Dynamic View Management System for Data Warehouses Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan.

Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.

Di Yang, Elke A. Rundensteiner and Matthew O. Ward Worcester Polytechnic Institute VLDB 2009, Lyon, France 1 A Shared Execution Strategy for Multiple Pattern.

Towards Estimating the Number of Distinct Value Combinations for a Set of Attributes Xiaohui Yu 1, Calisto Zuzarte 2, Ken Sevcik 1 1 University of Toronto.

Fast Incremental Maintenance of Approximate histograms : Phillip B. Gibbons (Intel Research Pittsburgh) Yossi Matias (Tel Aviv University) Viswanath Poosala.

Brian Babcock Surajit Chaudhuri Gautam Das at the 2003 ACM SIGMOD International Conference By Shashank Kamble Gnanoba.

Harikrishnan Karunakaran Sulabha Balan CSE  Introduction  Icicles  Icicle Maintenance  Icicle-Based Estimators  Quality & Performance  Conclusion.

February 14, 2006CS DB Exploration 1 Congressional Samples for Approximate Answering of Group-By Queries Swarup Acharya Phillip B. Gibbons Viswanath.

New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.

Karl Schnaitter and Neoklis Polyzotis (UC Santa Cruz) Serge Abiteboul (INRIA and University of Paris 11) Tova Milo (University of Tel Aviv) Automatic Index.

Probabilistic Aggregation in Distributed Networks Ling Huang, Ben Zhao, Anthony Joseph and John Kubiatowicz {hling, ravenben, adj,

Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.

Presented by Ozgur D. Sahin. Outline Introduction Neighborhood Functions ANF Algorithm Modifications Experimental Results Data Mining using ANF Conclusions.

Cost-Based Plan Selection Choosing an Order for Joins Chapter 16.5 and16.6 by:- Vikas Vittal Rao ID: 124/227 Chiu Luk ID: 210.

Liang Jin and Chen Li VLDB’2005 Supported by NSF CAREER Award IIS Selectivity Estimation for Fuzzy String Predicates in Large Data Sets.

1 Overview of Indexing Chapter 8 – Part II. 1. Introduction to indexing 2. First glimpse at indices and workloads.

Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.

Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support Chapter 25, Part B.

CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,

Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,

Ripple Joins for Online Aggregation by Peter J. Haas and Joseph M. Hellerstein published in June 1999 presented by Ronda Hilton.

Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.

Join Synopses for Approximate Query Answering Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy.

Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.

DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015.

The Ohio State University Efficient and Effective Sampling Methods for Aggregation Queries on the Hidden Web Fan Wang Gagan Agrawal Presented By: Venu.

End-biased Samples for Join Cardinality Estimation Cristian Estan, Jeffrey F. Naughton Computer Sciences Department University of Wisconsin-Madison.

Sampling in Space Restricted Settings Anup Bhattacharya IIT Delhi Joint work with Davis Issac (MPI), Ragesh Jaiswal (IITD) and Amit Kumar (IITD)

Join Synopses for Approximate Query Answering Swarup Acharya, Philip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy By Vladimir Gamaley.

Load Shedding Techniques for Data Stream Systems Brian Babcock Mayur Datar Rajeev Motwani Stanford University.

New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang

Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.

BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

Answering Top-k Queries Using Views Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto), Dimitris.

Presented By Anirban Maiti Chandrashekar Vijayarenu

CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP BY QUERIES Swaroop Acharya,Philip B Gibbons, VishwanathPoosala By Agasthya Padisala Anusha Reddy.

By: Gang Zhou Computer Science Department University of Virginia 1 Medians and Beyond: New Aggregation Techniques for Sensor Networks CS851 Seminar Presentation.

University of Texas at Arlington Presented By Srikanth Vadada Fall CSE rd Sep 2010 Dynamic Sample Selection for Approximate Query Processing.

Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.

Introduction Sample surveys involve chance error. Here we will study how to find the likely size of the chance error in a percentage, for simple random.

Storage Estimation for Multidimensional Aggregates in the Presence of Hierarchies 병렬 분산 컴퓨팅 연구실 석사 1 학기 김남희.

ICICLES: Self-tuning Samples for Approximate Query Answering By Venkatesh Ganti, Mong Li Lee, and Raghu Ramakrishnan Shruti P. Gopinath CSE 6339.

Dense-Region Based Compact Data Cube

Igor EPIMAKHOV Abdelkader HAMEURLAIN Franck MORVAN

PSY 626: Bayesian Statistics for Psychological Science

SIMILARITY SEARCH The Metric Space Approach

Trading Timeliness and Accuracy in Geo-Distributed Streaming Analytics

Parallel Databases.

A paper on Join Synopses for Approximate Query Answering

Anthony Okorodudu CSE ICICLES: Self-tuning Samples for Approximate Query Answering By Venkatesh Ganti, Mong Li Lee, and Raghu Ramakrishnan.

Bolin Ding Silu Huang* Surajit Chaudhuri Kaushik Chakrabarti Chi Wang

Summary Presented by : Aishwarya Deep Shukla

Supporting Fault-Tolerance in Streaming Grid Applications

Spatial Online Sampling and Aggregation

PSY 626: Bayesian Statistics for Psychological Science

Load Shedding Techniques for Data Stream Systems

CS222: Principles of Data Management Notes #09 Indexing Performance

Sidharth Mishra Dr. T.Y. Lin CS 257 Section 1 MH 222 SJSU - Fall 2016

Farzaneh Mirzazadeh Fall 2007

Chapter 8 – Part II. A glimpse at indices and workloads

CS222P: Principles of Data Management Notes #09 Indexing Performance

Data Warehousing and Decision Support

One-Pass Algorithms for Database Operations (15.2)

Chapter 11 I/O Management and Disk Scheduling

Presented by: Mariam John CSE /14/2006

CS222/CS122C: Principles of Data Management UCI, Fall 2018 Notes #08 Comparisons of Indexes and Indexing Performance Instructor: Chen Li.

Presentation transcript:

ICICLES: Self-tuning Samples for Approximate Query Answering Haidong Wang 02/13/2007

Outline Introduction and background Intuition and basic idea of Icicles Icicle maintenance Estimators for Aggregate Queries Performance Evaluation Conclusion and my comment

Introduction Analysis of data in data warehouses is useful in decision support OLAP—provide interactive response time s to aggregate queries Approximate query answering (AQUA) system are being developed Tolerate approximate answers to achieve response time

Introduction Various approaches to answering approximate queries Sampling-based Histogram-based Clustering Probabilistic Wavelet-based

Uniform Random Sample Sales S_sales 50% Sample scale factor Branch State Sales 1 CA 80K 2 TX 42K 3 40K 4 5 75K 6 48K 7 55K 8 38K 9 10 41K 50% Sample Branch State Sales 2 TX 42K 4 CA 6 48K 8 38K 10 41K scale factor SELECT SUM(sales) x 2 AS cnt FROM s_sales WHERE state = ‘TX’

Why Icicles? In practice, queries may follow a predictable pattern Static sampling strategy treats all tuples uniformly, thus wasting memory on less required tuples. Sample relation space better utilized if more samples from actual result set are present Example: Manager of Walmart in Asia location=”Asia” and year>=2000

Intuition of Icicles Tuples being selected to the sample is proportional to the frequency with which it’s required to answer queries location=”Asia” and year>=2000 We need a dynamic algorithm that tunes the sample with respect to the most recent knowledge of the workload

What is Icicles Icicle for Relation R is: A uniform random sample of a multiset of tuples L, which is the union of R and all sets of tuples that were required to answer queries in the workload

What is Icicles Uniform random sample R Icicle R(Q1) R(Q2) R(Q3)

Icicle Maintenance Reservoir sampling algorithm Each time we only need to access the new block of data Uniform random sample R Icicle R(Q1) R(Q2) R(Q3)

Icicle Maintenance

Icicle Maintenance Example SELECT average(*) FROM widget-tuners WHERE date.month = ‘April’

Icicle Maintenance Maintaining the frequency relation Keep frequency relation in main memory Delay updating the frequency into the disk until the magnitude of the change crosses a threshold

Estimators for Aggregate Queries Traditional estimators can’t be used due to selection bias and duplicates in icicle Example: count Maintain a set of frequencies one per tuple in the relation.

Estimators for Aggregate Queries Avg: the average of distinct tuples in sample satisfying query Doesn’t require frequency attribute Count: the sum of expected contributions of all tuples in icicle that satisfy the query Sum: Avg * Count

Performance Evaluation Plots definition: Static sample: Uniform random sample on the relation Icicle: Icicle evolves with the workload Icicle-complete The tuned icicle again on the same workload

Performance Evaluation

Performance Evaluation

Conclusion icicle are better than static sampling when workload focuses on relatively small subsets in relation When the workload is randomized, icicle is at least the same as static sampling Icicles adapt quickly to changing workload

My Comment Icicle is a trade-off between accuracy and cost Icicle works well under certain restrictions How to define a workload Has to be exact query (not approximate) The typical scenario of a analyst using Icicle: First do a bunch of approximate query Then do one or more exact query

My conclusion Icicle is useful when the following is true: workload focuses on relatively small subsets in relation Calls for high accuracy of approximate answer Has to have exact query (the more the better)

Thanks Questions?