ICICLES: Self-tuning Samples for Approximate Query Answering By Venkatesh Ganti, Mong Li Lee, and Raghu Ramakrishnan Shruti P. Gopinath CSE 6339.

Slides:



Advertisements
Similar presentations
Raghavendra Madala. Introduction Icicles Icicle Maintenance Icicle-Based Estimators Quality Guarantee Performance Evaluation Conclusion 2 ICICLES: Self-tuning.
Advertisements

Dynamic Sample Selection for Approximate Query Processing Brian Babcock Stanford University Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research.
Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
Kaushik Chakrabarti(Univ Of Illinois) Minos Garofalakis(Bell Labs) Rajeev Rastogi(Bell Labs) Kyuseok Shim(KAIST and AITrc) Presented at 26 th VLDB Conference,
CS4432: Database Systems II
Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.
Fast Algorithms For Hierarchical Range Histogram Constructions
Introduction to Histograms Presented By: Laukik Chitnis
Fast Incremental Maintenance of Approximate histograms : Phillip B. Gibbons (Intel Research Pittsburgh) Yossi Matias (Tel Aviv University) Viswanath Poosala.
Linked Bernoulli Synopses Sampling Along Foreign Keys Rainer Gemulla, Philipp Rösch, Wolfgang Lehner Technische Universität Dresden Faculty of Computer.
Brian Babcock Surajit Chaudhuri Gautam Das at the 2003 ACM SIGMOD International Conference By Shashank Kamble Gnanoba.
Harikrishnan Karunakaran Sulabha Balan CSE  Introduction  Icicles  Icicle Maintenance  Icicle-Based Estimators  Quality & Performance  Conclusion.
February 14, 2006CS DB Exploration 1 Congressional Samples for Approximate Answering of Group-By Queries Swarup Acharya Phillip B. Gibbons Viswanath.
Modeling and Analysis of Random Walk Search Algorithms in P2P Networks Nabhendra Bisnik, Alhussein Abouzeid ECSE, Rensselaer Polytechnic Institute.
Optimal Workload-Based Weighted Wavelet Synopsis
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
Probabilistic Aggregation in Distributed Networks Ling Huang, Ben Zhao, Anthony Joseph and John Kubiatowicz {hling, ravenben, adj,
Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.
ACM GIS An Interactive Framework for Raster Data Spatial Joins Wan Bae (Computer Science, University of Denver) Petr Vojtěchovský (Mathematics,
Presented by Ozgur D. Sahin. Outline Introduction Neighborhood Functions ANF Algorithm Modifications Experimental Results Data Mining using ANF Conclusions.
Proof Sketches: Verifiable In-Network Aggregation Minos Garofalakis Yahoo! Research, UC Berkeley, Intel Research Berkeley
1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.
Adaptive Sampling  Based on a hot-list algorithm by Gibbons and Matias (SIGMOD 1998)  Sample elements from the input set Frequently occurring elements.
On the Accuracy of Modal Parameters Identified from Exponentially Windowed, Noise Contaminated Impulse Responses for a System with a Large Range of Decay.
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
Query Optimization Allison Griffin. Importance of Optimization Time is money Queries are faster Helps everyone who uses the server Solution to speed lies.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,
Ripple Joins for Online Aggregation by Peter J. Haas and Joseph M. Hellerstein published in June 1999 presented by Ronda Hilton.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
Join Synopses for Approximate Query Answering Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy.
Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.
DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015.
Data Streams Part 3: Approximate Query Evaluation Reynold Cheng 23 rd July, 2002.
February 14, 2006CS DB Exploration 1 Congressional Samples for Approximate Answering of Group-By Queries Swarup Acharya Phillip B. Gibbons Viswanath.
Histograms for Selectivity Estimation
Join Synopses for Approximate Query Answering Swarup Acharya, Philip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy By Vladimir Gamaley.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)
Presented By Anirban Maiti Chandrashekar Vijayarenu
Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.
1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP BY QUERIES Swaroop Acharya,Philip B Gibbons, VishwanathPoosala By Agasthya Padisala Anusha Reddy.
HASE: A Hybrid Approach to Selectivity Estimation for Conjunctive Queries Xiaohui Yu University of Toronto Joint work with Nick Koudas.
Adaptive Ordering of Pipelined Stream Filters Babu, Motwani, Munagala, Nishizawa, and Widom SIGMOD 2004 Jun 13-18, 2004 presented by Joshua Lee Mingzhu.
University of Texas at Arlington Presented By Srikanth Vadada Fall CSE rd Sep 2010 Dynamic Sample Selection for Approximate Query Processing.
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data Authored by Sameer Agarwal, et. al. Presented by Atul Sandur.
Storage Estimation for Multidimensional Aggregates in the Presence of Hierarchies 병렬 분산 컴퓨팅 연구실 석사 1 학기 김남희.
Dense-Region Based Compact Data Cube
RankSQL: Query Algebra and Optimization for Relational Top-k Queries
A paper on Join Synopses for Approximate Query Answering
Anthony Okorodudu CSE ICICLES: Self-tuning Samples for Approximate Query Answering By Venkatesh Ganti, Mong Li Lee, and Raghu Ramakrishnan.
Methodology – Physical Database Design for Relational Databases
Online Subpath Profiling
Bolin Ding Silu Huang* Surajit Chaudhuri Kaushik Chakrabarti Chi Wang
Probabilistic Data Management
Query Sampling in DB2.
Overcoming Limitations of Sampling for Aggregation Queries
ICICLES: Self-tuning Samples for Approximate Query Answering
A Framework for Automatic Resource and Accuracy Management in A Cloud Environment Smita Vijayakumar.
Spatial Online Sampling and Aggregation
Load Shedding Techniques for Data Stream Systems
AQUA: Approximate Query Answering
Random Sampling over Joins Revisited
Query Sampling in DB2.
Farzaneh Mirzazadeh Fall 2007
Arithmetic Mean This represents the most probable value of the measured variable. The more readings you take, the more accurate result you will get.
Smita Vijayakumar Qian Zhu Gagan Agrawal
Presentation transcript:

ICICLES: Self-tuning Samples for Approximate Query Answering By Venkatesh Ganti, Mong Li Lee, and Raghu Ramakrishnan Shruti P. Gopinath CSE 6339

Outline  Introduction and background  Icicles  Icicle Maintenance  Icicle-Based Estimators  Quality & Performance  Conclusion

Background Data Warehouses –huge information collection and management systems OLAP-provide interactive response times to aggregate queries AQUA- Approximate query answering systems are being developed

Introduction Various approaches to answering approximate queries Sampling-based Histogram-based Probabilistic-based Wavelet-based Clustering-based

Sampling based techniques Uniform Random Sampling ◦ All tuples are deemed equally important ◦ Leads to locality in data access ◦ Waste of precious real estate ◦ Join of random samples of base relations may not be a random sample of the join of the base relations. This is basis for Join Synopsis by Gibbons

Why Icicles? Queries follow a predictable pattern Class of samples capture data locality of aggregate queries on foreign key joins Sample relation space better utilized if more samples from actual result set are present Dynamic algorithm that changes the sample to suit the queries being executed in the workload

Icicles Is a uniform random sample of a multiset of tuples L, which is the union of R and all sets of tuples that were required to answer queries in the workload (an extension of R)

Icicle Maintenance Icicle Maintenance We maintain an icicle such that at all times the probability of a tuple’s presence in the icicle is proportional to its “importance” (frequency) in answering of a tuple in a workload. R Icicle R(Q1) R(Q2) R(Q3) Uniform random sample

2006/2/7 ICICLES: Self-tuning Samples for Approximate Query Answering9 Icicle Maintenance Example

2006/2/7 ICICLES: Self-tuning Samples for Approximate Query Answering10 Icicle Maintenance

Algorithm is efficient due to  Uniform Random Sample of L ensures that tuple’s selection in its icicle is proportional to it’s frequency  Incremental maintenance of icicle requires only the segment of R that satisfies the new query from the workload  Reservoir Sampling Algorithm Icicle Maintenance Algorithm

In spite of unified sampling being used the result is a biased sample Traditional estimation of scaling up the count from a sample N of k by a factor of N/k is way off the mark Frequency Relation maintained over all tuples in relation Different Estimation mechanisms for Average, Count and Sum Icicle-Based Estimators

Average : Average taken over set of distinct sample tuples that satisfy the query predicate of the average query is a pretty good estimate of the average Count : First compute an estimate of the number of tuples in R that contribute one tuple t in the icicle. Sum of Expected Contributions of all tuples in the sample that satisfy the given query Estimators

Estimators Sum : Estimate is given by the product of the average and the count estimates Thus E[A(A,R)]

Frequency Attribute added to the Relation Starting Frequency set to 1 for all tuples Incremented each time tuple is used to answer a query Frequencies of relevant tuples updated only when icicle updated with new query Maintaining Frequency Relation

When queries exhibit data locality then icicle is constituted of more tuples from frequently accessed subsets of the relation Accuracy improves with increase in number of tuples used to compute it Class consisting of queries ‘focused’ with respect to workload will obtain more accurate approximate answers from the icicle Quality Guarantees

Performance Evaluation SELECT COUNT(*), AVG(LI_Extendedprice), SUM(LI_Extendedprice) FROM LI, C, O, S, N, R WHERE C_Custkey=O_Custkey AND O_Orderkey=LI_Orderkey AND LI_Suppkey=S_Suppkey AND C_Nationkey = N_Nationkey AND N_Regionkey = R_Regionkey AND R Name = [region] AND O Orderdate >= Date[startdate] AND O Orderdate <= SELECT COUNT(*), AVG(LI_Extendedprice), SUM(LI_Extendedprice) FROM LICOS-icicle, N, R WHERE C_Nationkey = N_Nationkey AND N_Regionkey = R_Regionkey AND R Name = [region] AND O Orderdate >= Date[startdate] AND O Orderdate <= Q workload : Template for generating workloads Template for obtaining approximate answers

Performance Evaluation Plots definition: Static sample: Uniform random sample on the relation Icicle: Icicle evolves with the workload Icicle-complete The tuned icicle again on the same workload

Performance Evaluation

Rapid decrease in relative error of query answers from icicles with queries focused on a set of core tuples Icicle plot shows a convergence to the Icicle-Complete plot Quick Convergence of Icicle plot towards Icicle-Complete means Icicle adapts fast Improvement due to usage of icicles is not significant At worse, can be concluded that icicles are as good as the static samples Observations

Icicles provide class of samples that adapt according to the characteristics of the workload Icicle is useful when the following is true: ◦ workload focuses on relatively small subsets in relation ◦ Calls for high accuracy of approximate answer ◦ Has to have exact query (the more the better) Icicle is a trade-off between accuracy and cost Conclusion

Questions When are icicles best suited to be used? What are the disadvantages of Icicles? More???

V. Ganti, M. Lee, and R. Ramakrishnan. ICICLES: Self-tuning Samples for Approximate Query Answering. VLDB Conference S Acharya, PB Gibbons, V Poosala, S Ramaswamy Join synopses for approximate query answering. ACM SIGMOD Record 1999 References

THANK YOU!