Harikrishnan Karunakaran Sulabha Balan CSE 6339.  Introduction  Icicles  Icicle Maintenance  Icicle-Based Estimators  Quality & Performance  Conclusion.

Slides:

Advertisements

Similar presentations

Raghavendra Madala. Introduction Icicles Icicle Maintenance Icicle-Based Estimators Quality Guarantee Performance Evaluation Conclusion 2 ICICLES: Self-tuning.

Advertisements

Dynamic Sample Selection for Approximate Query Processing Brian Babcock Stanford University Surajit Chaudhuri Microsoft Research Gautam Das Microsoft Research.

Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.

Kaushik Chakrabarti(Univ Of Illinois) Minos Garofalakis(Bell Labs) Rajeev Rastogi(Bell Labs) Kyuseok Shim(KAIST and AITrc) Presented at 26 th VLDB Conference,

1 DynaMat A Dynamic View Management System for Data Warehouses Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan.

CS4432: Database Systems II

Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.

Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.

Data Warehouse Tuning. 7 - Datawarehouse2 Datawarehouse Tuning Aggregate (strategic) targeting: –Aggregates flow up from a wide selection of data, and.

Effective Keyword Based Selection of Relational Databases Bei Yu, Guoliang Li, Karen Sollins, Anthony K.H Tung.

Fast Algorithms For Hierarchical Range Histogram Constructions

Linked Bernoulli Synopses Sampling Along Foreign Keys Rainer Gemulla, Philipp Rösch, Wolfgang Lehner Technische Universität Dresden Faculty of Computer.

Brian Babcock Surajit Chaudhuri Gautam Das at the 2003 ACM SIGMOD International Conference By Shashank Kamble Gnanoba.

March DGRC FedStats Visit Aggregation in Main Memory Kenneth A. Ross Columbia University.

February 14, 2006CS DB Exploration 1 Congressional Samples for Approximate Answering of Group-By Queries Swarup Acharya Phillip B. Gibbons Viswanath.

Modeling and Analysis of Random Walk Search Algorithms in P2P Networks Nabhendra Bisnik, Alhussein Abouzeid ECSE, Rensselaer Polytechnic Institute.

Optimal Workload-Based Weighted Wavelet Synopsis

A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed.

New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.

Graph-Based Synopses for Relational Selectivity Estimation Joshua Spiegel and Neoklis Polyzotis University of California, Santa Cruz.

1 Abstract This paper presents a novel modification to the classical Competitive Learning (CL) by adding a dynamic branching mechanism to neural networks.

Evaluating Hypotheses

1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.

(C) 2001 SNU CSE Biointelligence Lab Incremental Classification Using Tree- Based Sampling for Large Data H. Yoon, K. Alsabti, and S. Ranka Instance Selection.

Adaptive Sampling  Based on a hot-list algorithm by Gibbons and Matias (SIGMOD 1998)  Sample elements from the input set Frequently occurring elements.

Scalable Approximate Query Processing through Scalable Error Estimation Kai Zeng UCLA Advisor: Carlo Zaniolo 1.

Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.

Query Optimization Allison Griffin. Importance of Optimization Time is money Queries are faster Helps everyone who uses the server Solution to speed lies.

Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support Chapter 25, Part B.

CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,

Chapter 16 Methodology – Physical Database Design for Relational Databases.

Ripple Joins for Online Aggregation by Peter J. Haas and Joseph M. Hellerstein published in June 1999 presented by Ronda Hilton.

Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.

Join Synopses for Approximate Query Answering Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy.

The τ - Synopses System Yossi Matias Leon Portman Tel Aviv University.

OLAP : Blitzkreig Introduction 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema :

Swarup Acharya Phillip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented By Vinay Hoskere.

DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015.

End-biased Samples for Join Cardinality Estimation Cristian Estan, Jeffrey F. Naughton Computer Sciences Department University of Wisconsin-Madison.

February 14, 2006CS DB Exploration 1 Congressional Samples for Approximate Answering of Group-By Queries Swarup Acharya Phillip B. Gibbons Viswanath.

6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.

Histograms for Selectivity Estimation

Join Synopses for Approximate Query Answering Swarup Acharya, Philip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy By Vladimir Gamaley.

New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang

Joseph M. Hellerstein Peter J. Haas Helen J. Wang Presented by: Calvin R Noronha ( ) Deepak Anand ( ) By:

BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)

Presented By Anirban Maiti Chandrashekar Vijayarenu

Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.

1 Approximate XML Query Answers Presenter: Hongyu Guo Authors: N. polyzotis, M. Garofalakis, Y. Ioannidis.

Mining real world data RDBMS and SQL. Index RDBMS introduction SQL (Structured Query language)

CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP BY QUERIES Swaroop Acharya,Philip B Gibbons, VishwanathPoosala By Agasthya Padisala Anusha Reddy.

HASE: A Hybrid Approach to Selectivity Estimation for Conjunctive Queries Xiaohui Yu University of Toronto Joint work with Nick Koudas.

XCluster Synopses for Structured XML Content Alkis Polyzotis (UC Santa Cruz) Minos Garofalakis (Intel Research, Berkeley)

By: Peter J. Haas and Joseph M. Hellerstein published in June 1999 : Presented By: Sthuti Kripanidhi 9/28/20101 CSE Data Exploration.

University of Texas at Arlington Presented By Srikanth Vadada Fall CSE rd Sep 2010 Dynamic Sample Selection for Approximate Query Processing.

병렬분산컴퓨팅연구실 1 Cubing Algorithms, Storage Estimation, and Storage and Processing Alternatives for OLAP 병렬 분산 컴퓨팅 연구실 석사 1 학기 이 은 정

Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support Chapter 25.

AQAX: Approximate Query Answering for XML Josh Spiegel, M. Pontikakis, S. Budalakoti, N. Polyzotis Univ. of California Santa Cruz.

Storage Estimation for Multidimensional Aggregates in the Presence of Hierarchies 병렬 분산 컴퓨팅 연구실 석사 1 학기 김남희.

IncApprox The marriage of incremental and approximate computing Pramod Bhatotia Dhanya Krishnan, Do Le Quoc, Christof Fetzer, Rodrigo Rodrigues* (TU Dresden.

ICICLES: Self-tuning Samples for Approximate Query Answering By Venkatesh Ganti, Mong Li Lee, and Raghu Ramakrishnan Shruti P. Gopinath CSE 6339.

Dense-Region Based Compact Data Cube

A paper on Join Synopses for Approximate Query Answering

Anthony Okorodudu CSE ICICLES: Self-tuning Samples for Approximate Query Answering By Venkatesh Ganti, Mong Li Lee, and Raghu Ramakrishnan.

Bolin Ding Silu Huang* Surajit Chaudhuri Kaushik Chakrabarti Chi Wang

Overcoming Limitations of Sampling for Aggregation Queries

ICICLES: Self-tuning Samples for Approximate Query Answering

A Framework for Automatic Resource and Accuracy Management in A Cloud Environment Smita Vijayakumar.

AQUA: Approximate Query Answering

Presentation transcript:

Harikrishnan Karunakaran Sulabha Balan CSE 6339

 Introduction  Icicles  Icicle Maintenance  Icicle-Based Estimators  Quality & Performance  Conclusion

 Analysis of data in data warehouses useful in decision support  Users of decision support systems want interactive systems OLAP – Online Analytical Processing  Aggregate Query Answering Systems (AQUA) developed to reduce response time to desirable levels  Tolerant of approximate results

 Various Approaches  Sampling-based  Histogram-based  Clustering  Probabilistic  Wavelet-based

BranchStateSales 1CA80K 2TX42K 3CA40K 4CA42K 5TX75K 6CA48K 7TX55K 8TX38K 9CA40K 10CA41K BranchStateSales 2TX42K 4CA42K 6CA48K 8TX38K 10CA41K 50% Sample SELECT SUM(sales) x 2 AS cnt FROM s_sales WHERE state = ‘TX’ S_sales scale factor Sales

Sample relation for aggregation query workload regarding Texas branches BranchStateSales 1CA80K 2TX42K 3CA40K 4CA42K 5TX75K 6CA48K 7TX55K 8TX38K 9CA40K 10CA41K BranchStateSales 2TX42K 4CA42K 5TX75K 7TX55K 8TX38K Sales S_sales

 All tuples in a Uniform Random Sample are treated as equally important for answering queries  Sample needs to be tuned to contain tuples which are more relevant to answer queries in a workload  Need for a dynamic algorithm that changes the sample as and according to suit the queries being executed in the workload

 Join of a Uniform Random Sample of a Fact Table with a set of accompanying Dimension Tables  SELECT COUNT(*), AVG(LI Extendedprice),  SUM(LI Extendedprice)  FROM LI, C, O, S, N, R  WHERE C Custkey=O Custkey AND O Orderkey=LI Orderkey  AND LI Suppkey=S Suppkey AND C Nationkey=N Nationkey  AND N Regionkey=R Regionkey AND R Name=North America  AND O Orderdate AND O Orderdate ;

 Any aggregate query on the fact table can be answered approximately using exactly one of a smaller number of synopses  Uniform Random Sample of Relation wastes memory  OLAP queries exhibit locality in their data access

 Class of samples to capture data locality of aggregate queries of foreign key joins  Identify focus of a query workload and sample accordingly  Is a uniform random sample of a multiset of tuples L, which is the union of R and all sets of tuples that were required to answer queries in the workload (an extension of R)  Is a non-uniform sample of the original relation R

Algorithm is efficient due to  Uniform Random Sample of L ensures tuple’s selection in its icicle is proportional to it’s frequency  Incremental maintenance of icicle requires only the segment of R that satisfies the new query from the workload Reservoir Sampling Algorithm

SELECT average(*) FROM widget-tuners WHERE date.month = ‘April’

In spite of unified sampling being used the result is a biased sample Frequency Relation maintained over all tuples in relation Different Estimation mechanisms for Average, Count and Sum

 Average Average taken over set of distinct sample tuples that satisfy the query predicate of the average query is a pretty good estimate of the average  Count Sum of Expected Contributions of all tuples in the sample that satisfy the given query  Sum Estimate is given by the product of the average and the count estimates

 Frequency Attribute added to the Relation  Starting Frequency set to 1 for all tuples  Incremented each time tuple is used to answer a query  Frequencies of relevant tuples updated only when icicle updated with new query

 When queries exhibit data locality then icicle is constituted of more tuples from frequently accessed subsets of the relation  Accuracy improves with increase in number of tuples used to compute it  Class consisting of queries ‘focused’ with respect to workload will obtain more accurate approximate answers from the icicle

SELECT COUNT(*), AVG(LI_Extendedprice), SUM(LI_Extendedprice) FROM LI, C, O, S, N, R WHERE C_Custkey=O_Custkey AND O_Orderkey=LI_Orderkey AND LI_Suppkey=S_Suppkey AND C_Nationkey = N_Nationkey AND N_Regionkey = R_Regionkey AND R Name = [region] AND O Orderdate >= Date[startdate] AND O Orderdate <= SELECT COUNT(*), AVG(LI_Extendedprice), SUM(LI_Extendedprice) FROM LICOS-icicle, N, R WHERE C_Nationkey = N_Nationkey AND N_Regionkey = R_Regionkey AND R Name = [region] AND O Orderdate >= Date[startdate] AND O Orderdate <= Q workload : Template for generating workloads Template for obtaining approximate answers

 The Error Plots for Comparison  Static uniform random sample on Join Synopsis  Icicle as it evolves with the workload  Icicle-Complete which is formed after entire workload has been executed once

Mixed Workload

 Rapid decrease in relative error of query answers from icicles with queries focused on a set of core tuples  Icicle plot shows a convergence to the Icicle- Complete plot  Quick Convergence of Icicle plot towards Icicle-Complete means Icicle adapts fast

 Improvement due to usage of icicles is not significant  Can be concluded that icicles are at worst as good as the static samples

 Icicles provide class of samples that adapt according to the characteristics of the workload  It can never be worse than the case of static sampling  It focuses on relatively small subsets in the relation

 There is no significant gains in the case of Uniform Workload  There is a trade-off between accuracy and cost  Restricted to certain scenarios where the queries tend to be increasingly focused towards the workload.

 V. Ganti, M. Lee, and R. Ramakrishnan. ICICLES: Self-tuning Samples for Approximate Query Answering. VLDB Conference  S Acharya, PB Gibbons, V Poosala, S Ramaswamy Join synopses for approximate query answering. ACM SIGMOD Record 1999

Thank You Questions?