Storage Estimation for Multidimensional Aggregates in the Presence of Hierarchies 병렬 분산 컴퓨팅 연구실 석사 1 학기 김남희.

Slides:



Advertisements
Similar presentations
Raghavendra Madala. Introduction Icicles Icicle Maintenance Icicle-Based Estimators Quality Guarantee Performance Evaluation Conclusion 2 ICICLES: Self-tuning.
Advertisements

Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.
16.4 Estimating the Cost of Operations Project GuidePrepared By Dr. T. Y. LinVinayan Verenkar Computer Science Dept San Jose State University.
C6 Databases.
Counting Distinct Objects over Sliding Windows Presented by: Muhammad Aamir Cheema Joint work with Wenjie Zhang, Ying Zhang and Xuemin Lin University of.
Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Fast Algorithms For Hierarchical Range Histogram Constructions
Chapter 11 Indexing and Hashing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
Fast Incremental Maintenance of Approximate histograms : Phillip B. Gibbons (Intel Research Pittsburgh) Yossi Matias (Tel Aviv University) Viswanath Poosala.
Brian Babcock Surajit Chaudhuri Gautam Das at the 2003 ACM SIGMOD International Conference By Shashank Kamble Gnanoba.
Harikrishnan Karunakaran Sulabha Balan CSE  Introduction  Icicles  Icicle Maintenance  Icicle-Based Estimators  Quality & Performance  Conclusion.
February 14, 2006CS DB Exploration 1 Congressional Samples for Approximate Answering of Group-By Queries Swarup Acharya Phillip B. Gibbons Viswanath.
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers P. B. Gibbons and Y. Matias (ACM SIGMOD 1998) Rongfang Li Feb 2007.
Bloom Filters Kira Radinsky Slides based on material from:
Algorithms for massive data sets Lecture 2 (Mar 14, 2004) Yossi Matias & Ely Porat (partially based on various presentations & notes)
Ph.D. DefenceUniversity of Alberta1 Approximation Algorithms for Frequency Related Query Processing on Streaming Data Presented by Fan Deng Supervisor:
Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets Based on the work of Jeffrey Scott Vitter and Min Wang.
Advanced Querying OLAP Part 2. Context OLAP systems for supporting decision making. Components: –Dimensions with hierarchies, –Measures, –Aggregation.
A survey on stream data mining
Estimating Set Expression Cardinalities over Data Streams Sumit Ganguly Minos Garofalakis Rajeev Rastogi Internet Management Research Department Bell Labs,
One-Pass Wavelet Decompositions of Data Streams TKDE May 2002 Anna C. Gilbert,Yannis Kotidis, S. Muthukrishanan, Martin J. Strauss Presented by James Chan.
Hashed Samples Selectivity Estimators for Set Similarity Selection Queries.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP-BY QUERIES Swarup Acharya Phillip Gibbons Viswanath Poosala ( Information Sciences Research Center,
1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.
1 Cube Computation and Indexes for Data Warehouses CPS Notes 7.
Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach Wenjie Zhang, Xuemin Lin The University of New South Wales & NICTA Ming Hua,
Ripple Joins for Online Aggregation by Peter J. Haas and Joseph M. Hellerstein published in June 1999 presented by Ronda Hilton.
Join Synopses for Approximate Query Answering Swarup Achrya Philip B. Gibbons Viswanath Poosala Sridhar Ramaswamy Presented by Bhushan Pachpande.
OLAP : Blitzkreig Introduction 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema :
Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.
Constructing Optimal Wavelet Synopses Dimitris Sacharidis Timos Sellis
DAQ: A New Paradigm for Approximate Query Processing Navneet Potti Jignesh Patel VLDB 2015.
Data Reduction Strategies Why data reduction? A database/data warehouse may store terabytes of data Complex data analysis/mining may take a very long time.
C6 Databases. 2 Traditional file environment Data Redundancy and Inconsistency: –Data redundancy: The presence of duplicate data in multiple data files.
A Novel Approach for Approximate Aggregations Over Arrays SSDBM 2015 June 29 th, San Diego, California 1 Yi Wang, Yu Su, Gagan Agrawal The Ohio State University.
Set Containment Joins: The Good, The Bad and The Ugly Karthikeyan Ramasamy Jointly With Jignesh Patel, Jeffrey F. Naughton and Raghav Kaushik.
Tahir Mahmood Lecturer Department of Statistics. Outlines: E xplain the role of sampling in the research process D istinguish between probability and.
Today Ensemble Methods. Recap of the course. Classifier Fusion
A Formal Analysis of Conservative Update Based Approximate Counting Gil Einziger and Roy Freidman Technion, Haifa.
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries Surajit Chaudhuri Gautam Das Vivek Narasayya Presented by Sushanth.
OLAP Recap 3 characteristics of OLAP cubes: Large data sets ~ Gb, Tb Expected Query : Aggregation Infrequent updates Star Schema : Hierarchical Dimensions.
Join Synopses for Approximate Query Answering Swarup Acharya, Philip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy By Vladimir Gamaley.
New Sampling-Based Summary Statistics for Improving Approximate Query Answers Yinghui Wang
Joseph M. Hellerstein Peter J. Haas Helen J. Wang Presented by: Calvin R Noronha ( ) Deepak Anand ( ) By:
BlinkDB: Queries with Bounded Errors and Bounded Response Times on Very Large Data ACM EuroSys 2013 (Best Paper Award)
Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.
CONGRESSIONAL SAMPLES FOR APPROXIMATE ANSWERING OF GROUP BY QUERIES Swaroop Acharya,Philip B Gibbons, VishwanathPoosala By Agasthya Padisala Anusha Reddy.
Artificial Intelligence in Game Design Lecture 20: Hill Climbing and N-Grams.
OLAP Seminar1 Sanjay Goil Alok Choudhary Department of Electrical & Computer Engineering and Center for Parallel and Distributed Computing, Northwestern.
병렬분산컴퓨팅연구실 1 Cubing Algorithms, Storage Estimation, and Storage and Processing Alternatives for OLAP 병렬 분산 컴퓨팅 연구실 석사 1 학기 이 은 정
Written By: Presented By: Swarup Acharya,Amr Elkhatib Phillip B. Gibbons, Viswanath Poosala, Sridhar Ramaswamy Join Synopses for Approximate Query Answering.
ICICLES: Self-tuning Samples for Approximate Query Answering By Venkatesh Ganti, Mong Li Lee, and Raghu Ramakrishnan Shruti P. Gopinath CSE 6339.
Data Transformation: Normalization
Updating SF-Tree Speaker: Ho Wai Shing.
Efficient Methods for Data Cube Computation
Ge Yang Ruoming Jin Gagan Agrawal The Ohio State University
A paper on Join Synopses for Approximate Query Answering
The Variable-Increment Counting Bloom Filter
COMP 430 Intro. to Database Systems
Anthony Okorodudu CSE ICICLES: Self-tuning Samples for Approximate Query Answering By Venkatesh Ganti, Mong Li Lee, and Raghu Ramakrishnan.
Bolin Ding Silu Huang* Surajit Chaudhuri Kaushik Chakrabarti Chi Wang
ICICLES: Self-tuning Samples for Approximate Query Answering
Chapter 15 QUERY EXECUTION.
Load Shedding Techniques for Data Stream Systems
AQUA: Approximate Query Answering
Lecture 15: Bitmap Indexes
Indexing and Hashing Basic Concepts Ordered Indices
Arithmetic Mean This represents the most probable value of the measured variable. The more readings you take, the more accurate result you will get.
Author: Ramana Rao Kompella, Kirill Levchenko, Alex C
Presentation transcript:

Storage Estimation for Multidimensional Aggregates in the Presence of Hierarchies 병렬 분산 컴퓨팅 연구실 석사 1 학기 김남희

Contents n Introdution n Approximating the size of the Cube  An Analytical Algorithm  A Sampling-Based Algorithm  An Algorithm Based on probabilistic Counting  The Probabilistic Counting Algorithm  Approximating the Size of the Cube

Contents n Evaluating the Accuracy of the Estimates n Extension to the PCSA-based algorithm  Estimating sub-cube sizes  Incremental estimation  Estimation after data removal n Conclusion

Introduction n Virtually all OLAP products resort to some degree of precomputation of these aggregation  The more that is precomputated, the faster queries can be answered.  저장공간의 문제 발생 n The problem of estimating how much storage will be required if all possible combinations of dimensions and their hierarchies are precomputed. n Hierarchy 가 있는 경우가 없는 경우 보다 더 많 은 storage 를 요구한다.

Introduction n Three strategies.  An analytic algorithm  A sampling - based algorithm  Probabilistic counting algorithm

An analytical Algorithm(1/2) Uniformly distributed assumption  If r elements are chosen uniformly and at random from a set of n elements, the expected number of distinct elements obtained is  위의 결과를 이용하여 attribute 의 어떤 subset 의 경우라도 group by 의 size 를 추정할 수 있다.  : size of hierarchy of dimension : dimensions the total numbers of group bys:

An analytical Algorithm(2/2) n Advantage : simple, fast n Disadvantage  cube 의 size 를 overestimate 하는 경향이 있다.  It requires counts of distinct values.

A Sampling-Based Algorithm(1/2) n Basic idea  Take the random subset of the data base  Compute the cube on that subset  Scale up this estimate by the ratio of the data size to the sample size D:database s:sample |D|:size of database |s|: size of sample CUBE(s):size of cube computed on the sample s the size of the cube computed on the entire database D:

A Sampling-Based Algorithm(2/2)  Advantage : simple  Disadvantage: biased in the case of projection  duplicate 를 고려하지 않음

The Probabilistic Counting Algorithm(1/3) n Probabilistic algorithm : counting the number of distinct elements in a multi set. n Algorithm  The estimate formed from the above will typically within a factor of 2 from the actual size.

The Probabilistic Counting Algorithm(2/3) n The simplest way to improve the accuracy of the estimate is to use a set H of m hashing function and computing m different BITMAP vectors.  R:position of the left most zero in the BITMAP :obtained from hashing function i average

The Probabilistic Counting Algorithm(3/3) n stochastic averaging # of distinct values :

Approximating the size o f the Cube(1/2) n Algorithm (using probabilistic counting algorithm to estimate the # of tuples resulting from computing the cube on the base data) C:hierarchy bitset(C,BM,b): hierarchy C 의 combination 에 대한 BM 번째 bitmap 의 b 번째 bit 을 setting 한다.

Approximating the size of the Cube(2/2) n Lemma  The error in the sum of two estimates is  the error in a single estimate. n Advantage  This algorithm actually guarantees an error bound on its estimate. n Disadvantage  This comes at the cost of a complete scan of the base data table;however,even this scan is much cheaper than actually computing the cube.

Evaluating the accuracy of the Estimates n scheme1

Evaluating the accuracy of the Estimates n scheme2

Evaluating the accuracy of the Estimates n Scheme3  D0,D1: dimension Each dimension has 100 unique values. Database consists of 50,000 tuples. There is no hierarchy on either dimension.

Extension to the PCSA-based algorithm n Estimating sub-cube sizes n Incremental estimation  addition of new data  change the sizes of some of the group bys  this change can be estimated by updating the bitmap used by the previous estimation.  To estimate the cube size, the bitmaps corresponding to every combination of group bys have to be stored. |C|: # of group bys L:length of each bitmap m: # of bitmaps per group bys  storage needed for the bitmaps: |C|*L*m

Extension to the PCSA-based algorithm n Estimation after data removal  For each bitmap, we have to store # of “hits” for each bit.  |C| :# of group bys in the cube L: length of each count-array m:# of count-arrays per group by I: size of the an integer  storage needed for the count array: |C|*L*m*I

Conclusion n Three strategies to estimate the blowup  Algorithm based on sampling.  Overestimate the size of the cube  strongly dependent on the # of duplicates  Algorithm based on assuming the data is uniformly distributed  work well if the data uniformly distributed  inaccurate if the skew in the data increase  The analytical estimate was more accurate than the sampling based estimate for widely varying skew in the data.

Conclusion  Probabilistic counting algorithm  Perform very well under various degrees of skew,always giving an estimate with a bounded error.  Provide a more reliable, accurate and predictable estimate than the other algorithm.