Presentation is loading. Please wait.

Presentation is loading. Please wait.

Storage Estimation for Multidimensional Aggregates in the Presence of Hierarchies 병렬 분산 컴퓨팅 연구실 석사 1 학기 김남희.

Similar presentations


Presentation on theme: "Storage Estimation for Multidimensional Aggregates in the Presence of Hierarchies 병렬 분산 컴퓨팅 연구실 석사 1 학기 김남희."— Presentation transcript:

1 Storage Estimation for Multidimensional Aggregates in the Presence of Hierarchies 병렬 분산 컴퓨팅 연구실 석사 1 학기 김남희

2 Contents n Introdution n Approximating the size of the Cube  An Analytical Algorithm  A Sampling-Based Algorithm  An Algorithm Based on probabilistic Counting  The Probabilistic Counting Algorithm  Approximating the Size of the Cube

3 Contents n Evaluating the Accuracy of the Estimates n Extension to the PCSA-based algorithm  Estimating sub-cube sizes  Incremental estimation  Estimation after data removal n Conclusion

4 Introduction n Virtually all OLAP products resort to some degree of precomputation of these aggregation  The more that is precomputated, the faster queries can be answered.  저장공간의 문제 발생 n The problem of estimating how much storage will be required if all possible combinations of dimensions and their hierarchies are precomputed. n Hierarchy 가 있는 경우가 없는 경우 보다 더 많 은 storage 를 요구한다.

5 Introduction n Three strategies.  An analytic algorithm  A sampling - based algorithm  Probabilistic counting algorithm

6 An analytical Algorithm(1/2) Uniformly distributed assumption  If r elements are chosen uniformly and at random from a set of n elements, the expected number of distinct elements obtained is  위의 결과를 이용하여 attribute 의 어떤 subset 의 경우라도 group by 의 size 를 추정할 수 있다.  : size of hierarchy of dimension : dimensions the total numbers of group bys:

7 An analytical Algorithm(2/2) n Advantage : simple, fast n Disadvantage  cube 의 size 를 overestimate 하는 경향이 있다.  It requires counts of distinct values.

8 A Sampling-Based Algorithm(1/2) n Basic idea  Take the random subset of the data base  Compute the cube on that subset  Scale up this estimate by the ratio of the data size to the sample size D:database s:sample |D|:size of database |s|: size of sample CUBE(s):size of cube computed on the sample s the size of the cube computed on the entire database D:

9 A Sampling-Based Algorithm(2/2)  Advantage : simple  Disadvantage: biased in the case of projection  duplicate 를 고려하지 않음

10 The Probabilistic Counting Algorithm(1/3) n Probabilistic algorithm : counting the number of distinct elements in a multi set. n Algorithm  The estimate formed from the above will typically within a factor of 2 from the actual size.

11 The Probabilistic Counting Algorithm(2/3) n The simplest way to improve the accuracy of the estimate is to use a set H of m hashing function and computing m different BITMAP vectors.  R:position of the left most zero in the BITMAP :obtained from hashing function i average

12 The Probabilistic Counting Algorithm(3/3) n stochastic averaging # of distinct values :

13 Approximating the size o f the Cube(1/2) n Algorithm (using probabilistic counting algorithm to estimate the # of tuples resulting from computing the cube on the base data) C:hierarchy bitset(C,BM,b): hierarchy C 의 combination 에 대한 BM 번째 bitmap 의 b 번째 bit 을 setting 한다.

14 Approximating the size of the Cube(2/2) n Lemma  The error in the sum of two estimates is  the error in a single estimate. n Advantage  This algorithm actually guarantees an error bound on its estimate. n Disadvantage  This comes at the cost of a complete scan of the base data table;however,even this scan is much cheaper than actually computing the cube.

15 Evaluating the accuracy of the Estimates n scheme1

16 Evaluating the accuracy of the Estimates n scheme2

17 Evaluating the accuracy of the Estimates n Scheme3  D0,D1: dimension Each dimension has 100 unique values. Database consists of 50,000 tuples. There is no hierarchy on either dimension.

18 Extension to the PCSA-based algorithm n Estimating sub-cube sizes n Incremental estimation  addition of new data  change the sizes of some of the group bys  this change can be estimated by updating the bitmap used by the previous estimation.  To estimate the cube size, the bitmaps corresponding to every combination of group bys have to be stored. |C|: # of group bys L:length of each bitmap m: # of bitmaps per group bys  storage needed for the bitmaps: |C|*L*m

19 Extension to the PCSA-based algorithm n Estimation after data removal  For each bitmap, we have to store # of “hits” for each bit.  |C| :# of group bys in the cube L: length of each count-array m:# of count-arrays per group by I: size of the an integer  storage needed for the count array: |C|*L*m*I

20 Conclusion n Three strategies to estimate the blowup  Algorithm based on sampling.  Overestimate the size of the cube  strongly dependent on the # of duplicates  Algorithm based on assuming the data is uniformly distributed  work well if the data uniformly distributed  inaccurate if the skew in the data increase  The analytical estimate was more accurate than the sampling based estimate for widely varying skew in the data.

21 Conclusion  Probabilistic counting algorithm  Perform very well under various degrees of skew,always giving an estimate with a bounded error.  Provide a more reliable, accurate and predictable estimate than the other algorithm.


Download ppt "Storage Estimation for Multidimensional Aggregates in the Presence of Hierarchies 병렬 분산 컴퓨팅 연구실 석사 1 학기 김남희."

Similar presentations


Ads by Google