Download presentation
Presentation is loading. Please wait.
Published byDylan O’Brien’ Modified over 8 years ago
1
Storage Estimation for Multidimensional Aggregates in the Presence of Hierarchies 병렬 분산 컴퓨팅 연구실 석사 1 학기 김남희
2
Contents n Introdution n Approximating the size of the Cube An Analytical Algorithm A Sampling-Based Algorithm An Algorithm Based on probabilistic Counting The Probabilistic Counting Algorithm Approximating the Size of the Cube
3
Contents n Evaluating the Accuracy of the Estimates n Extension to the PCSA-based algorithm Estimating sub-cube sizes Incremental estimation Estimation after data removal n Conclusion
4
Introduction n Virtually all OLAP products resort to some degree of precomputation of these aggregation The more that is precomputated, the faster queries can be answered. 저장공간의 문제 발생 n The problem of estimating how much storage will be required if all possible combinations of dimensions and their hierarchies are precomputed. n Hierarchy 가 있는 경우가 없는 경우 보다 더 많 은 storage 를 요구한다.
5
Introduction n Three strategies. An analytic algorithm A sampling - based algorithm Probabilistic counting algorithm
6
An analytical Algorithm(1/2) Uniformly distributed assumption If r elements are chosen uniformly and at random from a set of n elements, the expected number of distinct elements obtained is 위의 결과를 이용하여 attribute 의 어떤 subset 의 경우라도 group by 의 size 를 추정할 수 있다. : size of hierarchy of dimension : dimensions the total numbers of group bys:
7
An analytical Algorithm(2/2) n Advantage : simple, fast n Disadvantage cube 의 size 를 overestimate 하는 경향이 있다. It requires counts of distinct values.
8
A Sampling-Based Algorithm(1/2) n Basic idea Take the random subset of the data base Compute the cube on that subset Scale up this estimate by the ratio of the data size to the sample size D:database s:sample |D|:size of database |s|: size of sample CUBE(s):size of cube computed on the sample s the size of the cube computed on the entire database D:
9
A Sampling-Based Algorithm(2/2) Advantage : simple Disadvantage: biased in the case of projection duplicate 를 고려하지 않음
10
The Probabilistic Counting Algorithm(1/3) n Probabilistic algorithm : counting the number of distinct elements in a multi set. n Algorithm The estimate formed from the above will typically within a factor of 2 from the actual size.
11
The Probabilistic Counting Algorithm(2/3) n The simplest way to improve the accuracy of the estimate is to use a set H of m hashing function and computing m different BITMAP vectors. R:position of the left most zero in the BITMAP :obtained from hashing function i average
12
The Probabilistic Counting Algorithm(3/3) n stochastic averaging # of distinct values :
13
Approximating the size o f the Cube(1/2) n Algorithm (using probabilistic counting algorithm to estimate the # of tuples resulting from computing the cube on the base data) C:hierarchy bitset(C,BM,b): hierarchy C 의 combination 에 대한 BM 번째 bitmap 의 b 번째 bit 을 setting 한다.
14
Approximating the size of the Cube(2/2) n Lemma The error in the sum of two estimates is the error in a single estimate. n Advantage This algorithm actually guarantees an error bound on its estimate. n Disadvantage This comes at the cost of a complete scan of the base data table;however,even this scan is much cheaper than actually computing the cube.
15
Evaluating the accuracy of the Estimates n scheme1
16
Evaluating the accuracy of the Estimates n scheme2
17
Evaluating the accuracy of the Estimates n Scheme3 D0,D1: dimension Each dimension has 100 unique values. Database consists of 50,000 tuples. There is no hierarchy on either dimension.
18
Extension to the PCSA-based algorithm n Estimating sub-cube sizes n Incremental estimation addition of new data change the sizes of some of the group bys this change can be estimated by updating the bitmap used by the previous estimation. To estimate the cube size, the bitmaps corresponding to every combination of group bys have to be stored. |C|: # of group bys L:length of each bitmap m: # of bitmaps per group bys storage needed for the bitmaps: |C|*L*m
19
Extension to the PCSA-based algorithm n Estimation after data removal For each bitmap, we have to store # of “hits” for each bit. |C| :# of group bys in the cube L: length of each count-array m:# of count-arrays per group by I: size of the an integer storage needed for the count array: |C|*L*m*I
20
Conclusion n Three strategies to estimate the blowup Algorithm based on sampling. Overestimate the size of the cube strongly dependent on the # of duplicates Algorithm based on assuming the data is uniformly distributed work well if the data uniformly distributed inaccurate if the skew in the data increase The analytical estimate was more accurate than the sampling based estimate for widely varying skew in the data.
21
Conclusion Probabilistic counting algorithm Perform very well under various degrees of skew,always giving an estimate with a bounded error. Provide a more reliable, accurate and predictable estimate than the other algorithm.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.