Presentation is loading. Please wait.

Presentation is loading. Please wait.

00-03-30 병렬분산컴퓨팅연구실 1 Cubing Algorithms, Storage Estimation, and Storage and Processing Alternatives for OLAP 병렬 분산 컴퓨팅 연구실 석사 1 학기 이 은 정

Similar presentations


Presentation on theme: "00-03-30 병렬분산컴퓨팅연구실 1 Cubing Algorithms, Storage Estimation, and Storage and Processing Alternatives for OLAP 병렬 분산 컴퓨팅 연구실 석사 1 학기 이 은 정"— Presentation transcript:

1 00-03-30 병렬분산컴퓨팅연구실 1 Cubing Algorithms, Storage Estimation, and Storage and Processing Alternatives for OLAP 병렬 분산 컴퓨팅 연구실 석사 1 학기 이 은 정 eunjung@mm.ewha.ac.kr

2 00-03-30 병렬분산컴퓨팅연구실 2 Contents n Introduction n Cubing Relational Tables n Storage Explosion and the Cube n Three Strategies n Comparing the Algorithms n An Object-Relational ADT n Conclusions

3 00-03-30 병렬분산컴퓨팅연구실 3 Introduction n Key demand of OLAP App. - queries be answered quickly n The Goal of research - to exploit the structure of the multi- dimensional model to provide extremely high performance for queries

4 00-03-30 병렬분산컴퓨팅연구실 4 Introduction n Group-by ( in SQL terms ) - the ability to simultaneously aggregate across many sets of dimensions n “Cube” operator ( by Gray ) - to compute aggregates over all subsets of dimensions specified in the “cube” operation n Precompution of aggregates  to speed up multidimensional data analysis

5 00-03-30 병렬분산컴퓨팅연구실 5 Introduction n “the computing of cube” Problem - to precompute some or all of the cube - what & how much to precompute  difficult n 3 Strategies - uniformly distributed assumption - simple sampling-based algorithm - probabilistic counting algorithm n Implementing a multidimensional array ADT in Paradise, object relational DBMS

6 00-03-30 병렬분산컴퓨팅연구실 6 Cubing Relational Tables n “Cube” operator - by Gray group-by 의 n 차원에 대한 generalization to formalize simultaneous aggregation to express it in SQL CUBE [DISTINCT | ALL] BY cuboid : each such group-by aggregate base cuboid : the group-by aggregate over all the attributes in - n 개의 attribute 에 대해 개의 base cuboid

7 00-03-30 병렬분산컴퓨팅연구실 7 Cubing Relational Tables n Example - CUBE Product, Year, Customer By SUM(sales)  compute the sales aggregate cuboids on all 8 subsets of the set {Product, Year, Customer} n Key challenge - to understand how the cuboids in this collection are related to each other - to exploit these relationships to minimize I/O  exploring a class of sorting-based methods Ex. experiments  always perform better

8 00-03-30 병렬분산컴퓨팅연구실 8 Storage Explosion and the Cube n Virtually all OLAP products resort to some degree of precomputation of these aggregates  precomputation 을 많이 하면 할수록 … queries 응답이 더 빨라진다. n The problem of estimating how much storage… - full precomputation problem  cube framework n Example… (), (ProductId), (StoreId), (ProductId, StoreId)  4 group-bys  (StoreId) :: select StoreId, SUM(Quantity) from sales group by StoreId;

9 00-03-30 병렬분산컴퓨팅연구실 9 Figure - (a) (1/3) X : sales

10 00-03-30 병렬분산컴퓨팅연구실 10 Figure - (b) (2/3) X : sales

11 00-03-30 병렬분산컴퓨팅연구실 11 Figure - ( c ) (3/3) X : sales

12 00-03-30 병렬분산컴퓨팅연구실 12 Table

13 00-03-30 병렬분산컴퓨팅연구실 13 Storage Explosion and the Cube n hierarchy 가 없는 cube 의 storage requirement 보다 hierarchy 가 있는 cube 의 storage requirement 가 훨씬 나쁘다. Ex. Figure - (a)  hierarchy 없는 경우 : 34 tuples  hierarchy 있는 경우 : 73 tuples  Even for a small database & a small number of dimensions  the size of the cubes for the databases are very different n Blowup range 가 일어날 수 있는 예 Ex. Figure-( b ) Vs. Figure-( c )  Table 결과

14 00-03-30 병렬분산컴퓨팅연구실 14 Three Strategies (1/3) n Uniformly distributed assumption  if r elements are chosen uniformly and at random from a set of n elements, the expected number of distinct elements obtained is. 즉, attributes 의 any subset 에 대한 group-by 의 size 를 추정할 수 있다.  cube size estimate  size of a hierarchy( i- dimension) : k : dimensions the total # of group-bys :  overestimate the size of cube, require count of distinct values…. But simple & fast

15 00-03-30 병렬분산컴퓨팅연구실 15 Three Strategies (2/3) n Simple Sampling-based Algorithm  take a random subset of the database & compute the cube on that subset  sample size 에 대한 data size 의 ratio 로 estimate  |D| : the size of database, |s| : the sample size  CUBE(s) : the size of the cube computed on the s  the size of the cube on the entire database D is approximated by :  The simple biased estimator produces surprisingly good estimates.

16 00-03-30 병렬분산컴퓨팅연구실 16 Three Strategies (3/3) n Probabilistic Counting Algorithm  by Flajolet & Martin  count the number of distinct elements in a multi-set  by estimating the number of distinct elements in a particular grouping of the data, we can estimate the number of tuples in that grouping.  a single pass through the database, using only a fixed amount of additional storage

17 00-03-30 병렬분산컴퓨팅연구실 17 Comparing the algorithms n Sampling-based Alg.  database 에 나타난 중복된 수에 dependent n Assuming the data is uniformly distributed  as the skew in the data increases, the estimate becomes inaccurate n Probabilistic counting Alg.  perform very well under various degrees of skew  more reliable, accurate and predictable estimate n for a reasonably quick and accurate estimate of the size of he cube  Probabilistic counting Alg.

18 00-03-30 병렬분산컴퓨팅연구실 18 An Object-Relational ADT n Storage Structure : (1) Relational Table : ex) (I, J, K, D) (2) to store the data in an Array : row or column-major n The advantage of a MOLAP  dense arrays 가 array 에 더 compact 하게 저장  array lookup 은 단순한 arithmetic operation n The advantage of a ROLAP  sparse data sets 이 tables 에 더 compact 하게 저장  standard SQL DB 가 가져오는 모든 것을 얻는다 ( scalability to very large data sets) Array index integer

19 00-03-30 병렬분산컴퓨팅연구실 19 Using Paradise n To implement an array ADT ( MOLAP style ) n To implement bit-map indices n To purpose “Query evaluation Algorithm” - an example of a high-performance ROLAP system n 같은 code 에 대해서 수행  factors 수를 reduce Ex. Same concurrency control & recovery system

20 00-03-30 병렬분산컴퓨팅연구실 20 Conclusions n Consider the problem of computing the “cube” over data stored in arrays rather than in tables. n Start a data set in a table  convert it to an array, “cube” the array  store the result back to tables  faster ( than to cube the table directly )  very efficient


Download ppt "00-03-30 병렬분산컴퓨팅연구실 1 Cubing Algorithms, Storage Estimation, and Storage and Processing Alternatives for OLAP 병렬 분산 컴퓨팅 연구실 석사 1 학기 이 은 정"

Similar presentations


Ads by Google