Download presentation
Presentation is loading. Please wait.
Published byRoderick Bond Modified over 8 years ago
1
00-03-30 병렬분산컴퓨팅연구실 1 Cubing Algorithms, Storage Estimation, and Storage and Processing Alternatives for OLAP 병렬 분산 컴퓨팅 연구실 석사 1 학기 이 은 정 eunjung@mm.ewha.ac.kr
2
00-03-30 병렬분산컴퓨팅연구실 2 Contents n Introduction n Cubing Relational Tables n Storage Explosion and the Cube n Three Strategies n Comparing the Algorithms n An Object-Relational ADT n Conclusions
3
00-03-30 병렬분산컴퓨팅연구실 3 Introduction n Key demand of OLAP App. - queries be answered quickly n The Goal of research - to exploit the structure of the multi- dimensional model to provide extremely high performance for queries
4
00-03-30 병렬분산컴퓨팅연구실 4 Introduction n Group-by ( in SQL terms ) - the ability to simultaneously aggregate across many sets of dimensions n “Cube” operator ( by Gray ) - to compute aggregates over all subsets of dimensions specified in the “cube” operation n Precompution of aggregates to speed up multidimensional data analysis
5
00-03-30 병렬분산컴퓨팅연구실 5 Introduction n “the computing of cube” Problem - to precompute some or all of the cube - what & how much to precompute difficult n 3 Strategies - uniformly distributed assumption - simple sampling-based algorithm - probabilistic counting algorithm n Implementing a multidimensional array ADT in Paradise, object relational DBMS
6
00-03-30 병렬분산컴퓨팅연구실 6 Cubing Relational Tables n “Cube” operator - by Gray group-by 의 n 차원에 대한 generalization to formalize simultaneous aggregation to express it in SQL CUBE [DISTINCT | ALL] BY cuboid : each such group-by aggregate base cuboid : the group-by aggregate over all the attributes in - n 개의 attribute 에 대해 개의 base cuboid
7
00-03-30 병렬분산컴퓨팅연구실 7 Cubing Relational Tables n Example - CUBE Product, Year, Customer By SUM(sales) compute the sales aggregate cuboids on all 8 subsets of the set {Product, Year, Customer} n Key challenge - to understand how the cuboids in this collection are related to each other - to exploit these relationships to minimize I/O exploring a class of sorting-based methods Ex. experiments always perform better
8
00-03-30 병렬분산컴퓨팅연구실 8 Storage Explosion and the Cube n Virtually all OLAP products resort to some degree of precomputation of these aggregates precomputation 을 많이 하면 할수록 … queries 응답이 더 빨라진다. n The problem of estimating how much storage… - full precomputation problem cube framework n Example… (), (ProductId), (StoreId), (ProductId, StoreId) 4 group-bys (StoreId) :: select StoreId, SUM(Quantity) from sales group by StoreId;
9
00-03-30 병렬분산컴퓨팅연구실 9 Figure - (a) (1/3) X : sales
10
00-03-30 병렬분산컴퓨팅연구실 10 Figure - (b) (2/3) X : sales
11
00-03-30 병렬분산컴퓨팅연구실 11 Figure - ( c ) (3/3) X : sales
12
00-03-30 병렬분산컴퓨팅연구실 12 Table
13
00-03-30 병렬분산컴퓨팅연구실 13 Storage Explosion and the Cube n hierarchy 가 없는 cube 의 storage requirement 보다 hierarchy 가 있는 cube 의 storage requirement 가 훨씬 나쁘다. Ex. Figure - (a) hierarchy 없는 경우 : 34 tuples hierarchy 있는 경우 : 73 tuples Even for a small database & a small number of dimensions the size of the cubes for the databases are very different n Blowup range 가 일어날 수 있는 예 Ex. Figure-( b ) Vs. Figure-( c ) Table 결과
14
00-03-30 병렬분산컴퓨팅연구실 14 Three Strategies (1/3) n Uniformly distributed assumption if r elements are chosen uniformly and at random from a set of n elements, the expected number of distinct elements obtained is. 즉, attributes 의 any subset 에 대한 group-by 의 size 를 추정할 수 있다. cube size estimate size of a hierarchy( i- dimension) : k : dimensions the total # of group-bys : overestimate the size of cube, require count of distinct values…. But simple & fast
15
00-03-30 병렬분산컴퓨팅연구실 15 Three Strategies (2/3) n Simple Sampling-based Algorithm take a random subset of the database & compute the cube on that subset sample size 에 대한 data size 의 ratio 로 estimate |D| : the size of database, |s| : the sample size CUBE(s) : the size of the cube computed on the s the size of the cube on the entire database D is approximated by : The simple biased estimator produces surprisingly good estimates.
16
00-03-30 병렬분산컴퓨팅연구실 16 Three Strategies (3/3) n Probabilistic Counting Algorithm by Flajolet & Martin count the number of distinct elements in a multi-set by estimating the number of distinct elements in a particular grouping of the data, we can estimate the number of tuples in that grouping. a single pass through the database, using only a fixed amount of additional storage
17
00-03-30 병렬분산컴퓨팅연구실 17 Comparing the algorithms n Sampling-based Alg. database 에 나타난 중복된 수에 dependent n Assuming the data is uniformly distributed as the skew in the data increases, the estimate becomes inaccurate n Probabilistic counting Alg. perform very well under various degrees of skew more reliable, accurate and predictable estimate n for a reasonably quick and accurate estimate of the size of he cube Probabilistic counting Alg.
18
00-03-30 병렬분산컴퓨팅연구실 18 An Object-Relational ADT n Storage Structure : (1) Relational Table : ex) (I, J, K, D) (2) to store the data in an Array : row or column-major n The advantage of a MOLAP dense arrays 가 array 에 더 compact 하게 저장 array lookup 은 단순한 arithmetic operation n The advantage of a ROLAP sparse data sets 이 tables 에 더 compact 하게 저장 standard SQL DB 가 가져오는 모든 것을 얻는다 ( scalability to very large data sets) Array index integer
19
00-03-30 병렬분산컴퓨팅연구실 19 Using Paradise n To implement an array ADT ( MOLAP style ) n To implement bit-map indices n To purpose “Query evaluation Algorithm” - an example of a high-performance ROLAP system n 같은 code 에 대해서 수행 factors 수를 reduce Ex. Same concurrency control & recovery system
20
00-03-30 병렬분산컴퓨팅연구실 20 Conclusions n Consider the problem of computing the “cube” over data stored in arrays rather than in tables. n Start a data set in a table convert it to an array, “cube” the array store the result back to tables faster ( than to cube the table directly ) very efficient
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.