Efficient Methods for Data Cube Computation and Data Generalization

Name: Efficient Methods for Data Cube Computation and Data Generalization
Uploaded: 2017-07-11T04:53:40+00:00
Duration: PTM21S16
Channel: Reynold Barber
Description: Efficient Methods for Data Cube Computation and Data Generalization

Efficient Methods for Data Cube Computation and Data Generalization
Chapter 4 (4.1) April 23, 2017

Data Generalization It is a process of abstracting conceptual level knowledge from large set of task-relevant data. Two types of analysis : Descriptive data mining: Describes data in a concise manner, highlighting interesting general properties. Supports interest. Predictive data mining: constructs a model and attempts to predict behavior of new data. ( classification, regression….) April 23, 2017

A Data Cube (MOLAP) Fast on-line analytical processing takes minimum time if aggregates for all the cuboids are precomputed. Pre-computation of the full cube requires excessive amount of memory and depends on number of dimensions and cardinality of dimensions. For many cells in a cuboid the measure value is zero and cells are of little or no interest. Cuboids are often sparse. April 23, 2017

Partial Materialization
Precomputation of some of the cuboids in advance leads to fast response time and avoids redundant computations during on-line analytical processing. Data cube materialization/ pre-computation No materialization: Don’t precompute any of the non-base cuboid. Leads to multidimensional aggregation on the fly and is slow. Full materialization: Precompute all the cubes. Running queries will be very fast. Requires huge memory. Partial Materialization: Selectively compute a proper subset of the cuboids, which contains only those cells that satisfy some user specified criterion. April 23, 2017

Outline Types of cells : Base cell, aggregate cell, cell relationship
Types of Cubes : Full cube, Iceberg Cube, Closed Cube, Shell Cube Efficient Computation of Data Cubes Multiway Array Aggregation BUC Star Cubing April 23, 2017

A Data Cube: sales Product 10 11 12 3 1 9 6 7 8 5 47 48 44 Branch 13 8
Branch cuboid Base cuboid I1 I2 I3 I4 I5 I6 All 10 11 12 3 1 9 6 7 8 5 47 48 44 New York Branch Chicago Toronto Vancouver 13 8 10 5 6 3 45 All 46 37 36 22 29 14 184 Product cuboid Aggregate cell Base cell Apex Cuboid April 23, 2017

- Types of cells Types of cells
Base cell: a cell which belongs to a base cuboid Aggregate cell: a cell which belongs to a non-base cuboid Each aggregate dimension is indicated by a “*” Ancestor-descendent relationship between cells: dimensions are (branch, product, year) 1-D cell c1 = (New York, *, *, 2000) is an ancestor of a 2-D cell c2 = (New York, I1, *, 400) and a 3-D cell c3 = (New York, I1, 2013, 111). c3 is a descendent of c1 and c2; In an n-D data cube an i-D cell a=(a1,a2,…an,measure_a) is an ancestor of a j-D cell b=(b1,b2,…bn,measure_b) if 1) i<j and 2) for 1≤m≤n am=bm whenever am 3) if j=i+1 a is called parent of b or b is a child of a April 23, 2017

- Types of cubes Full cube: All cells and cuboids are materialized. All possible combination of dimensions and values. or Iceberg cube: Partial materialization. Materializing only the cells in a cuboid whose measure value is above the minimum threshold. count(*) >= min support Iceberg Condition Closed cube: No ancestor cell is created if its measure is equal to that of its descendent cell. Shell cube: Only cuboids with limited number of dimensions are created. April 23, 2017

Two base cells {(a1,a2,….a100):10, (a1,a2,b1,…b100):10}
How many sub-patterns for first base cell Total number of aggregate cells is Ignore all of the aggregate cells that can be obtained by replacing some constants by “*” while keeping the same measure value. Only 3 really offer new information. {(a1,a2,….a100):10, (a1,a2,b1,…b100):10, (a1,a2,*…,*):20} April 23, 2017

can be derived from the closed cell
Example Which are the closed cells? Similarly we can also get can be derived from the closed cell April 23, 2017

Iceberg Cube, Closed Cube & Cube Shell
Is iceberg cube good enough? 2 base cells: {(a1, a2, a , a100):10, (a1, a2, b3, , b100):10} How many cells will the iceberg cube have if having count(*) >= 10? Hint: A huge but tricky number! Close cube: Closed cell c: if there exists no cell d, s.t. d is a descendant of c, and d has the same measure value as c. Closed cube: a cube consisting of only closed cells What is the closed cube of the above base cuboid? Hint: only 3 cells Cube Shell Precompute only the cuboids involving a small # of dimensions, e.g., 3 More dimension combinations will need to be computed on the fly If the two base cells were: {(a1, a2, a , a100):10, (b1, b2, b3, , b100):10}, the total # of non-base cells should be 2 * 2^{100} – 3. But for {(a1, a2, a , a100):10, (a1, a2, b3, , b100):10}, the total # of non-base cells should be 2 * 2^{100} – 6. For (A1, A2, … A10), how many combinations to compute?

Outline Types of cells Types of Cubes
Efficient Computation of Data Cubes Multiway Array Aggregation BUC Star Cubing April 23, 2017

- Efficient Computation of Data Cubes
Preliminary cube computation tricks Computing full/iceberg cubes: 2 methodologies Top-Down: Multi-Way array aggregation Bottom-Up: Bottom-up computation: BUC Star-Cubing: Integrates top-down and bottom-up April 23, 2017

-- Preliminary Cube Computation Tricks
Sorting, hashing, and grouping operations are applied to the dimension attributes in order to reorder and cluster related tuples. (ROLAP) Aggregates may be computed from previously computed aggregates, rather than from the base fact table Cache-results: accumulating results of already computed cuboid to reduce disk I/Os. Higher-level aggregates are computed from lower-level aggregates rather than base facts. Smallest-child: computing a cuboid from the smallest, previously computed cuboid. Cbranch C{ branch, year}, C{branch, item} Amortize-scans: computing as many as possible cuboids at the same time to amortize disk reads Share-sorts: sharing sorting costs cross multiple cuboids when sort-based method is used Share-partitions: sharing the partitioning cost across multiple cuboids when hash-based algorithms are used April 23, 2017

-- Multi-Way Array Aggregation …
Used for MOLAP and full cube computation Array-based “bottom-up” algorithm Using multi-dimensional chunks Simultaneous aggregation on multiple dimensions Intermediate aggregate values are re-used for computing ancestor cuboids Cannot do Apriori pruning: No iceberg optimization April 23, 2017

… -- Multi-way Array Aggregation …
Partition arrays into chunks (a small subcube which fits in memory). Compressed sparse array addressing: (chunk_id, offset) Compute aggregates in “multiway” by visiting cube cells in the order which minimizes the # of times to visit each cell, and reduces memory access and storage cost. A B 29 30 31 32 1 2 3 4 5 9 13 14 15 16 64 63 62 61 48 47 46 45 a1 a0 c3 c2 c1 c 0 b3 b2 b1 b0 a2 a3 C 44 28 56 40 24 52 36 20 60 What is the best traversing order to do multi-way aggregation? April 23, 2017

B 29 30 31 32 1 2 3 4 5 9 13 14 15 16 64 63 62 61 48 47 46 45 a1 a0 c3 c2 c1 c 0 b3 b2 b1 b0 a2 a3 C 44 28 56 40 24 52 36 20 60 B April 23, 2017

C c3 61 62 63 64 c2 45 46 47 48 c1 29 30 31 32 c 0 B b3 13 14 15 16 60 44 B 28 b2 9 56 40 24 b1 5 52 36 20 b0 1 2 3 4 a0 a1 a2 a3 A AB requires longest scan, i.e scanning of 49th chunk April 23, 2017

Assume the sizes of dimension, A, B, and C are 40, 400, 4000 respectively. Therefore AB is the smallest and AC is the largest 2-D planes If chunks are scanned as 1, 2, 3, … then 156,000 memory units are needed (40*400+40* *1000) If chunks are scanned as 1, 17, 33, 49, 5, 21,37 …then 1,641,000 memory units are needed (aggregation ordering AB-AC-BC). Chunk memory units needed are (400* * *10*100) April 23, 2017

All A B C AB AC BC ABC Needs 156,000 Memory units Needs 1,641,000 Memory units April 23, 2017

… -- Multi-way Array Aggregation
Method: the planes should be sorted and computed according to their size in ascending order Idea: keep the smallest plane in the main memory, fetch and compute only one chunk at a time for the largest plane Limitation of the method: computing well only for a small number of dimensions If there are a large number of dimensions, “top-down” computation and iceberg cube computation methods can be explored April 23, 2017

-- Bottom-Up Computation (BUC) …
Bottom-up cube computation (Note: top-down in our view!) Divides dimensions into partitions and facilitates iceberg pruning If a partition does not satisfy min_sup, its descendants can be pruned If minsup = 1 Þ compute full CUBE! No simultaneous aggregation April 23, 2017

BUC: Partitioning Usually, entire data set can’t fit in main memory Sort distinct values partition into blocks that fit Continue processing Optimizations Partitioning External Sorting, Hashing, Counting Sort Ordering dimensions to encourage pruning Cardinality, Skew, Correlation Higher the cardinality-smaller the partitions-greater pruning opportunity Collapsing duplicates Can’t do holistic aggregates anymore! Ideally the dimension with most discriminative, higher cardinality and having less skew is processed first. 23

--- BUC: Example (Having count(*) > 5) …
Toronto 3 1 1 New York 2 5 1 8 9 1 I1 8 I2 I3 Q1 Q2 Q3 Q4 New-York Toronto 3 1 2 8 11 I1 5 1 8 9 I1 I2 I2 I3 I3 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 April 23, 2017

… --- BUC: Example (Having count(*) > 5)
All 77 All 1 Q 7 5 2 B P Q 20 5 23 29 Q1 Q2 Q3 Q4 6 3 4 P,B Q,B Q,P Q,P I1 8 2 1 3 9 10 12 16 I2 B,P,Q I3 Q1 Q2 Q3 Q4 April 23, 2017

Till Now Facilitates a-priori pruning.
During partitioning, each partition’s count is compared with min sup. The recursion stops if the count does not satisfy min sup. Aggregates simultaneously on multiple dimensions. Multiple cuboids can be computed simultaneously in one pass. Dynamic structure with simultaneous aggregation. April 23, 2017

Summary Data Cube Materialization Data Cube Computation Methods
Full Materialization Partial Materialization: iceberg cubes, shell fragments Data Cube Computation Methods Multiway array aggregation BUC for computing iceberg cubes Next Class Star Cubing Shell Fragments for Fast High-Dimensional OLAP Exploration and Discovery in Multidimensional Databases April 23, 2017

Efficient Methods for Data Cube Computation and Data Generalization

Similar presentations

Presentation on theme: "Efficient Methods for Data Cube Computation and Data Generalization"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Efficient Methods for Data Cube Computation and Data Generalization

Similar presentations

Presentation on theme: "Efficient Methods for Data Cube Computation and Data Generalization"— Presentation transcript:

Similar presentations

About project

Feedback