Efficient Methods for Data Cube Computation and Data Generalization

Slides:



Advertisements
Similar presentations
Mining Association Rules
Advertisements

CSE 634 Data Mining Techniques
An Array-Based Algorithm for Simultaneous Multidimensional Aggregates By Yihong Zhao, Prasad M. Desphande and Jeffrey F. Naughton Presented by Kia Hall.
Materialization and Cubing Algorithms. Cube Materialization Each cell of the data cube is a view consisting of an aggregation of interest. The values.
Outline What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation Further development of data.
Implementação do DW. SAD Tagus 2004/05 H. Galhardas O problema e as soluções Grandes quantidades de dados => Métodos de acesso e processamento de interrogações.
Generating the Data Cube (Shared Disk) Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint Work with F. Dehne T. Eavis S. Hambrusch.
Database Management Systems 3ed, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 14, Part B.
Implementation of Other Relational Algebra Operators, R. Ramakrishnan and J. Gehrke1 Implementation of other Relational Algebra Operators Chapter 12.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Bhargav Vadher (208) APRIL 9 th, 2008 Submittetd To: Dr. T Y Lin Computer Science Department San Jose State University.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Implementation & Computation of DW and Data Cube.
Cube Tree Dimension: number of group-by values Relation tuples map to a point in the space Aggregates: projection of all data points on all the subspaces.
Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets Based on the work of Jeffrey Scott Vitter and Min Wang.
1 Anna Östlin Pagh and Rasmus Pagh IT University of Copenhagen Advanced Database Technology March 25, 2004 QUERY COMPILATION II Lecture based on [GUW,
Advanced Querying OLAP Part 2. Context OLAP systems for supporting decision making. Components: –Dimensions with hierarchies, –Measures, –Aggregation.
1 Computing the cube Abhinandan Das CS 632 Mar
16.5 Introduction to Cost- based plan selection Amith KC Student Id: 109.
1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Evaluation of Relational Operations. Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation.
An Array-Based Algorithm for Simultaneous Multidimensional Aggregates
Data Cube Computation Model dependencies among the aggregates: most detailed “view” can be computed from view (product,store,quarter) by summing-up all.
Week 6 Lecture The Data Warehouse Samuel Conn, Asst. Professor
SharePoint 2010 Business Intelligence Module 6: Analysis Services.
25th VLDB, Edinburgh, Scotland, September 7-10, 1999 Extending Practical Pre-Aggregation for On-Line Analytical Processing T. B. Pedersen 1,2, C. S. Jensen.
1 Cube Computation and Indexes for Data Warehouses CPS Notes 7.
Data Mining Association Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 6 Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach,
OnLine Analytical Processing (OLAP)
Relational Operator Evaluation. Overview Index Nested Loops Join If there is an index on the join column of one relation (say S), can make it the inner.
1 Fast Computation of Sparse Datacubes Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung.
Closed Cube Computation Data cube produces large outputs –1,015,367 tuples (39MB) –210,343,580 tuples (8GB)(200 times) Two methods to reduce.
1 Chapter 10 Joins and Subqueries. 2 Joins & Subqueries Joins – Methods to combine data from multiple tables – Optimizer information can be limited based.
Frank Dehnewww.dehne.net Parallel Data Cube Data Mining OLAP (On-line analytical processing) cube / group-by operator in SQL.
CS411 Database Systems Kazuhiro Minami 11: Query Execution.
Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.
CS4432: Database Systems II Query Processing- Part 2.
CSCE Database Systems Chapter 15: Query Execution 1.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations – Join Chapter 14 Ramakrishnan and Gehrke (Section 14.4)
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining ARM: Improvements March 10, 2009 Slide.
Indexing OLAP Data Sunita Sarawagi Monowar Hossain York University.
Lecture 3 - Query Processing (continued) Advanced Databases Masood Niazi Torshiz Islamic Azad university- Mashhad Branch
Relational Operator Evaluation. overview Projection Two steps –Remove unwanted attributes –Eliminate any duplicate tuples The expensive part is removing.
Implementation of Database Systems, Jarek Gryz1 Evaluation of Relational Operations Chapter 12, Part A.
Query Execution Query compiler Execution engine Index/record mgr. Buffer manager Storage manager storage User/ Application Query update Query execution.
SF-Tree and Its Application to OLAP Speaker: Ho Wai Shing.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 14, Part A (Joins)
Cubing Heuristics (JIT lecture) Heuristics used during data cube computation.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
병렬분산컴퓨팅연구실 1 Cubing Algorithms, Storage Estimation, and Storage and Processing Alternatives for OLAP 병렬 분산 컴퓨팅 연구실 석사 1 학기 이 은 정
Chapter 3 Data Mining: Classification & Association Chapter 4 in the text box Section: 4.3 (4.3.1),
1 Parallel Datacube Construction: Algorithms, Theoretical Analysis, and Experimental Evaluation Ruoming Jin Ge Yang Gagan Agrawal The Ohio State University.
DATABASE OPERATORS AND SOLID STATE DRIVES Geetali Tyagi ( ) Mahima Malik ( ) Shrey Gupta ( ) Vedanshi Kataria ( )
Data Mining and Big Data
Dense-Region Based Compact Data Cube
Chapter 5. Data Cube Technology
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 5 —
Efficient Methods for Data Cube Computation
Chapter 13 The Data Warehouse
Cube Materialization: Full Cube, Iceberg Cube, Closed Cube, and Shell Cube Introducing iceberg cubes will lessen the burden of computing trivial aggregate.
CS 412 Intro. to Data Mining Chapter 5. Data Cube Technology
Evaluation of Relational Operations
Evaluation of Relational Operations: Other Operations
Association Rule Mining
Lecture 2- Query Processing (continued)
One-Pass Algorithms for Database Operations (15.2)
Evaluation of Relational Operations: Other Techniques
Chapter 4: Data Cube Computation and Data Generalization
Evaluation of Relational Operations: Other Techniques
Presentation transcript:

Efficient Methods for Data Cube Computation and Data Generalization Chapter 4 (4.1) April 23, 2017

Data Generalization It is a process of abstracting conceptual level knowledge from large set of task-relevant data. Two types of analysis : Descriptive data mining: Describes data in a concise manner, highlighting interesting general properties. Supports interest. Predictive data mining: constructs a model and attempts to predict behavior of new data. ( classification, regression….) April 23, 2017

A Data Cube (MOLAP) Fast on-line analytical processing takes minimum time if aggregates for all the cuboids are precomputed. Pre-computation of the full cube requires excessive amount of memory and depends on number of dimensions and cardinality of dimensions. For many cells in a cuboid the measure value is zero and cells are of little or no interest. Cuboids are often sparse. April 23, 2017

Partial Materialization Precomputation of some of the cuboids in advance leads to fast response time and avoids redundant computations during on-line analytical processing. Data cube materialization/ pre-computation No materialization: Don’t precompute any of the non-base cuboid. Leads to multidimensional aggregation on the fly and is slow. Full materialization: Precompute all the cubes. Running queries will be very fast. Requires huge memory. Partial Materialization: Selectively compute a proper subset of the cuboids, which contains only those cells that satisfy some user specified criterion. April 23, 2017

Outline Types of cells : Base cell, aggregate cell, cell relationship Types of Cubes : Full cube, Iceberg Cube, Closed Cube, Shell Cube Efficient Computation of Data Cubes Multiway Array Aggregation BUC Star Cubing April 23, 2017

A Data Cube: sales Product 10 11 12 3 1 9 6 7 8 5 47 48 44 Branch 13 8 Branch cuboid Base cuboid I1 I2 I3 I4 I5 I6 All 10 11 12 3 1 9 6 7 8 5 47 48 44 New York Branch Chicago Toronto Vancouver 13 8 10 5 6 3 45 All 46 37 36 22 29 14 184 Product cuboid Aggregate cell Base cell Apex Cuboid April 23, 2017

- Types of cells Types of cells Base cell: a cell which belongs to a base cuboid Aggregate cell: a cell which belongs to a non-base cuboid Each aggregate dimension is indicated by a “*” Ancestor-descendent relationship between cells: dimensions are (branch, product, year) 1-D cell c1 = (New York, *, *, 2000) is an ancestor of a 2-D cell c2 = (New York, I1, *, 400) and a 3-D cell c3 = (New York, I1, 2013, 111). c3 is a descendent of c1 and c2; In an n-D data cube an i-D cell a=(a1,a2,…an,measure_a) is an ancestor of a j-D cell b=(b1,b2,…bn,measure_b) if 1) i<j and 2) for 1≤m≤n am=bm whenever am 3) if j=i+1 a is called parent of b or b is a child of a April 23, 2017

- Types of cubes Full cube: All cells and cuboids are materialized. All possible combination of dimensions and values. or Iceberg cube: Partial materialization. Materializing only the cells in a cuboid whose measure value is above the minimum threshold. count(*) >= min support Iceberg Condition Closed cube: No ancestor cell is created if its measure is equal to that of its descendent cell. Shell cube: Only cuboids with limited number of dimensions are created. April 23, 2017

Two base cells {(a1,a2,….a100):10, (a1,a2,b1,…b100):10} How many sub-patterns for first base cell Total number of aggregate cells is Ignore all of the aggregate cells that can be obtained by replacing some constants by “*” while keeping the same measure value. Only 3 really offer new information. {(a1,a2,….a100):10, (a1,a2,b1,…b100):10, (a1,a2,*…,*):20} April 23, 2017

can be derived from the closed cell Example Which are the closed cells? Similarly we can also get can be derived from the closed cell April 23, 2017

Iceberg Cube, Closed Cube & Cube Shell Is iceberg cube good enough? 2 base cells: {(a1, a2, a3 . . . , a100):10, (a1, a2, b3, . . . , b100):10} How many cells will the iceberg cube have if having count(*) >= 10? Hint: A huge but tricky number! Close cube: Closed cell c: if there exists no cell d, s.t. d is a descendant of c, and d has the same measure value as c. Closed cube: a cube consisting of only closed cells What is the closed cube of the above base cuboid? Hint: only 3 cells Cube Shell Precompute only the cuboids involving a small # of dimensions, e.g., 3 More dimension combinations will need to be computed on the fly If the two base cells were: {(a1, a2, a3 . . . , a100):10, (b1, b2, b3, . . . , b100):10}, the total # of non-base cells should be 2 * 2^{100} – 3. But for {(a1, a2, a3 . . . , a100):10, (a1, a2, b3, . . . , b100):10}, the total # of non-base cells should be 2 * 2^{100} – 6. For (A1, A2, … A10), how many combinations to compute?

Outline Types of cells Types of Cubes Efficient Computation of Data Cubes Multiway Array Aggregation BUC Star Cubing April 23, 2017

- Efficient Computation of Data Cubes Preliminary cube computation tricks Computing full/iceberg cubes: 2 methodologies Top-Down: Multi-Way array aggregation Bottom-Up: Bottom-up computation: BUC Star-Cubing: Integrates top-down and bottom-up April 23, 2017

-- Preliminary Cube Computation Tricks Sorting, hashing, and grouping operations are applied to the dimension attributes in order to reorder and cluster related tuples. (ROLAP) Aggregates may be computed from previously computed aggregates, rather than from the base fact table Cache-results: accumulating results of already computed cuboid to reduce disk I/Os. Higher-level aggregates are computed from lower-level aggregates rather than base facts. Smallest-child: computing a cuboid from the smallest, previously computed cuboid. Cbranch C{ branch, year}, C{branch, item} Amortize-scans: computing as many as possible cuboids at the same time to amortize disk reads Share-sorts: sharing sorting costs cross multiple cuboids when sort-based method is used Share-partitions: sharing the partitioning cost across multiple cuboids when hash-based algorithms are used April 23, 2017

-- Multi-Way Array Aggregation … Used for MOLAP and full cube computation Array-based “bottom-up” algorithm Using multi-dimensional chunks Simultaneous aggregation on multiple dimensions Intermediate aggregate values are re-used for computing ancestor cuboids Cannot do Apriori pruning: No iceberg optimization April 23, 2017

… -- Multi-way Array Aggregation … Partition arrays into chunks (a small subcube which fits in memory). Compressed sparse array addressing: (chunk_id, offset) Compute aggregates in “multiway” by visiting cube cells in the order which minimizes the # of times to visit each cell, and reduces memory access and storage cost. A B 29 30 31 32 1 2 3 4 5 9 13 14 15 16 64 63 62 61 48 47 46 45 a1 a0 c3 c2 c1 c 0 b3 b2 b1 b0 a2 a3 C 44 28 56 40 24 52 36 20 60 What is the best traversing order to do multi-way aggregation? April 23, 2017

… -- Multi-way Array Aggregation … B 29 30 31 32 1 2 3 4 5 9 13 14 15 16 64 63 62 61 48 47 46 45 a1 a0 c3 c2 c1 c 0 b3 b2 b1 b0 a2 a3 C 44 28 56 40 24 52 36 20 60 B April 23, 2017

… -- Multi-way Array Aggregation … C c3 61 62 63 64 c2 45 46 47 48 c1 29 30 31 32 c 0 B b3 13 14 15 16 60 44 B 28 b2 9 56 40 24 b1 5 52 36 20 b0 1 2 3 4 a0 a1 a2 a3 A AB requires longest scan, i.e scanning of 49th chunk April 23, 2017

… -- Multi-way Array Aggregation … Assume the sizes of dimension, A, B, and C are 40, 400, 4000 respectively. Therefore AB is the smallest and AC is the largest 2-D planes If chunks are scanned as 1, 2, 3, … then 156,000 memory units are needed (40*400+40*1000+100*1000) If chunks are scanned as 1, 17, 33, 49, 5, 21,37 …then 1,641,000 memory units are needed (aggregation ordering AB-AC-BC). Chunk memory units needed are (400*4000+40*1000+10*10*100) April 23, 2017

… -- Multi-way Array Aggregation … All A B C AB AC BC ABC Needs 156,000 Memory units Needs 1,641,000 Memory units April 23, 2017

… -- Multi-way Array Aggregation Method: the planes should be sorted and computed according to their size in ascending order Idea: keep the smallest plane in the main memory, fetch and compute only one chunk at a time for the largest plane Limitation of the method: computing well only for a small number of dimensions If there are a large number of dimensions, “top-down” computation and iceberg cube computation methods can be explored April 23, 2017

-- Bottom-Up Computation (BUC) … Bottom-up cube computation (Note: top-down in our view!) Divides dimensions into partitions and facilitates iceberg pruning If a partition does not satisfy min_sup, its descendants can be pruned If minsup = 1 Þ compute full CUBE! No simultaneous aggregation April 23, 2017

BUC: Partitioning Usually, entire data set can’t fit in main memory Sort distinct values partition into blocks that fit Continue processing Optimizations Partitioning External Sorting, Hashing, Counting Sort Ordering dimensions to encourage pruning Cardinality, Skew, Correlation Higher the cardinality-smaller the partitions-greater pruning opportunity Collapsing duplicates Can’t do holistic aggregates anymore! Ideally the dimension with most discriminative, higher cardinality and having less skew is processed first. 23

--- BUC: Example (Having count(*) > 5) … Toronto 3 1 1 New York 2 5 1 8 9 1 I1 8 I2 I3 Q1 Q2 Q3 Q4 New-York Toronto 3 1 2 8 11 I1 5 1 8 9 I1 I2 I2 I3 I3 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 April 23, 2017

… --- BUC: Example (Having count(*) > 5) All 77 All 1 Q 7 5 2 B P Q 20 5 23 29 Q1 Q2 Q3 Q4 6 3 4 P,B Q,B Q,P Q,P I1 8 2 1 3 9 10 12 16 I2 B,P,Q I3 Q1 Q2 Q3 Q4 April 23, 2017

Till Now Facilitates a-priori pruning. During partitioning, each partition’s count is compared with min sup. The recursion stops if the count does not satisfy min sup. Aggregates simultaneously on multiple dimensions. Multiple cuboids can be computed simultaneously in one pass. Dynamic structure with simultaneous aggregation. April 23, 2017

Summary Data Cube Materialization Data Cube Computation Methods Full Materialization Partial Materialization: iceberg cubes, shell fragments Data Cube Computation Methods Multiway array aggregation BUC for computing iceberg cubes Next Class Star Cubing Shell Fragments for Fast High-Dimensional OLAP Exploration and Discovery in Multidimensional Databases April 23, 2017