1 Multi-way Algorithm for Cube Computation CPS Notes 8
2 First Programming Project l Individual project, 15 Points in final grade l Sales(customer_id, item_id, item_group, item_price, purchase_date) u Will be provided as a file during demo and for generating performance numbers for project report l Task 1: 5 Points u Interface to enter MIN_SUPPORT (% of customers) u Find frequent itemsets using Apriori (set of item_id’s) l Task 2: 5 Points (Section 5.5 in the textbook) u Interface to enter two constraint types (e.g., SUM(item_price) op const) u Use the constraints in Apriori as effectively as possible, study and demonstrate performance improvement l Task 3: 5 Points u Extension of your choice. Examples include (i) association rules, (ii) complex constraints, (iii) sequential patterns, (iv) variants of apriori, (v) FP-growth
3 File Format l 10,123,3,54,4/4/2008 l 10,12,4,101,4/5/2008 l 14,123,3,54,8/4/2008 l … l Caveats: u Customer Vs. Item u Three datasets: Toy, Medium, and Large u Comma-separated file, one purchase per line in file, no header in file u Integers for simplicity u Note date format
4 First Programming Project: Milestones l Feb 3: Project announced l Feb 17: Mid-project report due u Describe progress and planned extensions u Describe detailed algorithms for all three tasks l Feb 17: Sample data file will be provided for generating performance results for project report l March 2: Submit code, README file to run code, code documentation, and final project report l March 2-4: Project demos (random assignment) l March 6: Spring break. Second project announced
5 Finalized Grading Criteria for Class l Homeworks: 15 points l Programming projects: 40 points l Midterm: 20 points u Note: Midterm is on Feb 19 (Thu) in class l Final: 25 Points
6 ROLAP Server l Relational OLAP Server relational DBMS ROLAP server tools utilities Special indices, tuning; Schema is “denormalized”
7 MOLAP Server l Multi-Dimensional OLAP Server multi- dimensional server M.D. tools utilities could also sit on relational DBMS Product City Date milk soda eggs soap A B Sales
8 MOLAP Total annual sales of TV in U.S.A. Date Product Country sum TV VCR PC 1Qtr 2Qtr 3Qtr 4Qtr U.S.A Canada Mexico sum
9 MOLAP A B a1a0 c3 c2 c1 c 0 b3 b2 b1 b0 a2a3 C B
10 Challenges in MOLAP l Storing large arrays for efficient access u Row-major, column major u Chunking u Compressing sparse arrays l Creating array data from data in tables l Efficient techniques for Cube computation Topics are discussed in the paper for reading
11 ROLAP Vs. MOLAP l What do the authors say? l What can you do in MOLAP that you cannot do in ROLAP? l Can the algorithm in this paper be used in ROLAP?
12 Array Storage l Chunks l Compression u Chunk-offset compression Vs. LZW
13 Loading Arrays from Tables l The easy case: array fits in memory l Else: u Partitions
14 l Suppose there are 1000 chunks. 10 chunks can fit in memory. The partition size is 10 chunks chunks 100 Table Loading Arrays from Tables
15 Basic Array Cubing Algo l First find minimum spanning tree u Hierarchy of aggregates l Compute each (k-1) dimensional aggregate from its best k dimensional aggregate u One pass through the array in the right order Let us look at some basics first
16 Chunked 3D Array C B a1a0 a3 a2 a1 a0 b3 b2 b1 b0 c2c3 A B Dimension order CBA
17 “a0b0” chunk a0b0c0 c1 c2 c3 a0b1c0 c1 c2 c3 a0b2c0 c1 c2 c3 a0b3c0 c1 c2 c3 xxxx xxxx xxxx a0 b0 c0 c1 c2c3 b0 b1 b2 b3 c0 c1 c2c3 …
18 a0b1 chunk a0b0c0 c1 c2 c3 a0b1c0 c1 c2 c3 a0b2c0 c1 c2 c3 a0b3c0 c1 c2 c3 yyyy xy xxxx yyyy a0 b1 c0 c1 c2c3 b0 b1 b2 b3 c0 c1 c2c3 … Done with a0b0
19 a0b2 chunk a0b0c0 c1 c2 c3 a0b1c0 c1 c2 c3 a0b2c0 c1 c2 c3 a0b3c0 c1 c2 c3 zzzz xyz xxxx yyyy zzzz a0 b2 c0 c1 c2c3 b0 b1 b2 b3 c0 c1 c2c3 … Done with a0b1
20 Table Visualization a0b0c0 c1 c2 c3 a0b1c0 c1 c2 c3 a0b2c0 c1 c2 c3 a0b3c0 c1 c2 c3 uuuu xyzu xxxx yyyy zzzz uuuu a0 b3 c0 c1 c2c3 b0 b1 b2 b3 c0 c1 c2c3 Done with a0b2
21 Table Visualization a1b0c0 c1 c2 c3 a1b1c0 c1 c2 c3 a1b2c0 c1 c2 c3 a1b3c0 c1 c2 c3 xxxx xxxx xx yyyy zzzz uuuu a1 b0 c0 c1 c2c3 b0 b1 b2 b3 c0 c1 c2c3 … Done with a0b3 Done with a0c* …
22 a3b3 chunk (last one) a3b0c0 c1 c2 c3 a3b1c0 c1 c2 c3 a3b2c0 c1 c2 c3 a3b3c0 c1 c2 c3 uuuu xyzu xxxx yyyy zzzz uuuu a3 b0 c0 c1 c2c3 b0 b1 b2 b3 c0 c1 c2c3 Finish Done with a0b3 Done with a0c* Done with b*c* …
23 Memory Used l A: 40 distinct values l B: 400 distinct values l C: 4000 distinct values l CBA: Dimension Order l Plane AB: Need 1 chunk (10 * 100 * 1) l Plane AC: Need 4 chunks (10 * 1000 * 4) l Plane BC: Need 16 chunks (100 * 1000 * 16) l Total memory: 1,641,000
24 Memory Used l A: 40 distinct values l B: 400 distinct values l C: 4000 distinct values l ABC: Dimension Order l Plane BC: Need 1 chunk (1000 * 100 * 1) l Plane AC: Need 4 chunks (1000 * 10 * 4) l Plane AB: Need 16 chunks (100 * 10 * 16) l Total memory: 156,000
25 Basic Array Cubing Algo l First find minimum spanning tree u Hierarchy of aggregates l Compute each (k-1) dimensional aggregate from its best k dimensional aggregate u One pass through the array in the right order l What are the advantages and disadvantages of this algorithm?
26 Multi-way Array Cubing Algo l What is the main idea? l Rule 1 on Page 163 l Minimum memory spanning tree u Figure 2 u Figures 3 and 4 l Theorem 1 l Basic idea of multi-pass algorithm u Tradeoff between memory usage and number of passes
27 D1D2D3M A3B2C120 A7B2C110 A13B1C1230 A2B2C110 A3B7C1240 A15B7C120 A6B1C1210 A13B2C120 A1B11C1100 A1B11C150 A13B2C130 A3B11C1210 A13B7C140 A10B1C150 A3B1C1210