Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.

Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University

Motivation  A lot of effort into developing cluster computing tools targetting scientific applications  There is an emerging class of commercial applications that are well suited for cluster environments  OnLine Analytical Processing (OLAP)  Data Mining  Can we successfully use cluster tools developed for scientific applications on commercial applications ?

Overview Focus on:  Data cube construction, which is an OLAP problem  Both compute and data intensive  Frequently used in data warehouses  Use of Active Data Repository (ADR) developed for scientific data intensive applications  Questions:  Are new algorithms / variations to existing algorithms required ?  Implementation experience ?  Performance ?

Outline  Data cube construction  Problem definition  Challenges  Active Data Repository (ADR)  Scalable data cube construction algorithms targetting ADR  Implementation Experience  Performance Evaluation  Summary

Data Cube Construction Context: Data Warehouses  Frequently store (possibly sparse) multidimensional datasets  Example: Sale information for a chain of stores: time, item, and location can be the three dimensions  Frequently asked queries: aggregate along one or more dimensions Data Cube Construction:  Perform all aggregations in advance to facilitate rapid response to all queries  For the original n dimension array construct: n C m arrays of m dimensions, 0 =< m =< n

Data Cube Construction Example:  Consider original 3 dimensional array ABC  Data cube comprises of  3 two-dimensional arrays AB, BC, AC  3 one-dimensional arrays A, B, and C  A scalar value all  Some observations:  Large input size: data warehouses can have a lot of data  Total amount of output could be quite large  A lot of computation is involved

Lattice for Data Cube Construction Options for computing different output arrays can be represented by a lattice If A is the shortest dimension and C is the largest, the arrows represent the minimal spanning tree of the lattice AB is considered the smallest parent of A and B

Active Data Repository  Developed at University of Maryland (Chang, Kurc, Sussman, Saltz)  Targetted scientific data intensive applications  Execution model:  Divide output dataset(s) into tiles, allocate one tile at a time  Fetch input dataset one chunk at a time to compute the tile  Decide on a plan or schedule for fetching chunks that contribute to a tile  Operations involved in computing an output element must be associative and commutative

Goals In Algorithm Design  Must use smallest parents / minimal spanning tree  Maximal cache and memory reuse: perform all computations associated with an input chunk before it is discarded from memory  Minimize interprocessor communication volume  Minimize the amount of memory that needs to be allocated across the tiles  Fit into ADR’s computation model

Approach  Currently consider data cube construction starting from three dimensional array only  Partition and tile along a single dimension only  If the size along the dimensions A, B, and C are |A|, |B| and |C|, assume that |A| <= |B| <= |C| (No loss of generality)

Partitioning and Tiling  Always partition along the dimension C  Minimizes communication volume  If |A| <= |B| <= |C|, |A||B| <= |A||C| <= |B||C|  Let the size of the dimension C on each processor be |C’|  Three separate cases for tiling  Case I: |A| <= |B| <= |C’|  Case II: |A| <= |C’| <= |B|  Case III: |C’| <= |A| <= |B|  Focus on first and second cases, third is almost identical to the second case

First Case Tile along the dimension C on each processor Hold AB in memory through the processing of all tiles AC and BC are allocated separately for each tile

Algorithm for Case I Allocate AB Foreach tile: Allocate AC and BC Foreach input chunk to be read Update AB, AC, and BC Compute C from AC Write-back AC, BC, and C If last tile Perform global reduction to obtain AB If (proc_id == 0) Compute A and B from AB Compute all from A

Properties of the Algorithm  All arrays are computed from their smallest parents  Maximal cache and memory reuse  Minimal interprocessor communication volume among all single dimensional partitions  Portion of output arrays that need to be kept in the main memory for the entire computation is minimal of all single dimensional tiling possibilities

Second Case  Tile along the dimension B  Hold AC in main memory for the entire computation

Algorithm for Case II Allocate AC and A Foreach tile: Allocate AB and AC Foreach input chunk to be read Update AB, AC, and BC Perform global reduction to obtain final AB If (proc_id == 0) Compute B from AB Update A using AB Write-back AB, BC, and B If (last tile) Finish AC Compute C from AC If (proc_id == 0) Finish A Compute all from A

Implementation Experience Using ADR  Had to supply  Local reduction function - processing for each chunk  Global reduction function - after local reduction on each tile  A Finalize function – after processing all tiles  A specification of tiling desired  ADR’s runtime support offered  Fetching of input chunk corresponding to each tile  Scheduling asynchronous operations  Details of interprocessor communication

Experimental Evaluation  Goals:  Speedups on sparse and dense datasets  Scaling of performance with respect to dataset sizes  Scaling of performance with respect to number of tiles  Evaluating the impact of sparsity  Experimental Platform:  8 250 MHz Ultra-II processors  1 GB of main memory on each  Myrinet for interconnection

Scaling Input Datasets - Dense Arrays Almost linear speedups upto 8 processors Performance per element increases linearly with increase in dataset size

Scaling Dataset Sizes: Sparse Dataset 25% Sparsity level Slightly lower speedups than dense datasets: higher comm. to comp. ratio Execution time stays Proportional to the amt. Of Computation

Increasing Number of Tiles 2 nodes Fixed amount of Computation per tile Execution time stays proportional to the amount of computation

Impact of Sparsity Same number of non-zero elements in each dataset Good speedups in all cases Some reduction in sequential performance as sparsity increases: Particularly for 1% case

Summary  Consider data cube construction on clusters  Used a runtime system developed for scientific data intensive applications  New algorithms to combine tiling and interprocessor communication  Observations:  Code writing simplified because of the use of runtime system  High speedups  Performance scales well as dataset sizes are increased

Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.

Similar presentations

Presentation on theme: "Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.

Similar presentations

Presentation on theme: "Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal."— Presentation transcript:

Similar presentations

About project

Feedback