Download presentation
Presentation is loading. Please wait.
Published byBuddy Randell Walker Modified over 9 years ago
1
Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University
2
Motivation A lot of effort into developing cluster computing tools targetting scientific applications There is an emerging class of commercial applications that are well suited for cluster environments OnLine Analytical Processing (OLAP) Data Mining Can we successfully use cluster tools developed for scientific applications on commercial applications ?
3
Overview Focus on: Data cube construction, which is an OLAP problem Both compute and data intensive Frequently used in data warehouses Use of Active Data Repository (ADR) developed for scientific data intensive applications Questions: Are new algorithms / variations to existing algorithms required ? Implementation experience ? Performance ?
4
Outline Data cube construction Problem definition Challenges Active Data Repository (ADR) Scalable data cube construction algorithms targetting ADR Implementation Experience Performance Evaluation Summary
5
Data Cube Construction Context: Data Warehouses Frequently store (possibly sparse) multidimensional datasets Example: Sale information for a chain of stores: time, item, and location can be the three dimensions Frequently asked queries: aggregate along one or more dimensions Data Cube Construction: Perform all aggregations in advance to facilitate rapid response to all queries For the original n dimension array construct: n C m arrays of m dimensions, 0 =< m =< n
6
Data Cube Construction Example: Consider original 3 dimensional array ABC Data cube comprises of 3 two-dimensional arrays AB, BC, AC 3 one-dimensional arrays A, B, and C A scalar value all Some observations: Large input size: data warehouses can have a lot of data Total amount of output could be quite large A lot of computation is involved
7
Lattice for Data Cube Construction Options for computing different output arrays can be represented by a lattice If A is the shortest dimension and C is the largest, the arrows represent the minimal spanning tree of the lattice AB is considered the smallest parent of A and B
8
Active Data Repository Developed at University of Maryland (Chang, Kurc, Sussman, Saltz) Targetted scientific data intensive applications Execution model: Divide output dataset(s) into tiles, allocate one tile at a time Fetch input dataset one chunk at a time to compute the tile Decide on a plan or schedule for fetching chunks that contribute to a tile Operations involved in computing an output element must be associative and commutative
9
Goals In Algorithm Design Must use smallest parents / minimal spanning tree Maximal cache and memory reuse: perform all computations associated with an input chunk before it is discarded from memory Minimize interprocessor communication volume Minimize the amount of memory that needs to be allocated across the tiles Fit into ADR’s computation model
10
Approach Currently consider data cube construction starting from three dimensional array only Partition and tile along a single dimension only If the size along the dimensions A, B, and C are |A|, |B| and |C|, assume that |A| <= |B| <= |C| (No loss of generality)
11
Partitioning and Tiling Always partition along the dimension C Minimizes communication volume If |A| <= |B| <= |C|, |A||B| <= |A||C| <= |B||C| Let the size of the dimension C on each processor be |C’| Three separate cases for tiling Case I: |A| <= |B| <= |C’| Case II: |A| <= |C’| <= |B| Case III: |C’| <= |A| <= |B| Focus on first and second cases, third is almost identical to the second case
12
First Case Tile along the dimension C on each processor Hold AB in memory through the processing of all tiles AC and BC are allocated separately for each tile
13
Algorithm for Case I Allocate AB Foreach tile: Allocate AC and BC Foreach input chunk to be read Update AB, AC, and BC Compute C from AC Write-back AC, BC, and C If last tile Perform global reduction to obtain AB If (proc_id == 0) Compute A and B from AB Compute all from A
14
Properties of the Algorithm All arrays are computed from their smallest parents Maximal cache and memory reuse Minimal interprocessor communication volume among all single dimensional partitions Portion of output arrays that need to be kept in the main memory for the entire computation is minimal of all single dimensional tiling possibilities
15
Second Case Tile along the dimension B Hold AC in main memory for the entire computation
16
Algorithm for Case II Allocate AC and A Foreach tile: Allocate AB and AC Foreach input chunk to be read Update AB, AC, and BC Perform global reduction to obtain final AB If (proc_id == 0) Compute B from AB Update A using AB Write-back AB, BC, and B If (last tile) Finish AC Compute C from AC If (proc_id == 0) Finish A Compute all from A
17
Implementation Experience Using ADR Had to supply Local reduction function - processing for each chunk Global reduction function - after local reduction on each tile A Finalize function – after processing all tiles A specification of tiling desired ADR’s runtime support offered Fetching of input chunk corresponding to each tile Scheduling asynchronous operations Details of interprocessor communication
18
Experimental Evaluation Goals: Speedups on sparse and dense datasets Scaling of performance with respect to dataset sizes Scaling of performance with respect to number of tiles Evaluating the impact of sparsity Experimental Platform: 8 250 MHz Ultra-II processors 1 GB of main memory on each Myrinet for interconnection
19
Scaling Input Datasets - Dense Arrays Almost linear speedups upto 8 processors Performance per element increases linearly with increase in dataset size
20
Scaling Dataset Sizes: Sparse Dataset 25% Sparsity level Slightly lower speedups than dense datasets: higher comm. to comp. ratio Execution time stays Proportional to the amt. Of Computation
21
Increasing Number of Tiles 2 nodes Fixed amount of Computation per tile Execution time stays proportional to the amount of computation
22
Impact of Sparsity Same number of non-zero elements in each dataset Good speedups in all cases Some reduction in sequential performance as sparsity increases: Particularly for 1% case
23
Summary Consider data cube construction on clusters Used a runtime system developed for scientific data intensive applications New algorithms to combine tiling and interprocessor communication Observations: Code writing simplified because of the use of runtime system High speedups Performance scales well as dataset sizes are increased
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.