Presentation is loading. Please wait.

Presentation is loading. Please wait.

Efficient OLAP Query Processing for Distributed Data Warehouses Michael O. Akinde, SMHI, Sweden & NDB, Aalborg University, Denmark Michael H. Böhlen, NDB,

Similar presentations


Presentation on theme: "Efficient OLAP Query Processing for Distributed Data Warehouses Michael O. Akinde, SMHI, Sweden & NDB, Aalborg University, Denmark Michael H. Böhlen, NDB,"— Presentation transcript:

1 Efficient OLAP Query Processing for Distributed Data Warehouses Michael O. Akinde, SMHI, Sweden & NDB, Aalborg University, Denmark Michael H. Böhlen, NDB, Aalborg University, Denmark Theodore Johnson, AT&T Labs-Research, USA Laks V. S. Lakshmanan, University of British Columbia, Canada Divesh Srivastava, AT&T Labs-Research, USA Michael O. Akinde EDBT’2002 -- March 24-28, Prague

2 Michael O. Akinde EDBT’2002 -- March 24-28, Prague 2 Motivation  Analysis of network data  Collect, correlate, and analyze data across the network  Huge amounts of decentrally collected data  Complex OLAP operations  Performed using ad-hoc Perl scripts  Experiment: OLAP technology  Pro: Improves specification, performance  Con: Expensive (or data loss) when centralized  Existing centralized OLAP tools are inadequate

3 Michael O. Akinde EDBT’2002 -- March 24-28, Prague 3 Solution:Distributed Datawarehouse  Local DW at each collection point (e.g., router)  Compute queries across multiple DWs  A technology is needed for distributed processing of complex OLAP queries DW Source Coordinator Source Query Coordinator DWs close to the data collection points Network Administrators & Control Systems Application

4 Michael O. Akinde EDBT’2002 -- March 24-28, Prague 4 Complex OLAP Queries  Examples:  Network usage: For each IP address, what fraction of the total number of flows is due to web traffic?  Principal components: On an hourly basis, what fraction of the total traffic is from IP subnets whose total hourly traffic is within 10% of the maximum?  Pattern identification: Break down all flows recorded on US election day by all possible combinations of source AS, destination AS, and protocol  Diverse OLAP queries, involving pivots, correlations, and multiple levels of aggregation

5 Michael O. Akinde EDBT’2002 -- March 24-28, Prague 5 Skalla  Translates OLAP queries expressed with extended algebra, into distributed query evaluation plans  Salient Features:  Efficiently handles a significant variety of complex OLAP queries (incl. pivots, correlations,etc.)  Only partial results are shipped between the sites and the coordinator -- never subsets of the detail data  No site to site communication

6 Michael O. Akinde EDBT’2002 -- March 24-28, Prague 6 Extended Algebra: GMDJ  Algebraic OLAP operator [Chatziantoniou et al., 2001]  Salient feature: Splits grouping and aggregation  MD(B, R, l,  )  B is the base-values table (the “groups”)  R is the detail table (fact data)  l is the list of aggregate functions   : possibly complex condition over B and R describing what fact data is to be aggregated  Result: The table B extended with the aggregates in l

7 Michael O. Akinde EDBT’2002 -- March 24-28, Prague 7 Extended Algebra: Example  For each IP address, what fraction of the total number of flows is due to web traffic? MD ( MD(IPT, Flows, (Cnt1), (IPT.key = Flows.key)), Flows, (Cnt2), (IPT.key = Flows.key and Flows.Source = WEB ) )  Result of inner GMDJ: (IPT, Cnt1)  Result of outer GMDJ: (IPT, Cnt1, Cnt2)  Sequences of GMDJs instead of multiple aggregate-join expressions

8 Michael O. Akinde EDBT’2002 -- March 24-28, Prague 8 Coordinator Query Engine Mediator Skalla Architecture & Evaluation  Skalla Evaluation Rounds:  Computation of GMDJ at the local DWs  Synchronize sub-results at coordinator DW Source Administers local GMDJ queries Application Site Wrapper Site Wrapper Site Wrapper Coordinator Query Engine Mediator Skalla Computes distributed query plans Synchronizes sub- results

9 Michael O. Akinde EDBT’2002 -- March 24-28, Prague 9 Skalla Evaluation: Example  For each IP address, what fraction of the total number of flows is due to web traffic? S1 S2 Coordinator DW Build Groups Distribute Compute Aggregates Synchronize Distribute Compute Aggregates Synchronize result

10 Michael O. Akinde EDBT’2002 -- March 24-28, Prague 10 Skalla Evaluation: Features  Each round of computation in the distributed query evaluation computes a single GMDJ.  Features of the Evaluation:  Semantics of the query plans ensure that the amount of data shipped by the algorithm is dependent on the number of groups and aggregate functions and independent of the size of the fact relation in the database!  The algorithm permits for a wide variety of optimizations

11 Michael O. Akinde EDBT’2002 -- March 24-28, Prague 11 Optimizations: Group Reduction  During processing, we only ship data that has actually been changed  Example:  Query: For each IP address, what fraction of the total number of flows is due to web traffic?  Each local DW receives a base-values table containing all source data  Coordinator has a copy of base-values table  Local DWs ship only those tuples back that have actually been changed

12 Michael O. Akinde EDBT’2002 -- March 24-28, Prague 12 Group Reduction: Example  For each IP address, what fraction of the total number of flows is due to web traffic? Distribute Compute Aggregates S1 S2 Coordinator DW Build GroupsSynchronize Distribute Compute Aggregates Synchronize result

13 Michael O. Akinde EDBT’2002 -- March 24-28, Prague 13 Optimizations: Synchronization Reduction  It is possible to detect cases where no synchronization is required between passes.  Example:  DW data: All the flows of an autonomous system are always registered (stored) at a particular local DW  Query: For each IP address, what fraction of the total number of flows is due to web traffic?  Each IP address belongs to a particular autonomous system; i.e., all data for a particular IP address is located at the system storing the flows of its autonomous system

14 Michael O. Akinde EDBT’2002 -- March 24-28, Prague 14 Synch Reduction: Example (1)  For each IP address, what fraction of the total number of flows is due to web traffic? Distribute Compute Aggregates S1 S2 Coordinator DW Build GroupsSynchronize Distribute Compute Aggregates Synchronize result

15 Michael O. Akinde EDBT’2002 -- March 24-28, Prague 15 Synch Reduction: Example (2)  For each IP address, what fraction of the total number of flows is due to web traffic? Distribute Compute Aggregates S1 S2 Coordinator DW Build Groups Synchronize result

16 Michael O. Akinde EDBT’2002 -- March 24-28, Prague 16 Experiment: Number of Sites (GR)

17 Michael O. Akinde EDBT’2002 -- March 24-28, Prague 17 Experiment: Number of Sites (SR)

18 Michael O. Akinde EDBT’2002 -- March 24-28, Prague 18 Experiments: Size of Database

19 Michael O. Akinde EDBT’2002 -- March 24-28, Prague 19 Experiments: Cost Breakdown

20 Michael O. Akinde EDBT’2002 -- March 24-28, Prague 20 Conclusions  We develop a framework for evaluating complex OLAP queries on a distributed data warehouse  Efficient query plans that minimize data transfer over the network  Further work:  Additional developments of the architecture  Cost-based query optimization


Download ppt "Efficient OLAP Query Processing for Distributed Data Warehouses Michael O. Akinde, SMHI, Sweden & NDB, Aalborg University, Denmark Michael H. Böhlen, NDB,"

Similar presentations


Ads by Google