Efficient OLAP Query Processing for Distributed Data Warehouses Michael O. Akinde, SMHI, Sweden & NDB, Aalborg University, Denmark Michael H. Böhlen, NDB,

Slides:



Advertisements
Similar presentations
Database Architectures and the Web
Advertisements

C6 Databases.
Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.
Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Composite Subset Measures Lei Chen, Paul Barford, Bee-Chung Chen, Vinod Yegneswaran University of Wisconsin - Madison Raghu Ramakrishnan Yahoo! Research.
Master/Slave Architecture Pattern Source: Pattern-Oriented Software Architecture, Vol. 1, Buschmann, et al.
A Data Stream Management System for Network Traffic Management Shivnath Babu Stanford University Lakshminarayanan Subramanian Univ. California, Berkeley.
Distributed Databases John Ortiz. Lecture 24Distributed Databases2  Distributed Database (DDB) is a collection of interrelated databases interconnected.
Distributed databases
Connect. Communicate. Collaborate Click to edit Master title style MODULE 1: perfSONAR TECHNICAL OVERVIEW.
Sensitivity of PCA for Traffic Anomaly Detection Evaluating the robustness of current best practices Haakon Ringberg 1, Augustin Soule 2, Jennifer Rexford.
Advanced Database Systems September 2013 Dr. Fatemeh Ahmadi-Abkenari 1.
Managing Data Resources
Information Integration. Modes of Information Integration Applications involved more than one database source Three different modes –Federated Databases.
1 Distributed Databases Chapter Two Types of Applications that Access Distributed Databases The application accesses data at the level of SQL statements.
Extensible Scalable Monitoring for Clusters of Computers Eric Anderson U.C. Berkeley Summer 1997 NOW Retreat.
Distributed Database Management Systems
OCT1 Principles From Chapter One of “Distributed Systems Concepts and Design”
Overview Distributed vs. decentralized Why distributed databases
COMP 578 Data Warehousing And OLAP Technology Keith C.C. Chan Department of Computing The Hong Kong Polytechnic University.
1 Distributed Databases Chapter What is a Distributed Database? Database whose relations reside on different sites Database some of whose relations.
Chapter 14 The Second Component: The Database.
DIDS part II The Return of dIDS 2/12 CIS GrIDS Graph based intrusion detection system for large networks. Analyzes network activity on networks.
Graph Algebra with Pattern Matching and Aggregation Support 1.
H-1 Network Management Network management is the process of controlling a complex data network to maximize its efficiency and productivity The overall.
Virtual LANs. VLAN introduction VLANs logically segment switched networks based on the functions, project teams, or applications of the organization regardless.
Chapter 3 Database Architectures and the Web Pearson Education © 2009.
GOVERNMENT SERVICES INTEGRATION INDUSTRY SOLUTION.
Database Architectures and the Web
Differences between In- and Outbound Internet Backbone Traffic Wolfgang John and Sven Tafvelin Dept. of Computer Science and Engineering Chalmers University.
Database Architectures and the Web Session 5
GrIDS -- A Graph Based Intrusion Detection System For Large Networks Paper by S. Staniford-Chen et. al.
Database Design – Lecture 16
Protocols and the TCP/IP Suite
Division of IT Convergence Engineering Towards Unified Management A Common Approach for Telecommunication and Enterprise Usage Sung-Su Kim, Jae Yoon Chung,
9/5/2012ISC329 Isabelle Bichindaritz1 Web Database Environment.
An Integration Framework for Sensor Networks and Data Stream Management Systems.
Session-8 Data Management for Decision Support
Week 5 Lecture Distributed Database Management Systems Samuel ConnSamuel Conn, Asst Professor Suggestions for using the Lecture Slides.
Distributed Database Systems Overview
C6 Databases. 2 Traditional file environment Data Redundancy and Inconsistency: –Data redundancy: The presence of duplicate data in multiple data files.
Oracle's Distributed Database Bora Yasa. Definition A Distributed Database is a set of databases stored on multiple computers at different locations and.
Kjell Orsborn UU - DIS - UDBL DATABASE SYSTEMS - 10p Course No. 2AD235 Spring 2002 A second course on development of database systems Kjell.
Distributed Databases
Distributed Database. Introduction A major motivation behind the development of database systems is the desire to integrate the operational data of an.
1 Distributed Databases Chapter 21, Part B. 2 Introduction v Data is stored at several sites, each managed by a DBMS that can run independently. v Distributed.
Chapter 5 DATA WAREHOUSING Study Sections 5.2, 5.3, 5.5, Pages: & Snowflake schema.
An end-to-end usage of the IPv6 flow label
Two-Tier DW Architecture. Three-Tier DW Architecture.
Introduction to Active Directory
1 Network Address Translation. 2 Network Address Translation (NAT) Extension of original addressing scheme Motivated by exhaustion of IP address space.
Company LOGO Network Management Architecture By Dr. Shadi Masadeh 1.
2009/6/221 BotMiner: Clustering Analysis of Network Traffic for Protocol- and Structure- Independent Botnet Detection Reporter : Fong-Ruei, Li Machine.
1 Minneapolis‘ IETF IPFIX Aggregation draft-dressler-ipfix-aggregation-00.txt.
1 Traffic Engineering By Kavitha Ganapa. 2 Introduction Traffic engineering is concerned with the issue of performance evaluation and optimization of.
ALTO: A Multi Dimensional Peer Selection Problem IETF 73 Saumitra Das
1 Netflow Collection and Aggregation in the AT&T Common Backbone Carsten Lund.
1 Chapter 22 Distributed DBMSs - Concepts and Design Simplified Transparencies © Pearson Education Limited 1995, 2005.
Managing Data Resources File Organization and databases for business information systems.
Real-Time Systems Laboratory Seolyoung, Jeong The CASCADAS Framework for Autonomic Communications Autonomic Communication Springer.
Database Architectures and the Web
Data Warehousing CIS 4301 Lecture Notes 4/20/2006.
Data Warehouse.
Database Architectures and the Web
Basic Concepts in Data Management
Data Integration with Dependent Sources
The Globus Toolkit™: Information Services
MANAGING DATA RESOURCES
Online Analytical Processing Stream Data: Is It Feasible?
Towards Unified Management
Presentation transcript:

Efficient OLAP Query Processing for Distributed Data Warehouses Michael O. Akinde, SMHI, Sweden & NDB, Aalborg University, Denmark Michael H. Böhlen, NDB, Aalborg University, Denmark Theodore Johnson, AT&T Labs-Research, USA Laks V. S. Lakshmanan, University of British Columbia, Canada Divesh Srivastava, AT&T Labs-Research, USA Michael O. Akinde EDBT’ March 24-28, Prague

Michael O. Akinde EDBT’ March 24-28, Prague 2 Motivation  Analysis of network data  Collect, correlate, and analyze data across the network  Huge amounts of decentrally collected data  Complex OLAP operations  Performed using ad-hoc Perl scripts  Experiment: OLAP technology  Pro: Improves specification, performance  Con: Expensive (or data loss) when centralized  Existing centralized OLAP tools are inadequate

Michael O. Akinde EDBT’ March 24-28, Prague 3 Solution:Distributed Datawarehouse  Local DW at each collection point (e.g., router)  Compute queries across multiple DWs  A technology is needed for distributed processing of complex OLAP queries DW Source Coordinator Source Query Coordinator DWs close to the data collection points Network Administrators & Control Systems Application

Michael O. Akinde EDBT’ March 24-28, Prague 4 Complex OLAP Queries  Examples:  Network usage: For each IP address, what fraction of the total number of flows is due to web traffic?  Principal components: On an hourly basis, what fraction of the total traffic is from IP subnets whose total hourly traffic is within 10% of the maximum?  Pattern identification: Break down all flows recorded on US election day by all possible combinations of source AS, destination AS, and protocol  Diverse OLAP queries, involving pivots, correlations, and multiple levels of aggregation

Michael O. Akinde EDBT’ March 24-28, Prague 5 Skalla  Translates OLAP queries expressed with extended algebra, into distributed query evaluation plans  Salient Features:  Efficiently handles a significant variety of complex OLAP queries (incl. pivots, correlations,etc.)  Only partial results are shipped between the sites and the coordinator -- never subsets of the detail data  No site to site communication

Michael O. Akinde EDBT’ March 24-28, Prague 6 Extended Algebra: GMDJ  Algebraic OLAP operator [Chatziantoniou et al., 2001]  Salient feature: Splits grouping and aggregation  MD(B, R, l,  )  B is the base-values table (the “groups”)  R is the detail table (fact data)  l is the list of aggregate functions   : possibly complex condition over B and R describing what fact data is to be aggregated  Result: The table B extended with the aggregates in l

Michael O. Akinde EDBT’ March 24-28, Prague 7 Extended Algebra: Example  For each IP address, what fraction of the total number of flows is due to web traffic? MD ( MD(IPT, Flows, (Cnt1), (IPT.key = Flows.key)), Flows, (Cnt2), (IPT.key = Flows.key and Flows.Source = WEB ) )  Result of inner GMDJ: (IPT, Cnt1)  Result of outer GMDJ: (IPT, Cnt1, Cnt2)  Sequences of GMDJs instead of multiple aggregate-join expressions

Michael O. Akinde EDBT’ March 24-28, Prague 8 Coordinator Query Engine Mediator Skalla Architecture & Evaluation  Skalla Evaluation Rounds:  Computation of GMDJ at the local DWs  Synchronize sub-results at coordinator DW Source Administers local GMDJ queries Application Site Wrapper Site Wrapper Site Wrapper Coordinator Query Engine Mediator Skalla Computes distributed query plans Synchronizes sub- results

Michael O. Akinde EDBT’ March 24-28, Prague 9 Skalla Evaluation: Example  For each IP address, what fraction of the total number of flows is due to web traffic? S1 S2 Coordinator DW Build Groups Distribute Compute Aggregates Synchronize Distribute Compute Aggregates Synchronize result

Michael O. Akinde EDBT’ March 24-28, Prague 10 Skalla Evaluation: Features  Each round of computation in the distributed query evaluation computes a single GMDJ.  Features of the Evaluation:  Semantics of the query plans ensure that the amount of data shipped by the algorithm is dependent on the number of groups and aggregate functions and independent of the size of the fact relation in the database!  The algorithm permits for a wide variety of optimizations

Michael O. Akinde EDBT’ March 24-28, Prague 11 Optimizations: Group Reduction  During processing, we only ship data that has actually been changed  Example:  Query: For each IP address, what fraction of the total number of flows is due to web traffic?  Each local DW receives a base-values table containing all source data  Coordinator has a copy of base-values table  Local DWs ship only those tuples back that have actually been changed

Michael O. Akinde EDBT’ March 24-28, Prague 12 Group Reduction: Example  For each IP address, what fraction of the total number of flows is due to web traffic? Distribute Compute Aggregates S1 S2 Coordinator DW Build GroupsSynchronize Distribute Compute Aggregates Synchronize result

Michael O. Akinde EDBT’ March 24-28, Prague 13 Optimizations: Synchronization Reduction  It is possible to detect cases where no synchronization is required between passes.  Example:  DW data: All the flows of an autonomous system are always registered (stored) at a particular local DW  Query: For each IP address, what fraction of the total number of flows is due to web traffic?  Each IP address belongs to a particular autonomous system; i.e., all data for a particular IP address is located at the system storing the flows of its autonomous system

Michael O. Akinde EDBT’ March 24-28, Prague 14 Synch Reduction: Example (1)  For each IP address, what fraction of the total number of flows is due to web traffic? Distribute Compute Aggregates S1 S2 Coordinator DW Build GroupsSynchronize Distribute Compute Aggregates Synchronize result

Michael O. Akinde EDBT’ March 24-28, Prague 15 Synch Reduction: Example (2)  For each IP address, what fraction of the total number of flows is due to web traffic? Distribute Compute Aggregates S1 S2 Coordinator DW Build Groups Synchronize result

Michael O. Akinde EDBT’ March 24-28, Prague 16 Experiment: Number of Sites (GR)

Michael O. Akinde EDBT’ March 24-28, Prague 17 Experiment: Number of Sites (SR)

Michael O. Akinde EDBT’ March 24-28, Prague 18 Experiments: Size of Database

Michael O. Akinde EDBT’ March 24-28, Prague 19 Experiments: Cost Breakdown

Michael O. Akinde EDBT’ March 24-28, Prague 20 Conclusions  We develop a framework for evaluating complex OLAP queries on a distributed data warehouse  Efficient query plans that minimize data transfer over the network  Further work:  Additional developments of the architecture  Cost-based query optimization