Dependable Technologies for Critical Systems Copyright Critical Software S.A. 1998-2003 All Rights Reserved. Handling big dimensions in distributed data.

Slides:



Advertisements
Similar presentations
A Data Masking Technique for Data Warehouses Ricardo Jorge Santos & Marco Vieira CISUC – DEI – FCTUC University of Coimbra - Portugal Jorge Bernardino.
Advertisements

Distributed Data Processing
Chapter 10: Designing Databases
Supervisor : Prof . Abbdolahzadeh
BY LECTURER/ AISHA DAWOOD DW Lab # 2. LAB EXERCISE #1 Oracle Data Warehousing Goal: Develop an application to implement defining subject area, design.
OLAP Query Processing in Grids
Anindya Datta Debra VanderMeer Krithi Ramamritham Presented by –
OLAP Tuning. Outline OLAP 101 – Data warehouse architecture – ROLAP, MOLAP and HOLAP Data Cube – Star Schema and operations – The CUBE operator – Tuning.
Dwarf: A High Performance OLAP Engine Nick Roussopoulos ACT Inc. & UMD.
Erhan Erdinç Pehlivan Computer Architecture Support for Database Applications.
High Performance Analytical Appliance MPP Database Server Platform for high performance Prebuilt appliance with HW & SW included and optimally configured.
A Scalable, Predictable Join Operator for Highly Concurrent Data Warehouses George Candea (EPFL & Aster Data) Neoklis Polyzotis (UC Santa Cruz) Radek Vingralek.
Chapter 9 Designing Systems for Diverse Environments.
Manish Bhide, Manoj K Agarwal IBM India Research Lab India {abmanish, Amir Bar-Or, Sriram Padmanabhan IBM Software Group, USA
Multidimensional Database in Context of DB2 OLAP Server Khang Pham Class: CSCI397-16C Instructor: Professor Renner.
Chapter 1 Introduction 1.1A Brief Overview - Parallel Databases and Grid Databases 1.2Parallel Query Processing: Motivations 1.3Parallel Query Processing:
Data Warehousing - 3 ISYS 650. Snowflake Schema one or more dimension tables do not join directly to the fact table but must join through other dimension.
Chapter 13 The Data Warehouse
Data Warehousing: Defined and Its Applications Pete Johnson April 2002.
PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al.
Business Intelligence Instructor: Bajuna Salehe Web:
1DBTest2008. Motivation Background Relational Data Warehousing (DW) SQL Server 2008 Starjoin improvement Testing Challenge Extending Enterprise-class.
Ch 4. The Evolution of Analytic Scalability
Week 6 Lecture The Data Warehouse Samuel Conn, Asst. Professor
PMIT-6102 Advanced Database Systems
Data Warehousing.
IMS 4212: Distributed Databases 1 Dr. Lawrence West, Management Dept., University of Central Florida Distributed Databases Business needs.
1 Experimental Evidence on Partitioning in Parallel Data Warehouses Pedro Furtado Prof. at Univ. of Coimbra & Researcher at CISUC DEI/CISUC-Universidade.
1 © 2012 OpenLink Software, All rights reserved. Virtuoso - Column Store, Adaptive Techniques for RDF Orri Erling Program Manager, Virtuoso Openlink Software.
Data Warehousing at Acxiom Paul Montrose Data Warehousing at Acxiom Paul Montrose.
OnLine Analytical Processing (OLAP)
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
DANIEL J. ABADI, ADAM MARCUS, SAMUEL R. MADDEN, AND KATE HOLLENBACH THE VLDB JOURNAL. SW-Store: a vertically partitioned DBMS for Semantic Web data.
Copyright © 2002, SAS Institute Inc. All rights reserved. SAS is a registered trademark or trademark of SAS Institute Inc. in the USA and other countries.
Faster and Smarter Data Warehouses with Oracle OLAP 11g.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
Daniel J. Abadi · Adam Marcus · Samuel R. Madden ·Kate Hollenbach Presenter: Vishnu Prathish Date: Oct 1 st 2013 CS 848 – Information Integration on the.
The Data Warehouse “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of “all” an organisation’s data in support.
Decision Support and Date Warehouse Jingyi Lu. Outline Decision Support System OLAP vs. OLTP What is Date Warehouse? Dimensional Modeling Extract, Transform,
Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.
 2009 Calpont Corporation 1 Calpont Open Source Columnar Storage Engine for Scalable MySQL Data Warehousing April 22, 2009 MySQL User Conference Santa.
Fox MIS Spring 2011 Data Warehouse Week 8 Introduction of Data Warehouse Multidimensional Analysis: OLAP.
Chapter 5 DATA WAREHOUSING Study Sections 5.2, 5.3, 5.5, Pages: & Snowflake schema.
Building a Distributed Full-Text Index for the Web by Sergey Melnik, Sriram Raghavan, Beverly Yang and Hector Garcia-Molina from Stanford University Presented.
CMPE 226 Database Systems October 21 Class Meeting Department of Computer Engineering San Jose State University Fall 2015 Instructor: Ron Mak
Presented By Anirban Maiti Chandrashekar Vijayarenu
Pooja Sharma Shanti Ragathi Vaishnavi Kasala. BUSINESS BACKGROUND Lowe's started as a single hardware store in North Carolina in 1946 and since then has.
Copyright © 2006, GemStone Systems Inc. All Rights Reserved. Increasing computation throughput with Grid Data Caching Jags Ramnarayan Chief Architect GemStone.
MIS2502: Data Analytics Advanced Analytics - Introduction.
Two-Tier DW Architecture. Three-Tier DW Architecture.
What is OLAP?.
Copyright© 2014, Sira Yongchareon Department of Computing, Faculty of Creative Industries and Business Lecturer : Dr. Sira Yongchareon ISCG 6425 Data Warehousing.
Session id: Darrell Hilliard Senior Delivery Manager Oracle University Oracle Corporation.
1 Copyright © 2009, Oracle. All rights reserved. Oracle Business Intelligence Enterprise Edition: Overview.
Handling Data Skew in Parallel Joins in Shared-Nothing Systems Yu Xu, Pekka Kostamaa, XinZhou (Teradata) Liang Chen (University of California) SIGMOD’08.
1 Semijoin Reduction in Query Processors Stocker, Kossman, Braumandl, Kemper Integrating Semi-Join-Reducers into State-of-the-Art Query Processors ICDE.
Building the Corporate Data Warehouse Pindaro Demertzoglou Data Resource Management.
CMPE 226 Database Systems April 12 Class Meeting Department of Computer Engineering San Jose State University Spring 2016 Instructor: Ron Mak
9 Copyright © 2006, Oracle. All rights reserved. Summary Management.
Advanced Applied IT for Business 2
MIS2502: Data Analytics Advanced Analytics - Introduction
Chapter 13 The Data Warehouse
Jozsef Patvarczki, Elke A. Rundensteiner, and Neil T. Heffernan
Blazing-Fast Performance:
The Globus Toolkit™: Information Services
Introduction to Databases Transparencies
Ch 4. The Evolution of Analytic Scalability
Building your First Cube with SSAS
Copyright © JanBask Training. All rights reserved Get Started with Hadoop Hive HiveQL Languages.
Presentation transcript:

Dependable Technologies for Critical Systems Copyright Critical Software S.A All Rights Reserved. Handling big dimensions in distributed data warehouses using the DWS technique Marco Costa DEI – CISUC – University of Coimbra Critical Software S.A.

© Copyright Critical Software S.A All Rights Reserved. 2 Agenda Introduction The DWS technique Description Problems with big dimensions The Selective Loading technique Experimental Results Conclusions

© Copyright Critical Software S.A All Rights Reserved. 3 Critical Software Inc. Company Profile International Software Engineering company. Founded in 1998, offices in Portugal, US, UK. Entrepreneurial and independent SME. Staff of 100, software engineers, Msc’s, Phd’s. Figures Turnover of US 6M (2004). International market represents +70%. Profitable since foundation (ebit= 17%, 2003). Quality, R&D ISO 9001:2000 Tick-IT certified (only in Iberia). ISO / CMM level 3 R&D focused, Patents submitted Headquarters, Portugal

© Copyright Critical Software S.A All Rights Reserved. 4 Introduction Companies produce and store more and more data Data Warehouses have large and continuously growing volumes of data to process High performance in query execution is crucial to enable interactivity in OLAP process Typically the performance is achieved through very expensive hardware platforms (e.g. high end servers)

© Copyright Critical Software S.A All Rights Reserved. 5 Introduction Parallel processing has been explored as one of the solutions to support large DW Intra-query parallelism Distributed DW For geographical reasons For performance Load balancing of data Query execution Reduce communication between nodes

© Copyright Critical Software S.A All Rights Reserved. 6 The DWS Technique Distribution of a DW through a cluster of “low cost computers” Data partition technique Query re-write and parallel execution technique Approximated query answering Shared-nothing architecture – Federated Conceived specifically for data warehouses implemented with star-schema model High scalability Near linear speed up for data aggregation queries

© Copyright Critical Software S.A All Rights Reserved. 7 The DWS Technique Data partitioning / data placement All nodes have the same data model Dimension tables are replicated Fact tables are distributed through all nodes in an uniform way Row by row Random

© Copyright Critical Software S.A All Rights Reserved. 8 The DWS Technique Data partitioning / data placement Row by row example

© Copyright Critical Software S.A All Rights Reserved. 9 The DWS Technique Query re-write Partition the queries in steps: Partial Query (independently executed in each node) Merge Query Some queries might require more than one step Execution tree optimizer – determines the steps that need to be executed independently or can be included in the upper query

© Copyright Critical Software S.A All Rights Reserved. 10 The DWS Technique Query Re-write (example for 2 nodes) A typical data aggregation query: select t.calendar_month_desc "Month", c.cust_city "City", p.prod_category "Category", avg(s.quantity_sold) "Quantity", avg(s.amount_sold) "Amount" from sales s, customers c, times t, products p where s.time_id = t.time_id and s.cust_id = c.cust_id and s.prod_id = p.prod_id and t.calendar_year = 2000 group byt.calendar_month_desc, c.cust_city, p.prod_category Dimensions Facts (aggregated)

© Copyright Critical Software S.A All Rights Reserved. 11 The DWS Technique Query Re-write (example for 2 nodes) Partial Query sent to all nodes: create table dws as select t.calendar_month_desc calendar_month_desc, c.cust_city cust_city, p.prod_category prod_category, sum(s.quantity_sold) as dws1_sum, count(s.quantity_sold) as dws1_count, sum(s.amount_sold) as dws2_sum, count(s.amount_sold) as dws2_count from sales s, customers c, times t, products p where s.time_id = t.time_id and s.cust_id = c.cust_id and s.prod_id = p.prod_id and t.calendar_year = 2000 group by t.calendar_month_desc, c.cust_city, p.prod_category Collect partial aggregations

© Copyright Critical Software S.A All Rights Reserved. 12 The DW-SP Technology Query Re-write (example for 2 nodes) Merge Query – merge the partial results: select calendar_month_desc "month", cust_city "city", prod_category "category", sum(dws1_sum) / sum(dws1_count) "quantity", sum(dws2_sum) / sum(dws2_count) "amount" from dws_finalmerge_ group by calendar_month_desc, cust_city, prod_category create table dws_finalmerge_ as (select * from union all select * from Gather partial Results Build final results Merge aggregations

© Copyright Critical Software S.A All Rights Reserved. 13 The DWS Technique Achievements Optimal data load balance Optimal work load balance For each query each node processes the same amount of data as all the others, mostly within its local data Low communication between nodes High scalability Near linear speed-up Nead linear scale-up Tested with APB1 benchmark (Olap Council) and 10 nodes

© Copyright Critical Software S.A All Rights Reserved. 14 The DWS Technique The problem Replication of dimension tables is not typically a problem (dimension tables represent 5% to 10% of the data) Business with big dimensions can not apply DWS The businesses that have big dimensions have high potential (e.g. airlines, telecoms, e- business)

© Copyright Critical Software S.A All Rights Reserved. 15 The Selective Load Technique Selective load the dimension tables Typical OLAP aggregate facts according to restrictions applied to dimensions The join between facts and dimensions only need the dimension rows that exist in both tables Do not replicate the big dimension tables Load only the necessary rows to each node

© Copyright Critical Software S.A All Rights Reserved. 16 The Selective Load Technique Selective load the dimension tables Example: Node of a DWS cluster

© Copyright Critical Software S.A All Rights Reserved. 17 The Selective Load Technique High reduction of the number of rows to load to each node Big dimensions High number of rows (absolute size) Significant percentage of the number of rows in fact tables Produce sparse models (passenger in a flight company) Rows in the dimension table are related with low number of facts Worst scenario is having has many dimension rows as facts in each node

© Copyright Critical Software S.A All Rights Reserved. 18 The Selective Load Technique Dimension browsing queries? There’s not a complete version of the big dimension table The union of all selective load partitions of the dimension table does not give a complete version of the dimension table Dimension rows with no fact won’t be loaded at all Apply the DWS data partitioning algorithm to the big dimension Create a partitioned version of the dimension table distributed through all nodes Enables the dimension queries to benefit of DWS speed up and scale up Dimension browsing queries aiming big dimension will be executed in parallel by all nodes

© Copyright Critical Software S.A All Rights Reserved. 19 Experimental Results Experiments with TPC-H Facts: Lineitem Big Dimension: Orders Dimensions: Customer, Supplier, Region, Nation, Part Scenarios Single Node – Centralized DB for reference DWS (5,10,20) – DWS with replication of dimensions for 5, 10 and 20 nodes DWS_SL (5,10,20) – DWS with selective load of big dimension for 5, 10 and 20 nodes

© Copyright Critical Software S.A All Rights Reserved. 20 Experimental Results Storage per node Replication of big dimension has a high impact Selective load reduces significantly the data volume LineItemOrdersOrders_distTotal Single Node3576,251573, ,82 DWS_5715,251573, ,81 DWS_SL_5715,25557,97314,711587,93 DWS_10357,631573, ,19 DWS_SL_10357,63312,10157,36827,09 DWS_20178,811573, ,38 DWS_SL_20178,81157,0178,68414,51 Table size (MB)

© Copyright Critical Software S.A All Rights Reserved. 21 Experimental Results Performance DWS speed up is inexistent due to the replication of the big dimension DWS_SL speed up is near linear

© Copyright Critical Software S.A All Rights Reserved. 22 Conclusions DWS is a technique to distribute data warehouses through a cluster of (low cost) computers with near linear speed up and scale up for star schema models and aggregations queries The current work enables the use of the DWS technique for star schema models with large dimensions with linear speed up and scale up. Enables browsing dimension queries to experience the advantages of parallel execution in a DWS system.

© Copyright Critical Software S.A All Rights Reserved. 23 Questions and Contacts Marco Costa, Henrique Madeira, Critical Software, S.A. Parque Industrial de Taveiro, Lote Coimbra, PORTUGAL Tel ,Fax Critical Software Inc. 111 North Market Street, Suite 670 San Jose, California, USA, Tel. +1(408) , Fax: +1(408)