Presentation is loading. Please wait.

Presentation is loading. Please wait.

Dependable Technologies for Critical Systems Copyright Critical Software S.A. 1998-2003 All Rights Reserved. Handling big dimensions in distributed data.

Similar presentations


Presentation on theme: "Dependable Technologies for Critical Systems Copyright Critical Software S.A. 1998-2003 All Rights Reserved. Handling big dimensions in distributed data."— Presentation transcript:

1 Dependable Technologies for Critical Systems Copyright Critical Software S.A. 1998-2003 All Rights Reserved. Handling big dimensions in distributed data warehouses using the DWS technique Marco Costa DEI – CISUC – University of Coimbra Critical Software S.A.

2 © Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 2 Agenda Introduction The DWS technique Description Problems with big dimensions The Selective Loading technique Experimental Results Conclusions

3 © Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 3 Critical Software Inc. Company Profile International Software Engineering company. Founded in 1998, offices in Portugal, US, UK. Entrepreneurial and independent SME. Staff of 100, software engineers, Msc’s, Phd’s. Figures Turnover of US 6M (2004). International market represents +70%. Profitable since foundation (ebit= 17%, 2003). Quality, R&D ISO 9001:2000 Tick-IT certified (only in Iberia). ISO 15504 / CMM level 3 R&D focused, Patents submitted Headquarters, Portugal

4 © Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 4 Introduction Companies produce and store more and more data Data Warehouses have large and continuously growing volumes of data to process High performance in query execution is crucial to enable interactivity in OLAP process Typically the performance is achieved through very expensive hardware platforms (e.g. high end servers)

5 © Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 5 Introduction Parallel processing has been explored as one of the solutions to support large DW Intra-query parallelism Distributed DW For geographical reasons For performance Load balancing of data Query execution Reduce communication between nodes

6 © Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 6 The DWS Technique Distribution of a DW through a cluster of “low cost computers” Data partition technique Query re-write and parallel execution technique Approximated query answering Shared-nothing architecture – Federated Conceived specifically for data warehouses implemented with star-schema model High scalability Near linear speed up for data aggregation queries

7 © Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 7 The DWS Technique Data partitioning / data placement All nodes have the same data model Dimension tables are replicated Fact tables are distributed through all nodes in an uniform way Row by row Random

8 © Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 8 The DWS Technique Data partitioning / data placement Row by row example

9 © Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 9 The DWS Technique Query re-write Partition the queries in steps: Partial Query (independently executed in each node) Merge Query Some queries might require more than one step Execution tree optimizer – determines the steps that need to be executed independently or can be included in the upper query

10 © Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 10 The DWS Technique Query Re-write (example for 2 nodes) A typical data aggregation query: select t.calendar_month_desc "Month", c.cust_city "City", p.prod_category "Category", avg(s.quantity_sold) "Quantity", avg(s.amount_sold) "Amount" from sales s, customers c, times t, products p where s.time_id = t.time_id and s.cust_id = c.cust_id and s.prod_id = p.prod_id and t.calendar_year = 2000 group byt.calendar_month_desc, c.cust_city, p.prod_category Dimensions Facts (aggregated)

11 © Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 11 The DWS Technique Query Re-write (example for 2 nodes) Partial Query sent to all nodes: create table dws110517101718101 as select t.calendar_month_desc calendar_month_desc, c.cust_city cust_city, p.prod_category prod_category, sum(s.quantity_sold) as dws1_sum, count(s.quantity_sold) as dws1_count, sum(s.amount_sold) as dws2_sum, count(s.amount_sold) as dws2_count from sales s, customers c, times t, products p where s.time_id = t.time_id and s.cust_id = c.cust_id and s.prod_id = p.prod_id and t.calendar_year = 2000 group by t.calendar_month_desc, c.cust_city, p.prod_category Collect partial aggregations

12 © Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 12 The DW-SP Technology Query Re-write (example for 2 nodes) Merge Query – merge the partial results: select calendar_month_desc "month", cust_city "city", prod_category "category", sum(dws1_sum) / sum(dws1_count) "quantity", sum(dws2_sum) / sum(dws2_count) "amount" from dws_finalmerge_ group by calendar_month_desc, cust_city, prod_category create table dws_finalmerge_ as (select * from dws110517154329101@node1 union all select * from dws110517154329101@node2) Gather partial Results Build final results Merge aggregations

13 © Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 13 The DWS Technique Achievements Optimal data load balance Optimal work load balance For each query each node processes the same amount of data as all the others, mostly within its local data Low communication between nodes High scalability Near linear speed-up Nead linear scale-up Tested with APB1 benchmark (Olap Council) and 10 nodes

14 © Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 14 The DWS Technique The problem Replication of dimension tables is not typically a problem (dimension tables represent 5% to 10% of the data) Business with big dimensions can not apply DWS The businesses that have big dimensions have high potential (e.g. airlines, telecoms, e- business)

15 © Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 15 The Selective Load Technique Selective load the dimension tables Typical OLAP aggregate facts according to restrictions applied to dimensions The join between facts and dimensions only need the dimension rows that exist in both tables Do not replicate the big dimension tables Load only the necessary rows to each node

16 © Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 16 The Selective Load Technique Selective load the dimension tables Example: Node of a DWS cluster

17 © Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 17 The Selective Load Technique High reduction of the number of rows to load to each node Big dimensions High number of rows (absolute size) Significant percentage of the number of rows in fact tables Produce sparse models (passenger in a flight company) Rows in the dimension table are related with low number of facts Worst scenario is having has many dimension rows as facts in each node

18 © Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 18 The Selective Load Technique Dimension browsing queries? There’s not a complete version of the big dimension table The union of all selective load partitions of the dimension table does not give a complete version of the dimension table Dimension rows with no fact won’t be loaded at all Apply the DWS data partitioning algorithm to the big dimension Create a partitioned version of the dimension table distributed through all nodes Enables the dimension queries to benefit of DWS speed up and scale up Dimension browsing queries aiming big dimension will be executed in parallel by all nodes

19 © Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 19 Experimental Results Experiments with TPC-H Facts: Lineitem Big Dimension: Orders Dimensions: Customer, Supplier, Region, Nation, Part Scenarios Single Node – Centralized DB for reference DWS (5,10,20) – DWS with replication of dimensions for 5, 10 and 20 nodes DWS_SL (5,10,20) – DWS with selective load of big dimension for 5, 10 and 20 nodes

20 © Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 20 Experimental Results Storage per node Replication of big dimension has a high impact Selective load reduces significantly the data volume LineItemOrdersOrders_distTotal Single Node3576,251573,56 5149,82 DWS_5715,251573,56 2288,81 DWS_SL_5715,25557,97314,711587,93 DWS_10357,631573,56 1931,19 DWS_SL_10357,63312,10157,36827,09 DWS_20178,811573,56 1752,38 DWS_SL_20178,81157,0178,68414,51 Table size (MB)

21 © Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 21 Experimental Results Performance DWS speed up is inexistent due to the replication of the big dimension DWS_SL speed up is near linear

22 © Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 22 Conclusions DWS is a technique to distribute data warehouses through a cluster of (low cost) computers with near linear speed up and scale up for star schema models and aggregations queries The current work enables the use of the DWS technique for star schema models with large dimensions with linear speed up and scale up. Enables browsing dimension queries to experience the advantages of parallel execution in a DWS system.

23 © Copyright Critical Software S.A. 1998-2003 All Rights Reserved. 23 Questions and Contacts Marco Costa, mcosta@criticalsoftware.com Henrique Madeira, henrique@dei.uc.pt Critical Software, S.A. Parque Industrial de Taveiro, Lote 48 3045-504 Coimbra, PORTUGAL Tel+351 239989100,Fax+351 239989119 Critical Software Inc. 111 North Market Street, Suite 670 San Jose, California, USA, 95113 Tel. +1(408)9711231, Fax: +1(408)3513330


Download ppt "Dependable Technologies for Critical Systems Copyright Critical Software S.A. 1998-2003 All Rights Reserved. Handling big dimensions in distributed data."

Similar presentations


Ads by Google