Anindya Datta Debra VanderMeer Krithi Ramamritham Presented by –

Slides:



Advertisements
Similar presentations
BY LECTURER/ AISHA DAWOOD DW Lab # 2. LAB EXERCISE #1 Oracle Data Warehousing Goal: Develop an application to implement defining subject area, design.
Advertisements

Unit 1:Parallel Databases
OLAP Tuning. Outline OLAP 101 – Data warehouse architecture – ROLAP, MOLAP and HOLAP Data Cube – Star Schema and operations – The CUBE operator – Tuning.
1 Copyright © 2011, Oracle and/or its affiliates. All rights reserved.
Scalable Content-Addressable Network Lintao Liu
Data Warehouse Tuning. 7 - Datawarehouse2 Datawarehouse Tuning Aggregate (strategic) targeting: –Aggregates flow up from a wide selection of data, and.
Jingren Zhou Microsoft Corp.. Large-scale Distributed Computing Large data centers (x1000 machines): storage and computation Key technology for search.
Ingres/Vectorwise Implementation Details XXV Ingres Benutzerkonferenz 2012 Confidential © 2011 Actian Corporation Doug Inkster 1 of 9.
Store RDF Triples In A Scalable Way Liu Long & Liu Chunqiu.
Outline What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation Further development of data.
Parallel Databases By Dr.S.Sridhar, Ph.D.(JNUD), RACI(Paris, NICE), RMR(USA), RZFM(Germany) DIRECTOR ARUNAI ENGINEERING COLLEGE TIRUVANNAMALAI.
Clydesdale: Structured Data Processing on MapReduce Jackie.
Dimensional Modeling – Part 2
Exploiting the DW data DW is a platform for creating a wide array of reports It solves data feed problems, but does not lead to specific decision support.
Advanced Querying OLAP Part 2. Context OLAP systems for supporting decision making. Components: –Dimensions with hierarchies, –Measures, –Aggregation.
Physical Design CS 543 – Data Warehousing. CS Data Warehousing (Sp ) - Asim LUMS2 Physical Design Steps 1. Develop standards 2.
-Shourie Boddupalli. Data Parallelism Data Parallelism is a form of parallelization of computing across multiple processors in parallel computing environment.
Data Warehousing - 3 ISYS 650. Snowflake Schema one or more dimension tables do not join directly to the fact table but must join through other dimension.
Chap8: Trends in DBMS 8.1 Database support for Field Entities 8.2 Content-based retrieval 8.3 Introduction to spatial data warehouses 8.4 Summary.
Evaluation of Top-k OLAP Queries Using Aggregate R-trees Nikos Mamoulis (HKU) Spiridon Bakiras (HKUST) Panos Kalnis (NUS)
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
Chapter 5 Parallel Join 5.1Join Operations 5.2Serial Join Algorithms 5.3Parallel Join Algorithms 5.4Cost Models 5.5Parallel Join Optimization 5.6Summary.
PARALLEL DBMS VS MAP REDUCE “MapReduce and parallel DBMSs: friends or foes?” Stonebraker, Daniel Abadi, David J Dewitt et al.
1DBTest2008. Motivation Background Relational Data Warehousing (DW) SQL Server 2008 Starjoin improvement Testing Challenge Extending Enterprise-class.
Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke1 Decision Support Chapter 23.
ITEC 3220A Using and Designing Database Systems
CS 345: Topics in Data Warehousing Thursday, October 28, 2004.
Designing a Data Warehouse Issues in DW design. Three Fundamental Processes Data Acquisition Data Storage Data a Access.
Scalable Management of On-line E-commerce Interactions Krithi Ramamritham July 2000.
Data Warehouse & Data Mining
School of Software SUN YAT-SEN UNIVERSITY Mar, 27, 2011.
MapReduce VS Parallel DBMSs
IMS 6217: Data Warehousing / Business Intelligence Part 3 1 Dr. Lawrence West, Management Dept., University of Central Florida Analysis.
1 Experimental Evidence on Partitioning in Parallel Data Warehouses Pedro Furtado Prof. at Univ. of Coimbra & Researcher at CISUC DEI/CISUC-Universidade.
CS 345: Topics in Data Warehousing Tuesday, October 19, 2004.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
TM 7-1 Copyright © 1999 Addison Wesley Longman, Inc. Physical Database Design.
OnLine Analytical Processing (OLAP)
MonetDB/X100 hyper-pipelining query execution Peter Boncz, Marcin Zukowski, Niels Nes.
Data Warehouse and the Star Schema CSCI 242 ©Copyright 2015, David C. Roberts, all rights reserved.
Data Warehouse. Design DataWarehouse Key Design Considerations it is important to consider the intended purpose of the data warehouse or business intelligence.
1 Data Warehouses BUAD/American University Data Warehouses.
Data Warehouse Design Xintao Wu University of North Carolina at Charlotte Nov 10, 2008.
CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob SPRINT: A Scalable Parallel Classifier for Data Mining.
Dependable Technologies for Critical Systems Copyright Critical Software S.A All Rights Reserved. Handling big dimensions in distributed data.
CMPE 226 Database Systems October 21 Class Meeting Department of Computer Engineering San Jose State University Fall 2015 Instructor: Ron Mak
Data Warehousing Multidimensional Analysis
Variant Indexes. Specialized Indexes? Data warehouses are large databases with data integrated from many independent sources. Queries are often complex.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Business Intelligence Transparencies 1. ©Pearson Education 2009 Objectives What business intelligence (BI) represents. The technologies associated with.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
Chapter 4 Logical & Physical Database Design
The Cubetree Storage Organization A High Performance ROLAP Datablade 데이터베이스 연구실 석사 3 학기 강 주 영
Two-Tier DW Architecture. Three-Tier DW Architecture.
Copyright© 2014, Sira Yongchareon Department of Computing, Faculty of Creative Industries and Business Lecturer : Dr. Sira Yongchareon ISCG 6425 Data Warehousing.
Efficient OLAP Operations in Spatial Data Warehouses Dimitris Papadias, Panos Kalnis, Jun Zhang and Yufei Tao Department of Computer Science Hong Kong.
4 Copyright © Oracle Corporation, All rights reserved. Modeling the Data Warehouse.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support Chapter 25.
Data Warehousing and OLAP Outline u Models & operations u Implementing a warehouse u Future directions.
DATABASE OPERATORS AND SOLID STATE DRIVES Geetali Tyagi ( ) Mahima Malik ( ) Shrey Gupta ( ) Vedanshi Kataria ( )
What is a Data Warehouse? A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a.
Plan for Final Lecture What you may expect to be asked in the Exam?
Parallel Databases.
Physical Database Design and Performance
ITD1312 Database Principles Chapter 5: Physical Database Design
Data Warehouse.
Lecture 17: Distributed Transactions
Physical Database Design
Lecture 13: Query Execution
Presentation transcript:

Parallel Star Join + DataIndexes : Efficient Query Processing in Data Warehousing and OLAP Anindya Datta Debra VanderMeer Krithi Ramamritham Presented by – Ashutosh Joshi

Motivation OLAP involves efficient retrieval of data from data warehouses for decision-support purposes Data Warehouses are extremely large and queries are highly computationally expensive DataIndex is a storage structure serving as both index and data Parallel Star Join (PSJ) is an efficient algorithm for performing star join in parallel

The Road Map A physical design principle for exploiting parallelism Parallel Star Join algorithm Experiment results

The Star Schema Dimension Table Fact Table PART CUSTOMER PartKey 4 Name 55 Mfgr 25 Brand 10 Type 25 Size 4 Others... 41 164 CustKey 4 Name 25 Address 40 Nation 25 Region 25 Phone 15 AcctBal 8 MktSegment 10 Comment 117 269 SALES PartKey 4 SuppKey 4 CustKey 4 Quantity 8 ExtPrice 8 Discount 8 Tax 8 RetFlag 1 Status 1 ShipDate 2 CommitDate 2 ReceiptDate 2 ShipInstruct 25 ShipMode 10 Comment 44 137 200,000 SUPPLIER 150,000 SuppKey 4 Name 25 Address 40 Nation 25 Region 25 Phone 15 AcctBal 8 Comment 101 243 TIME TimeKey 2 Alpha 10 Year 4 Month 4 Week 4 Day 4 28 6,000,000 2,557 10,000

A Physical Design Principle DataIndexes Serve as both index as well as data Based on vertical partitioning of tables Two types Projection Index (PI) Join Index (JI)

Projection Index Base Table PI PI PI CustKey Qty ExtPrice Discount CK1

Join Index Base Dimension Table Base Fact Table PI PI PI JI PI PI Name Address CustKey CustKey Tax ExtPrice Discount N1 A1 CK1 CK1 T1 E1 D1 N2 A2 CK2 CK2 T2 E2 D2 N3 A3 CK3 CK3 T3 E3 D3 CK3 T4 E4 D4 PI PI PI JI PI PI Name Address CustKey RIDs RID1 RID2 RID3 Tax ExtPrice Discount N1 A1 CK1 T1 E1 D1 N2 A2 CK2 T2 E2 D2 N3 A3 CK3 T3 E3 D3 T4 E4 D4

The Principle Each foreign key column in the fact table is stored as Join Index (JI) Rest of the columns (for both dimension as well as fact table) are stored as Projection Index (PI)

Parallel Star Join Data placement strategy Based on shared nothing architecture with N processors Assume a d dimensional data warehouse Partition N processors into d+1 groups Assign to each group j, dimension table Dj and Jj , the fact table join index Assign metric PIs to the group d+1

Processor Group Partitioning Number of processors is governed by the size of dimension table Dj Size of jth processor group Size of metric group

Physical Data Placement Horizontally partition JI’s across all processors Replicate PI’s on all processors Use round-robin strategy for partitioning JI’s

The Parallel Star Join Algorithm A general k- dimensional star join query Select AdP, AmP from F, D1, … , Dk where Pjoin and Pselect The algorithm has three phases Local rowset generation Global rowset synthesis Output preparation

Local Rowset generation Load PI fragment Pc P1 P2 PI fragment PI fragment PI fragment 25 5 7 15 1 Qty > 10 PI fragment Rowset fragment

Local Rowset Generation (contd) Merge dimension rowset fragments Distribute dimension rowset Rowset fragment P1 P2 P3 P4 OR Rdim,i

Local Rowset Generation (contd) Load JI fragment Merge partial fact rowsets 1 1 RIDs RID1 RID2 RID3 Rfact,i Rdim,i JIi

Global Rowset Synthesis Merge local fact rowsets Distribute global rowset to groups participating in the output phase Rfact,1 G1 G2 Rfact,2 G3 G4 AND Rglobal

Output Preparation Distribute global rowset to individual processors Load PI columns necessary for output Merge output CustKey CK1 CK2 CK3 CK4 1 Output CK1 CK2 RIDs RID1 RID2 RID3 PIi JIi Rglobal

Performance Comparison The PSJ algorithm was compared with Bitmapped Join Index algorithm and the Pipelined Hash join algorithm Two performance metrics used Response time in block access (RTBA) Aggregate Data Transmission (ADT)

Scalability Experiments The curves rise as the scale factor and number of processors increase PSJ cost is much lower than BJI and HASH costs At large memory sizes, PSJ approaches “near-perfect” scalability

Scalability Experiments(contd) Transmission costs for PSJ and BJI are the same Both curves exhibit imperfect scalability HASH has substantially higher transmission costs than PSJ

Conclusion DataIndex is a physical design strategy which provides efficient partitioning of the schema Parallel Star Join algorithm provides a means to perform star join in parallel PSJ algorithm performs better than BJI and HASH algorithms in terms of I/O and transmission costs