-Shourie Boddupalli. Data Parallelism Data Parallelism is a form of parallelization of computing across multiple processors in parallel computing environment.

Slides:

Advertisements

Similar presentations

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.

Advertisements

Anindya Datta Debra VanderMeer Krithi Ramamritham Presented by –

CS 245Notes 71 CS 245: Database System Principles Notes 7: Query Optimization Hector Garcia-Molina.

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.

CS 540 Database Management Systems

CS 245Notes 71 CS 245: Database System Principles Notes 7: Query Optimization Hector Garcia-Molina.

Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.

Advanced Querying OLAP Part 2. Context OLAP systems for supporting decision making. Components: –Dimensions with hierarchies, –Measures, –Aggregation.

Data Warehousing - 3 ISYS 650. Snowflake Schema one or more dimension tables do not join directly to the fact table but must join through other dimension.

MIS 451 Building Business Intelligence Systems Logical Design (3) – Design Multiple-fact Dimensional Model.

Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.

Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Standard architecture emerging: – Cluster of commodity.

Ch 4. The Evolution of Analytic Scalability

USING HADOOP & HBASE TO BUILD CONTENT RELEVANCE & PERSONALIZATION Tools to build your big data application Ameya Kanitkar.

Optimizing Queries and Diverse Data Sources Laura M. Hass Donald Kossman Edward L. Wimmers Jun Yang Presented By Siddhartha Dasari.

Interpreting the data: Parallel analysis with Sawzall LIN Wenbin 25 Mar 2014.

Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)

CS 345: Topics in Data Warehousing Tuesday, October 19, 2004.

MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.

Chapter 9 Designing Databases Modern Systems Analysis and Design Sixth Edition Jeffrey A. Hoffer Joey F. George Joseph S. Valacich.

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

Chapter 2 Adapted from Silberschatz, et al. CHECK SLIDE 16.

Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)

Querying Large Databases Rukmini Kaushik. Purpose Research for efficient algorithms and software architectures of query engines.

Data Warehouse Design Xintao Wu University of North Carolina at Charlotte Nov 10, 2008.

© ETH Zürich Eric Lo ETH Zurich a joint work with Carsten Binnig (U of Heidelberg), Donald Kossmann (ETH Zurich), Tamer Ozsu (U of Waterloo) and Peter.

BI Terminologies.

Relational Databases.  In week 1 we looked at the concept of a key, the primary key is a column/attribute that uniquely identifies the rest of the data.

Query Execution Section 15.1 Shweta Athalye CS257: Database Systems ID: 118 Section 1.

Hung-chih Yang 1, Ali Dasdan 1 Ruey-Lung Hsiao 2, D. Stott Parker 2

Lecture 15- Parallel Databases (continued) Advanced Databases Masood Niazi Torshiz Islamic Azad University- Mashhad Branch

Computing Scientometrics in Large-Scale Academic Search Engines with MapReduce Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication.

Dependable Technologies for Critical Systems Copyright Critical Software S.A All Rights Reserved. Handling big dimensions in distributed data.

Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.

MapReduce and Data Management Based on slides from Jimmy Lin’s lecture slides ( (licensed.

By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.

CS4432: Database Systems II Query Processing- Part 2.

Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.

MapReduce and NoSQL CMSC 461 Michael Wilson. Big data  The term big data has become fairly popular as of late  There is a need to store vast quantities.

IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.

Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.

Query Optimization CMPE 226 Database Systems By, Arjun Gangisetty

©Silberschatz, Korth and Sudarshan20.1Database System Concepts 3 rd Edition Chapter 20: Parallel Databases Introduction I/O Parallelism Interquery Parallelism.

MapReduce and the New Software Stack. Outline  Algorithm Using MapReduce  Matrix-Vector Multiplication  Matrix-Vector Multiplication by MapReduce 

Description and exemplification use of a Data Dictionary. A data dictionary is a catalogue of all data items in a system. The data dictionary stores details.

CS 540 Database Management Systems

Database Relationships Objective 5.01 Understand database tables used in business.

CSE 326: Data Structures Lecture #22 Databases and Sorting Alon Halevy Spring Quarter 2001.

PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.

I am Xinyuan Niu I am here because I love to give presentations. Data Warehousing.

1 VLDB, Background What is important for the user.

1 Chengkai Li Kevin-Chen-Chuan Chang Ihab Ilyas Sumin Song Presented by: Mariam John CSE /20/2006 RankSQL: Query Algebra and Optimization for Relational.

B ig D ata Analysis for Page Ranking using Map/Reduce R.Renuka, R.Vidhya Priya, III B.Sc., IT, The S.F.R.College for Women, Sivakasi.

IT 5433 LM3 Relational Data Model. Learning Objectives: List the 5 properties of relations List the properties of a candidate key, primary key and foreign.

IT 5433 LM4 Physical Design. Learning Objectives: Describe the physical database design process Explain how attributes transpose from the logical to physical.

Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )

”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.

Module 2: Intro to Relational Model

Database Management System

Central Florida Business Intelligence User Group

Chapter 2: Intro to Relational Model

Chapter 15 QUERY EXECUTION.

One-Pass Algorithms for Database Operations (15.2)

Chapter 2: Intro to Relational Model

Chapter 2: Intro to Relational Model

Example of a Relation attributes (or columns) tuples (or rows)

Chapter 2: Intro to Relational Model

MapReduce: Simplified Data Processing on Large Clusters

Map Reduce, Types, Formats and Features

Presentation transcript:

-Shourie Boddupalli

Data Parallelism Data Parallelism is a form of parallelization of computing across multiple processors in parallel computing environment. A data-parallel framework is very attractive for large- scale data processing since it enables such an application to easily process a huge amount of data on commodity machines

Data Warehouse A data warehouse is an online repository for decision support applications that answer business queries in a short time. Where can data parallelism be used in a Warehouse? Star Schema Star-Join Query

Approaches to process Star-Join Data Parallel Framework (Ex: Hive, CloudBase) - No need for up-to-date hardware & software - Fault-Tolerance provided by hiding complexity. But in case of join-query processing computational efficiency in premature state.

Warehouse Example

Example Query SELECT D_YEAR,S-NATION,P_CATEGORY FROM DATE,CUSTOMER,SUPPLIER,PART,LINORDER WHERE LO_CUSTKEY = C_CUSTKEY AND LO_SUPKEY = S_SUPKEY AND LO_PARTKEY = P_PARTKEY AND C_REGION = ‘AMERICA’ GROUP BY D_YEAR,S_NATION,P_CATEGORY;

Execution plan for the query

Scatter-Gather-Merge This algorithm(as name indicates) has 3 phases Scatter Gather Merge Key Manipulation Technique: Basic idea is to join the fact table with n dimension tables within 3 computational phases

Example of Database and Star-Join Query

Contd. During the scatter phase 1) If the input is a tuple of FT, the tuple is transformed into two key-value pairs as results 2) If the input is a tuple of the dimension tables, the tuple is transformed into a new key-value pair as a result Gather Phase aggregates according to key Merge Phase produces the final results of star-join queries

Algorithm Algorithm 1 (Key manipulation algorithm of Scatter-Gather-Merge) Scatter(r) Input r is a record. 1: if (r is a record of the fact table F) then 2: for each fki do 3: Turn input tuple (fk1, fk2,..., fkn, rF ) into key- value pair ((fki, i), (fk1, fk2, fkn, rF )). 4: Store ((fki, i), (fk1, fk2,..., fkn, rF )). 5: endfor 6: endif 7: if (r is a record of dimension table Di )then 8: Turn input tuple (pki, rDi) into key-value pair ((pki, i), rDi). 9: Store and Distribute ((pki, i), rDi). 10: endif Gather(k, v) Input k is a key (join key). ν is a set of records that have the same join key. 1: Match all ((pki, i), rDi) with all ((fki, i), (fk1, fk2,..., fkn, rF )). 2: Make an output ((fk1, fk2,..., fkn), (rDi, rF )). 3: Store and Distribute ((fk1, fk2,..., fkn), (rDi, rF )). Merge(k, v) Input k is a key (fk1, fk2,..., fkn). ν is a set of records that have the key. 1: Aggregate every record with all ((fk1, fk2,..., fkn), (rDi, rF )) where 1 ≤ i ≤ n. 2: Make an output ((fk1, fk2,..., fkn), (rD1, rD2,..., rDn, rF )). 3: Store ((fk1, fk2,..., fkn), (rD1, rD2,..., rDn, rF )). //final output

Notation Used Di has the primary key PKi that is associated with the foreign key FKi of F where i is the dimension identification number of Di. Each tuple of Di is (pki, rDi) where pki is the value of the primary key PKi and rDi is a vector that contains other attribute values. Each tuple of F (fk1, fk2,..., fkn, rF ) where fki is the value of the foreign key FKi and rF is a vector that contains other attribute values. The vector (fk1, fk2,..., fkn) is unique in the fact table or rF contains the primary key

IO Reduction Technique In case of key manipulation technique there are n intermediate results to generate a final query which needs to be reduced. To reduce the number of intermediate results Bloom filters were introduced.

Algorithm for IO Reduction Algorithm 2 (Scatter-Gather-Merge algorithm) Filter-Construction(r) Input r is a record. BFi is a bloom filter of Di. 1: if (r is a record of dimension table Di 2: and r is satisfied with CDi ) then 3: Store and Distribute r. 4: Add pki to BFi. 5: endif Scatter(r) Input r is a record. 1: if (v is a record of the fact table F) then 2: for each fki do 3: if fki is not contained by the corresponding BFi return 4: endif 5: endfor 6: for each fki do 7: Turn input tuple (fk1, fk2,..., fkn, rF ) into key value pair ((fki, i), (fk1, fk2,..., fkn, rF )). 8: Store and Distribute ((fki, i), (fk1, fk2,..., fkn, rF )). 9: endfor 10: endif 11: if (r is a record of dimension table Di ) then 12: Turn input tuple (pki, rDi) into key-value pair ((pki, i), rDi). 13: Store and Distribute ((pki, i), rDi). 14: endif

Map-Reduce based Scatter-Gather- Merge Algorithm Three Phases - Construction - Scatter & Gather - Merge

Contd.

Map-Reduce based Scatter-Gather- Merge Algorithm Map(k, v) Input k is a key. ν is a record of each participating dimension table that the star-join query has restrictions on. 1: if (v is a record of dimension table Di 2: and v is satisfied with CDi ) then 3: Turn input tuple (pki, rDi) into key value pair ((pki, i), rDi). 4: Emit ((pki, i), rDi). 5: endif Reduce(k, v) Input (k, ν) is a filtered record of each dimension table. BF(i,j ) is a bloom filter of Di for the j th Reduce process. 1: Emit ((pki, i), rDi). 2: Add pki to BF(i,j ). Map(k, v) // scatter function Input k is a key. ν is a record of the fact table and every participating dimension table. 1: if (v is a record of the fact table F) then 2: for each fki do 3: if fki is not contained by the corresponding BF(i,j ) return 4: endif 5: endfor 6: for each fki do 7: Turn input tuple (fk1, fk2,..., fkn, rF ) into key- value pair ((fki, i), (fk1, fk2,..., fkn, rF )). 8: Emit ((fki, i), (fk1, fk2,..., fkn, rF )).

Contd. 9: endfor 10: endif 11: if (v is a record of dimension table Di ) then 12: if (There are restrictions on Di ) then 13: Emit ((pki, i), rDi). 14: else 15: Turn input tuple (pki, rDi) into key-value pair ((pki, i), rDi). 16: Emit ((pki, i), rDi). 17: endif 18: endif Reduce(k, v) // gather function Input k is a key (join key). ν is a set of records that have the same join key. 1: Match all ((pki, i), rDi) with all ((fki, i), (fk1, fk2,..., fkn, rF )) where pki= fki. 2: Make an output ((fk1, fk2,..., fkn), (rDi, rF )). 3: Emit ((fk1, fk2,..., fkn), (rDi, rF )). Map(k, v) Input k is a key (fk1, fk2,..., fkn) and ν is a value (rDi, rF ). 1: Emit ((fk1, fk2,..., fkn), (rDi, rF )). Reduce(k, v) Input k is a key (fk1, fk2,..., fkn). ν is a set of records that have the key. 1: Aggregate every record with all ((fk1, fk2,..., fkn), (rDi, rF )) where 1 ≤ i ≤ n. 2: Make an output ((fk1, fk2,..., fkn), (rD1, rD2,..., rDn, rF )). 3: Emit ((fk1, fk2,..., fkn), (rD1, rD2,..., rDn, rF )).

Experimental Results From the experiments conducted it is observed that the query performance was better when Scatter- Gather-Merge algorithm with Bloom filters fared well compared to case without Bloom filters Even in cases where the warehouse size has increased the same results were obtained.