Presentation by Dr. Greg Speegle CSI 5335, November 18, 2011.

Slides:



Advertisements
Similar presentations
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 12, Part A.
Advertisements

COMP 5138 Relational Database Management Systems Semester 2, 2007 Lecture 5A Relational Algebra.
CS4432: Database Systems II
Relational Algebra, Join and QBE Yong Choi School of Business CSUB, Bakersfield.
Copyright © 2011 Ramez Elmasri and Shamkant Navathe Algorithms for SELECT and JOIN Operations (8) Implementing the JOIN Operation: Join (EQUIJOIN, NATURAL.
CS 540 Database Management Systems
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
Tutorial 8 CSI 2132 Database I. Exercise 1 Both disks and main memory support direct access to any desired location (page). On average, main memory accesses.
Join Processing in Databases Systems with Large Main Memories
CS CS4432: Database Systems II Operator Algorithms Chapter 15.
Database Management Systems 3ed, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 14, Part B.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
A Robust, Optimization-Based Approach for Approximate Answering of Aggregate Queries By : Surajid Chaudhuri Gautam Das Vivek Narasayya Presented by :Sayed.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Virtual techdays INDIA │ 9-11 February 2011 SQL 2008 Query Tuning Praveen Srivatsa │ Principal SME – StudyDesk91 │ Director, AsthraSoft Consulting │ Microsoft.
1 Optimization - Selection. 2 The Selection Operation Table: Reserves(sid, bid, day, agent) A page (block) can hold 100 Reserves tuples There are 1,000.
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
Introduction to Database Systems 1 Join Algorithms Query Processing: Lecture 1.
1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
Jeffrey D. Ullman Stanford University. 2 Formal Definition Implementation Fault-Tolerance Example: Join.
CS 4432query processing - lecture 171 CS4432: Database Systems II Lecture #17 Join Processing Algorithms (cont). Professor Elke A. Rundensteiner.
Evaluation of Relational Operations. Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation.
ICOM 6005 – Database Management Systems Design Dr. Manuel Rodríguez-Martínez Electrical and Computer Engineering Department Lecture 14 – Join Processing.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Database Management 9. course. Execution of queries.
M Taimoor Khan Course Objectives 1) Basic Concepts 2) Tools 3) Database architecture and design 4) Flow of data (DFDs)
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
Query Processing. Steps in Query Processing Validate and translate the query –Good syntax. –All referenced relations exist. –Translate the SQL to relational.
Relational Operator Evaluation. Overview Index Nested Loops Join If there is an index on the join column of one relation (say S), can make it the inner.
Processing Theta-Joins using MapReduce
CS 4432query processing1 CS4432: Database Systems II Lecture #11 Professor Elke A. Rundensteiner.
1.1 CAS CS 460/660 Introduction to Database Systems Relational Algebra.
1 Relational Algebra Chapter 4, Sections 4.1 – 4.2.
CS4432: Database Systems II Query Processing- Part 3 1.
MapReduce and Data Management Based on slides from Jimmy Lin’s lecture slides ( (licensed.
Database Management Systems, R. Ramakrishnan1 Relational Algebra Module 3, Lecture 1.
Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations – Join Chapter 14 Ramakrishnan and Gehrke (Section 14.4)
Computing & Information Sciences Kansas State University Monday, 03 Nov 2008CIS 560: Database System Concepts Lecture 27 of 42 Monday, 03 November 2008.
Relational Algebra p BIT DBMS II.
MapReduce and the New Software Stack. Outline  Algorithm Using MapReduce  Matrix-Vector Multiplication  Matrix-Vector Multiplication by MapReduce 
Computing & Information Sciences Kansas State University Wednesday, 08 Nov 2006CIS 560: Database System Concepts Lecture 32 of 42 Monday, 06 November 2006.
More Optimization Exercises. Block Nested Loops Join Suppose there are B buffer pages Cost: M + ceil (M/(B-2))*N where –M is the number of pages of R.
Chapter 12 Query Processing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li.
Query Processing – Implementing Set Operations and Joins Chap. 19.
Implementation of Database Systems, Jarek Gryz1 Evaluation of Relational Operations Chapter 12, Part A.
Alon Levy 1 Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation. – Projection ( ) Deletes.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 14, Part A (Joins)
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Relational Algebra Chapter 4, Part A.
Jeffrey D. Ullman Stanford University.  A real story from CS341 data-mining project class.  Students involved did a wonderful job, got an “A.”  But.
Evaluation of Relational Operations
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Evaluation of Relational Operations: Other Operations
On Spatial Joins in MapReduce
External Joins Query Optimization 10/4/2017
KMeans Clustering on Hadoop Fall 2013 Elke A. Rundensteiner
Selected Topics: External Sorting, Join Algorithms, …
Lecture 2- Query Processing (continued)
Implementation of Relational Operations
Slides adapted from Donghui Zhang, UC Riverside
Evaluation of Relational Operations: Other Techniques
Sorting We may build an index on the relation, and then use the index to read the relation in sorted order. May lead to one disk block access for each.
Evaluation of Relational Operations: Other Techniques
Presentation transcript:

Presentation by Dr. Greg Speegle CSI 5335, November 18, 2011

 Automatic parallelization technique  Map function  Reads input file in parallel  Outputs pairs  Reduce function  Input: All pairs with same key  Output: Results  Information Week: Hadoop skills in demand

 Theta-join  Join on non-equality predicate  Example: Select qid, hid From Heroes h, Quests q where q.level <= h.level  Nested Block Loop  For every block of r read all of s  Always applicable  “Computes” cross-product  Hash Join  Only examines tuples to join  Cannot always be used (e.g., theta join)

 MapReduce Algorithm  “Computes” cross-product  Goals:  Tuples matched at exactly one reducer  Minimal input to a reducer  Minimal output from each reducer  “1-Bucket” refers to no statistics about data distribution

 Precompute regions of cross-product SxT  Use size of S (|S|) and T (|T|)  Regions are disjoint  Union of regions covers cross-product  Each region assigned to single reducer

|S|=8; |T|=8; #reducers =4 Rows are tuples in s; columns are tuples in t Value is region for the pair

 Each row in S  Randomly assign value (x) from 1 to size(S)  Output for each region containing x  Example: Assume x=3. Output and  Each row in T  Same, except output  ExampleL Assume x=3. Output and

 Joins all S rows with all T rows  Can use any join algorithm appropriate for join value  Output cross-product, theta join or equi-join

 Random assignment of tuples  Since actual row number unknown, any row number works  Some reducer will compare tuple to any tuple in other table  Therefore, every pair compared (as in nested block loop join) in only one reducer

 Basis for minimal input and minimal output  Let |S| be size of table S; r number of reducers  Optimal output |S||T|/r  Optimal input sqrt(|S||T|/r) from each table  Special case:  |S| = s*sqrt(|S||T|/r); |T| = t* s*sqrt(|S||T|/r)  Optimal: s*t squares with side length sqrt(|S||T|/r)

|S|=8; |T|=8; r=4; sqrt(|S||T|/r) =4; s=t=2

 Optimal case is rare  General case  t=floor(|T|/ sqrt(|S||T|/r))  Side length: floor((1+1/min(s,t)) * sqrt(|S||T|/r))  Note floor function omitted from paper  Example: |S|=|T|=8; r=9  s=t=floor(8/sqrt(64/9))=3  Side length = floor((1+1/3)*sqrt(64/9))=3

Assumed partitioning Note: 64/9= Eight partitions with 7 and one with 8 is better

 Map  Each row in S output  Each row in T output  Reducer  Join all matching rows (same as 1-Bucket)  Cannot be used for arbitrary theta joins  Subject to skew  Great for foreign key join w/uniform distribution

 Cloud data set  Information about cloud cover  382 million records  28.8 GB  Cloud-5-i is 5 million record subset  SELECT S.date, S.longitude, S.latitude FROM Cloud S, Cloud T WHERE s.date = t.date and S.longitude = T. longitude and ABS(S.latitude- T.latitude) <= 10  SELECT S.latitude, T.latitude FROM Cloud-5-1 S, Cloud-5-2 T WHERE ABS(S.latitude-T.latitude) < 2

 MapReduce algorithm for arbitrary joins  Always applicable  Effective for large-scale data analysis  Additional statistics provide better performance