Processing Theta-Joins using MapReduce

Slides:



Advertisements
Similar presentations
Experiences with Hadoop and MapReduce
Advertisements

Relational Database Operators
COMP 5138 Relational Database Management Systems Semester 2, 2007 Lecture 5A Relational Algebra.
CS4432: Database Systems II
Relational Algebra, Join and QBE Yong Choi School of Business CSUB, Bakersfield.
Copyright © 2011 Ramez Elmasri and Shamkant Navathe Algorithms for SELECT and JOIN Operations (8) Implementing the JOIN Operation: Join (EQUIJOIN, NATURAL.
Query Optimization CS634 Lecture 12, Mar 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
Fast Algorithms For Hierarchical Range Histogram Constructions
J OIN ALGORITHMS USING MAPREDUCE Haiping Wang
SkewTune: Mitigating Skew in MapReduce Applications
Tutorial 8 CSI 2132 Database I. Exercise 1 Both disks and main memory support direct access to any desired location (page). On average, main memory accesses.
Presentation by Dr. Greg Speegle CSI 5335, November 18, 2011.
Distributed Approximate Spectral Clustering for Large- Scale Datasets FEI GAO, WAEL ABD-ALMAGEED, MOHAMED HEFEEDA PRESENTED BY : BITA KAZEMI ZAHRANI 1.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Relational Algebra Chapter 4, Part A Modified by Donghui Zhang.
By relieving the brain of all unnecessary work, a good notation sets it free to concentrate on more advanced problems, and, in effect, increases the mental.
A Comparison of Join Algorithms for Log Processing in MapReduce Spyros Blanas, Jignesh M. Patel, Vuk Ercegovac, Jun Rao, Eugene J. Shekita, Yuanyuan Tian.
CMPT 354, Simon Fraser University, Fall 2008, Martin Ester 52 Database Systems I Relational Algebra.
Virtual techdays INDIA │ 9-11 February 2011 SQL 2008 Query Tuning Praveen Srivatsa │ Principal SME – StudyDesk91 │ Director, AsthraSoft Consulting │ Microsoft.
1 Optimization - Selection. 2 The Selection Operation Table: Reserves(sid, bid, day, agent) A page (block) can hold 100 Reserves tuples There are 1,000.
Introduction to Database Systems 1 Join Algorithms Query Processing: Lecture 1.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
Jeffrey D. Ullman Stanford University. 2 Formal Definition Implementation Fault-Tolerance Example: Join.
Rutgers University Relational Algebra 198:541 Rutgers University.
Evaluation of Relational Operations. Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation.
Relational Algebra, R. Ramakrishnan and J. Gehrke (with additions by Ch. Eick) 1 Relational Algebra.
Ch 4. The Evolution of Analytic Scalability
SOFTWARE SYSTEMS DEVELOPMENT MAP-REDUCE, Hadoop, HBase.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Lecture 05 Structured Query Language. 2 Father of Relational Model Edgar F. Codd ( ) PhD from U. of Michigan, Ann Arbor Received Turing Award.
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters H.Yang, A. Dasdan (Yahoo!), R. Hsiao, D.S.Parker (UCLA) Shimin Chen Big Data.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
M Taimoor Khan Course Objectives 1) Basic Concepts 2) Tools 3) Database architecture and design 4) Flow of data (DFDs)
Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!), Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
M Taimoor Khan Course Objectives 1) Basic Concepts 2) Tools 3) Database architecture and design 4) Flow of data (DFDs)
Relational Operator Evaluation. Overview Index Nested Loops Join If there is an index on the join column of one relation (say S), can make it the inner.
1 Relational Algebra. 2 Relational Query Languages v Query languages: Allow manipulation and retrieval of data from a database. v Relational model supports.
Relational Algebra  Souhad M. Daraghma. Relational Query Languages Query languages: Allow manipulation and retrieval of data from a database. Relational.
1 Relational Algebra and Calculas Chapter 4, Part A.
1.1 CAS CS 460/660 Introduction to Database Systems Relational Algebra.
Relational Algebra.
Histograms for Selectivity Estimation
1 Relational Algebra Chapter 4, Sections 4.1 – 4.2.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Database Management Systems Chapter 4 Relational Algebra.
Scalable Distributed Reasoning Using MapReduce Jacopo Urbani, Spyros Kotoulas, Eyal Oren, and Frank van Harmelen Department of Computer Science, Vrije.
CS4432: Database Systems II Query Processing- Part 3 1.
MapReduce and Data Management Based on slides from Jimmy Lin’s lecture slides ( (licensed.
Database Management Systems, R. Ramakrishnan1 Relational Algebra Module 3, Lecture 1.
Massive Semantic Web data compression with MapReduce Jacopo Urbani, Jason Maassen, Henri Bal Vrije Universiteit, Amsterdam HPDC ( High Performance Distributed.
Relational Algebra p BIT DBMS II.
File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li.
Alon Levy 1 Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation. – Projection ( ) Deletes.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Relational Algebra Chapter 4, Part A.
1 VLDB, Background What is important for the user.
Jeffrey D. Ullman Stanford University.  A real story from CS341 data-mining project class.  Students involved did a wonderful job, got an “A.”  But.
Assignment Problems of Different- Sized Inputs in MapReduce Foto N. Afrati 1, Shlomi Dolev 2, Ephraim Korach 2, Shantanu Sharma 2, and Jeffrey D. Ullman.
”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.
Ritu CHaturvedi Some figures are adapted from T. COnnolly
Evaluation of Relational Operations
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Relational Algebra 461 The slides for this text are organized into chapters. This lecture covers relational algebra, from Chapter 4. The relational calculus.
Evaluation of Relational Operations: Other Operations
Private and Secure Secret Shared MapReduce
On Spatial Joins in MapReduce
Ch 4. The Evolution of Analytic Scalability
Selected Topics: External Sorting, Join Algorithms, …
Implementation of Relational Operations
Experiences with Hadoop and MapReduce
Sorting We may build an index on the relation, and then use the index to read the relation in sorted order. May lead to one disk block access for each.
External Sorting Sorting is used in implementing many relational operations Problem: Relations are typically large, do not fit in main memory So cannot.
Presentation transcript:

Processing Theta-Joins using MapReduce Alper Okcan, Mirek Riedewald Northeastern University, Boston, MA SIGMOD 11 12 April 2013 SNU IDB Lab. Hyesung Oh

Outline Introduction Optimization Goal Mapping Join Matrix Cells to Reducers 1-Bucket Theta Experiments Conclusion

Introduction - 1 Internet companies want to analyze terabytes of data parallel computation is essential Join equi-join : join exact same attribute value theta-join : join range attribute values

Introduction - 2 MapReduce overview

Introduction - 3 MapReduce Key, value map, reduce jobs good for equi-joins about another types of joins? reducer-centered cost model and a join model simplifies creation of and reasoning about theta-join

Optimization Goal How to minimize job completion time max-reducer-input max-reducer-output problems input-size dominated output-size dominated input-output balanced

Mapping Join Matrix Cells to Reducers Standard equi-join (left), random(center), and balanced (right)

Comparisons of Reduce Allocation Methods Simple allocation Minimize the maximum input size of reduce functions Output size may be skewed Random allocation Minimize the maximum output size of reduce functions Input size may be increased due to duplication Balances allocation Minimize both maximum input and output sizes

1-Bucket Theta MapReduce Algorithm “Computes” cross-product Goals: Tuples matched at exactly one reducer Minimal input to a reducer Minimal output from each reducer “1-Bucket” refers to no statistics about data distribution

Algorithm Precompute regions of cross-product SxT Use size of S (|S|) and T (|T|) Regions are disjoint Union of regions covers cross-product Each region assigned to single reducer 1 2 3 4 |S|=8; |T|=8; #reducers =4 Rows are tuples in s; columns are tuples in t Value is region for the <s,t> pair

Algorithm : Mapper Each row in S Each row in T Randomly assign value (x) from 1 to size(S) Output <region, row + ‘S’> for each region containing x Example: Assume x=3. Output <1,row+’S’> and <2,row+’S’> Each row in T Same, except output <region, row+’T’> Example: Assume x=3. Output <1, row+’T’> and <3,row+’T’>

Algorithm: Reducer Joins all S rows with all T rows Can use any join algorithm appropriate for join value Output cross-product, theta join or equi-join

Algorithm: Correctness Random assignment of tuples Since actual row number unknown, any row number works Some reducer will compare tuple to any tuple in other table Therefore, every pair compared (as in nested block loop join) in only one reducer

Optimal Partitioning Basis for minimal input and minimal output Let |S| be size of table S; r number of reducers Optimal output |S||T|/r Optimal input sqrt(|S||T|/r) from each table

Example |S|=8; |T|=8; r=4; sqrt(|S||T|/r) =4; s=t=2

Near Optimal Partitioning Optimal case is rare General case t=floor(|T|/ sqrt(|S||T|/r)) Side length: floor((1+1/min(s,t)) * sqrt(|S||T|/r)) Note floor function omitted from paper Example: |S|=|T|=8; r=9 s=t=floor(8/sqrt(64/9))=3 Side length = floor((1+1/3)*sqrt(64/9))=3

Example: Near-Optimal Partitioning Assumed partitioning Note: 64/9=7.111 . . . Eight partitions with 7 and one with 8 is better

Experiments Cloud data set Information about cloud cover 382 million records 28.8 GB Cloud-5-i is 5 million record subset SELECT S.date, S.longitude, S.latitude FROM Cloud S, Cloud T WHERE s.date = t.date and S.longitude = T. longitude and ABS(S.latitude-T.latitude) <= 10 SELECT S.latitude, T.latitude FROM Cloud-5-1 S, Cloud-5-2 T WHERE ABS(S.latitude-T.latitude) < 2

Experimental Results

Conclusion MapReduce algorithm for arbitrary joins Always applicable Effective for large-scale data analysis Additional statistics provide better performance