Assignment Problems of Different- Sized Inputs in MapReduce Foto N. Afrati 1, Shlomi Dolev 2, Ephraim Korach 2, Shantanu Sharma 2, and Jeffrey D. Ullman.

Slides:

Advertisements

Similar presentations

 Review: The Greedy Method

Advertisements

LIBRA: Lightweight Data Skew Mitigation in MapReduce

An Introduction to Computational Complexity Edith Elkind IAM, ECS.

Parallel Computing MapReduce Examples Parallel Efficiency Assignment

Assignment of Different-Sized Inputs in MapReduce Shantanu Sharma 2 joint work with Foto N. Afrati 1, Shlomi Dolev 2, Ephraim Korach 2, and Jeffrey D.

© The McGraw-Hill Companies, Inc., Chapter 8 The Theory of NP-Completeness.

The number of edge-disjoint transitive triples in a tournament.

Bounds for Overlapping Interval Join on MapReduce Foto N. Afrati 1, Shlomi Dolev 2, Shantanu Sharma 2, and Jeffrey D. Ullman 3 1 National Technical University.

PTAS for Bin-Packing. Special Cases of Bin Packing 1. All item sizes smaller than Claim 1: Proof: If then So assume Therefore:

Advanced Topics in Algorithms and Data Structures Page 1 Parallel merging through partitioning The partitioning strategy consists of: Breaking up the given.

1 Delay-efficient Data Gathering in Sensor Networks Bin Tang, Xianjin Zhu and Deng Pan.

Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.

Parallel Merging Advanced Algorithms & Data Structures Lecture Theme 15 Prof. Dr. Th. Ottmann Summer Semester 2006.

Computational Complexity, Physical Mapping III + Perl CIS 667 March 4, 2004.

1 Relations: The Second Time Around Chapter 7 Equivalence Classes.

Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.

© J. Christopher Beck Lecture 18: Timetabling with Workforce Capacity.

9-1 Chapter 9 Approximation Algorithms. 9-2 Approximation algorithm Up to now, the best algorithm for solving an NP-complete problem requires exponential.

9-1 Chapter 9 Approximation Algorithms. 9-2 Approximation algorithm Up to now, the best algorithm for solving an NP-complete problem requires exponential.

Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.

Jeffrey D. Ullman Stanford University. 2 Formal Definition Implementation Fault-Tolerance Example: Join.

Scheduling Master - Slave Multiprocessor Systems Professor: Dr. G S Young Speaker:Darvesh Singh.

Improved results for a memory allocation problem Rob van Stee University of Karlsruhe Germany Leah Epstein University of Haifa Israel WADS 2007 WAOA 2007.

Jeffrey D. Ullman Stanford University.  Mining of Massive Datasets, J. Leskovec, A. Rajaraman, J. D. Ullman.  Available for free download at i.stanford.edu/~ullman/mmds.html.

Hadoop & Cheetah. Key words Cluster  data center – Lots of machines thousands Node  a server in a data center – Commodity device fails very easily Slot.

© 2010 AT&T Intellectual Property. All rights reserved. AT&T and the AT&T logo are trademarks of AT&T Intellectual Property. Bin Packing: From Theory to.

Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.

The Theory of NP-Completeness 1. What is NP-completeness? Consider the circuit satisfiability problem Difficult to answer the decision problem in polynomial.

Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.

Chapter 12 Coping with the Limitations of Algorithm Power Copyright © 2007 Pearson Addison-Wesley. All rights reserved.

Meta-MapReduce A Technique for Reducing Communication in MapReduce Computations Foto N. Afrati 1, Shlomi Dolev 2, Shantanu Sharma 2, and Jeffrey D. Ullman.

Basic Counting. This Lecture We will study some basic rules for counting. Sum rule, product rule, generalized product rule Permutations, combinations.

Zois Vasileios Α. Μ :4183 University of Patras Department of Computer Engineering & Informatics Diploma Thesis.

MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.

Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.

© 2010 AT&T Intellectual Property. All rights reserved. AT&T and the AT&T logo are trademarks of AT&T Intellectual Property. Case Studies: Bin Packing.

Chapter 15 Approximation Algorithm Introduction Basic Definition Difference Bounds Relative Performance Bounds Polynomial approximation Schemes Fully Polynomial.

Packing Rectangles into Bins Nikhil Bansal (CMU) Joint with Maxim Sviridenko (IBM)

TECH Computer Science NP-Complete Problems Problems  Abstract Problems  Decision Problem, Optimal value, Optimal solution  Encodings  //Data Structure.

Processing Theta-Joins using MapReduce

Bi-Hadoop: Extending Hadoop To Improve Support For Binary-Input Applications Xiao Yu and Bo Hong School of Electrical and Computer Engineering Georgia.

Foto Afrati — National Technical University of Athens Anish Das Sarma — Google Research Semih Salihoglu — Stanford University Jeff Ullman — Stanford University.

Solving the Maximum Cardinality Bin Packing Problem with a Weight Annealing-Based Algorithm Kok-Hua Loh University of Maryland Bruce Golden University.

MapReduce Algorithm Design Based on Jimmy Lin’s slides

Operational Research & ManagementOperations Scheduling Economic Lot Scheduling 1.Summary Machine Scheduling 2.ELSP (one item, multiple items) 3.Arbitrary.

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.

IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.

CSCE350 Algorithms and Data Structure Lecture 21 Jianjun Hu Department of Computer Science and Engineering University of South Carolina

MapReduce and the New Software Stack. Outline  Algorithm Using MapReduce  Matrix-Vector Multiplication  Matrix-Vector Multiplication by MapReduce 

The Theory of NP-Completeness 1. Nondeterministic algorithms A nondeterminstic algorithm consists of phase 1: guessing phase 2: checking If the checking.

On Detecting Termination in Cognitive Radio Networks Shantanu Sharma 1 and Awadhesh Kumar Singh 2 1 Ben-Gurion University of the Negev, Israel 2 National.

Jeffrey D. Ullman Stanford University.  A real story from CS341 data-mining project class.  Students involved did a wonderful job, got an “A.”  But.

BIN SORTING Problem Pack the following items in bins of size Firstly, find the lower bound by summing the numbers to be packed.

MapReduce and Hadoop Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata November 10, 2014.

Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.

”Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters” Published In SIGMOD '07 By Yahoo! Senthil Nathan N IIT Bombay.

Upper and Lower Bounds on the cost of a Map-Reduce Computation

Large-scale file systems and Map-Reduce

Assignment Problems of Different-Sized Inputs in MapReduce

Private and Secure Secret Shared MapReduce

Theory of MapReduce Algorithms

湖南大学-信息科学与工程学院-计算机与科学系

On Spatial Joins in MapReduce

MapReduce Algorithm Design Adapted from Jimmy Lin’s slides.

Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &

Sanjoy Baruah The University of North Carolina at Chapel Hill

Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &

Distributed Systems and Concurrency: Map Reduce

Presentation transcript:

Assignment Problems of Different- Sized Inputs in MapReduce Foto N. Afrati 1, Shlomi Dolev 2, Ephraim Korach 2, Shantanu Sharma 2, and Jeffrey D. Ullman 3 1 National Technical University of Athens, Greece 2 Ben-Gurion University of the Negev, Israel 3 Stanford University, USA

Outline Introduction Problem Statement and Our Contribution All-to-All (A2A) Mapping Schema Problem Heuristics for A2A Mapping Schema Problem X-to-Y (X2Y) Mapping Schema Problem Heuristics for X2Y Mapping Schema Problem Conclusion 2

Cluster Computing – Terabytes or Petabytes amount of data cannot be processed easily on a single computer – Cluster of computers – How to mask failures, e.g., hardware failures MapReduce is a programming model used for parallel processing over large-scale data Introduction 3

MapReduce job – Map Phase: applies a user-defined Map function – Reduce Phase: applies a user-defined Reduce function Mapper – An application of the Map function to a single input – Provides outputs in the form of  key, value  Reducer – An application of the Reduce function to a single key and its associated list of values Introduction 4

5 Worker Master process Worker fork Assign map tasks Assign reduce tasks Read Local write Remote read, sort Output File 0 Output File 1 Write Chunk 0 Chunk 1 Chunk 2 Input Data MapReduce job: Map Phase and Reduce Phase Map Phase: applies a user-defined Map function Reduce Phase: applies a user-defined Reduce function

Mapper 1 Reducer for k 1 Reducer for k 2 Reducer for k 3 Mapper 2 Mapper 3 Mapper 4 input 1 k1k1 k2k2 input 2 k1k1 k2k2 input 3 k3k3 input 4 k2k2 k3k3 Introduction MapReduce working 6 Notation k i : key input 1 input 2 input 3 input 4

Mapper 1 Reducer for I Mapper I 1 1 like Introduction MapReduce working example – Word Count 2 2 apple Reducer for like Reducer for apple Reducer for is Reducer for banana Reducer for fruit (I, 2) (like, 2) (apple, 2) (is, 1) (fruit, 1) (banana, 1) I like apple. Apple is fruit. I like banana. 1 1 fruit 1 1 is 1 1 I 1 1 like 1 1 banana

Mapper 1 Reducer for I Mapper I 1 1 like Introduction MapReduce working example – Word Count 2 2 apple Reducer for like Reducer for apple Reducer for is Reducer for banana Reducer for fruit (I, 2) (like, 2) (apple, 2) (is, 1) (fruit, 1) (banana, 1) I like apple. Apple is fruit. I like banana. 1 1 fruit 1 1 is 1 1 I 1 1 like 1 1 banana

Mapper 1 Reducer for I Mapper I 1 1 like Introduction Inputs and outputs in our context 2 2 apple Reducer for like Reducer for apple Reducer for is Reducer for banana Reducer for fruit (I, 2) (like, 2) (apple, 2) (is, 1) (fruit, 1) (banana, 1) I like apple. Apple is fruit. I like banana. 1 1 fruit 1 1 is 1 1 I 1 1 like 1 1 banana Inputs Outputs

Values, provided by each mapper, have some sizes (input size) Reduce capacity: an upper bound on the sum of the sizes of the values that are assigned to the reducer Example: reducer capacity to be the size of the main memory of the processors on which reducers run We consider two special matching problems Reducer Capacity 10

Mapping Schema Mapping schema is an assignment of the set of inputs to some given reducers, such that – Respect the reducer capacity A reducer is assigned only inputs whose sum is less than or equal to the reducer capacity – Assignment of inputs For every output, it is required to assign every two corresponding inputs to at least one reducer in common 11 Reducer (4GB) Reducer (4GB) M 1 (1GB) M 1 (1GB) M 2 (2GB) M 3 (2GB) Reducer (4GB) Reducer (4GB) M 1 (1GB) M 1 (1GB) M 2 (2GB) M 3 (2GB) M 1 (1GB) M 1 (1GB) M 2 (2GB) M 3 (2GB) Reducer (4GB) Reducer (4GB) Reducer (4GB) Reducer (4GB)

State-of-the-Art F. Afrati, A.D. Sarma, S. Salihoglu, and J.D. Ullman, “Upper and Lower Bounds on the Cost of a Map- Reduce Computation,” PVLDB, Unit input size Reducer Size – Maximum number of inputs that a given reducer can have. Mapping Schema – Respect the reducer capacity – Assignment of inputs 12

Outline Introduction Problem Statement and Our Contribution All-to-All (A2A) Mapping Schema Problem Heuristics for A2A Mapping Schema Problem X-to-Y (X2Y) Mapping Schema Problem Heuristics for X2Y Mapping Schema Problem Conclusion 13

Problem Statement Communication cost between the map and the reduce phases is a significant factor How we can reduce the communication cost? – A lesser number of reducers, and hence, a smaller communication cost – How to minimize the total number of reducers while respecting their limited capacity? Not an easy task – All-to-All mapping schema problem – X-to-Y mapping schema problem 14 Mapper for 1 st input Reducer for k 1 ( 1, 2 ) Reducer for k 2 ( 1, 3 ) Reducer for k 3 ( 2, 3 ) Mapper for 2 nd input Mapper for 3 rd input input 1 k1k1 k2k2 input 2 k1k1 k3k3 input 3 k2k2 k3k3 Mapper for 1 st input Reducer for k 1 ( 1, 2, 3 ) Mapper for 2 nd input Mapper for 3 rd input input 1 k1k1 input 2 k1k1 input 3 k1k1 input 1 input 2 input 3 input 1 input 2 input 3 Notation k i : key

Our Contribution Try to decrease communication cost Two kinds of mapping schema problems: – All-to-All (A2A) mapping schema problem – X-to-Y (X2Y) mapping schema problem Heuristics for mapping schema problems 15

Outline Introduction Problem Statement and Our Contribution All-to-All (A2A) Mapping Schema Problem Heuristics for A2A Mapping Schema Problem X-to-Y (X2Y) Mapping Schema Problem Heuristics for X2Y Mapping Schema Problem Conclusion 16

A set of inputs is given Each pair of inputs corresponds to one output Example – Computing common friends Lists of friends of m persons are given Find common friends of the given m persons Every two friend lists must be assigned to a single common reducer A2A Mapping Schema Problem 17

Mapper for 1 st friend fl 2 fl 3 fl 1 Reducer for k 1 (1, 2, 3, 4) fl 4 Mapper for 2 nd friend Mapper for 3 rd friend Mapper for 4 th friend fl 1 k1k1 fl 2 k1k1 fl 3 k1k1 fl 4 k1k1 Reducer capacity is enough to hold all the friend lists together 18 Notations k i : key fl i : i th friend list 1, 2 1, 3 1, 4 2, 3 2, 4 3, 4 A2A Mapping Schema Problem

Mapper for 1 st friend fl 2 fl 3 fl 1 Reducer for k 1 (1, 2, 3) fl 4 Reducer for k 2 (1, 2, 4) Reducer for k 3 (3, 4) Mapper for 2 nd friend Mapper for 3 rd friend Mapper for 4 th friend fl 1 k1k1 k2k2 fl 2 k1k1 k2k2 fl 3 k1k1 k3k3 fl 4 k2k2 k3k3 Reducer capacity is enough to hold some of the friend lists together 19 Notations k i : key fl i : i th friend list 1, 21, 32, 32, 41, 43, 4 A2A Mapping Schema Problem

Inputs to the problem – A set of m inputs – A size for each input (w 1, w 2, …, w m ) – A set of reducers (r 1, r 2, …, r z ) – A mapping from outputs to sets of inputs Identical reducer capacity q 20 A2A Mapping Schema Problem

What to do? – Assigns the given m inputs to the given number of reducers, without exceeding q, in a manner that every given input is coupled with every other given input in at least one reducer in common Polynomial time solution for one and two reducers NP-hard for z > 2 reducers 21 Reducer (4GB) Reducer (4GB) M 1 (1GB) M 1 (1GB) M 2 (2GB) M 3 (2GB) Cannot assign M 3 Reducer (4GB) Reducer (4GB) M 1 (1GB) M 1 (1GB) M 2 (2GB) M 3 (2GB) Reducer (4GB) Reducer (4GB) M 1 (1GB) M 1 (1GB) Cannot assign M 2 M 3 A2A Mapping Schema Problem

22 w 1, w 2, …, w m ai z+1 reducers Subset 1 of I Subset 2 of I Subset z of I 3 reducers q = s Partition problem: (M.R. Garey and D.S. Johnson, "Computers and Intractability: A Guide to the Theory of NP-Completeness," 1979.) A = {3,1,1,2,2,1} A 1 = {1,1,1,2} ( =5) A 2 = {2,3} (2+3=5) A2A Mapping Schema Problem w 1, w 2, …, w m w ai = s/2 Subset 1 of I Subset 2 of I w ai = s/2

Outline Introduction Problem Statement and Our Contribution All-to-All (A2A) Mapping Schema Problem Heuristics for A2A Mapping Schema Problem X-to-Y (X2Y) Mapping Schema Problem Heuristics for X2Y Mapping Schema Problem Conclusion 23

Heuristics for A2A Mapping Schema Problem Based on – First-Fit Decreasing (FFD) or Best-Fit Decreasing (BFD) bin-packing algorithm – Pseudo-polynomial bin-packing algorithm * – 2-step Algorithms – The selection of a prime number p A fixed reducer capacity is given 24 * D. R. Karger and J. Scott. Efficient algorithms for fixed-precision instances of bin packing and euclidean tsp. In APPROX-RANDOM, pages 104–117, 2008.

Heuristics for A2A Mapping Schema Problem 25

Heuristics for A2A Mapping Schema Problem Parameters for analysis: – Per input replication – Replication rate, r – Total number of reducers, r(m, q) – Total communication cost, c 26

Heuristics for A2A Mapping Schema Problem s is sum of all the input sizes q is the reducers capacity w 2, w 4 w1w1 w1w1 w 3, w m, w 5 w1w1 w1w1 w m-1, w 6 w1w1 w1w1 s-w 1 q-w 1

Heuristics for A2A Mapping Schema Problem Case 1- All the input sizes are different – Use First-Fit Decreasing (FFD)* to create x bins (S 1, S 2, …, S x ) of size at most q/2 – Use x(x-1)/2 reducers to assign each bin with each other 28 S1S1 S1S1 S2S2 S2S2 S1S1 S1S1 S3S3 S3S3 S1S1 S1S1 SxSx SxSx S3S3 S3S3 S2S2 S2S2 S4S4 S4S4 S2S2 S2S2 SxSx SxSx S2S2 S2S2 SxSx SxSx S x-1 *D.S. Johnson, Near-optimal bin-packing algorithms, Doctoral thesis, MIT, Cambridge, S1S1 S1S1 S2S2 S2S2 SxSx SxSx w1w1 w2w2 wmwm w3w3

Heuristics for A2A Mapping Schema Problem 29 S1S1 S1S1 S2S2 S2S2 SxSx SxSx Bins of size q/2 Bins are at least half full So, each bin has at least q/4 sized input You can place every two bins at a reducer s is sum of all the input sizes q is the reducers capacity

Heuristics for A2A Mapping Schema Problem Case 2 - One input, i, is of size w i, q/2 < w i < q Based on the bin-packing based algorithm Make bins of size q-w i to place all the other inputs except the input i, assign them at reducers for an assignment of the i inputs Make a solution to all the other inputs except the input i 30 wiwi wiwi S1S1 S1S1 wiwi wiwi S2S2 S2S2 wiwi wiwi SxSx SxSx S’ 2 S’ 1 S’ 3 S’ 1 S’ y S’ y-1 S1S1 S1S1 S2S2 S2S2 SxSx SxSx S’ 2 S’ 1 S’ y Size is q – w_i Size is q/2

Heuristics for A2A Mapping Schema Problem 31 w1w1 w1w1 w2w2 w2w2 w3w3 w3w3 w4w4 w4w4 w1w1 w1w1 w5w5 w5w5 w6w6 w6w6 w7w7 w7w7 w1w1 w1w1 w m -2 w m -1 wmwm wmwm Each reducer can hold at most k inputs k-1 m-1

Heuristics for A2A Mapping Schema Problem Case 3: All the input sizes are identical (q/k, k>1) 4  OPTIMUM recursive algorithm, when k is odd, k>2 – Divide m inputs into two sets, A (of y inputs) and B (of x inputs) – Make y -1 groups, each holds y/2 pairs – Assign each input from B to one of the groups – Perform the same operation on set B 32 y = 4 3 groups of 2 pairs Group 1 Group 2 Group 3

Heuristics for A2A Mapping Schema Problem Case 3: All the input sizes are identical (q/k, k>1): reducer capacity q = 3 – 4  OPTIMUM recursive algorithm, when k>2 is odd 33 Group 6 Group 7 Group 5 m = 15 inputs, each is of q/3 size, k= 3 Set A = {1, 2, …, 8} Set B = {9, 10, …, 15} Divide inputs of the set A into two groups of equal number of inputs Assign each row of every group at a reducers and perform the same method on the set B Assign each row of every group at a reducers and perform the same method on the set B Group 2Group 1 Group 3 Group

Heuristics for A2A Mapping Schema Problem Case 3: All the input sizes are identical (q/k, k>1) : reducer capacity q = 5 – 4  OPTIMUM recursive algorithm, when k>2 is odd Group 1 Group Group 3 Group 4 Group 6 Group 7 Group 5 m =23 inputs, each is of q/5 size, k= 5 Set A = {1, 2, …, 16} Set B = {17, 18, …, 23} Assign each row of every group at a reducers and perform the same method on the set B Assign each row of every group at a reducers and perform the same method on the set B

Heuristics for A2A Mapping Schema Problem Case 3: All the input sizes are identical (q/k, k>1) : reducer capacity q = 4 – 2  OPTIMUM recursive algorithm, when k is even – Make 2m/k subgroups, and then make 2m/k -1 groups 35 Group 1 Group 2Group 3 Group 4 Group 6 Group 7 Group 5 16 inputs, each is of q/4 size Divide inputs into 8 groups each of 2 inputs 1,2 3,4 5,6 7,8 9,10 11,12 13,14 15,16 1,2 3,4 5,6 7,8 9,10 11,12 13,14 15,16 1,2 3,4 5,6 7,8 11,12 13,14 15,16 9,10 1,2 3,4 5,6 7,8 13,14 15,16 9,10 11,12 1,2 3,4 5,6 7,8 15,16 9,10 11,12 13,14 1,2 3,4 9,10 11,12 5,6 7,8 13,14 15,16 1,2 3,4 9,10 11,12 7,8 5,6 15,16 13,14 1,2 5,6 9,10 13,14 3,4 7,8 11,12 15,16 Work similar to q/3 case

Heuristics for A2A Mapping Schema Problem 36

Heuristics for A2A Mapping Schema Problem Case 3: All the input sizes are identical (q/k, k>1) – when k is a prime number Extends the approach of AU’13 AU’ 13 provides a solution when m = k 2, where k is a prime number Create k+1 teams and k players (reducers) in each team 37 Foto N. Afrati, Jeffrey D. Ullman: Matching bounds for the all-pairs MapReduce problem. IDEAS 2013: 3-4.

Heuristics for A2A Mapping Schema Problem Case 3: All the input sizes are identical (q/k, k>1) m = 3 2 inputs k = 3, q = 3 38 a1a2a3a1a2a3 a4a5a6a4a5a6 a7a8a9a7a8a9 a1a5a9a1a5a9 a4a8a3a4a8a3 a7a2 a6a7a2 a6 a1a8a6a1a8a6 a4a2a9a4a2a9 a7a5a3a7a5a3 a1a4a7a1a4a7 a2a5a8a2a5a8 a3a6a9a3a6a9 Team 0Team 1Team 2Team 3

Heuristics for A2A Mapping Schema Problem Case 3: All the input sizes are identical (q/k, k>1) when k is a prime number, q = 3 39 C1C2C3 C4C5C6 C7C8C C1 C2 C C4 C5 C C7 C8 C C1 C6 C C2 C4 C C3 C5 C C1 C4 C C2 C5 C C3 C6 C C1 C5 C C3 C4 C C2 C6 C7 Matrix of columns 12 Reducers

Outline Introduction Problem Statement and Our Contribution All-to-All (A2A) Mapping Schema Problem Heuristics for A2A Mapping Schema Problem X-to-Y (X2Y) Mapping Schema Problem Heuristics for X2Y Mapping Schema Problem Conclusion 40

Two disjoint sets X and Y are given Each pairs of element  x i, y j  (where x i  X, y j  Y,  i, j) of the sets X and Y corresponds to one output Example – Skew Join Two relations X(A, B) and Y(B, C) are given where lots of tuple have a common “b” value Every tuple with an identical “b” value is required to assign at at least one reducer X2Y Mapping Schema Problem 41

Mapper for X(1,2) Reducer for key = 2 Mapper for X(5,2) Mapper for X(9,2) Mapper for Y(2,5) Mapper for Y(2,4) Mapper for Y(2,7) X(1, 2 ) Reducer capacity is enough to hold all the tuples whose b = 2 together X(5, 2 ) X(9, 2 ) Y( 2,5) Y( 2,4) Y( 2,7) AB Relation X BC Relation Y X2Y Mapping Schema Problem

Reducer for k 1 k1k1 k1k1 k2k2 k1k1 k1k1 k2k2 Reducer capacity is enough to hold some of the tuples of both the relations together k2k2 k2k2 k3k3 k3k3 k3k3 Reducer for k 2 Reducer for k 3 Mapper for X(1,2) Mapper for X(5,2) Mapper for X(9,2) Mapper for Y(2,5) Mapper for Y(2,4) Mapper for Y(2,7) X2Y Mapping Schema Problem

Input to the problem – Two sets X and Y of m and n inputs resp. – A size for each input – A set of reducers (r 1, r 2, …, r z ) – A mapping from outputs to sets of inputs Identical reducer capacity q 44

X2Y Mapping Schema Problem What to do? – Assigns each input of the set X with each input of the set Y to at least one reducer in common, without exceeding q Polynomial for one reducer – Can we assign all the inputs of the sets X and Y to a single reducer NP-hard for z > 1 reducers 45

X2Y Mapping Schema Problem 46 2 reducers s is sum of input sizes of the set X q = w 1 ’ +s/2 z = 2 reducers Set Y Subset 1 of X (s/2) Subset 2 of X (s/2)

Outline Introduction Problem Statement and Our Contribution All-to-All (A2A) Mapping Schema Problem Heuristics for A2A Mapping Schema Problem X-to-Y (X2Y) Mapping Schema Problem Heuristics for X2Y Mapping Schema Problem Conclusion 47

Heuristics for X2Y Mapping Schema Problem Based on – First-Fit Decreasing (FFD) or Best-Fit Decreasing (BFD) bin-packing algorithm A fixed reducer capacity is given 48

Heuristics for X2Y Mapping Schema Problem 49 Case 1- All the input sizes are upper bounded by q/2 in sets X and Y – Both the sets cannot hold inputs of size greater than q/2 – Use FFD to create u = bins of size at most q/2 of the inputs of X v = bins of size at most q/2 of the inputs of Y u1u1 u1u1 v1v1 v1v1 u1u1 u1u1 v2v2 v2v2 u1u1 u1u1vv v1v1 v1v1 u2u2 u2u2 v2v2 v2v2 u2u2 u2u2 vv u2u2 u2u2 uv reducers v1v1 v1v1uu v2v2 v2v2uu vvuu

Heuristics for X2Y Mapping Schema Problem Case 2- Inputs of either set are of size at most w, q/2 < w < q – Inputs of the set X are of sizes at most w – Hence, inputs of the set Y are of sizes at most q-w – Use FFD to create u = bins of size at most w of the inputs of X v = bins of size at most q-w of the inputs of Y 50 u1u1 u1u1 v1v1 v1v1 u1u1 u1u1 v2v2 v2v2 u1u1 u1u1vv v1v1 v1v1 u2u2 u2u2 v2v2 v2v2 u2u2 u2u2 vv u2u2 u2u2 uv reducers v1v1 v1v1uu v2v2 v2v2uu vvuu

Outline Introduction Problem Statement and Our Contribution All-to-All Mapping Schema Problem X-to-Y Mapping Schema Problem Heuristics for Mapping Schema Problems Conclusion 51

Conclusion Reducer capacity – An important parameter to be considered in all MapReduce algorithms – The capacity is in terms of, not necessarily identical, memory auxiliary size, augmented and added to the index of the data item(s) Two assignment schemas of MapReduce are given – All-to-All (A2A) mapping schema problem – X-to-Y (X2Y) mapping schema problem Several heuristics for A2A and X2Y mapping schema problems are provided 52

Foto Afrati 1, Shlomi Dolev 2, Ephraim Korach 3, Shantanu Sharma 2, and Jeffrey D. Ullman 4 1 School of Electrical and Computing Engineering, National Technical University of Athens, Greece 2 Department of Computer Science, Ben-Gurion University of the Negev, Israel 3 Department of Industrial Engineering and Management, Ben-Gurion University of the Negev, Israel 4 Department of Computer Science, Stanford University, USA Presentation is available at