Assignment Problems of Different-Sized Inputs in MapReduce

Slides:



Advertisements
Similar presentations
Minimum Clique Partition Problem with Constrained Weight for Interval Graphs Jianping Li Department of Mathematics Yunnan University Jointed by M.X. Chen.
Advertisements

S YSTEM -W IDE E NERGY M ANAGEMENT FOR R EAL -T IME T ASKS : L OWER B OUND AND A PPROXIMATION Xiliang Zhong and Cheng-Zhong Xu ICCAD 2006, ACM Trans. on.
Optical Architecture for (Restricted) Exponential Time Hard Problems Nova Fandina Ben-Gurion University of the Negev, Israel Joint work with: Prof. Shlomi.
 Review: The Greedy Method
LIBRA: Lightweight Data Skew Mitigation in MapReduce
Counting the bits Analysis of Algorithms Will it run on a larger problem? When will it fail?
Assignment of Different-Sized Inputs in MapReduce Shantanu Sharma 2 joint work with Foto N. Afrati 1, Shlomi Dolev 2, Ephraim Korach 2, and Jeffrey D.
Outline. Theorem For the two processor network, Bit C(Leader) = Bit C(MaxF) = 2[log 2 ((M + 2)/3.5)] and Bit C t (Leader) = Bit C t (MaxF) = 2[log 2 ((M.
Bounds for Overlapping Interval Join on MapReduce Foto N. Afrati 1, Shlomi Dolev 2, Shantanu Sharma 2, and Jeffrey D. Ullman 3 1 National Technical University.
1 Delay-efficient Data Gathering in Sensor Networks Bin Tang, Xianjin Zhu and Deng Pan.
FSM Decomposition using Partitions on States 290N: The Unknown Component Problem Lecture 24.
1 Regular expression matching with input compression : a hardware design for use within network intrusion detection systems Department of Computer Science.
Computational Complexity, Physical Mapping III + Perl CIS 667 March 4, 2004.
Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.
On the Task Assignment Problem : Two New Efficient Heuristic Algorithms.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
Jeffrey D. Ullman Stanford University. 2 Formal Definition Implementation Fault-Tolerance Example: Join.
1 The Theory of NP-Completeness 2 NP P NPC NP: Non-deterministic Polynomial P: Polynomial NPC: Non-deterministic Polynomial Complete P=NP? X = P.
Scheduling Master - Slave Multiprocessor Systems Professor: Dr. G S Young Speaker:Darvesh Singh.
Jeffrey D. Ullman Stanford University.  Mining of Massive Datasets, J. Leskovec, A. Rajaraman, J. D. Ullman.  Available for free download at i.stanford.edu/~ullman/mmds.html.
Meta-MapReduce A Technique for Reducing Communication in MapReduce Computations Foto N. Afrati 1, Shlomi Dolev 2, Shantanu Sharma 2, and Jeffrey D. Ullman.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Chapter 15 Approximation Algorithm Introduction Basic Definition Difference Bounds Relative Performance Bounds Polynomial approximation Schemes Fully Polynomial.
The Fast Optimal Voltage Partitioning Algorithm For Peak Power Density Minimization Jia Wang, Shiyan Hu Department of Electrical and Computer Engineering.
Processing Theta-Joins using MapReduce
Foto Afrati — National Technical University of Athens Anish Das Sarma — Google Research Semih Salihoglu — Stanford University Jeff Ullman — Stanford University.
Solving the Maximum Cardinality Bin Packing Problem with a Weight Annealing-Based Algorithm Kok-Hua Loh University of Maryland Bruce Golden University.
Instructor Neelima Gupta Table of Contents Class NP Class NPC Approximation Algorithms.
Operational Research & ManagementOperations Scheduling Economic Lot Scheduling 1.Summary Machine Scheduling 2.ELSP (one item, multiple items) 3.Arbitrary.
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
MapReduce and the New Software Stack. Outline  Algorithm Using MapReduce  Matrix-Vector Multiplication  Matrix-Vector Multiplication by MapReduce 
Output Grouping-Based Decomposition of Logic Functions Petr Fišer, Hana Kubátová Department of Computer Science and Engineering Czech Technical University.
Great Theoretical Ideas in Computer Science.
Jeffrey D. Ullman Stanford University.  A real story from CS341 data-mining project class.  Students involved did a wonderful job, got an “A.”  But.
BIN SORTING Problem Pack the following items in bins of size Firstly, find the lower bound by summing the numbers to be packed.
Timetable Problem solving using Graph Coloring
Assignment Problems of Different- Sized Inputs in MapReduce Foto N. Afrati 1, Shlomi Dolev 2, Ephraim Korach 2, Shantanu Sharma 2, and Jeffrey D. Ullman.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
ICS 353: Design and Analysis of Algorithms NP-Complete Problems King Fahd University of Petroleum & Minerals Information & Computer Science Department.
BAHIR DAR UNIVERSITY Institute of technology Faculty of Computing Department of information technology Msc program Distributed Database Article Review.
Approximation Algorithms for Scheduling
More NP-Complete and NP-hard Problems
Chapter 10 NP-Complete Problems.
The Subset-sum Problem
Upper and Lower Bounds on the cost of a Map-Reduce Computation
Algorithm Design Methods
BIPARTITE GRAPHS AND ITS APPLICATIONS
Parallel Programming By J. H. Wang May 2, 2017.
Chart Packing Heuristic
Computing Full Disjunctions
Approximation Algorithms
Private and Secure Secret Shared MapReduce
Theory of MapReduce Algorithms
On Spatial Joins in MapReduce
ICS 353: Design and Analysis of Algorithms
Sanjoy Baruah The University of North Carolina at Chapel Hill
Exam 2 LZW not on syllabus. 73% / 75%.
המחלקה להנדסת חשמל ומחשבים תשע"ט (2019)
Basic notions contd... Definition:
PTAS for Bin-Packing.
Sungho Kang Yonsei University
Algorithms for Budget-Constrained Survivable Topology Design
NP-Complete Problems.
Algorithm Design Methods
Algorithm Design Methods
Complexity Theory in Practice
CS154, Lecture 16: More NP-Complete Problems; PCPs
CS21 Decidability and Tractability
Algorithm Design Methods
Presentation transcript:

Assignment Problems of Different-Sized Inputs in MapReduce Foto N. Afrati1, Shlomi Dolev2, Ephraim Korach2, Shantanu Sharma2 and Jeffrey D. Ullman3 1 National Technical University of Athens, Greece 2 Ben-Gurion University of the Negev, Israel 3 Stanford University, USA

Outline Introduction Problem Statement and Our Contribution All-to-All (A2A) Mapping Schema Problem Heuristics for A2A Mapping Schema Problem X-to-Y (X2Y) Mapping Schema Problem Heuristics for X2Y Mapping Schema Problem Conclusion

Outline Introduction Problem Statement and Our Contribution Inputs and outputs Reducer capacity Mapping schema State-of-the-art Problem Statement and Our Contribution All-to-All (A2A) Mapping Schema Problem Heuristics for A2A Mapping Schema Problem X-to-Y (X2Y) Mapping Schema Problem Heuristics for X2Y Mapping Schema Problem Conclusion

Inputs and outputs in our context Introduction Outputs Inputs and outputs in our context 1 I Reducer for I (I, 2) (like, 2) (apple, 2) (is, 1) (fruit, 1) (banana, 1) 1 like 2 apple Reducer for like Inputs I like apple. Apple is fruit. Mapper 1 1 is 1 fruit Reducer for apple Reducer for is 1 I Reducer for fruit Mapper 2 I like banana. 1 like 1 banana Reducer for banana

We consider two special matching problems Reducer Capacity (q) Values, provided by each mapper, have some sizes (input size) Machines have bounded memory Reducer capacity: an upper bound on the sum of the sizes of the values that are assigned to the reducer Example: reducer capacity to be the size of the main memory of the processors on which reducers run We consider two special matching problems

Mapping Schema Mapping schema is an assignment of the set of inputs to some given reducers, such that Respect the reducer capacity A reducer is assigned only inputs whose sum is less than or equal to the reducer capacity Assignment of inputs For every output, it is required to assign every two corresponding inputs to at least one reducer in common M1 (1GB) M2 (2GB) M3 (2GB) M1 (1GB) M1 (1GB) M2 (2GB) M2 (2GB) M3 (2GB) M3 (2GB) Reducer (4GB) Reducer (4GB) Reducer (4GB) Reducer (4GB)

State-of-the-Art Unit input size Reducer Size Mapping Schema F. Afrati, A.D. Sarma, S. Salihoglu, and J.D. Ullman, “Upper and Lower Bounds on the Cost of a Map- Reduce Computation,” PVLDB, 2013. Unit input size Reducer Size Maximum number of inputs that a given reducer can have. Mapping Schema Respect the reducer capacity Assignment of inputs

Outline Introduction Problem Statement and Our Contribution All-to-All (A2A) Mapping Schema Problem Heuristics for A2A Mapping Schema Problem X-to-Y (X2Y) Mapping Schema Problem Heuristics for X2Y Mapping Schema Problem Conclusion

Problem Statement Notation ki: key Communication cost between the map and the reduce phases is a significant factor How we can reduce the communication cost? A lesser number of reducers, and hence, a smaller communication cost How to minimize the total number of reducers while respecting their limited capacity? Not an easy task All-to-All mapping schema problem X-to-Y mapping schema problem Reducer for k1 (1, 2) input1 Mapper for 1st input input1 Mapper for 1st input input1 k1 input1 k1 input1 k2 input2 input2 Reducer for k2 (1, 3) Mapper for 2nd input Reducer for k1 (1, 2, 3) Mapper for 2nd input input2 k1 input2 k1 input2 k3 input3 input3 Mapper for 3rd input Mapper for 3rd input input3 k2 Reducer for k3 (2, 3) input3 k1 input3 k3

Our Contribution Reducer capacity Try to decrease communication cost An important parameter to be considered in MapReduce algorithms All inputs do not necessarily have identical size Try to decrease communication cost Two types of mapping schema problems: All-to-All (A2A) mapping schema problem X-to-Y (X2Y) mapping schema problem Lower and upper bounds on the communication cost

Outline Introduction Problem Statement and Our Contribution All-to-All (A2A) Mapping Schema Problem Heuristics for A2A Mapping Schema Problem X-to-Y (X2Y) Mapping Schema Problem Heuristics for X2Y Mapping Schema Problem Conclusion

A2A Mapping Schema Problem A set of inputs is given Each pair of inputs corresponds to one output Example Computing common friends Lists of friends of m persons are given Find common friends of the given m persons Every two friend lists must be assigned to a single common reducer

A2A Mapping Schema Problem Inputs w1 = w2 = w3 = 0.20q, w4 = w5 = 0.19q, w6 = w7 = 0.18q One way Another way Group inputs such that size of a group is no more than q/2 .22q is misused w1, w2 w3, w4 w5, w6 w7 w1, w2 w3, w4 w3, w4 w5, w6 w1, w2, w3, w4, w7 w1, w2 w5, w6 w3, w4 w7 w1, w2, w5, w6, w7 w1, w2 w7 w5, w6 w7 w3, w4, w5, w6, w7 3 reducers and optimum communication cost 6 reducers and non-optimum communication cost

A2A Mapping Schema Problem What to do? Assigns the given m inputs to the given number of reducers, without exceeding q, in a manner that every given input is coupled with every other given input in at least one reducer in common Polynomial time solution for one and two reducers NP-hard for z > 2 reducers Reduction from the z-partition problem

Outline Introduction Problem Statement and Our Contribution All-to-All (A2A) Mapping Schema Problem Heuristics for A2A Mapping Schema Problem X-to-Y (X2Y) Mapping Schema Problem Heuristics for X2Y Mapping Schema Problem Conclusion

Heuristics for A2A Mapping Schema Problem Two cases: All the inputs are upper bounded by q 2 Exactly one input size, wi > q 2

Heuristics for A2A Mapping Schema Problem Sx S4 Sx-1 Case 1- All the input sizes are different Use a bin-packing algorithm to create x bins (S1, S2, …, Sx) of size at most q/2 Use x(x-1)/2 reducers to assign each bin with each other w1 w2 w3 wm S1 S2 Sx

Heuristics for A2A Mapping Schema Problem Case 2 - One input, i, is of size wi, q/2 < wi < q Based on the bin-packing based algorithm Make bins of size q-wi to place all the other inputs except the input i, assign them at reducers for an assignment of the i inputs Make a solution to all the other inputs except the input i wi S1 wi Hence, all the remaining inputs must be  q-wi S2 wi Sx S’1 S’2 S’1 S’3 S1 S2 Sx Size is q – wi S’y-1 S’y S’1 S’2 S’y Size is q/2

Outline Introduction Problem Statement and Our Contribution All-to-All (A2A) Mapping Schema Problem Heuristics for A2A Mapping Schema Problem X-to-Y (X2Y) Mapping Schema Problem Heuristics for X2Y Mapping Schema Problem Conclusion

X2Y Mapping Schema Problem Two disjoint sets X and Y are given Each pairs of element xi, yj (where xi  X, yj  Y, i, j) of the sets X and Y corresponds to one output Example Skew Join Two relations X(A, B) and Y(B, C) are given where lots of tuple have a common “b” value Every tuple with an identical “b” value is required to assign to at least one reducer

X2Y Mapping Schema Problem w1=w2=0.25q, w3=w4=0.24q, w5=w6=0.23q, w7=w8=0.22q, w9=w10=0.21q, w11=w12=0.20q Inputs of set 𝑋 Inputs of set 𝑌 𝑤 1 ′ = 𝑤 2 ′ =0.25𝑞, 𝑤 3 ′ = 𝑤 4 ′ =0.24𝑞 One way Another way Group inputs such that size of a group is no more than q/2 12 reducers Make groups by taking three inputs from X 16 reducers w1, w2 w3, w4 w5, w6 w7, w8 w1, w2, w3 w9, w10 𝑤 1 ′ w11, w12 w1, w2, w3 𝑤 3 ′ w4, w5, w6 𝑤 1 ′ w4, w5, w6 𝑤 3 ′ 𝑤 1 ′ , 𝑤 2 ′ 𝑤 3 ′ , 𝑤 4 ′ w1, w2 𝑤 1 ′ , 𝑤 2 ′ w1, w2 𝑤 3 ′ , 𝑤 4 ′ w7, w8, w9 𝑤 1 ′ w7, w8, w9 𝑤 3 ′ w3, w4 𝑤 1 ′ , 𝑤 2 ′ w3, w4 𝑤 3 ′ , 𝑤 4 ′ w10, w11, w12 𝑤 1 ′ w10, w11, w12 𝑤 3 ′ w5, w6 𝑤 1 ′ , 𝑤 2 ′ w5, w6 𝑤 3 ′ , 𝑤 4 ′ w1, w2, w3 𝑤 2 ′ w1, w2, w3 𝑤 4 ′ w7, w8 𝑤 1 ′ , 𝑤 2 ′ w7, w8 𝑤 3 ′ , 𝑤 4 ′ w4, w5, w6 𝑤 2 ′ w4, w5, w6 𝑤 4 ′ w9, w10 𝑤 1 ′ , 𝑤 2 ′ w9, w10 𝑤 3 ′ , 𝑤 4 ′ w7, w8, w9 𝑤 2 ′ w7, w8, w9 𝑤 4 ′ w11, w12 𝑤 1 ′ , 𝑤 2 ′ w11, w12 𝑤 3 ′ , 𝑤 4 ′ w10, w11, w12 𝑤 2 ′ w10, w11, w12 𝑤 4 ′

X2Y Mapping Schema Problem What to do? Assigns each input of the set X with each input of the set Y to at least one reducer in common, without exceeding q Polynomial time solution for one reducer Can we assign all the inputs of the sets X and Y to a single reducer NP-hard for z > 1 reducers Reduction from the z-partition problem

Outline Introduction Problem Statement and Our Contribution All-to-All (A2A) Mapping Schema Problem Heuristics for A2A Mapping Schema Problem X-to-Y (X2Y) Mapping Schema Problem Heuristics for X2Y Mapping Schema Problem Conclusion

Heuristics for X2Y Mapping Schema Problem v1 v2 vv u2 uv reducers uu Based on Bin-packing algorithm Inputs of either set are of size at most w, q/2 < w < q Inputs of the set X are of sizes at most w Hence, inputs of the set Y are of sizes at most q-w Use bin-pack algorithm to create u = bins of size at most w of the inputs of X v = bins of size at most q-w of the inputs of Y

Outline Introduction Problem Statement and Our Contribution All-to-All Mapping Schema Problem X-to-Y Mapping Schema Problem Heuristics for Mapping Schema Problems Conclusion

Conclusion Reducer capacity An important parameter to be considered in MapReduce algorithms All inputs do not necessarily have identical size Reducer capacity is equal to the sum of sizes of inputs Two assignment schemas of MapReduce are given All-to-All (A2A) mapping schema problem X-to-Y (X2Y) mapping schema problem Lower and upper bounds on the communication cost

Presentation is available at http://www.cs.bgu.ac.il/~sharmas/publication.html Foto Afrati1, Shlomi Dolev2, Ephraim Korach3, Shantanu Sharma2, and Jeffrey D. Ullman4 1 School of Electrical and Computing Engineering, National Technical University of Athens, Greece afrati@softlab.ece.ntua.gr 2 Department of Computer Science, Ben-Gurion University of the Negev, Israel {dolev,sharmas}@cs.bgu.ac.il 3 Department of Industrial Engineering and Management, Ben-Gurion University of the Negev, Israel korach@bgu.ac.il 4 Department of Computer Science, Stanford University, USA ullman@cs.stanford.edu