Foto Afrati — National Technical University of Athens Anish Das Sarma — Google Research Semih Salihoglu — Stanford University Jeff Ullman — Stanford University.

Slides:

Advertisements

Similar presentations

Optimal Space Lower Bounds for All Frequency Moments David Woodruff MIT

Advertisements

Tight Lower Bounds for the Distinct Elements Problem David Woodruff MIT Joint work with Piotr Indyk.

Graph Data Mining with Map-Reduce Nima Sarshar, Ph.D. INTUIT Inc,

Big Data Reading Group Grigory Yaroslavtsev 361 Levine

Lower Bound for Sparse Euclidean Spanners Presented by- Deepak Kumar Gupta(Y6154), Nandan Kumar Dubey(Y6279), Vishal Agrawal(Y6541)

Noise, Information Theory, and Entropy (cont.) CS414 – Spring 2007 By Karrie Karahalios, Roger Cheng, Brian Bailey.

1 LP Duality Lecture 13: Feb Min-Max Theorems In bipartite graph, Maximum matching = Minimum Vertex Cover In every graph, Maximum Flow = Minimum.

J OIN ALGORITHMS USING MAPREDUCE Haiping Wang

A Model of Computation for MapReduce

Algorithms + L. Grewe.

Bahman Bahmani  Fundamental Tradeoffs  Drug Interaction Example [Adapted from Ullman’s slides, 2012]  Technique I: Grouping 

Approximating Maximum Subgraphs Without Short Cycles Guy Kortsarz Join work with Michael Langberg and Zeev Nutov.

Assignment of Different-Sized Inputs in MapReduce Shantanu Sharma 2 joint work with Foto N. Afrati 1, Shlomi Dolev 2, Ephraim Korach 2, and Jeffrey D.

Bicriteria Approximation Tradeoff for the Node-Cost Budget Problem Yuval Rabani (Technion) Gabriel Scalosub (Tel Aviv University)

Online Social Networks and Media. Graph partitioning The general problem – Input: a graph G=(V,E) edge (u,v) denotes similarity between u and v weighted.

S KEW IN P ARALLEL Q UERY P ROCESSING Paraschos Koutris Paul Beame Dan Suciu University of Washington PODS 2014.

Jeffrey D. Ullman Stanford University. 2  Communication cost for a MapReduce job = the total number of key-value pairs generated by all the mappers.

Suffix Sorting & Related Algoritmics Martin Farach-Colton Rutgers University USA.

Parallel Subgraph Listing in a Large-Scale Graph Yingxia Shao  Bin Cui  Lei Chen  Lin Ma  Junjie Yao  Ning Xu   School of EECS, Peking University.

Bounds for Overlapping Interval Join on MapReduce Foto N. Afrati 1, Shlomi Dolev 2, Shantanu Sharma 2, and Jeffrey D. Ullman 3 1 National Technical University.

Generating Efficient Plans for Queries Using Views Chen Li Stanford University with Foto Afrati (National Technical University of Athens) and Jeff Ullman.

Cluster Computing, Recursion and Datalog Foto N. Afrati National Technical University of Athens, Greece.

Approximation Algorithm: Iterative Rounding Lecture 15: March 9.

Jeffrey D. Ullman Stanford University. 2 Formal Definition Implementation Fault-Tolerance Example: Join.

Approximation Algorithms Motivation and Definitions TSP Vertex Cover Scheduling.

Hardness Results for Problems P: Class of “easy to solve” problems Absolute hardness results Relative hardness results –Reduction technique.

1 Generalizing Map-Reduce The Computational Model Map-Reduce-Like Algorithms Computing Joins.

Hardness Results for Problems

CS 345D Semih Salihoglu (some slides are copied from Ilan Horn, Jeff Dean, and Utkarsh Srivastava’s presentations online) MapReduce System and Theory 1.

Jeffrey D. Ullman Stanford University.  Mining of Massive Datasets, J. Leskovec, A. Rajaraman, J. D. Ullman.  Available for free download at i.stanford.edu/~ullman/mmds.html.

Efficient Parallel Set-Similarity Joins Using Hadoop Chen Li Joint work with Michael Carey and Rares Vernica.

Math – Getting Information from the Graph of a Function 1.

C OMMUNICATION S TEPS F OR P ARALLEL Q UERY P ROCESSING Paraschos Koutris Paul Beame Dan Suciu University of Washington PODS 2013.

Distributed Verification and Hardness of Distributed Approximation Atish Das Sarma Stephan Holzer Danupon Nanongkai Gopal Pandurangan David Peleg 1 Weizmann.

Ontology Alignment/Matching Prafulla Palwe. Agenda ► Introduction  Being serious about the semantic web  Living with heterogeneity  Heterogeneity problem.

Meta-MapReduce A Technique for Reducing Communication in MapReduce Computations Foto N. Afrati 1, Shlomi Dolev 2, Shantanu Sharma 2, and Jeffrey D. Ullman.

MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.

Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.

1 Introduction to Approximation Algorithms. 2 NP-completeness Do your best then.

Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer.

Processing Theta-Joins using MapReduce

1 Steiner Tree Algorithms and Networks 2014/2015 Hans L. Bodlaender Johan M. M. van Rooij.

Combinatorial Optimization Problems in Computational Biology Ion Mandoiu CSE Department.

1/24 Introduction to Graphs. 2/24 Graph Definition Graph : consists of vertices and edges. Each edge must start and end at a vertex. Graph G = (V, E)

Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.

CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.

Lower Bounds for Embedding Edit Distance into Normed Spaces A. Andoni, M. Deza, A. Gupta, P. Indyk, S. Raskhodnikova.

Optimizing Joins in a Map-Reduce Environment EDBT 2010 Presented by Foto Afrati, Jeffrey D. Ullman Summarized by Jaeseok Myung Intelligent Database.

On the Hardness of Optimal Vertex Relabeling and Restricted Vertex Relabeling Amihood Amir Benny Porat.

Parallel Evaluation of Conjunctive Queries Paraschos Koutris and Dan Suciu University of Washington PODS 2011, Athens.

EE4271 VLSI Design VLSI Channel Routing.

Dr Nazir A. Zafar Advanced Algorithms Analysis and Design Advanced Algorithms Analysis and Design By Dr. Nazir Ahmad Zafar.

Item Based Recommender System SUPERVISED BY: DR. MANISH KUMAR BAJPAI TARUN BHATIA ( ) VAIBHAV JAISWAL( )

Jeffrey D. Ullman Stanford University.  A real story from CS341 data-mining project class.  Students involved did a wonderful job, got an “A.”  But.

Lower bound algorithm. 1 Start from A. Delete vertex A and all edges meeting at A. A B C D 4 5 E Find the length of the minimum spanning.

MapReduce and Hadoop Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata November 10, 2014.

Assignment Problems of Different- Sized Inputs in MapReduce Foto N. Afrati 1, Shlomi Dolev 2, Ephraim Korach 2, Shantanu Sharma 2, and Jeffrey D. Ullman.

BAHIR DAR UNIVERSITY Institute of technology Faculty of Computing Department of information technology Msc program Distributed Database Article Review.

Theory of MapReduce Algorithms

Upper and Lower Bounds on the cost of a Map-Reduce Computation

VLSI Physical Design Automation

RE-Tree: An Efficient Index Structure for Regular Expressions

COMP 6/4030 ALGORITHMS Prim’s Theorem 10/26/2000.

Assignment Problems of Different-Sized Inputs in MapReduce

Chapter 5. Optimal Matchings

Theory of MapReduce Algorithms

Relations and Functions

EE4271 VLSI Design, Fall 2016 VLSI Channel Routing.

VLSI Physical Design Automation

Review of Bulk-Synchronous Communication Costs Problem of Semijoin

Presentation transcript:

Foto Afrati — National Technical University of Athens Anish Das Sarma — Google Research Semih Salihoglu — Stanford University Jeff Ullman — Stanford University Towards an Understanding of the Limits of Map-Reduce Computation 1

Tradeoff Between Per-Reducer-Memory and Communication Cost 2 keyvalues drugs Patients 1, Patients 2 drugs Patients 1, Patients 3 …… drugs Patients 1, Patients n …… drugs Patients n, Patients n-1 Reduce … … Map … q = Per-Reducer- Memory-Cost r = Communication Cost 6500 drugs 6500*6499 > 40M reduce keys

Possible Per-Reducer-Memory/ Communication Cost Tradeoffs 3 q = per-reducer- memory r =communication- cost 1GB 1TB 10TB 100TB 4GB7GB 60GB (EC2 small inst.) (EC2 med inst.)(EC2 large inst.)(EC2 x-large inst.)

Similarity Join Input R(A, B), Domain(B) = [1, 10] Compute s.t |t[B]-u[B]| · 1 Example (1) AB a1a1 5 a2a2 2 a3a3 6 a4a4 2 a5a5 7 4 Output Input

Hashing Algorithm [ADMPU ICDE ’12] Split Domain(B) into k ranges of values => (k reducers) k = 2 Example (2) (a 1, 5) (a 2, 2) (a 3, 6) (a 4, 2) (a 5, 7) Reducer 1 Reducer 2 Replicate tuples on the boundary (if t.B = 5) Per-Reducer-Memory Cost = 3, Communication Cost = 6 [1, 5] [6, 10] 5

k = 5 => Replicate if t.B = 2, 4, 6 or 8 Example (3) (a 1, 5) (a 2, 2) (a 3, 6) (a 4, 2) (a 5, 7) 6 Per-Reducer-Memory Cost = 2, Communication Cost = 8 Reducer 1 [1, 2] Reducer 3 [5, 6] Reducer 4 [7, 8] Reducer 2 [3, 4] Reducer 5 [9, 10]

Finding subgraphs ([SV] WWW ’11, [AFU] Tech Report ’12) Computing Minimum Spanning Tree (KSV SODA ’10) Other similarity joins: Set similarity joins ([VCL] SIGMOD ’10) Hamming Distance (ADMPU ICDE ’12 and later in the talk) Same Tradeoff in Other Algorithms 7

General framework for studying memory/communication tradeoff, applicable to a variety of problems Question 1: What is the minimum communication for any MR algorithm, if each reducer uses · q memory? Question 2: Are there algorithms that achieve this lower bound? Our Goals 8

Input-Output Model Mapping Schemas & Replication Rate Hamming Distance 1 Other Results Remainder of Talk 9

Input-Output Model Input Data Elements I: {i 1, i 2, …, i n } Dependency = Provenance Output Elements O: {o 1, o 2, …, o m } 10

Example 1: R(A, B) S(B, C) ⋈ (a 1, b 1 ) … (a 1, b 20 ) … (a 10, b 20 ) |Domain(A)| = 10, |Domain(B)| = 20, |Domain(C)| = 40 (b 1, c 1 ) … (b 1, c 40 ) … (b 20, c 40 ) 10* *40 = 1000 input elements (a 1, b 1, c 1 ) … (a 1, b 1, c 40 ) … (a 1, b 20, c 40 ) (a 2, b 1, c 1 ) … (a 2, b 20, c 40 ) … (a 10, b 20, c 40 ) 10*20*40 = 8000 output elements R(A,B) S(B,C) 11

Example 2: Finding Triangles (v 1, v 2 ) (v 1, v 3 ) … (v 1, v n ) … (v 2, v 3 ) … (v 2, v n ) … (v n-1, v n ) Graphs G(V, E) of n vertices {v 1, …, v n } n-choose-2 input data elements (v 1, v 2, v 3 ) … (v 1, v 2, v n ) … (v 1, v n, v n-1 ) … (v 2, v 3, v 4 ) … (v n-2, v n-1, v n ) n-choose-3 output elements 12

Mapping Schema & Replication Rate p reducer: {R 1, R 2, …, R p } q max # inputs sent to any reducer R i Def (Mapping Schema): M : I  {R 1, R 2, …, R p } s.t R i receives at most q i · q inputs Every output is covered by some reducer: Def (Replication Rate): r = q captures memory, r captures communication cost 13

Our Questions Again 14 Question 1: What is the minimum replication rate of any mapping schema as a function of q (maximum # inputs sent to any reducer)? Question 2: Are there mapping schemas that match this lower bound?

Hamming Distance = 1 00…00 00…01 00…10 … 11…01 11…10 11…11 |I| = 2 b … … |O| = b2 b /2 bit strings of length b 15 … each input contributes to b outputs each output depends on 2 inputs

Lower Bound on Replication Rate (HD=1) Key is upper bound : max outputs a reducer can cover with · q inputs Claim: (proof by induction on b) All outputs must be covered: Recall: r = r = r ¸ b/log 2 (q) 16

Memory/Communication Cost Tradeoff (HD=1) q = max # inputs to each reducer r = replication rate b b/2 2b2b All inputs to one reducer One reducer for each output How about other points? 17 r ¸ b/log 2 (q)

Splitting Algorithm for HD 1 (q=2 b/2 ) 00…00 … 11…11 … 11…11 P … P P b/2 + 2 b/2 Reducers first b/2 bits (Prefix) 00…00 00…01 … 00…00 … 11…11 last b/2 bits (Suffix) r=2, q=2 b/2 18 Prefix Reducers S … S S Suffix Reducers

Where we stand for HD = 1 q =max # inputs to each reducer b b/2 2b2b All inputs to one reducer One reducer for each output Generalized Splitting Splitting 19 r =replication rate

General Method for Using Our Framework 1.Represent problem P in terms of I, O, and dependencies 2. Lower bound for r as function of q: i.Upper bound on : max outputs covered by q inputs ii.All outputs must be covered: iii.Manipulate (ii) to get r = as a function of q 3.Demonstrate algorithms/mapping schemas that match the lower bound 20

Other Results 21 Finding Triangles in G(V, E) with n vertices: Lower bound: r ¸ Algorithms: Multiway Self Joins: R(A 11,…,A 1k ) R(A 21,…, A 2k ) … R(A t1,…, A tk ) k # columns, n = |A i |, join t times on i columns Lower bound & Algorithms: Hamming distance · d Algorithms: r · d + 1 ⋈ ⋈ ⋈

Related Work Efficient Parallel Set Similarity Joins Using MapReduce (Vernica, Carey, Li in SIGMOD ’10) Processing Theta Joins Using MapReduce (Okcan, Riedewald in SIGMOD ’11) Fuzzy Joins Using MapReduce (Afrati, Das Sarma, Menestrina, Parameswaran, Ullman in ICDE ’12) Optimizing Joins in a MapReduce Environment (Afrati, Ullman in EDBT ’10) Counting Triangles and the Curse of the Last Reducer (Suri, Vassilvitskii WWW ’11) Enumerating Subgraph Instances Using MapReduce (Afrati, Fotakis, Ullman as Techreport 2011) A Model of Computation for MapReduce (Karloff, Suri, Vassilvitskii in SODA ’10) 22

Future Work Derive lower bounds on replication rate and match this lower bound with algorithms for many different problems. Relate structure of input-output dependency graph to replication rate. How does min-cut size relate to replication rate? How does expansion rate relate to replication rate? 23