CS 345D Semih Salihoglu (some slides are copied from Ilan Horn, Jeff Dean, and Utkarsh Srivastava’s presentations online) MapReduce System and Theory 1.

Slides:

Advertisements

Similar presentations

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.

Advertisements

Parallel Computing MapReduce Examples Parallel Efficiency Assignment

Data-Intensive Computing with MapReduce/Pig Pramod Bhatotia MPI-SWS Distributed Systems – Winter Semester 2014.

Distributed Computations

Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research Shimin Chen Big.

Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.

Utkarsh Srivastava Pig : Building High-Level Dataflows over Map-Reduce Research & Cloud Computing.

Jeffrey D. Ullman Stanford University. 2 Formal Definition Implementation Fault-Tolerance Example: Join.

Distributed Computations MapReduce

L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.

Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.

Jeffrey D. Ullman Stanford University.  Mining of Massive Datasets, J. Leskovec, A. Rajaraman, J. D. Ullman.  Available for free download at i.stanford.edu/~ullman/mmds.html.

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.

SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.

By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.

MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.

Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.

Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.

Map Reduce: Simplified Data Processing On Large Clusters Jeffery Dean and Sanjay Ghemawat (Google Inc.) OSDI 2004 (Operating Systems Design and Implementation)

Presenters: Abhishek Verma, Nicolas Zea.  Map Reduce  Clean abstraction  Extremely rigid 2 stage group-by aggregation  Code reuse and maintenance.

Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.

Map Reduce and Hadoop S. Sudarshan, IIT Bombay

Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)

CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.

HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.

Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins Yahoo! Research.

MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.

Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

f ACT s  Data intensive applications with Petabytes of data  Web pages billion web pages x 20KB = 400+ terabytes  One computer can read

Large-scale file systems and Map-Reduce Single-node architecture Memory Disk CPU Google example: 20+ billion web pages x 20KB = 400+ Terabyte 1 computer.

CSE 486/586 CSE 486/586 Distributed Systems Data Analytics Steve Ko Computer Sciences and Engineering University at Buffalo.

MapReduce How to painlessly process terabytes of data.

Google’s MapReduce Connor Poske Florida State University.

MapReduce M/R slides adapted from those of Jeff Dean’s.

Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.

MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.

Large scale IP filtering using Apache Pig and case study Kaushik Chandrasekaran Nabeel Akheel.

Foto Afrati — National Technical University of Athens Anish Das Sarma — Google Research Semih Salihoglu — Stanford University Jeff Ullman — Stanford University.

Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins Pig Latin: A Not-So-Foreign Language For Data Processing Research.

SLIDE 1IS 240 – Spring 2013 MapReduce, HBase, and Hive University of California, Berkeley School of Information IS 257: Database Management.

By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.

Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.

CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.

C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.

ApproxHadoop Bringing Approximations to MapReduce Frameworks

Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies

MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.

MapReduce: simplified data processing on large clusters Jeffrey Dean and Sanjay Ghemawat.

CS 440 Database Management Systems Parallel DB & Map/Reduce Some slides due to Kevin Chang 1.

COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University

Image taken from: slideshare

Distributed Programming in “Big Data” Systems Pramod Bhatotia wp

Large-scale file systems and Map-Reduce

Pig : Building High-Level Dataflows over Map-Reduce

RDDs and Spark.

MapReduce Simplied Data Processing on Large Clusters

湖南大学-信息科学与工程学院-计算机与科学系

February 26th – Map/Reduce

Cse 344 May 4th – Map/Reduce.

Overview of big data tools

Introduction to MapReduce

MapReduce: Simplified Data Processing on Large Clusters

Pig and pig latin: An Introduction

Presentation transcript:

CS 345D Semih Salihoglu (some slides are copied from Ilan Horn, Jeff Dean, and Utkarsh Srivastava’s presentations online) MapReduce System and Theory 1

Outline  System  MapReduce/Hadoop  Pig & Hive  Theory:  Model For Lower Bounding Communication Cost  Shares Algorithm for Joins on MR & Its Optimality 2

Outline  System  MapReduce/Hadoop  Pig & Hive  Theory:  Model For Lower Bounding Communication Cost  Shares Algorithm for Joins on MR & Its Optimality 3

MapReduce History  2003: built at Google  2004: published in OSDI (Dean&Ghemawat)  2005: open-source version Hadoop  : very influential in DB community 4

Google’s Problem in 2003: lots of data  Example: 20+ billion web pages x 20KB = 400+ terabytes  One computer can read MB/sec from disk  ~ four months to read the web  ~1,000 hard drives just to store the web  Even more to do something with the data:  process crawled documents  process web request logs  build inverted indices  construct graph representations of web documents 5

Special-Purpose Solutions Before 2003  Spread work over many machines  Good news: same problem with 1000 machines < 3 hours 6

Problems with Special-Purpose Solutions  Bad news 1: lots of programming work  communication and coordination  work partitioning  status reporting  optimization  locality  Bad news II: repeat for every problem you want to solve  Bad news III: stuff breaks  One server may stay up three years (1,000 days)  If you have 10,000 servers, expect to lose 10 a day 7

What They Needed  A Distributed System: 1.Scalable 2.Fault-Tolerant 3.Easy To Program 4.Applicable To Many Problems 8

MapReduce Programming Model 9 Map Stage … <r_k 1, {r_v 1, r_v 2, r_v 3 }> <r_k 2, {r_v 1, r_v 2 }> <r_k 5, {r_v 1, r_v 2 }> … out_list 5 … Reduce Stage Group by reduce key reduce() out_list 2 map() … … out_list 1

Example 1: Word Count 10 Input Output: e.g. map (String input_key, String input_value): for each word w in input_value: EmitIntermediate(w,1); reduce (String reduce_key, Iterator values): EmitOutput(reduce_key + “ “ + values.length);

Example 1: Word Count 11 … Group by reduce key … … … … <“obama”, {1}> <“the”, {1, 1}> <“is”, {1, 1, 1}>

Example 2: Binary Join R(A, B) S(B, C) 12 Input > or > Output: successful tuples map (String relationName, Tuple t): Int b_val = (relationName == “R”) ? t[1] : t[0] Int a_or_c_val = (relationName == “R”) ? t[0] : t[1] EmitIntermediate(b_val, ); reduce (Int b j, Iterator > a_or_c_vals): int[] aVals = getAValues(a_or_c_vals); int[] cVals = getCValues(a_or_c_vals) ; foreach a i,c k in aVals, cVals => EmitOutput(a i,b j, c k ); ⋈

Example 2: Binary Join R(A, B) S(B, C) 13 Group by reduce key <‘R’, > <‘R’, > <‘R’, > <‘R’, > <‘S’, > <‘S’, > <‘S’, > <‘S’, > <‘S’, > <‘S’, > <b 3, > <b 3, > <b 3, > <b 2, > <b 3, > <b 3, {,,, }> <b 2, { }> No output ⋈

Programming Model Very Applicable 14 distributed grepweb access log stats distributed sortweb link-graph reversal term-vector per hostinverted index construction document clusteringstatistical machine translation machine learning Image processing … …  Can read and write many different data types  Applicable to many problems

MapReduce Execution 15 Usually many more map tasks than machines E.g. 200K map tasks 5K reduce tasks 2K machines Master Task

Fault-Tolerance: Handled via re-execution  On worker failure:  Detect failure via periodic heartbeats  Re-execute completed and in-progress map tasks  Re-execute in progress reduce tasks  Task completion committed through master  Master failure  Is much more rare  AFAIK MR/Hadoop do not handle master node failure 16

Other Features  Combiners  Status & Monitoring  Locality Optimization  Redundant Execution (for curse of last reducer) 17 Overall: Great execution environment for large-scale data

Outline  System  MapReduce/Hadoop  Pig & Hive  Theory:  Model For Lower Bounding Communication Cost  Shares Algorithm for Joins on MR & Its Optimality 18

MR Shortcoming 1: Workflows  Many queries/computations need multiple MR jobs  2-stage computation too rigid  Ex: Find the top 10 most visited pages in each category 19 UserUrlTime Amycnn.com8:00 Amybbc.com10:00 Amyflickr.com10:05 Fredcnn.com12:00 UrlCategoryPageRank cnn.comNews0.9 bbc.comNews0.8 flickr.comPhotos0.7 espn.comSports0.9 VisitsUrlInfo 19

Top 10 most visited pages in each category UrlInfo(Url, Category, PageRank) 20 Visits(User, Url, Time) MR Job 1: group by url + count UrlCount(Url, Count) MR Job 2:join UrlCategoryCount(Url, Category, Count) MR Job 3: group by category + count TopTenUrlPerCategory(Url, Category, Count)

UrlInfo(Url, Category, PageRank) 21 Visits(User, Url, Time) MR Job 1: group by url + count UrlCount(Url, Count) MR Job 2:join UrlCategoryCount(Url, Category, Count) MR Job 3: group by category + find top 10 TopTenUrlPerCategory(Url, Category, Count) Common Operations are coded by hand: join, selects, projection, aggregates, sorting, distinct MR Shortcoming 2: API too low-level

MapReduce Is Not The Ideal Programming API  Programmers are not used to maps and reduces  We want: joins/filters/groupBy/select * from  Solution: High-level languages/systems that compile to MR/Hadoop 22

High-level Language 1: Pig Latin 23  2008 SIGMOD: From Yahoo Research (Olston, et. al.)  Apache software - main teams now at Twitter & Hortonworks  Common ops as high-level language constructs  e.g. filter, group by, or join  Workflow as: step-by-step procedural scripts  Compiles to Hadoop

Pig Latin Example 24 visits = load ‘/data/visits’ as (user, url, time); gVisits = group visits by url; urlCounts = foreach gVisits generate url, count(visits); urlInfo = load ‘/data/urlInfo’ as (url, category, pRank); urlCategoryCount = join urlCounts by url, urlInfo by url; gCategories = group urlCategoryCount by category; topUrls = foreach gCategories generate top(urlCounts,10); store topUrls into ‘/data/topUrls’;

Pig Latin Example 25 visits = load ‘/data/visits’ as (user, url, time); gVisits = group visits by url; urlCounts = foreach gVisits generate url, count(visits); urlInfo = load ‘/data/urlInfo’ as (url, category, pRank); urlCategoryCount = join urlCounts by url, urlInfo by url; gCategories = group urlCategoryCount by category; topUrls = foreach gCategories generate top(urlCounts,10); store topUrls into ‘/data/topUrls’; Operates directly over files

Pig Latin Example 26 visits = load ‘/data/visits’ as (user, url, time); gVisits = group visits by url; urlCounts = foreach gVisits generate url, count(visits); urlInfo = load ‘/data/urlInfo’ as (url, category, pRank); urlCategoryCount = join urlCounts by url, urlInfo by url; gCategories = group urlCategoryCount by category; topUrls = foreach gCategories generate top(urlCounts,10); store topUrls into ‘/data/topUrls’; Schemas optional; Can be assigned dynamically Schemas optional; Can be assigned dynamically

Pig Latin Example 27 visits = load ‘/data/visits’ as (user, url, time); gVisits = group visits by url; urlCounts = foreach gVisits generate url, count(visits); urlInfo = load ‘/data/urlInfo’ as (url, category, pRank); urlCategoryCount = join urlCounts by url, urlInfo by url; gCategories = group urlCategoryCount by category; topUrls = foreach gCategories generate top(urlCounts,10); store topUrls into ‘/data/topUrls’; User-defined functions (UDFs) can be used in every construct Load, Store Group, Filter, Foreach User-defined functions (UDFs) can be used in every construct Load, Store Group, Filter, Foreach

Pig Latin Execution 28 visits = load ‘/data/visits’ as (user, url, time); gVisits = group visits by url; urlCounts = foreach gVisits generate url, count(visits); urlInfo = load ‘/data/urlInfo’ as (url, category, pRank); urlCategoryCount = join urlCounts by url, urlInfo by url; gCategories = group urlCategoryCount by category; topUrls = foreach gCategories generate top(urlCounts,10); store topUrls into ‘/data/topUrls’; MR Job 1 MR Job 2 MR Job 3

UrlInfo(Url, Category, PageRank) 29 Visits(User, Url, Time) MR Job 1: group by url + foreach UrlCount(Url, Count) MR Job 2:join UrlCategoryCount(Url, Category, Count) MR Job 3: group by category + for each TopTenUrlPerCategory(Url, Category, Count) Pig Latin: Execution visits = load ‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(visits); urlInfo = load ‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’;

High-level Language 2: Hive 30  2009 VLDB: From Facebook (Thusoo et. al.)  Apache software  Hive-QL: SQL-like Declarative syntax  e.g. SELECT *, INSERT INTO, GROUP BY, SORT BY  Compiles to Hadoop

Hive Example 31 INSERT TABLE UrlCounts (SELECT url, count(*) AS count FROM Visits GROUP BY url) INSERT TABLE UrlCategoryCount (SELECT url, count, category FROM UrlCounts JOIN UrlInfo ON (UrlCounts.url = UrlInfo.url)) SELECT category, topTen(*) FROM UrlCategoryCount GROUP BY category

Hive Architecture 32 Compiler/Query Optimizer Command Line Web JDBC Query Interfaces

UrlInfo(Url, Category, PageRank) 33 Visits(User, Url, Time) MR Job 1: select from-group by UrlCount(Url, Count) MR Job 2:join UrlCategoryCount(Url, Category, Count) MR Job 3: select from-group by TopTenUrlPerCategory(Url, Category, Count) Hive Final Execution INSERT TABLE UrlCounts (SELECT url, count(*) AS count FROM Visits GROUP BY url) INSERT TABLE UrlCategoryCount (SELECT url, count, category FROM UrlCounts JOIN UrlInfo ON (UrlCounts.url = UrlInfo.url)) SELECT category, topTen(*) FROM UrlCategoryCount GROUP BY category

Pig & Hive Adoption  Both Pig & Hive are very successful  Pig Usage in 2009 at Yahoo: 40% all Hadoop jobs  Hive Usage: thousands of job, 15TB/day new data loaded

MapReduce Shortcoming 3  Iterative computations  Ex: graph algorithms, machine learning  Specialized MR-like or MR-based systems:  Graph Processing: Pregel, Giraph, Stanford GPS  Machine Learning: Apache Mahout  General iterative data processing systems:  iMapReduce, HaLoop  **Spark from Berkeley** (now Apache Spark), published in HotCloud`10 [Zaharia et. al]

Outline  System  MapReduce/Hadoop  Pig & Hive  Theory:  Model For Lower Bounding Communication Cost  Shares Algorithm for Joins on MR & Its Optimality 36

Tradeoff Between Per-Reducer-Memory and Communication Cost 37 keyvalues drugs Patients 1, Patients 2 drugs Patients 1, Patients 3 …… drugs Patients 1, Patients n …… drugs Patients n, Patients n-1 Reduce … … Map … q = Per-Reducer- Memory-Cost r = Communication Cost 6500 drugs 6500*6499 > 40M reduce keys

Similarity Join Input R(A, B), Domain(B) = [1, 10] Compute s.t |t[B]-u[B]| ≤ 1 Example (1) AB a1a1 5 a2a2 2 a3a3 6 a4a4 2 a5a Output Input

Hashing Algorithm [ADMPU ICDE ’12] Split Domain(B) into p ranges of values => (p reducers) p = 2 Example (2) (a 1, 5) (a 2, 2) (a 3, 6) (a 4, 2) (a 5, 7) Reducer 1 Reducer 2 Replicate tuples on the boundary (if t.B = 5) Per-Reducer-Memory Cost = 3, Communication Cost = 6 [1, 5] [6, 10] 39

p = 5 => Replicate if t.B = 2, 4, 6 or 8 Example (3) (a 1, 5) (a 2, 2) (a 3, 6) (a 4, 2) (a 5, 7) 40 Per-Reducer-Memory Cost = 2, Communication Cost = 8 Reducer 1 [1, 2] Reducer 3 [5, 6] Reducer 4 [7, 8] Reducer 2 [3, 4] Reducer 5 [9, 10]

Multiway-joins ([AU] TKDE ‘11) Finding subgraphs ([SV] WWW ’11, [AFU] ICDE ’13) Computing Minimum Spanning Tree (KSV SODA ’10) Other similarity joins: Set similarity joins ([VCL] SIGMOD ’10) Hamming Distance (ADMPU ICDE ’12 and later in the talk) Same Tradeoff in Other Algorithms 41

General framework applicable to a variety of problems Question 1: What is the minimum communication for any MR algorithm, if each reducer uses ≤ q memory? Question 2: Are there algorithms that achieve this lower bound? We want 42

Framework Input-Output Model Mapping Schemas & Replication Rate Lower bound for Triangle Query Shares Algorithm for Triangle Query Generalized Shares Algorithm Next 43

Framework: Input-Output Model Input Data Elements I: {i 1, i 2, …, i n } Output Elements O: {o 1, o 2, …, o m } 44

Example 1: R(A, B) S(B, C) ⋈ (a 1, b 1 ) … (a 1, b n ) … (a n, b n ) |Domain(A)| = n, |Domain(B)| = n, |Domain(C)| = n (b 1, c 1 ) … (b 1, c n ) … (b n, c n ) n 2 + n 2 = 2n 2 possible inputs (a 1, b 1, c 1 ) … (a 1, b 1, c n ) … (a 1, b n, c n ) (a 2, b 1, c 1 ) … (a 2, b n, c n ) … (a n, b n, c n ) n 3 possible outputs R(A,B) S(B,C) 45

Example 2: R(A, B) S(B, C) T(C, A) ⋈ (a 1, b 1 ) … (a n, b n ) |Domain(A)| = n, |Domain(B)| = n, |Domain(C)| = n n 2 + n 2 + n 2 = 3n 2 input elements (a 1, b 1, c 1 ) … (a 1, b 1, c n ) … (a 1, b n, c n ) (a 2, b 1, c 1 ) … (a 2, b n, c n ) … (a n, b n, c n ) n 3 output elements R(A,B) S(B,C) 46 ⋈ (b 1, c 1 ) … (b n, c n ) (c 1, a 1 ) … (c n, a n ) T(C,A)

Framework: Mapping Schema & Replication Rate p reducer: {R 1, R 2, …, R p } q max # inputs sent to any reducer R i Def (Mapping Schema): M : I  {R 1, R 2, …, R p } s.t R i receives at most q i ≤ q inputs Every output is covered by some reducer Def (Replication Rate): r = q captures memory, r captures communication cost 47

Our Questions Again 48 Question 1: What is the minimum replication rate of any mapping schema as a function of q (maximum # inputs sent to any reducer)? Question 2: Are there mapping schemas that match this lower bound?

|Domain(A)| = n, |Domain(B)| = n, |Domain(C)| = n (a 1, b 1, c 1 ) … (a 1, b 1, c n ) … (a 1, b n, c n ) (a 2, b 1, c 1 ) … (a 2, b n, c n ) … (a n, b n, c n ) 49 (a 1, b 1 ) … (a n, b n ) R(A,B) S(B,C) (b 1, c 1 ) … (b n, c n ) (c 1, a 1 ) … (c n, a n ) T(C,A) Triangle Query: R(A, B) S(B, C) T(C, A) ⋈ ⋈ 3n 2 input elements each input contributes to N outputs n 3 outputs each output depends on 3 inputs

Lower Bound on Replication Rate (Triangle Query) Key is upper bound : max outputs a reducer can cover with ≤ q inputs Claim: (proof by AGM bound) All outputs must be covered: Recall: r = r = 50

Memory/Communication Cost Tradeoff (Triangle Query) q =max # inputs to each reducer n n 2 All inputs to one reducer One reducer for each output Shares Algorithm 51 r =replication rate n 2 /3

52 Shares Algorithm for Triangles  p = k 3 reducers indexed as r 1,1,1 to r k,k,k  We say each attribute A, B, C has k “shares”  h A, h B, and h C from n -> k are indep. and perfect  (a i, b j ) in R(A, B)  r (ha(ai), hb(bj),*)  E.g. If h A (a i ) = 3, h B (b j ) = 4, send it to r 3,4,1, r 3,4,2, …, r 3,4,k  (b j, c l ) in S(B, C)  r (*, hb(bj), hc(cl))  (c l, a i ) in T(C, A)  r (ha(ai), *, hc(cl))  Correct: dependencies of (a i, b j, c l ) meets at r (ha(ai), hb(bj), hc(cl))  E.g. if h C (c l ) = 2, all tuples are sent to r 3,4,2

(a 1, b 1 ) … (a n, b n ) R(A,B) S(B,C) 53 (b 1, c 1 ) … (b n, c n ) (c 1, a 1 ) … (c n, a n ) T(C,A) Shares Algorithm for Triangles r 111 r 113 r 211 r 212 r 213 r 223 r 233 r 313 r 333 let p=27 h A (a 1 ) = 2 h B (b 1 ) = 1 h C (c 1 ) = 3 (a 1, b 1 ) => r 2,1,* (b 1, c 1 ) => r *,1,3 (a 1, c 1 ) => r 2,*,3 … … … … … r = k => p 1/3 q=3n 2 /p 2/3 r 213

54 Shares Algorithm for Triangles  Shares’ replication rate:  r = k => p 1/3 and q=3n 2 /p 2/3  Lower Bound for r >= (3 1/2 n)/q 1/2  Substitute q in LB r >= p 1/3  Special case 1:  p=n 3, q=3, r=n  Equivalent to trivial algorithm one reducer for each output  Special case 2:  p=1, q=3n 2, r=1  Equivalent to the trivial serial algorithm

Other Lower Bound Results [Afrati et. al., VLDB ’13]  Hamming Distance 1  Multiway joins: R(A,B) S(B, C) T(C, A)  Matrix Multiplication 55 ⋈ ⋈

56 Generalized Shares ([AU] TKDE ’11)  R i, i=1,…,m relations. Let r i =|R i |  A j, j=1,…,n attributes  Q = \Join R i  Give each attribute “share” s i  p reducers indexed by r 1,1,..,1 to r s1,s2,…,sn  Minimize total communication cost:

57 Example: Triangles  R(A, B), S(B, C), T(C, A)  |R|=|S|=|T|=n 2  Total communication cost: min |R|s C + |S|s A + |T|s B s.t s A s B s C = p Solution: s A =s B =s C =p 1/3 =k

58 Shares is Optimal For Any Query  General shares solves a geometric program  Always has solution and solvable in poly time  observed by Chris and independently by Beame, Koutris, Suciu (BKS))  BKS proved, shares’ comm. cost vs. per-reducer memory optimal for any query

59 Open MapReduce Theory Questions  Shares communication cost grows with p for most queries  e.g. triangle communication cost p 1/3 |I|  best for one round (again per-reducer memory)  Q1: Can we do better with multi-round algorithms:  Are there 2 round algorithms with O(|I|) cost?  Answer is no for general queries. But maybe for a class of queries?  How about constant round MR algorithms?  Good work in PODS 2013 by Beame, Koutris, Suciu from UW  Q2: How about instance optimal algorithms?  Q3: How can we guard computations against skew? (good work in arxiv by Beame, Koutris, Suciu)

60 References  MapReduce: Simplied Data Processing on Large Clusters [Dean&Ghemawarat OSDI ’04]  Pig Latin: A Not-So-Foreign Language for Data Processing [Olston et. al. SIGMOD ’08]  Hive – A Petabyte Scale Data Warehouse Using Hadoop [Thusoo ’09 VLDB]  Spark: Cluster Computing With Working Sets [Zaharia et. al. HotCloud`10]  Upper and lower bounds on the cost of a map-reduce computation [Afrati et. al., VLDB ’13]  Optimizing Joins in a Map-Reduce Environment [Afrati et. al., TKDE ‘10]  Parallel Evaluation of Conjunctive Queries [Koutris & Suciu, PODS ’11]  Communication Steps For Parallel Query Processing [Beame et. al., PODS `13]  Skew In Parallel Query Processing [Beame et. al., arxiv]