CS 345D Semih Salihoglu (some slides are copied from Ilan Horn, Jeff Dean, and Utkarsh Srivastava’s presentations online) MapReduce System and Theory 1.

CS 345D Semih Salihoglu (some slides are copied from Ilan Horn, Jeff Dean, and Utkarsh Srivastava’s presentations online) MapReduce System and Theory 1

Outline  System  MapReduce/Hadoop  Pig & Hive  Theory:  Model For Lower Bounding Communication Cost  Shares Algorithm for Joins on MR & Its Optimality 2

MapReduce History  2003: built at Google  2004: published in OSDI (Dean&Ghemawat)  2005: open-source version Hadoop  2005-2014: very influential in DB community 4

Google’s Problem in 2003: lots of data  Example: 20+ billion web pages x 20KB = 400+ terabytes  One computer can read 30-35 MB/sec from disk  ~ four months to read the web  ~1,000 hard drives just to store the web  Even more to do something with the data:  process crawled documents  process web request logs  build inverted indices  construct graph representations of web documents 5

Special-Purpose Solutions Before 2003  Spread work over many machines  Good news: same problem with 1000 machines < 3 hours 6

Problems with Special-Purpose Solutions  Bad news 1: lots of programming work  communication and coordination  work partitioning  status reporting  optimization  locality  Bad news II: repeat for every problem you want to solve  Bad news III: stuff breaks  One server may stay up three years (1,000 days)  If you have 10,000 servers, expect to lose 10 a day 7

What They Needed  A Distributed System: 1.Scalable 2.Fault-Tolerant 3.Easy To Program 4.Applicable To Many Problems 8

MapReduce Programming Model 9 Map Stage … <r_k 1, {r_v 1, r_v 2, r_v 3 }> <r_k 2, {r_v 1, r_v 2 }> <r_k 5, {r_v 1, r_v 2 }> … out_list 5 … Reduce Stage Group by reduce key reduce() out_list 2 map() … … out_list 1

Example 1: Word Count 10 Input Output: e.g. map (String input_key, String input_value): for each word w in input_value: EmitIntermediate(w,1); reduce (String reduce_key, Iterator values): EmitOutput(reduce_key + “ “ + values.length);

Example 1: Word Count 11 … Group by reduce key … … … … <“obama”, {1}> <“the”, {1, 1}> <“is”, {1, 1, 1}>

Example 2: Binary Join R(A, B) S(B, C) 12 Input > or > Output: successful tuples map (String relationName, Tuple t): Int b_val = (relationName == “R”) ? t[1] : t[0] Int a_or_c_val = (relationName == “R”) ? t[0] : t[1] EmitIntermediate(b_val, ); reduce (Int b j, Iterator > a_or_c_vals): int[] aVals = getAValues(a_or_c_vals); int[] cVals = getCValues(a_or_c_vals) ; foreach a i,c k in aVals, cVals => EmitOutput(a i,b j, c k ); ⋈

Example 2: Binary Join R(A, B) S(B, C) 13 Group by reduce key <‘R’, > <‘R’, > <‘R’, > <‘R’, > <‘S’, > <‘S’, > <‘S’, > <‘S’, > <‘S’, > <‘S’, > No output ⋈

Programming Model Very Applicable 14 distributed grepweb access log stats distributed sortweb link-graph reversal term-vector per hostinverted index construction document clusteringstatistical machine translation machine learning Image processing … …  Can read and write many different data types  Applicable to many problems

MapReduce Execution 15 Usually many more map tasks than machines E.g. 200K map tasks 5K reduce tasks 2K machines Master Task

Fault-Tolerance: Handled via re-execution  On worker failure:  Detect failure via periodic heartbeats  Re-execute completed and in-progress map tasks  Re-execute in progress reduce tasks  Task completion committed through master  Master failure  Is much more rare  AFAIK MR/Hadoop do not handle master node failure 16

Other Features  Combiners  Status & Monitoring  Locality Optimization  Redundant Execution (for curse of last reducer) 17 Overall: Great execution environment for large-scale data

MR Shortcoming 1: Workflows  Many queries/computations need multiple MR jobs  2-stage computation too rigid  Ex: Find the top 10 most visited pages in each category 19 UserUrlTime Amycnn.com8:00 Amybbc.com10:00 Amyflickr.com10:05 Fredcnn.com12:00 UrlCategoryPageRank cnn.comNews0.9 bbc.comNews0.8 flickr.comPhotos0.7 espn.comSports0.9 VisitsUrlInfo 19

Top 10 most visited pages in each category UrlInfo(Url, Category, PageRank) 20 Visits(User, Url, Time) MR Job 1: group by url + count UrlCount(Url, Count) MR Job 2:join UrlCategoryCount(Url, Category, Count) MR Job 3: group by category + count TopTenUrlPerCategory(Url, Category, Count)

UrlInfo(Url, Category, PageRank) 21 Visits(User, Url, Time) MR Job 1: group by url + count UrlCount(Url, Count) MR Job 2:join UrlCategoryCount(Url, Category, Count) MR Job 3: group by category + find top 10 TopTenUrlPerCategory(Url, Category, Count) Common Operations are coded by hand: join, selects, projection, aggregates, sorting, distinct MR Shortcoming 2: API too low-level

MapReduce Is Not The Ideal Programming API  Programmers are not used to maps and reduces  We want: joins/filters/groupBy/select * from  Solution: High-level languages/systems that compile to MR/Hadoop 22

High-level Language 1: Pig Latin 23  2008 SIGMOD: From Yahoo Research (Olston, et. al.)  Apache software - main teams now at Twitter & Hortonworks  Common ops as high-level language constructs  e.g. filter, group by, or join  Workflow as: step-by-step procedural scripts  Compiles to Hadoop

Pig Latin Example 24 visits = load ‘/data/visits’ as (user, url, time); gVisits = group visits by url; urlCounts = foreach gVisits generate url, count(visits); urlInfo = load ‘/data/urlInfo’ as (url, category, pRank); urlCategoryCount = join urlCounts by url, urlInfo by url; gCategories = group urlCategoryCount by category; topUrls = foreach gCategories generate top(urlCounts,10); store topUrls into ‘/data/topUrls’;

Pig Latin Example 25 visits = load ‘/data/visits’ as (user, url, time); gVisits = group visits by url; urlCounts = foreach gVisits generate url, count(visits); urlInfo = load ‘/data/urlInfo’ as (url, category, pRank); urlCategoryCount = join urlCounts by url, urlInfo by url; gCategories = group urlCategoryCount by category; topUrls = foreach gCategories generate top(urlCounts,10); store topUrls into ‘/data/topUrls’; Operates directly over files

Pig Latin Example 26 visits = load ‘/data/visits’ as (user, url, time); gVisits = group visits by url; urlCounts = foreach gVisits generate url, count(visits); urlInfo = load ‘/data/urlInfo’ as (url, category, pRank); urlCategoryCount = join urlCounts by url, urlInfo by url; gCategories = group urlCategoryCount by category; topUrls = foreach gCategories generate top(urlCounts,10); store topUrls into ‘/data/topUrls’; Schemas optional; Can be assigned dynamically Schemas optional; Can be assigned dynamically

Pig Latin Example 27 visits = load ‘/data/visits’ as (user, url, time); gVisits = group visits by url; urlCounts = foreach gVisits generate url, count(visits); urlInfo = load ‘/data/urlInfo’ as (url, category, pRank); urlCategoryCount = join urlCounts by url, urlInfo by url; gCategories = group urlCategoryCount by category; topUrls = foreach gCategories generate top(urlCounts,10); store topUrls into ‘/data/topUrls’; User-defined functions (UDFs) can be used in every construct Load, Store Group, Filter, Foreach User-defined functions (UDFs) can be used in every construct Load, Store Group, Filter, Foreach

Pig Latin Execution 28 visits = load ‘/data/visits’ as (user, url, time); gVisits = group visits by url; urlCounts = foreach gVisits generate url, count(visits); urlInfo = load ‘/data/urlInfo’ as (url, category, pRank); urlCategoryCount = join urlCounts by url, urlInfo by url; gCategories = group urlCategoryCount by category; topUrls = foreach gCategories generate top(urlCounts,10); store topUrls into ‘/data/topUrls’; MR Job 1 MR Job 2 MR Job 3

UrlInfo(Url, Category, PageRank) 29 Visits(User, Url, Time) MR Job 1: group by url + foreach UrlCount(Url, Count) MR Job 2:join UrlCategoryCount(Url, Category, Count) MR Job 3: group by category + for each TopTenUrlPerCategory(Url, Category, Count) Pig Latin: Execution visits = load ‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(visits); urlInfo = load ‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’;

High-level Language 2: Hive 30  2009 VLDB: From Facebook (Thusoo et. al.)  Apache software  Hive-QL: SQL-like Declarative syntax  e.g. SELECT *, INSERT INTO, GROUP BY, SORT BY  Compiles to Hadoop

Hive Example 31 INSERT TABLE UrlCounts (SELECT url, count(*) AS count FROM Visits GROUP BY url) INSERT TABLE UrlCategoryCount (SELECT url, count, category FROM UrlCounts JOIN UrlInfo ON (UrlCounts.url = UrlInfo.url)) SELECT category, topTen(*) FROM UrlCategoryCount GROUP BY category

Hive Architecture 32 Compiler/Query Optimizer Command Line Web JDBC Query Interfaces

UrlInfo(Url, Category, PageRank) 33 Visits(User, Url, Time) MR Job 1: select from-group by UrlCount(Url, Count) MR Job 2:join UrlCategoryCount(Url, Category, Count) MR Job 3: select from-group by TopTenUrlPerCategory(Url, Category, Count) Hive Final Execution INSERT TABLE UrlCounts (SELECT url, count(*) AS count FROM Visits GROUP BY url) INSERT TABLE UrlCategoryCount (SELECT url, count, category FROM UrlCounts JOIN UrlInfo ON (UrlCounts.url = UrlInfo.url)) SELECT category, topTen(*) FROM UrlCategoryCount GROUP BY category

Pig & Hive Adoption  Both Pig & Hive are very successful  Pig Usage in 2009 at Yahoo: 40% all Hadoop jobs  Hive Usage: thousands of job, 15TB/day new data loaded

MapReduce Shortcoming 3  Iterative computations  Ex: graph algorithms, machine learning  Specialized MR-like or MR-based systems:  Graph Processing: Pregel, Giraph, Stanford GPS  Machine Learning: Apache Mahout  General iterative data processing systems:  iMapReduce, HaLoop  **Spark from Berkeley** (now Apache Spark), published in HotCloud`10 [Zaharia et. al]

Tradeoff Between Per-Reducer-Memory and Communication Cost 37 keyvalues drugs Patients 1, Patients 2 drugs Patients 1, Patients 3 …… drugs Patients 1, Patients n …… drugs Patients n, Patients n-1 Reduce … … Map … q = Per-Reducer- Memory-Cost r = Communication Cost 6500 drugs 6500*6499 > 40M reduce keys

Similarity Join Input R(A, B), Domain(B) = [1, 10] Compute s.t |t[B]-u[B]| ≤ 1 Example (1) AB a1a1 5 a2a2 2 a3a3 6 a4a4 2 a5a5 7 38 Output Input

Hashing Algorithm [ADMPU ICDE ’12] Split Domain(B) into p ranges of values => (p reducers) p = 2 Example (2) (a 1, 5) (a 2, 2) (a 3, 6) (a 4, 2) (a 5, 7) Reducer 1 Reducer 2 Replicate tuples on the boundary (if t.B = 5) Per-Reducer-Memory Cost = 3, Communication Cost = 6 [1, 5] [6, 10] 39

p = 5 => Replicate if t.B = 2, 4, 6 or 8 Example (3) (a 1, 5) (a 2, 2) (a 3, 6) (a 4, 2) (a 5, 7) 40 Per-Reducer-Memory Cost = 2, Communication Cost = 8 Reducer 1 [1, 2] Reducer 3 [5, 6] Reducer 4 [7, 8] Reducer 2 [3, 4] Reducer 5 [9, 10]

Multiway-joins ([AU] TKDE ‘11) Finding subgraphs ([SV] WWW ’11, [AFU] ICDE ’13) Computing Minimum Spanning Tree (KSV SODA ’10) Other similarity joins: Set similarity joins ([VCL] SIGMOD ’10) Hamming Distance (ADMPU ICDE ’12 and later in the talk) Same Tradeoff in Other Algorithms 41

General framework applicable to a variety of problems Question 1: What is the minimum communication for any MR algorithm, if each reducer uses ≤ q memory? Question 2: Are there algorithms that achieve this lower bound? We want 42

Framework Input-Output Model Mapping Schemas & Replication Rate Lower bound for Triangle Query Shares Algorithm for Triangle Query Generalized Shares Algorithm Next 43

Framework: Input-Output Model Input Data Elements I: {i 1, i 2, …, i n } Output Elements O: {o 1, o 2, …, o m } 44

Example 1: R(A, B) S(B, C) ⋈ (a 1, b 1 ) … (a 1, b n ) … (a n, b n ) |Domain(A)| = n, |Domain(B)| = n, |Domain(C)| = n (b 1, c 1 ) … (b 1, c n ) … (b n, c n ) n 2 + n 2 = 2n 2 possible inputs (a 1, b 1, c 1 ) … (a 1, b 1, c n ) … (a 1, b n, c n ) (a 2, b 1, c 1 ) … (a 2, b n, c n ) … (a n, b n, c n ) n 3 possible outputs R(A,B) S(B,C) 45

Example 2: R(A, B) S(B, C) T(C, A) ⋈ (a 1, b 1 ) … (a n, b n ) |Domain(A)| = n, |Domain(B)| = n, |Domain(C)| = n n 2 + n 2 + n 2 = 3n 2 input elements (a 1, b 1, c 1 ) … (a 1, b 1, c n ) … (a 1, b n, c n ) (a 2, b 1, c 1 ) … (a 2, b n, c n ) … (a n, b n, c n ) n 3 output elements R(A,B) S(B,C) 46 ⋈ (b 1, c 1 ) … (b n, c n ) (c 1, a 1 ) … (c n, a n ) T(C,A)

Framework: Mapping Schema & Replication Rate p reducer: {R 1, R 2, …, R p } q max # inputs sent to any reducer R i Def (Mapping Schema): M : I  {R 1, R 2, …, R p } s.t R i receives at most q i ≤ q inputs Every output is covered by some reducer Def (Replication Rate): r = q captures memory, r captures communication cost 47

Our Questions Again 48 Question 1: What is the minimum replication rate of any mapping schema as a function of q (maximum # inputs sent to any reducer)? Question 2: Are there mapping schemas that match this lower bound?

|Domain(A)| = n, |Domain(B)| = n, |Domain(C)| = n (a 1, b 1, c 1 ) … (a 1, b 1, c n ) … (a 1, b n, c n ) (a 2, b 1, c 1 ) … (a 2, b n, c n ) … (a n, b n, c n ) 49 (a 1, b 1 ) … (a n, b n ) R(A,B) S(B,C) (b 1, c 1 ) … (b n, c n ) (c 1, a 1 ) … (c n, a n ) T(C,A) Triangle Query: R(A, B) S(B, C) T(C, A) ⋈ ⋈ 3n 2 input elements each input contributes to N outputs n 3 outputs each output depends on 3 inputs

Lower Bound on Replication Rate (Triangle Query) Key is upper bound : max outputs a reducer can cover with ≤ q inputs Claim: (proof by AGM bound) All outputs must be covered: Recall: r = r = 50

Memory/Communication Cost Tradeoff (Triangle Query) q =max # inputs to each reducer n 3 1 3 3n 2 All inputs to one reducer One reducer for each output Shares Algorithm 51 r =replication rate n 2 /3

52 Shares Algorithm for Triangles  p = k 3 reducers indexed as r 1,1,1 to r k,k,k  We say each attribute A, B, C has k “shares”  h A, h B, and h C from n -> k are indep. and perfect  (a i, b j ) in R(A, B)  r (ha(ai), hb(bj),*)  E.g. If h A (a i ) = 3, h B (b j ) = 4, send it to r 3,4,1, r 3,4,2, …, r 3,4,k  (b j, c l ) in S(B, C)  r (*, hb(bj), hc(cl))  (c l, a i ) in T(C, A)  r (ha(ai), *, hc(cl))  Correct: dependencies of (a i, b j, c l ) meets at r (ha(ai), hb(bj), hc(cl))  E.g. if h C (c l ) = 2, all tuples are sent to r 3,4,2

(a 1, b 1 ) … (a n, b n ) R(A,B) S(B,C) 53 (b 1, c 1 ) … (b n, c n ) (c 1, a 1 ) … (c n, a n ) T(C,A) Shares Algorithm for Triangles r 111 r 113 r 211 r 212 r 213 r 223 r 233 r 313 r 333 let p=27 h A (a 1 ) = 2 h B (b 1 ) = 1 h C (c 1 ) = 3 (a 1, b 1 ) => r 2,1,* (b 1, c 1 ) => r *,1,3 (a 1, c 1 ) => r 2,*,3 … … … … … r = k => p 1/3 q=3n 2 /p 2/3 r 213

54 Shares Algorithm for Triangles  Shares’ replication rate:  r = k => p 1/3 and q=3n 2 /p 2/3  Lower Bound for r >= (3 1/2 n)/q 1/2  Substitute q in LB r >= p 1/3  Special case 1:  p=n 3, q=3, r=n  Equivalent to trivial algorithm one reducer for each output  Special case 2:  p=1, q=3n 2, r=1  Equivalent to the trivial serial algorithm

Other Lower Bound Results [Afrati et. al., VLDB ’13]  Hamming Distance 1  Multiway joins: R(A,B) S(B, C) T(C, A)  Matrix Multiplication 55 ⋈ ⋈

56 Generalized Shares ([AU] TKDE ’11)  R i, i=1,…,m relations. Let r i =|R i |  A j, j=1,…,n attributes  Q = \Join R i  Give each attribute “share” s i  p reducers indexed by r 1,1,..,1 to r s1,s2,…,sn  Minimize total communication cost:

57 Example: Triangles  R(A, B), S(B, C), T(C, A)  |R|=|S|=|T|=n 2  Total communication cost: min |R|s C + |S|s A + |T|s B s.t s A s B s C = p Solution: s A =s B =s C =p 1/3 =k

58 Shares is Optimal For Any Query  General shares solves a geometric program  Always has solution and solvable in poly time  observed by Chris and independently by Beame, Koutris, Suciu (BKS))  BKS proved, shares’ comm. cost vs. per-reducer memory optimal for any query

59 Open MapReduce Theory Questions  Shares communication cost grows with p for most queries  e.g. triangle communication cost p 1/3 |I|  best for one round (again per-reducer memory)  Q1: Can we do better with multi-round algorithms:  Are there 2 round algorithms with O(|I|) cost?  Answer is no for general queries. But maybe for a class of queries?  How about constant round MR algorithms?  Good work in PODS 2013 by Beame, Koutris, Suciu from UW  Q2: How about instance optimal algorithms?  Q3: How can we guard computations against skew? (good work in arxiv by Beame, Koutris, Suciu)

60 References  MapReduce: Simplied Data Processing on Large Clusters [Dean&Ghemawarat OSDI ’04]  Pig Latin: A Not-So-Foreign Language for Data Processing [Olston et. al. SIGMOD ’08]  Hive – A Petabyte Scale Data Warehouse Using Hadoop [Thusoo ’09 VLDB]  Spark: Cluster Computing With Working Sets [Zaharia et. al. HotCloud`10]  Upper and lower bounds on the cost of a map-reduce computation [Afrati et. al., VLDB ’13]  Optimizing Joins in a Map-Reduce Environment [Afrati et. al., TKDE ‘10]  Parallel Evaluation of Conjunctive Queries [Koutris & Suciu, PODS ’11]  Communication Steps For Parallel Query Processing [Beame et. al., PODS `13]  Skew In Parallel Query Processing [Beame et. al., arxiv]

CS 345D Semih Salihoglu (some slides are copied from Ilan Horn, Jeff Dean, and Utkarsh Srivastava’s presentations online) MapReduce System and Theory 1.

Similar presentations

Presentation on theme: "CS 345D Semih Salihoglu (some slides are copied from Ilan Horn, Jeff Dean, and Utkarsh Srivastava’s presentations online) MapReduce System and Theory 1."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS 345D Semih Salihoglu (some slides are copied from Ilan Horn, Jeff Dean, and Utkarsh Srivastava’s presentations online) MapReduce System and Theory 1.

Similar presentations

Presentation on theme: "CS 345D Semih Salihoglu (some slides are copied from Ilan Horn, Jeff Dean, and Utkarsh Srivastava’s presentations online) MapReduce System and Theory 1."— Presentation transcript:

Similar presentations

About project

Feedback