Download presentation
Presentation is loading. Please wait.
Published byEsmond Foster Modified over 9 years ago
1
CS 345D Semih Salihoglu (some slides are copied from Ilan Horn, Jeff Dean, and Utkarsh Srivastava’s presentations online) MapReduce System and Theory 1
2
Outline System MapReduce/Hadoop Pig & Hive Theory: Model For Lower Bounding Communication Cost Shares Algorithm for Joins on MR & Its Optimality 2
3
Outline System MapReduce/Hadoop Pig & Hive Theory: Model For Lower Bounding Communication Cost Shares Algorithm for Joins on MR & Its Optimality 3
4
MapReduce History 2003: built at Google 2004: published in OSDI (Dean&Ghemawat) 2005: open-source version Hadoop 2005-2014: very influential in DB community 4
5
Google’s Problem in 2003: lots of data Example: 20+ billion web pages x 20KB = 400+ terabytes One computer can read 30-35 MB/sec from disk ~ four months to read the web ~1,000 hard drives just to store the web Even more to do something with the data: process crawled documents process web request logs build inverted indices construct graph representations of web documents 5
6
Special-Purpose Solutions Before 2003 Spread work over many machines Good news: same problem with 1000 machines < 3 hours 6
7
Problems with Special-Purpose Solutions Bad news 1: lots of programming work communication and coordination work partitioning status reporting optimization locality Bad news II: repeat for every problem you want to solve Bad news III: stuff breaks One server may stay up three years (1,000 days) If you have 10,000 servers, expect to lose 10 a day 7
8
What They Needed A Distributed System: 1.Scalable 2.Fault-Tolerant 3.Easy To Program 4.Applicable To Many Problems 8
9
MapReduce Programming Model 9 Map Stage … <r_k 1, {r_v 1, r_v 2, r_v 3 }> <r_k 2, {r_v 1, r_v 2 }> <r_k 5, {r_v 1, r_v 2 }> … out_list 5 … Reduce Stage Group by reduce key reduce() out_list 2 map() … … out_list 1
10
Example 1: Word Count 10 Input Output: e.g. map (String input_key, String input_value): for each word w in input_value: EmitIntermediate(w,1); reduce (String reduce_key, Iterator values): EmitOutput(reduce_key + “ “ + values.length);
11
Example 1: Word Count 11 … Group by reduce key … … … … <“obama”, {1}> <“the”, {1, 1}> <“is”, {1, 1, 1}>
12
Example 2: Binary Join R(A, B) S(B, C) 12 Input > or > Output: successful tuples map (String relationName, Tuple t): Int b_val = (relationName == “R”) ? t[1] : t[0] Int a_or_c_val = (relationName == “R”) ? t[0] : t[1] EmitIntermediate(b_val, ); reduce (Int b j, Iterator > a_or_c_vals): int[] aVals = getAValues(a_or_c_vals); int[] cVals = getCValues(a_or_c_vals) ; foreach a i,c k in aVals, cVals => EmitOutput(a i,b j, c k ); ⋈
13
Example 2: Binary Join R(A, B) S(B, C) 13 Group by reduce key <‘R’, > <‘R’, > <‘R’, > <‘R’, > <‘S’, > <‘S’, > <‘S’, > <‘S’, > <‘S’, > <‘S’, > <b 3, > <b 3, > <b 3, > <b 2, > <b 3, > <b 3, {,,, }> <b 2, { }> No output ⋈
14
Programming Model Very Applicable 14 distributed grepweb access log stats distributed sortweb link-graph reversal term-vector per hostinverted index construction document clusteringstatistical machine translation machine learning Image processing … … Can read and write many different data types Applicable to many problems
15
MapReduce Execution 15 Usually many more map tasks than machines E.g. 200K map tasks 5K reduce tasks 2K machines Master Task
16
Fault-Tolerance: Handled via re-execution On worker failure: Detect failure via periodic heartbeats Re-execute completed and in-progress map tasks Re-execute in progress reduce tasks Task completion committed through master Master failure Is much more rare AFAIK MR/Hadoop do not handle master node failure 16
17
Other Features Combiners Status & Monitoring Locality Optimization Redundant Execution (for curse of last reducer) 17 Overall: Great execution environment for large-scale data
18
Outline System MapReduce/Hadoop Pig & Hive Theory: Model For Lower Bounding Communication Cost Shares Algorithm for Joins on MR & Its Optimality 18
19
MR Shortcoming 1: Workflows Many queries/computations need multiple MR jobs 2-stage computation too rigid Ex: Find the top 10 most visited pages in each category 19 UserUrlTime Amycnn.com8:00 Amybbc.com10:00 Amyflickr.com10:05 Fredcnn.com12:00 UrlCategoryPageRank cnn.comNews0.9 bbc.comNews0.8 flickr.comPhotos0.7 espn.comSports0.9 VisitsUrlInfo 19
20
Top 10 most visited pages in each category UrlInfo(Url, Category, PageRank) 20 Visits(User, Url, Time) MR Job 1: group by url + count UrlCount(Url, Count) MR Job 2:join UrlCategoryCount(Url, Category, Count) MR Job 3: group by category + count TopTenUrlPerCategory(Url, Category, Count)
21
UrlInfo(Url, Category, PageRank) 21 Visits(User, Url, Time) MR Job 1: group by url + count UrlCount(Url, Count) MR Job 2:join UrlCategoryCount(Url, Category, Count) MR Job 3: group by category + find top 10 TopTenUrlPerCategory(Url, Category, Count) Common Operations are coded by hand: join, selects, projection, aggregates, sorting, distinct MR Shortcoming 2: API too low-level
22
MapReduce Is Not The Ideal Programming API Programmers are not used to maps and reduces We want: joins/filters/groupBy/select * from Solution: High-level languages/systems that compile to MR/Hadoop 22
23
High-level Language 1: Pig Latin 23 2008 SIGMOD: From Yahoo Research (Olston, et. al.) Apache software - main teams now at Twitter & Hortonworks Common ops as high-level language constructs e.g. filter, group by, or join Workflow as: step-by-step procedural scripts Compiles to Hadoop
24
Pig Latin Example 24 visits = load ‘/data/visits’ as (user, url, time); gVisits = group visits by url; urlCounts = foreach gVisits generate url, count(visits); urlInfo = load ‘/data/urlInfo’ as (url, category, pRank); urlCategoryCount = join urlCounts by url, urlInfo by url; gCategories = group urlCategoryCount by category; topUrls = foreach gCategories generate top(urlCounts,10); store topUrls into ‘/data/topUrls’;
25
Pig Latin Example 25 visits = load ‘/data/visits’ as (user, url, time); gVisits = group visits by url; urlCounts = foreach gVisits generate url, count(visits); urlInfo = load ‘/data/urlInfo’ as (url, category, pRank); urlCategoryCount = join urlCounts by url, urlInfo by url; gCategories = group urlCategoryCount by category; topUrls = foreach gCategories generate top(urlCounts,10); store topUrls into ‘/data/topUrls’; Operates directly over files
26
Pig Latin Example 26 visits = load ‘/data/visits’ as (user, url, time); gVisits = group visits by url; urlCounts = foreach gVisits generate url, count(visits); urlInfo = load ‘/data/urlInfo’ as (url, category, pRank); urlCategoryCount = join urlCounts by url, urlInfo by url; gCategories = group urlCategoryCount by category; topUrls = foreach gCategories generate top(urlCounts,10); store topUrls into ‘/data/topUrls’; Schemas optional; Can be assigned dynamically Schemas optional; Can be assigned dynamically
27
Pig Latin Example 27 visits = load ‘/data/visits’ as (user, url, time); gVisits = group visits by url; urlCounts = foreach gVisits generate url, count(visits); urlInfo = load ‘/data/urlInfo’ as (url, category, pRank); urlCategoryCount = join urlCounts by url, urlInfo by url; gCategories = group urlCategoryCount by category; topUrls = foreach gCategories generate top(urlCounts,10); store topUrls into ‘/data/topUrls’; User-defined functions (UDFs) can be used in every construct Load, Store Group, Filter, Foreach User-defined functions (UDFs) can be used in every construct Load, Store Group, Filter, Foreach
28
Pig Latin Execution 28 visits = load ‘/data/visits’ as (user, url, time); gVisits = group visits by url; urlCounts = foreach gVisits generate url, count(visits); urlInfo = load ‘/data/urlInfo’ as (url, category, pRank); urlCategoryCount = join urlCounts by url, urlInfo by url; gCategories = group urlCategoryCount by category; topUrls = foreach gCategories generate top(urlCounts,10); store topUrls into ‘/data/topUrls’; MR Job 1 MR Job 2 MR Job 3
29
UrlInfo(Url, Category, PageRank) 29 Visits(User, Url, Time) MR Job 1: group by url + foreach UrlCount(Url, Count) MR Job 2:join UrlCategoryCount(Url, Category, Count) MR Job 3: group by category + for each TopTenUrlPerCategory(Url, Category, Count) Pig Latin: Execution visits = load ‘/data/visits’ as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(visits); urlInfo = load ‘/data/urlInfo’ as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into ‘/data/topUrls’;
30
High-level Language 2: Hive 30 2009 VLDB: From Facebook (Thusoo et. al.) Apache software Hive-QL: SQL-like Declarative syntax e.g. SELECT *, INSERT INTO, GROUP BY, SORT BY Compiles to Hadoop
31
Hive Example 31 INSERT TABLE UrlCounts (SELECT url, count(*) AS count FROM Visits GROUP BY url) INSERT TABLE UrlCategoryCount (SELECT url, count, category FROM UrlCounts JOIN UrlInfo ON (UrlCounts.url = UrlInfo.url)) SELECT category, topTen(*) FROM UrlCategoryCount GROUP BY category
32
Hive Architecture 32 Compiler/Query Optimizer Command Line Web JDBC Query Interfaces
33
UrlInfo(Url, Category, PageRank) 33 Visits(User, Url, Time) MR Job 1: select from-group by UrlCount(Url, Count) MR Job 2:join UrlCategoryCount(Url, Category, Count) MR Job 3: select from-group by TopTenUrlPerCategory(Url, Category, Count) Hive Final Execution INSERT TABLE UrlCounts (SELECT url, count(*) AS count FROM Visits GROUP BY url) INSERT TABLE UrlCategoryCount (SELECT url, count, category FROM UrlCounts JOIN UrlInfo ON (UrlCounts.url = UrlInfo.url)) SELECT category, topTen(*) FROM UrlCategoryCount GROUP BY category
34
Pig & Hive Adoption Both Pig & Hive are very successful Pig Usage in 2009 at Yahoo: 40% all Hadoop jobs Hive Usage: thousands of job, 15TB/day new data loaded
35
MapReduce Shortcoming 3 Iterative computations Ex: graph algorithms, machine learning Specialized MR-like or MR-based systems: Graph Processing: Pregel, Giraph, Stanford GPS Machine Learning: Apache Mahout General iterative data processing systems: iMapReduce, HaLoop **Spark from Berkeley** (now Apache Spark), published in HotCloud`10 [Zaharia et. al]
36
Outline System MapReduce/Hadoop Pig & Hive Theory: Model For Lower Bounding Communication Cost Shares Algorithm for Joins on MR & Its Optimality 36
37
Tradeoff Between Per-Reducer-Memory and Communication Cost 37 keyvalues drugs Patients 1, Patients 2 drugs Patients 1, Patients 3 …… drugs Patients 1, Patients n …… drugs Patients n, Patients n-1 Reduce … … Map … q = Per-Reducer- Memory-Cost r = Communication Cost 6500 drugs 6500*6499 > 40M reduce keys
38
Similarity Join Input R(A, B), Domain(B) = [1, 10] Compute s.t |t[B]-u[B]| ≤ 1 Example (1) AB a1a1 5 a2a2 2 a3a3 6 a4a4 2 a5a5 7 38 Output Input
39
Hashing Algorithm [ADMPU ICDE ’12] Split Domain(B) into p ranges of values => (p reducers) p = 2 Example (2) (a 1, 5) (a 2, 2) (a 3, 6) (a 4, 2) (a 5, 7) Reducer 1 Reducer 2 Replicate tuples on the boundary (if t.B = 5) Per-Reducer-Memory Cost = 3, Communication Cost = 6 [1, 5] [6, 10] 39
40
p = 5 => Replicate if t.B = 2, 4, 6 or 8 Example (3) (a 1, 5) (a 2, 2) (a 3, 6) (a 4, 2) (a 5, 7) 40 Per-Reducer-Memory Cost = 2, Communication Cost = 8 Reducer 1 [1, 2] Reducer 3 [5, 6] Reducer 4 [7, 8] Reducer 2 [3, 4] Reducer 5 [9, 10]
41
Multiway-joins ([AU] TKDE ‘11) Finding subgraphs ([SV] WWW ’11, [AFU] ICDE ’13) Computing Minimum Spanning Tree (KSV SODA ’10) Other similarity joins: Set similarity joins ([VCL] SIGMOD ’10) Hamming Distance (ADMPU ICDE ’12 and later in the talk) Same Tradeoff in Other Algorithms 41
42
General framework applicable to a variety of problems Question 1: What is the minimum communication for any MR algorithm, if each reducer uses ≤ q memory? Question 2: Are there algorithms that achieve this lower bound? We want 42
43
Framework Input-Output Model Mapping Schemas & Replication Rate Lower bound for Triangle Query Shares Algorithm for Triangle Query Generalized Shares Algorithm Next 43
44
Framework: Input-Output Model Input Data Elements I: {i 1, i 2, …, i n } Output Elements O: {o 1, o 2, …, o m } 44
45
Example 1: R(A, B) S(B, C) ⋈ (a 1, b 1 ) … (a 1, b n ) … (a n, b n ) |Domain(A)| = n, |Domain(B)| = n, |Domain(C)| = n (b 1, c 1 ) … (b 1, c n ) … (b n, c n ) n 2 + n 2 = 2n 2 possible inputs (a 1, b 1, c 1 ) … (a 1, b 1, c n ) … (a 1, b n, c n ) (a 2, b 1, c 1 ) … (a 2, b n, c n ) … (a n, b n, c n ) n 3 possible outputs R(A,B) S(B,C) 45
46
Example 2: R(A, B) S(B, C) T(C, A) ⋈ (a 1, b 1 ) … (a n, b n ) |Domain(A)| = n, |Domain(B)| = n, |Domain(C)| = n n 2 + n 2 + n 2 = 3n 2 input elements (a 1, b 1, c 1 ) … (a 1, b 1, c n ) … (a 1, b n, c n ) (a 2, b 1, c 1 ) … (a 2, b n, c n ) … (a n, b n, c n ) n 3 output elements R(A,B) S(B,C) 46 ⋈ (b 1, c 1 ) … (b n, c n ) (c 1, a 1 ) … (c n, a n ) T(C,A)
47
Framework: Mapping Schema & Replication Rate p reducer: {R 1, R 2, …, R p } q max # inputs sent to any reducer R i Def (Mapping Schema): M : I {R 1, R 2, …, R p } s.t R i receives at most q i ≤ q inputs Every output is covered by some reducer Def (Replication Rate): r = q captures memory, r captures communication cost 47
48
Our Questions Again 48 Question 1: What is the minimum replication rate of any mapping schema as a function of q (maximum # inputs sent to any reducer)? Question 2: Are there mapping schemas that match this lower bound?
49
|Domain(A)| = n, |Domain(B)| = n, |Domain(C)| = n (a 1, b 1, c 1 ) … (a 1, b 1, c n ) … (a 1, b n, c n ) (a 2, b 1, c 1 ) … (a 2, b n, c n ) … (a n, b n, c n ) 49 (a 1, b 1 ) … (a n, b n ) R(A,B) S(B,C) (b 1, c 1 ) … (b n, c n ) (c 1, a 1 ) … (c n, a n ) T(C,A) Triangle Query: R(A, B) S(B, C) T(C, A) ⋈ ⋈ 3n 2 input elements each input contributes to N outputs n 3 outputs each output depends on 3 inputs
50
Lower Bound on Replication Rate (Triangle Query) Key is upper bound : max outputs a reducer can cover with ≤ q inputs Claim: (proof by AGM bound) All outputs must be covered: Recall: r = r = 50
51
Memory/Communication Cost Tradeoff (Triangle Query) q =max # inputs to each reducer n 3 1 3 3n 2 All inputs to one reducer One reducer for each output Shares Algorithm 51 r =replication rate n 2 /3
52
52 Shares Algorithm for Triangles p = k 3 reducers indexed as r 1,1,1 to r k,k,k We say each attribute A, B, C has k “shares” h A, h B, and h C from n -> k are indep. and perfect (a i, b j ) in R(A, B) r (ha(ai), hb(bj),*) E.g. If h A (a i ) = 3, h B (b j ) = 4, send it to r 3,4,1, r 3,4,2, …, r 3,4,k (b j, c l ) in S(B, C) r (*, hb(bj), hc(cl)) (c l, a i ) in T(C, A) r (ha(ai), *, hc(cl)) Correct: dependencies of (a i, b j, c l ) meets at r (ha(ai), hb(bj), hc(cl)) E.g. if h C (c l ) = 2, all tuples are sent to r 3,4,2
53
(a 1, b 1 ) … (a n, b n ) R(A,B) S(B,C) 53 (b 1, c 1 ) … (b n, c n ) (c 1, a 1 ) … (c n, a n ) T(C,A) Shares Algorithm for Triangles r 111 r 113 r 211 r 212 r 213 r 223 r 233 r 313 r 333 let p=27 h A (a 1 ) = 2 h B (b 1 ) = 1 h C (c 1 ) = 3 (a 1, b 1 ) => r 2,1,* (b 1, c 1 ) => r *,1,3 (a 1, c 1 ) => r 2,*,3 … … … … … r = k => p 1/3 q=3n 2 /p 2/3 r 213
54
54 Shares Algorithm for Triangles Shares’ replication rate: r = k => p 1/3 and q=3n 2 /p 2/3 Lower Bound for r >= (3 1/2 n)/q 1/2 Substitute q in LB r >= p 1/3 Special case 1: p=n 3, q=3, r=n Equivalent to trivial algorithm one reducer for each output Special case 2: p=1, q=3n 2, r=1 Equivalent to the trivial serial algorithm
55
Other Lower Bound Results [Afrati et. al., VLDB ’13] Hamming Distance 1 Multiway joins: R(A,B) S(B, C) T(C, A) Matrix Multiplication 55 ⋈ ⋈
56
56 Generalized Shares ([AU] TKDE ’11) R i, i=1,…,m relations. Let r i =|R i | A j, j=1,…,n attributes Q = \Join R i Give each attribute “share” s i p reducers indexed by r 1,1,..,1 to r s1,s2,…,sn Minimize total communication cost:
57
57 Example: Triangles R(A, B), S(B, C), T(C, A) |R|=|S|=|T|=n 2 Total communication cost: min |R|s C + |S|s A + |T|s B s.t s A s B s C = p Solution: s A =s B =s C =p 1/3 =k
58
58 Shares is Optimal For Any Query General shares solves a geometric program Always has solution and solvable in poly time observed by Chris and independently by Beame, Koutris, Suciu (BKS)) BKS proved, shares’ comm. cost vs. per-reducer memory optimal for any query
59
59 Open MapReduce Theory Questions Shares communication cost grows with p for most queries e.g. triangle communication cost p 1/3 |I| best for one round (again per-reducer memory) Q1: Can we do better with multi-round algorithms: Are there 2 round algorithms with O(|I|) cost? Answer is no for general queries. But maybe for a class of queries? How about constant round MR algorithms? Good work in PODS 2013 by Beame, Koutris, Suciu from UW Q2: How about instance optimal algorithms? Q3: How can we guard computations against skew? (good work in arxiv by Beame, Koutris, Suciu)
60
60 References MapReduce: Simplied Data Processing on Large Clusters [Dean&Ghemawarat OSDI ’04] Pig Latin: A Not-So-Foreign Language for Data Processing [Olston et. al. SIGMOD ’08] Hive – A Petabyte Scale Data Warehouse Using Hadoop [Thusoo ’09 VLDB] Spark: Cluster Computing With Working Sets [Zaharia et. al. HotCloud`10] Upper and lower bounds on the cost of a map-reduce computation [Afrati et. al., VLDB ’13] Optimizing Joins in a Map-Reduce Environment [Afrati et. al., TKDE ‘10] Parallel Evaluation of Conjunctive Queries [Koutris & Suciu, PODS ’11] Communication Steps For Parallel Query Processing [Beame et. al., PODS `13] Skew In Parallel Query Processing [Beame et. al., arxiv]
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.