Presentation is loading. Please wait.

Presentation is loading. Please wait.

MAP-REDUCE: WIN EPIC WIN MAP-REDUCE: WIN -OR- EPIC WIN CSC313: Advanced Programming Topics.

Similar presentations


Presentation on theme: "MAP-REDUCE: WIN EPIC WIN MAP-REDUCE: WIN -OR- EPIC WIN CSC313: Advanced Programming Topics."— Presentation transcript:

1 MAP-REDUCE: WIN EPIC WIN MAP-REDUCE: WIN -OR- EPIC WIN CSC313: Advanced Programming Topics

2 Brief History of Google BackRub: 1996 4 disk drives 24 GB total storage

3 Brief History of Google BackRub: 1996 4 disk drives 24 GB total storage =

4 Brief History of Google Google: 1998 44 disk drives 366 GB total storage

5 Brief History of Google Google: 1998 44 disk drives 366 GB total storage =

6 Traditional Design Principles  If big enough, supercomputer processes work  Use desktop CPUs, just a lot more of them  But it also provides huge bandwidth to memory  Equivalent to many machines bandwidth at once  But supercomputers are VERY, VERY expensive  Maintenance also expensive once machine bought  But do get something: high-quality == low downtime  Safe, expensive solution to very large problems

7 Why Trade Money for Safety?

8

9 How Was Search Performed? http://www.yahoo.com/search?p=pager DNS

10 How Was Search Performed? http://www.yahoo.com/search?p=pager DNS

11 How Was Search Performed? http://www.yahoo.com/search?p=pager DNS http://209.191.122.70

12 How Was Search Performed? DNS http://209.191.122.70/search?p=pager http://www.yahoo.com/search?p=pager

13 How Was Search Performed? DNS http://209.191.122.70/search?p=pager http://www.yahoo.com/search?p=pager

14 Google’s Big Insight  Performing search is “embarrassingly parallel”  No need for supercomputer and all that expense  Can instead do this using lots & lots of desktops  Identical effective bandwidth & performance

15 Google’s Big Insight  Performing search is “embarrassingly parallel”  No need for supercomputer and all that expense  Can instead do this using lots & lots of desktops  Identical effective bandwidth & performance  But problem is desktop machines unreliable  Budget for 2 replacements, since machines cheap  Just expect failure; software provides quality

16 Google’s Big Insight  Performing search is “embarrassingly parallel”  No need for supercomputer and all that expense  Can instead do this using lots & lots of desktops  Identical effective bandwidth & performance  But problem is desktop machines unreliable  Budget for 2 replacements, since machines cheap software provides quality  Just expect failure; software provides quality

17 Google’s Big Insight  Performing search is “embarrassingly parallel”  No need for supercomputer and all that expense  Can instead do this using lots & lots of desktops  Identical effective bandwidth & performance  But problem is desktop machines unreliable  Budget for 2 replacements, since machines cheap software provides quality  Just expect failure; software provides quality

18 A brief history of Google Google: 2012 ?0,000 total servers ??? PB total storage

19 How Is Search Performed Now? http://209.85.148.100/search?q=android

20 How Is Search Performed Now? http://209.85.148.100/search?q=android Spell Checker Ad Server Document Servers (TB) Index Servers (TB)

21 How Is Search Performed Now? http://209.85.148.100/search?q=android Spell Checker Ad Server Document Servers (TB) Index Servers (TB)

22 How Is Search Performed Now? http://209.85.148.100/search?q=android Spell Checker Ad Server Document Servers (TB) Index Servers (TB)

23 How Is Search Performed Now? http://209.85.148.100/search?q=android Spell Checker Ad Server Document Servers (TB) Index Servers (TB)

24 Google’s Processing Model  Buy cheap machines & prepare for worst  Machines going to fail, but still cheaper approach  Important steps keep whole system reliable  Replicate data so that information losses limited  Move data freely so can always rebalance loads  These decisions lead to many other benefits  Scalability helped by focus on balancing  Search speed improved; performance much better  Utilize resources fully, since search demand varies

25 Heterogeneous processing  By buying cheapest computers, variances are high  Programs must handle homo- & hetero- systems  Centralized workqueue helps with different speeds

26 Heterogeneous processing  By buying cheapest computers, variances are high  Programs must handle homo- & hetero- systems  Centralized workqueue helps with different speeds  This process also leads to a few small downsides  Space  Power consumption  Cooling costs

27 Complexity at Google

28

29 Google Abstractions  Google File System  Handles replication to provide scalability & durability  BigTable  Manages large relational data sets  Chubby  Gonna skip past that joke; distributed locking service  MapReduce  If  If job fits, easy parallelism possible without much work

30 Google Abstractions  Google File System  Handles replication to provide scalability & durability  BigTable  Manages large relational data sets  Chubby  Gonna skip past that joke; distributed locking service  MapReduce  If  If job fits, easy parallelism possible without much work

31 Remember Google’s Problem

32 MapReduce Overview  Programming model makes details simple  Automatic parallelization & load balancing  Network and disk I/O optimization  Robust performance even if machines fail

33 MapReduce Overview provides good Façade  Programming model provides good Façade  Automatic parallelization & load balancing  Network and disk I/O optimization  Robust performance even if machines fail

34 MapReduce Overview  Programming model provides good Façade  Automatic parallelization & load balancing  Network and disk I/O optimization  Robust performance even if machines fail  Idea came from 2 Lisp (functional) primitives  Map  Reduce

35 MapReduce Overview  Programming model provides good Façade  Automatic parallelization & load balancing  Network and disk I/O optimization  Robust performance even if machines fail  Idea came from 2 Lisp (functional) primitives  Map  Map: process each entry in list using some function  Reduce  Reduce: recombines data using given function

36 Typical MapReduce problem 1. Read lots and lots of data (e.g., TBs) 2. Map  Extract important data from each entry in input 3. Combine Maps and sort entries by key 4. Reduce  Process each key’s entries to get result for that key 5. Output final result & watch money roll in

37 Typical MapReduce problem 1. Read lots and lots of data (e.g., TBs) 2. Map  Extract important data from each entry in input 3. Combine Maps and sort entries by key 4. Reduce  Process each key’s entries to get result for that key 5. Output final result & watch money roll in

38 Typical MapReduce problem 1. Read lots and lots of data (e.g., TBs) 2. Map  Extract important data from each entry in input 3. Combine Maps and sort entries by key 4. Reduce  Process each key’s entries to get result for that key 5. Output final result & watch money roll in

39 Typical MapReduce problem 1. Read lots and lots of data (e.g., TBs) 2. Map  Extract important data from each entry in input 3. Combine Maps and sort entries by key 4. Reduce  Process each key’s entries to get result for that key 5. Output final result & watch money roll in

40 Pictorial View of MapReduce

41 Ex: Count Word Frequencies  Processes files separately Map Key=URL Value=text on page Key=URL Value=text on page

42 Ex: Count Word Frequencies  Processes files separately & count word freq. in each Map Key=URL Value=text on page Key=URL Value=text on page Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count

43 Ex: Count Word Frequencies Reduce Key’=“to” Value’=“1” Key’=“to” Value’=“1” Key’=“be” Value’=“1” Key’=“be” Value’=“1” Key’=“or” Value’=“1” Key’=“or” Value’=“1” Key’=“not” Value’=“1” Key’=“not” Value’=“1” Key’=“to” Value’=“1” Key’=“to” Value’=“1” Key’=“be” Value’=“1” Key’=“be” Value’=“1”  In shuffle step, Maps combined & entries sorted by key  Reduce

44 Ex: Count Word Frequencies  In shuffle step, Maps combined & entries sorted by key  Reduce combines key’s results to compute final output Reduce Key’=“to” Value’=“1” Key’=“to” Value’=“1” Key’=“be” Value’=“1” Key’=“be” Value’=“1” Key’=“or” Value’=“1” Key’=“or” Value’=“1” Key’=“not” Value’=“1” Key’=“not” Value’=“1” Key’=“to” Value’=“1” Key’=“to” Value’=“1” Key’=“be” Value’=“1” Key’=“be” Value’=“1” Key’’=“to” Value’’=“2” Key’’=“to” Value’’=“2” Key’’=“be” Value’’=“2” Key’’=“be” Value’’=“2” Key’’=“or” Value’’=“1” Key’’=“or” Value’’=“1” Key’’=“not” Value’’=“1” Key’’=“not” Value’’=“1”

45 Word Frequency Pseudo-code Map(String input_key, String input_values) { String[] words = input_values.split(“ ”); foreach w in words { EmitIntermediate(w, "1"); } } Reduce(String key, Iterator intermediate_values){ int result = 0; foreach v in intermediate_values { result += ParseInt(v); } Emit(result); }

46 Ex: Build Search Index  Processes files separately & record words found on each Map Key=URL Value=text on page Key=URL Value=text on page Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=URL Key’=word Value’=URL

47 Ex: Build Search Index  Processes files separately & record words found on each  To get search Map, combine key’s results in Reduce Map Key=URL Value=text on page Key=URL Value=text on page Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=URL Key’=word Value’=URL Reduce Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=URL Key’=word Value’=URL Key=word Value=URLs with word Key=word Value=URLs with word

48 Search Index Pseudo-code Map(String input_key, String input_values) { String[] words = input_values.split(“ ”); foreach w in words { EmitIntermediate(w, input_key); } } Reduce(String key, Iterator intermediate_values){ List result = new ArrayList(); foreach v in intermediate_values { result.addLast(v); } Emit(result); }

49 Ex: Page Rank Computation  Google’s algorithm ranking pages’ relevance

50 Ex: Page Rank Computation Map Key= Value=links on page Key= Value=links on page Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=link on page Value’= Key’=link on page Value’= Reduce Key= Value=links on page Key= Value=links on page Key= Value=links on page Key= Value=links on page Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=link to URL Value’= Key’=link to URL Value’= Key= Value=links on page Key= Value=links on page + +

51 Ex: Page Rank Computation Map Key= Value=links on page Key= Value=links on page Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=link on page Value’= Key’=link on page Value’= Reduce Key= Value=links on page Key= Value=links on page Key= Value=links on page Key= Value=links on page Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=link to URL Value’= Key’=link to URL Value’= Key= Value=links on page Key= Value=links on page + + Repeat entire process (e.g., input Reduce results back into Map) until page ranks stabilize (sum of changes to the ranks drops below some threshold)

52 Ex: Page Rank Computation  Google’s algorithm ranking pages’ relevance

53 Advanced MapReduce Ideas  How to implement? One master, many workers  Split input data into tasks where each task size fixed  Will also be partitioning reduce phase into tasks  Dynamically assign tasks to workers during each step  Tasks assigned as needed & placed in in-process list  Once worker completes task, save result & retire task  Assume that a worker crashed, if not complete in time  Move incomplete tasks back into pool for reassignment

54 Advanced MapReduce Ideas

55


Download ppt "MAP-REDUCE: WIN EPIC WIN MAP-REDUCE: WIN -OR- EPIC WIN CSC313: Advanced Programming Topics."

Similar presentations


Ads by Google