MAP-REDUCE: WIN EPIC WIN MAP-REDUCE: WIN -OR- EPIC WIN CSC313: Advanced Programming Topics.

MAP-REDUCE: WIN EPIC WIN MAP-REDUCE: WIN -OR- EPIC WIN CSC313: Advanced Programming Topics

Brief History of Google BackRub: 1996 4 disk drives 24 GB total storage

Brief History of Google BackRub: 1996 4 disk drives 24 GB total storage =

Brief History of Google Google: 1998 44 disk drives 366 GB total storage

Brief History of Google Google: 1998 44 disk drives 366 GB total storage =

Traditional Design Principles  If big enough, supercomputer processes work  Use desktop CPUs, just a lot more of them  But it also provides huge bandwidth to memory  Equivalent to many machines bandwidth at once  But supercomputers are VERY, VERY expensive  Maintenance also expensive once machine bought  But do get something: high-quality == low downtime  Safe, expensive solution to very large problems

Why Trade Money for Safety?

How Was Search Performed? http://www.yahoo.com/search?p=pager DNS

How Was Search Performed? http://www.yahoo.com/search?p=pager DNS http://209.191.122.70

How Was Search Performed? DNS http://209.191.122.70/search?p=pager http://www.yahoo.com/search?p=pager

Google’s Big Insight  Performing search is “embarrassingly parallel”  No need for supercomputer and all that expense  Can instead do this using lots & lots of desktops  Identical effective bandwidth & performance

Google’s Big Insight  Performing search is “embarrassingly parallel”  No need for supercomputer and all that expense  Can instead do this using lots & lots of desktops  Identical effective bandwidth & performance  But problem is desktop machines unreliable  Budget for 2 replacements, since machines cheap  Just expect failure; software provides quality

Google’s Big Insight  Performing search is “embarrassingly parallel”  No need for supercomputer and all that expense  Can instead do this using lots & lots of desktops  Identical effective bandwidth & performance  But problem is desktop machines unreliable  Budget for 2 replacements, since machines cheap software provides quality  Just expect failure; software provides quality

A brief history of Google Google: 2012 ?0,000 total servers ??? PB total storage

How Is Search Performed Now? http://209.85.148.100/search?q=android

How Is Search Performed Now? http://209.85.148.100/search?q=android Spell Checker Ad Server Document Servers (TB) Index Servers (TB)

Google’s Processing Model  Buy cheap machines & prepare for worst  Machines going to fail, but still cheaper approach  Important steps keep whole system reliable  Replicate data so that information losses limited  Move data freely so can always rebalance loads  These decisions lead to many other benefits  Scalability helped by focus on balancing  Search speed improved; performance much better  Utilize resources fully, since search demand varies

Heterogeneous processing  By buying cheapest computers, variances are high  Programs must handle homo- & hetero- systems  Centralized workqueue helps with different speeds

Heterogeneous processing  By buying cheapest computers, variances are high  Programs must handle homo- & hetero- systems  Centralized workqueue helps with different speeds  This process also leads to a few small downsides  Space  Power consumption  Cooling costs

Complexity at Google

Google Abstractions  Google File System  Handles replication to provide scalability & durability  BigTable  Manages large relational data sets  Chubby  Gonna skip past that joke; distributed locking service  MapReduce  If  If job fits, easy parallelism possible without much work

Remember Google’s Problem

MapReduce Overview  Programming model makes details simple  Automatic parallelization & load balancing  Network and disk I/O optimization  Robust performance even if machines fail

MapReduce Overview provides good Façade  Programming model provides good Façade  Automatic parallelization & load balancing  Network and disk I/O optimization  Robust performance even if machines fail

MapReduce Overview  Programming model provides good Façade  Automatic parallelization & load balancing  Network and disk I/O optimization  Robust performance even if machines fail  Idea came from 2 Lisp (functional) primitives  Map  Reduce

MapReduce Overview  Programming model provides good Façade  Automatic parallelization & load balancing  Network and disk I/O optimization  Robust performance even if machines fail  Idea came from 2 Lisp (functional) primitives  Map  Map: process each entry in list using some function  Reduce  Reduce: recombines data using given function

Typical MapReduce problem 1. Read lots and lots of data (e.g., TBs) 2. Map  Extract important data from each entry in input 3. Combine Maps and sort entries by key 4. Reduce  Process each key’s entries to get result for that key 5. Output final result & watch money roll in

Pictorial View of MapReduce

Ex: Count Word Frequencies  Processes files separately Map Key=URL Value=text on page Key=URL Value=text on page

Ex: Count Word Frequencies  Processes files separately & count word freq. in each Map Key=URL Value=text on page Key=URL Value=text on page Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count

Ex: Count Word Frequencies Reduce Key’=“to” Value’=“1” Key’=“to” Value’=“1” Key’=“be” Value’=“1” Key’=“be” Value’=“1” Key’=“or” Value’=“1” Key’=“or” Value’=“1” Key’=“not” Value’=“1” Key’=“not” Value’=“1” Key’=“to” Value’=“1” Key’=“to” Value’=“1” Key’=“be” Value’=“1” Key’=“be” Value’=“1”  In shuffle step, Maps combined & entries sorted by key  Reduce

Ex: Count Word Frequencies  In shuffle step, Maps combined & entries sorted by key  Reduce combines key’s results to compute final output Reduce Key’=“to” Value’=“1” Key’=“to” Value’=“1” Key’=“be” Value’=“1” Key’=“be” Value’=“1” Key’=“or” Value’=“1” Key’=“or” Value’=“1” Key’=“not” Value’=“1” Key’=“not” Value’=“1” Key’=“to” Value’=“1” Key’=“to” Value’=“1” Key’=“be” Value’=“1” Key’=“be” Value’=“1” Key’’=“to” Value’’=“2” Key’’=“to” Value’’=“2” Key’’=“be” Value’’=“2” Key’’=“be” Value’’=“2” Key’’=“or” Value’’=“1” Key’’=“or” Value’’=“1” Key’’=“not” Value’’=“1” Key’’=“not” Value’’=“1”

Word Frequency Pseudo-code Map(String input_key, String input_values) { String[] words = input_values.split(“ ”); foreach w in words { EmitIntermediate(w, "1"); } } Reduce(String key, Iterator intermediate_values){ int result = 0; foreach v in intermediate_values { result += ParseInt(v); } Emit(result); }

Ex: Build Search Index  Processes files separately & record words found on each Map Key=URL Value=text on page Key=URL Value=text on page Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=URL Key’=word Value’=URL

Ex: Build Search Index  Processes files separately & record words found on each  To get search Map, combine key’s results in Reduce Map Key=URL Value=text on page Key=URL Value=text on page Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=URL Key’=word Value’=URL Reduce Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=URL Key’=word Value’=URL Key=word Value=URLs with word Key=word Value=URLs with word

Search Index Pseudo-code Map(String input_key, String input_values) { String[] words = input_values.split(“ ”); foreach w in words { EmitIntermediate(w, input_key); } } Reduce(String key, Iterator intermediate_values){ List result = new ArrayList(); foreach v in intermediate_values { result.addLast(v); } Emit(result); }

Ex: Page Rank Computation  Google’s algorithm ranking pages’ relevance

Ex: Page Rank Computation Map Key= Value=links on page Key= Value=links on page Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=link on page Value’= Key’=link on page Value’= Reduce Key= Value=links on page Key= Value=links on page Key= Value=links on page Key= Value=links on page Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=link to URL Value’= Key’=link to URL Value’= Key= Value=links on page Key= Value=links on page + +

Ex: Page Rank Computation Map Key= Value=links on page Key= Value=links on page Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=link on page Value’= Key’=link on page Value’= Reduce Key= Value=links on page Key= Value=links on page Key= Value=links on page Key= Value=links on page Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=word Value’=count Key’=link to URL Value’= Key’=link to URL Value’= Key= Value=links on page Key= Value=links on page + + Repeat entire process (e.g., input Reduce results back into Map) until page ranks stabilize (sum of changes to the ranks drops below some threshold)

Ex: Page Rank Computation  Google’s algorithm ranking pages’ relevance

Advanced MapReduce Ideas  How to implement? One master, many workers  Split input data into tasks where each task size fixed  Will also be partitioning reduce phase into tasks  Dynamically assign tasks to workers during each step  Tasks assigned as needed & placed in in-process list  Once worker completes task, save result & retire task  Assume that a worker crashed, if not complete in time  Move incomplete tasks back into pool for reassignment

Advanced MapReduce Ideas

MAP-REDUCE: WIN EPIC WIN MAP-REDUCE: WIN -OR- EPIC WIN CSC313: Advanced Programming Topics.

Similar presentations

Presentation on theme: "MAP-REDUCE: WIN EPIC WIN MAP-REDUCE: WIN -OR- EPIC WIN CSC313: Advanced Programming Topics."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

MAP-REDUCE: WIN EPIC WIN MAP-REDUCE: WIN -OR- EPIC WIN CSC313: Advanced Programming Topics.

Similar presentations

Presentation on theme: "MAP-REDUCE: WIN EPIC WIN MAP-REDUCE: WIN -OR- EPIC WIN CSC313: Advanced Programming Topics."— Presentation transcript:

Similar presentations

About project

Feedback