Download presentation
Presentation is loading. Please wait.
Published byEgbert Thornton Modified over 9 years ago
1
MapReduce Kristof Bamps Wouter Deroey
2
Outline Problem overview MapReduce o overview o implementation o refinements o conclusion
3
Problem overview Conceptually straightforward computations o e.g. Find the most frequest search queries Large amount of data o billions of webpages/search queries Too much data for 1 computer to handle
4
Problem overview Typical solution: distribute the work over 100's of machines Downsides: o communication o recovering from machine failure o optimization o locality Has to be rewritten for each program
5
MapReduce Software framework patented by Google to support distributed computing on large datasets on computer clusters Features o parallelization o load balancing o recovering of machine failure o locality
6
Programming model Input: set of key/value pairs Output: set of key/value pairs Programmer specifies 2 functions: map and reduce Map: o takes input pair o produces an intermediate key/value pair MapReduce library groups together all intermediate pairs with same key l Reduce: o intermediate key with values for that key as input o merges values to produce smaller subset
7
MapReduce: Example map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, “1”); reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); Wordcounter
8
MapReduce: Example Example uses at Google: o distributed grep o distributed sort o web access log stats o large scale graph computations o language model processing o many more
9
Implementation: execution
10
Master data structures Stores state and identity of the machine for each map or reduce task Stores locations and file regions of output generated by map tasks Pushes information to in-progress reduce tasks
11
Fault tolerance Handling worker failures o master pings every worker o failed tasks will be rescheduled o completed map tasks that fail also rescheduled o reduce tasks notified of failure Handling master failures
12
Semantics in the presence of failures If the map and reduce functions are deterministic functions of their input values, MapReduce produces the same output as a non-faulting sequential execution Relys on atomic commits of map and reduce Each task writes output to private temporary files Map task completed: sends message to master which stores the filenames Reduce task completed: rename file to final filename
13
Locality Possible by usage of a distributed file system (e.g. GFS, HDFS,...) Master uses location information to determine where to schedule what task Greatly reduces network traffic
14
Backup tasks Possibility of stragglers o takes unusually long time to complete o can be caused by bad hard disk, competition for bandwidth,.... Solution: schedule backup tasks o backup execution of in-progress tasks o task is completed when the backup or primary are finished
15
Refinements: partitioning function Users specify number of reduce jobs (R) Data gets partitioned between each job using intermediate key (e.g. hash(key) modulo R) Possible to specify partitioning function example: input are URLs, we want all entries for 1 host in a single file o e.g. partitioning function: hash(Hostname(urlkey)) mod R
16
Refinements: combiner function Significant repetition in the intermediate keys possible o e.g. WordCount: ("the", 1) Optional "combine" function: o partial merging o typically same code as reduce task o executed on machine that does map task o difference with reduce: output
17
Refinements: input and output types MapReduce supports multiple formats o e.g. "text" mode: threats each line as a key/value pair o each format knows how to split itself Reader interface: o allows specification of custom input type o does not have to be text, users can specify an interface that reads from a database or something else Output types: similar to input types
18
Refinements: skipping bad records Sometimes there are bugs that cause a Map or Reduce task to crash on certain records o usually fixed by debugging, though not always feasible On crash: o send message to master o includes sequence number of the argument When master sees more than 1 failure: o indicate this record can be skipped when issuing next re- execution
19
Other refinements Local execution: o difficult to debug regular MapReduce applications o sequential execution on 1 machine User-defined counters
20
Conclusion MapReduce simplifies distributed large-scale computations Allows programmers to focus on the problem without worrying about details
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.