MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.

MapReduce Kristof Bamps Wouter Deroey

Outline Problem overview MapReduce o overview o implementation o refinements o conclusion

Problem overview Conceptually straightforward computations o e.g. Find the most frequest search queries Large amount of data o billions of webpages/search queries Too much data for 1 computer to handle

Problem overview Typical solution: distribute the work over 100's of machines Downsides: o communication o recovering from machine failure o optimization o locality Has to be rewritten for each program

MapReduce Software framework patented by Google to support distributed computing on large datasets on computer clusters Features o parallelization o load balancing o recovering of machine failure o locality

Programming model Input: set of key/value pairs Output: set of key/value pairs Programmer specifies 2 functions: map and reduce Map: o takes input pair o produces an intermediate key/value pair MapReduce library groups together all intermediate pairs with same key l Reduce: o intermediate key with values for that key as input o merges values to produce smaller subset

MapReduce: Example map(String key, String value): // key: document name // value: document contents for each word w in value: EmitIntermediate(w, “1”); reduce(String key, Iterator values): // key: a word // values: a list of counts int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); Wordcounter

MapReduce: Example Example uses at Google: o distributed grep o distributed sort o web access log stats o large scale graph computations o language model processing o many more

Implementation: execution

Master data structures Stores state and identity of the machine for each map or reduce task Stores locations and file regions of output generated by map tasks Pushes information to in-progress reduce tasks

Fault tolerance Handling worker failures o master pings every worker o failed tasks will be rescheduled o completed map tasks that fail also rescheduled o reduce tasks notified of failure Handling master failures

Semantics in the presence of failures If the map and reduce functions are deterministic functions of their input values, MapReduce produces the same output as a non-faulting sequential execution Relys on atomic commits of map and reduce Each task writes output to private temporary files Map task completed: sends message to master which stores the filenames Reduce task completed: rename file to final filename

Locality Possible by usage of a distributed file system (e.g. GFS, HDFS,...) Master uses location information to determine where to schedule what task Greatly reduces network traffic

Backup tasks Possibility of stragglers o takes unusually long time to complete o can be caused by bad hard disk, competition for bandwidth,.... Solution: schedule backup tasks o backup execution of in-progress tasks o task is completed when the backup or primary are finished

Refinements: partitioning function Users specify number of reduce jobs (R) Data gets partitioned between each job using intermediate key (e.g. hash(key) modulo R) Possible to specify partitioning function example: input are URLs, we want all entries for 1 host in a single file o e.g. partitioning function: hash(Hostname(urlkey)) mod R

Refinements: combiner function Significant repetition in the intermediate keys possible o e.g. WordCount: ("the", 1) Optional "combine" function: o partial merging o typically same code as reduce task o executed on machine that does map task o difference with reduce: output

Refinements: input and output types MapReduce supports multiple formats o e.g. "text" mode: threats each line as a key/value pair o each format knows how to split itself Reader interface: o allows specification of custom input type o does not have to be text, users can specify an interface that reads from a database or something else Output types: similar to input types

Refinements: skipping bad records Sometimes there are bugs that cause a Map or Reduce task to crash on certain records o usually fixed by debugging, though not always feasible On crash: o send message to master o includes sequence number of the argument When master sees more than 1 failure: o indicate this record can be skipped when issuing next re- execution

Other refinements Local execution: o difficult to debug regular MapReduce applications o sequential execution on 1 machine User-defined counters

Conclusion MapReduce simplifies distributed large-scale computations Allows programmers to focus on the problem without worrying about details

MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.

Similar presentations

Presentation on theme: "MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.

Similar presentations

Presentation on theme: "MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion."— Presentation transcript:

Similar presentations

About project

Feedback