Google’s MapReduce Connor Poske Florida State University.

Google’s MapReduce Connor Poske Florida State University

Outline Part I: Part I: –History –MapReduce architecture and features –How it works Part II: Part II: –MapReduce programming model and example

Initial History There is a demand for large scale data processing. There is a demand for large scale data processing. The folks at Google have discovered certain common themes for processing very large input sizes. The folks at Google have discovered certain common themes for processing very large input sizes. - Multiple machines are needed - Multiple machines are needed - There are usually 2 basic operations on the input data: 1) Map 2) Reduce

Map Similar to the Lisp primitive Similar to the Lisp primitive Apply a single function to multiple inputs Apply a single function to multiple inputs In the MapReduce model, the map function applies an operation to a list of pairs of the form (input_key, input_value), and produces a set of INTERMEDIATE key/value tuples. Map(input_key, input_value) -> (output_key, intermediate_value) list

Reduce Accepts the set of intermediate key/value tuples as input Accepts the set of intermediate key/value tuples as input Applies a reduce operation to all values that share the same key Applies a reduce operation to all values that share the same key Reduce(output_key, intermediate_value list) -> output list

Quick example Pseudo-code counts the number of occurrences of each word in a large collection of documents Pseudo-code counts the number of occurrences of each word in a large collection of documents Map(String fileName, String fileContents) //fileName is input key, fileContents is input value For each word w in fileContents EmitIntermediate(w, “1”) Reduce(String word, Iterator Values) //word: input key, values: a list of counts int count = 0 for each v in values count += 1 Emit(AsString(count))

The idea sounds good, but… We can’t forget about the problems arising from large scale, multiple-machine data processing We can’t forget about the problems arising from large scale, multiple-machine data processing How do we parallelize everything? How do we parallelize everything? How do we balance the input load? How do we balance the input load? Handle failures? Handle failures? Enter the MapReduce model…

MapReduce The MapReduce implementation is an abstraction that hides these complexities from the programmer The MapReduce implementation is an abstraction that hides these complexities from the programmer The User defines the Map and Reduce functions The User defines the Map and Reduce functions The MapReduce implementation automatically distributes the data, then applies the user-defined functions on the data The MapReduce implementation automatically distributes the data, then applies the user-defined functions on the data Actual code slightly more complex than previous example Actual code slightly more complex than previous example

MapReduce Architecture User program with Map and Reduce functions User program with Map and Reduce functions Cluster of average PCs Cluster of average PCs Upon execution, cluster is divided into: Upon execution, cluster is divided into: –Master worker –Map workers –Reduce workers

Execution Overview 1) Split up input data, start up program on all machines 2) Master machine assigns M Map and R Reduce tasks to idle worker machines 3) Map function executed and results buffered locally 4) Periodically, data in local memory is written to disk. Locations on disk of data are forwarded to master --Map phase complete— 5) Reduce worker uses RPCs to read intermediate data from Map machines. Data is sorted by key. 6) Reduce worker iterates over data and passes each unique key along with associated values to the Reduce function 7) Master wakes up the user program, MapReduce call returns.

Execution Overview

Master worker Stores state information about Map and Reduce workers Stores state information about Map and Reduce workers –Idle, in-progress, or completed Stores location and sizes on disk of intermediate file regions on Map machines Stores location and sizes on disk of intermediate file regions on Map machines –Pushes this information incrementally to workers with in- progress reduce tasks Displays status of entire operation via HTTP Displays status of entire operation via HTTP –Runs internal HTTP server –Displays progress I.E. bytes of intermediate data, bytes of output, processing rates, etc

Parallelization Map() runs in parallel, creating different intermediate output from different input keys and values Map() runs in parallel, creating different intermediate output from different input keys and values Reduce() runs in parallel, each working on a different key Reduce() runs in parallel, each working on a different key All data is processed independently by different worker machines All data is processed independently by different worker machines Reduce phase cannot begin until Map phase is completely finished! Reduce phase cannot begin until Map phase is completely finished!

Load Balancing User defines a MapReduce “spec” object User defines a MapReduce “spec” object –MapReduceSpecification spec –Spec.set_machines(2000) –Spec.set_map_megabytes(100) –Spec.set_reduce_megabytes(100) That’s it! The library will automatically take care of the rest.

Fault Tolerance - Master pings workers periodically Switch(ping response) case (idle): Assign task if possible case (idle): Assign task if possible case (in-progress): do nothing case (in-progress): do nothing case (completed): reset to idle case (completed): reset to idle case (no response): Reassign task case (no response): Reassign task

Fault Tolerance What if a map task completes but the machine fails before the intermediate data is retrieved via RPC? What if a map task completes but the machine fails before the intermediate data is retrieved via RPC? –Re-execute the map task on an idle machine What if the intermediate data is partially read, but the machine fails before all reduce operations can complete? What if the intermediate data is partially read, but the machine fails before all reduce operations can complete? What if the master fails…? PWNED What if the master fails…? PWNED

Fault Tolerance Skipping bad records Skipping bad records –Optional parameter to change mode of execution –When enabled, MapReduce library detects records that cause crashes and skips them Bottom line: MapReduce is very robust in its ability to recover from failure and handle errors Bottom line: MapReduce is very robust in its ability to recover from failure and handle errors

Part II: Programming Model MapReduce library is extremely easy to use MapReduce library is extremely easy to use Involves setting up only a few parameters, and defining the map() and reduce() functions Involves setting up only a few parameters, and defining the map() and reduce() functions –Define map() and reduce() –Define and set parameters for MapReduceInput object –Define and set parameters for MapReduceOutput object –Main program

Map() Class WordCounter : public Mapper{ public: public: virtual void Map(const MapInput &input) { //parse each word and for each word //emit(word, “1”) }};REGISTER_MAPPER(WordCounter);

Reduce() Class Adder : public Reducer { virtual void Reduce(ReduceInput *input) { //Iterate over all entries with same key //and add the values }};REGISTER_REDUCER(Adder);

Main() int main(int argc, char ** argv) { MapReduceSpecification spec; MapReduceInput *input; //store list of input files into “spec” for( int i = 0; I < argc; ++i) { input = spec.add_input(); input->set_format(“text”);input->set_filepattern(argv[i]);input->set_mapper_class(“WordCounter”); } }

Main() //Specify the output files MapReductOutput *output = spec.output(); out->set_filebase (“/gfs/test/freq”); out->set_num_tasks(100); // freq-00000-of-00100 // freq-00000-of-00100 // freq-00001-of-00100 // freq-00001-of-00100out->set_format(“text”);out->set_reducer_class(“Adder”);

Main() //Tuning parameters and actual MapReduce call spec.set_machines(2000);spec.set_map_megabytes(100);spec.set_reduce_megabytes(100); MapReduceResult result; if(!MapReduce(spec, &result)) abort(); Return 0; } //end main

Other possible uses Distributed grep Distributed grep –Map emits a line if it matches a supplied pattern –Reduce simply copies intermediate data to output Count URL access frequency Count URL access frequency –Map processes logs of web page requests and emits (URL, 1) –Reduce adds all values for each URL and emits (URL, count) Inverted Index Inverted Index –Map parses each document and emits a sequence of (word, document ID) pairs. –Reduce accepts all pairs for a given word, sorts the list based on Document ID, and emits (word, list(document ID)) Many more… Many more…

Conclusion MapReduce provides a easy to use, clean abstraction for large scale data processing MapReduce provides a easy to use, clean abstraction for large scale data processing Very robust in fault tolerance and error handling Very robust in fault tolerance and error handling Can be used for multiple scenarios Can be used for multiple scenarios Restricting the programming model to the Map and Reduce paradigms makes it easy to parallelize computations and make them fault-tolerant Restricting the programming model to the Map and Reduce paradigms makes it easy to parallelize computations and make them fault-tolerant

Google’s MapReduce Connor Poske Florida State University.

Similar presentations

Presentation on theme: "Google’s MapReduce Connor Poske Florida State University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Google’s MapReduce Connor Poske Florida State University.

Similar presentations

Presentation on theme: "Google’s MapReduce Connor Poske Florida State University."— Presentation transcript:

Similar presentations

About project

Feedback