Download presentation
Presentation is loading. Please wait.
1
MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra
2
Introduction Model for processing large data sets. Contains Map and Reduce functions. Runs on a large cluster of machines. A lot of MapReduce programs are executed on Google’s cluster everyday.
3
Motivation Very large data sets need to be processed. - The whole Web, billions of Pages Lots of machines - Use them efficiently.
4
Processing of Large Data Sets For example: - Counting access frequency to URLs: Input: list(RequestURL) Output: list(RequestURL, total_number) - Distributed Grep - Distributed Sort
5
Programming model Input & Output: each a set of key/value pairs Programmer specifies two functions: map (in_key, in_value) -> list(out_key, intermediate_value) Name comes from map function in LISP Ex. (map 'list #’+ '(1 2 3) '(1 2 3)) => (2 4 6) -Processes input key/value pair -Produces set of intermediate pairs m ap(document, content) { for each word in content emit(word, “1”) }
6
reduce (out_key, list(intermediate_value)) -> list(out_value) Name comes from reduce function in LISP Ex. (reduce #’+ '(1 2 3 4 5)) => 15 - Combines all intermediate values for a particular key - Produces a set of merged output values (usually just one) reduce(word, values) { result = 0; for each value in values result += value emitString(w, result) }
7
Example The problem of counting the number of occurrences of each word in a large collection of documents. Page 1: the weather is good Page 2: today is good Page 3: good weather is good
8
Map output Worker 1: (the 1), (weather 1), (is 1), (good 1). Worker 2: (today 1), (is 1), (good 1). Worker 3: (good 1), (weather 1), (is 1), (good 1).
9
Reduce Input Worker 1: (the 1) Worker 2: (is 1), (is 1), (is 1) Worker 3: (weather 1), (weather 1) Worker 4: (today 1) Worker 5: (good 1),(good 1), (good 1), (good 1)
10
Reduce Output Worker 1: (the 1) Worker 2: (is 3) Worker 3: (weather 2) Worker 4: (today 1) Worker 5: (good 4)
12
Example 2
13
Implementation
15
Flow of MapReduce Operation The MapReduce library in the user program splits the input files into M pieces(16,64 MB). One of the copies of the program is special. The master. The rest are workers. A worker who is assigned a map task parses key/value pairs out of the input data. Periodically, the buffered pairs are written to local disk. When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data. The output of the Reduce function is appended to a final output file. When all map tasks and reduce tasks have been completed, the master wakes up the user program.
16
Problem: Stragglers Often some machines are late in their replies - slow disk, overloaded, etc Approach: - when only few tasks left to execute, start backup tasks - a task completes when either primary or backup completes task Performance: - without backup, sort (->) takes 44% longer
17
Partition Function Defines which worker processes which keys - default: hash(key2) mod R Other partition functions useful: - sort: prefix of k bytes of line - idea: based on known/sampled distribution of key2 to evenly distribute processed keys
18
Combiner Function Problem: intermediate results can be quite verbose e.g., (“the”, 1) could occur many times in previous example Approach: perform a local reduction before writing intermediate results typically, combiner same function as reduce func This will reduce the run-time because less writing to disk and across the network
19
Performance Scan 10^10 100-byte records to extract records matching a rare pattern (92K matching records) : 150 seconds. Sort 10^10 100-byte records (modeled after TeraSort benchmark) : normal 839 seconds.
20
Fault Tolerance Crash of worker all - even finished - tasks are redone Crash of leader crash of leader process -> restart process with checkpoint crash of leader machine-> unlikely - restart computation redo computation
21
Conclusion MapReduce has proven to be a useful abstraction Easy to use Very large variety of problems are easily expressible as MapReduce computations Greatly simplifies large-scale computations at Google
22
Questions?
23
Thank You Thank You
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.