Download presentation
Presentation is loading. Please wait.
Published byClaud Fowler Modified over 9 years ago
1
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang 2014.10.6
2
Outline Introduction Programming model Implementation Refinements Performance Conclusion
3
1. Introduction
4
What is MapReduce Origin from Google, [OSDI’04] A simple programming model Functional model For large-scale data processing Exploits large set of commodity computers Executes process in distributed manner Offers high availability
5
Motivation Lots of demands for very large scale data processing : computation are conceptually straightforward input data is large distributed across thousands of machines The issue of how to parallelize computation, distribute the data, and handle failures obscure the original computation with complex code to deal with these issues
6
Distributed Grep Very big data Split data grep matches cat All matches
7
Distributed Word Count Very big data Split data count merge merged count
8
Goal Design a new abstraction that allows us to: express the simple computation we are trying to perform hides the messy details of parallelization, fault- tolerance, data distribution and load balancing in a library
9
2. Programming Model
10
Map + Reduce Map: Accepts input key/value pair Emits intermediate key/value pair Reduce : Accepts intermediate key/value* pair Emits output key/value pair Very big data Result MAPMAP REDUCEREDUCE Partitioning Function
11
A Simple Example Counting words in a large set of documents : map(string value) //key: document name //value: document contents for each word w in value EmitIntermediate(w, “1”); reduce(string key, iterator values) //key: word //values: list of counts int results = 0; for each v in values result += ParseInt(v); Emit(AsString(result));
12
More Examples Distributed Grep Map: emits a line that matches the pattern Reduce: identity function Count of URL Access Frequency Map: processes logs and emit Map: processes logs and emit Reduce: adds together all values for the same URL and emits a Reduce: adds together all values for the same URL and emits a Distributed Sort Map: extracts key from each record and emit Map: extracts key from each record and emit Reduce: identity function
13
3. Implementation
14
Environment Implementation depends on the environment: Machines with x86 dual-CPU, 2-4 GB of memory; Commodity networking hardware, 100 Mb/s or 1 Gb/s at machine level; A cluster consists of hundreds or thousands of machines; Embedded inexpensive IDE disks provides storage
15
Execution Overview
16
1. Input data partitioning (M splits, each 16-64MB); Starting up copies of program on a cluster 2. Tasks assignment: master assigns Map or Reduce to workers 3. Map task: parse key/value pair from input; produce intermediate key/value pair by Map function
17
Execution Overview 4. Pairs partitioning (hash function, typically mod); Location forwarding by master 5. Reduce task: read data from map worker; sort it by intermediate key; group 6. Reduce function: deal with the groups passed by Reduce task 7. All tasks completed. The MapReduce call returns
18
Details of Map/Reduce Task
19
Master Data Structure Master keeps several data structure: It stores the state (idle, in-process, or completed) for each map and reduce task It stores the identity of the worker machine Master is the conduit: With master the location of intermediate file is propagated from map task to reduce task
20
Fault Tolerance Worker failure Master pings workers periodically Any machine who does not respond is considered “dead” For both Map and Reduce machines, any task in progress needs to be re-executed For Map machines, completed tasks are also reset because results are stored on local disk Master failure Abort entire computation
21
Locality Issue Master scheduling policy Asks GFS for locations of replicas of input file blocks Map tasks typically split into 64MB (== GFS block size) Map tasks scheduled so GFS input block replica are on same or nearby machine Effect Most input data is read locally Consumes no network bandwidth
22
Choice of M and R : Ideally, M and R should be much larger than the number of work machines There are practical bounds on M and R: O(M+R) scheduling decisions O(M*R) state in memory M=200,000 and R=5,000, using 2,000 working machines Task Granularity
23
Backup Tasks Some “straggler” not performing optimally Near end of computation, schedule redundant execution of in-process tasks First to complete “wins”
24
4. Refinements
25
Refinements An Input Reader Support read input data in different formats Support read records from database or memory An output writer Support produce data in different formats
26
Refinements A Partition Function Data gets partitioned using the function on the intermediate key Default: hash(key) mod R A Combiner Function Do partial merging of data before it is send over network Typically the same code is used for the combiner and the reduce
27
Refinements Ordering Guarantees The intermediate key/value pairs are processed in increasing key order Generate a sorted output file per partition Side-effects Produce auxiliary files as additional outputs Write to a temporary file and atomically rename it
28
Refinements Skipping Bad Records map/reduce functions might fail for particular inputs Fixing the bug might not be possible: third party libraries On error Worker sends signal to master If multiple error on the same record, skip record
29
Refinements Local Execution Debugging problems can be tricky: distributed system An alternative implementation: execute on local machine Computation can be limited to particular map tasks
30
Refinements Status Information The master exports a set of status pages for human consumption Useful for diagnose bugs Counters Count occurrences of various events The counter are periodically propagated to the master Display on the status page
31
Status monitor
32
5. Performance
33
Performance Boasts Distributed grep 10 10 100-byte files (~1TB of data) 3-character substring found in ~100k files ~1800 workers 150 seconds start to finish, including ~60 seconds startup overhead
34
Performance Boasts Distributed sort Same files/workers as above 50 lines of MapReduce code 891 seconds, including overhead Best reported result of 1057 seconds for TeraSort benchmark
35
Performance Boasts
36
6. Conclusion
37
Conclusion Easy to use A large variety of problems are easily expressible as MapReduce computation Develop an implementation of MapReduce
38
Thank you!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.