Download presentation
Presentation is loading. Please wait.
Published byLouisa Parsons Modified over 9 years ago
1
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe
2
MapReduce: Simplified Data Processing on Large Clusters In Proceedings of the 6th Symposium on Operating Systems Design and Implementation (OSDI' 04) Also appears in the Communications of the ACM (2008)
3
Ph.D. in Computer Science – University of Washington Google Fellow in Systems and Infrastructure Group ACM Fellow Research Areas: Distributed Systems and Parallel Computing
4
Ph.D. in Computer Science – Massachusetts Institute of Technology Google Fellow Research Areas: Distributed Systems and Parallel Computing
5
Calculate 30*50 Easy? 30*50 + 31*51 + 32*52 + 33*52 +.... + 40*60 Little bit hard?
6
Simple computation, but huge data set Real world example for large computations 20+ billion web pages * 20kB webpage One computer reads 30/35 MB/sec from disc Nearly four months to read the web
7
Parallelize tasks in a distributed computing environment Web page problem solved in 3 hours with 1000 machines
8
o How to parallelize the computation? o Coordinate with other nodes o Handling failures o Preserve bandwidth o Load balancing Complexities in Distributed Computing
9
A platform to hide the messy details of distributed computing Which are, Parallelization Fault-tolerance Data distribution Load Balancing A programming model An implementation
10
Example: Word count the quick brown fox the fox ate the mouse the 1 quick 1 brown 1 fox 1 the 1 fox 1 ate 1 the 1 mouse 1 the 3 quick 1 brown 1 fox 2 ate 1 mouse 1 DocumentMappedReduced the 1
11
Eg: Word count using MapReduce the quick brown fox the fox ate the mouse Map Reduce the, 1 quick, 1 brown, 1 fox, 1 the, 1 fox, 1 ate,1 the, 1 mouse, 1 Input Map Reduce Output the, 3 quick, 1 brown, 1 fox, 2 ate, 1 mouse, 1
12
map(String key, String value): for each word w in value: EmitIntermediate(w, "1"); Intermediate key/value pair – Eg: (“fox”, “1”) Document Name Document Contents Input Text file Output (“fox”, “1”)
13
reduce(String key, Iterator values): int result = 0; for each v in values: result += ParseInt(v); Emit(AsString(result)); Word List of Counts (Output from Map) Input (“fox”, {“1”, “1”}) Output (“fox”, “2”) Accumulated Count
14
Reverse Web-Link Graph Target (My web page) Target (My web page) Source Web page 2 Source Web page 2 Source Web page 3 Source Web page 3 Source Web page 4 Source Web page 4 Source Web page 5 Source Web page 5 Source Web page 1 Source Web page 1
15
Reverse Web-Link Graph Map (“My Web”, “Source 1”) (“Not My Web”, “Source 2”) (“My Web”, “Source 3”) (“My Web”, “Source 4”) (“My Web”, “Source 5”) Reduce (“My Web”, {“Source 1”, “Source 3”,.....}) Target Source pointing to the target Source web pages
16
Implementation: Execution Overview Map LayerReduce Layer User Program Master Worker Input Layer Intermediate Files Output Layer Split 1 Split 2 Split 3 Split 4 Split 0 (1) Fork (2) Assign Map (2) Assign Reduce (3) Read(4) Local Write (5) Remote Read O/P File 1 O/P File 0 (6) Write
17
Complexities in Distributed Computing, to be solved o How to parallelize the computation? o Coordinate with other nodes o Handling failures o Preserve bandwidth o Load balancing o Automatic parallelization using Map & Reduce o How to parallelize the computation?
18
Restricted Programming model User specified Map & Reduce functions 1000s of workers, different data sets Worker1 Worker2 Worker3 User-defined Map/Reduce Instruction Data
19
o Automatic parallelization using Map & Reduce Complexities in Distributed Computing, solving.. o Coordinate with other nodes o Handling failures o Preserve bandwidth o Load balancing o Coordinate nodes using a master node
20
Master data structure Pushing information (meta-data) between workers Information Master Map Worker Reduce Worker
21
o Fault tolerance (Re-execution) & back up tasks o Coordinate nodes using a master node o Automatic parallelization using Map & Reduce Complexities in Distributed Computing, solving.. o Handling failures o Preserve bandwidth o Load balancing
22
No response from a worker task? If an ongoing Map or Reduce task: Re-execute If a completed Map task: Re-execute If a completed Reduce task: Remain untouched Master failure (unlikely) Restart
23
“Straggler”: machine that takes a long time to complete the last steps in the computation Solution: Redundant Execution Near end of phase, spawn backup copies Task that finishes first "wins"
24
o Saves bandwidth through locality o Fault tolerance (Re-execution) & back up tasks o Coordinate nodes using a master node o Automatic parallelization using Map & Reduce Complexities in Distributed Computing, solving.. o Preserve bandwidth o Load balancing
25
Same data set in different machines If a task has data locally, no need to access other nodes
26
o Saves bandwidth through locality o Fault tolerance & back up tasks o Coordinate nodes using a master node o Automatic parallelization using Map & Reduce Complexities in Distributed Computing, solving.. o Load balancing o Load balancing through granularity Complexities in Distributed Computing, solved
27
Fine granularity tasks: map tasks > machines 1 worker several tasks Idle workers are quickly assigned to work
28
Partitioning Combining Skipping bad records Debuggers – local execution Counters
29
Normal Execution No backup tasks 891 S1283 S 44% increment in time Very long tail Stragglers take >300s to finish
30
Normal Execution 200 processes killed 891 S 933 S 5% increment in time Quick failure recovery
31
Clustering for Google News and Google Product Search Google Maps Locating addresses Map tiles rendering Google PageRank Localized Search
32
Apache Hadoop MapReduce Hadoop Distributed File System (HDFS) Used in, Yahoo! Search Facebook Amazon Twitter Google
33
Higher level languages/systems based on Hadoop Amazon Elastic MapReduce Available for general public Process data in the cloud Pig and Hive
34
Large variety of problems can be expressed as Map & Reduce Restricted programming model Easy to hide details of distributed computing Achieved scalability & programming efficiency
35
GFS solution: Shadow masters Only meta data is passed through the master A new copy can be started from the last point of state
36
Programmer’s burden? “If we hadn’t had to deal with failures, if we had a perfectly reliable set of computers to run this on, we would probably never have implemented MapReduce” – Sanjay Ghemawat
37
Combiner
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.