Download presentation
Presentation is loading. Please wait.
Published byBruno Webb Modified over 9 years ago
1
MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat
2
Outline ◦Introduction ◦Programming Model ◦Implementation ◦Refinement ◦Performance ◦Related work ◦Conclusions
3
Introduction ◦What is the purpose? ◦The abstraction Input Data Map Intermediate Key/value Reduce Output File
4
Programming model ◦Map ◦Reduce ◦Example
5
Programming model ◦Real example: make an index
6
Programming Model ◦More example Distributed grep Count of URL Access Frequency Reverse Web-link Graph Term Vector per host Inverted index Distributed sort
7
Implementation ◦Execution overview
8
Implementation ◦Master data structure ◦Fault tolerance Worker failure Master failure Semantics in the Presence of Failures ◦Locality ◦Task Granularity ◦Back Tasks
9
Refinements ◦Partitioning Function ◦Ordering Guarantees ◦Combiner Function ◦Input and Out Types ◦Side-effect ◦Skipping Bad Records ◦Local Execution ◦Status Information ◦Counters
10
Performance ◦Cluster Configuration 1800machines Each 2GHz Intel Xeon processors 4GB memory 2*160GB IDE disk 1 Gbps Ethernet Arranged in two-level tree-shaped
11
Performance ◦Grep Scan through 10 10 100-byte records Search a relatively rare three-character pattern (occur in 92,337 records) Data transfer rate over time The entrie computation takes approximately 150s Peaks at over 30GB/s 1764workers assigned
12
Performance ◦Sort Sorts 10 10 100-byte records Modeled after TeraSort benchmark Extract a 10-byte sorting key
13
Performance ◦Sort Input rate is less than for grep There is a delay The rate: input > shuffle > output Effect of backup tasks Machine failures
14
Related Work ◦Restricted programming models ◦Parallel processing compare to Bulk Synchronous Programming & MPI primitive ◦Backup task mechanism compare to Charlotte System ◦Sorting facility compare to NOW-Sort
15
Related Work ◦Sending data over distributed queue compare to River ◦Programming model compare to BAD-FS
16
Conclusion ◦What is the reason for the sucess of MapReduce? Easy to use Problem are easily expressible Scales to large cluster ◦Learned from this work Restriction the programming Network bandwidth is a scarce resource Redundant execution
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.