MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat
Outline ◦Introduction ◦Programming Model ◦Implementation ◦Refinement ◦Performance ◦Related work ◦Conclusions
Introduction ◦What is the purpose? ◦The abstraction Input Data Map Intermediate Key/value Reduce Output File
Programming model ◦Map ◦Reduce ◦Example
Programming model ◦Real example: make an index
Programming Model ◦More example Distributed grep Count of URL Access Frequency Reverse Web-link Graph Term Vector per host Inverted index Distributed sort
Implementation ◦Execution overview
Implementation ◦Master data structure ◦Fault tolerance Worker failure Master failure Semantics in the Presence of Failures ◦Locality ◦Task Granularity ◦Back Tasks
Refinements ◦Partitioning Function ◦Ordering Guarantees ◦Combiner Function ◦Input and Out Types ◦Side-effect ◦Skipping Bad Records ◦Local Execution ◦Status Information ◦Counters
Performance ◦Cluster Configuration 1800machines Each 2GHz Intel Xeon processors 4GB memory 2*160GB IDE disk 1 Gbps Ethernet Arranged in two-level tree-shaped
Performance ◦Grep Scan through byte records Search a relatively rare three-character pattern (occur in 92,337 records) Data transfer rate over time The entrie computation takes approximately 150s Peaks at over 30GB/s 1764workers assigned
Performance ◦Sort Sorts byte records Modeled after TeraSort benchmark Extract a 10-byte sorting key
Performance ◦Sort Input rate is less than for grep There is a delay The rate: input > shuffle > output Effect of backup tasks Machine failures
Related Work ◦Restricted programming models ◦Parallel processing compare to Bulk Synchronous Programming & MPI primitive ◦Backup task mechanism compare to Charlotte System ◦Sorting facility compare to NOW-Sort
Related Work ◦Sending data over distributed queue compare to River ◦Programming model compare to BAD-FS
Conclusion ◦What is the reason for the sucess of MapReduce? Easy to use Problem are easily expressible Scales to large cluster ◦Learned from this work Restriction the programming Network bandwidth is a scarce resource Redundant execution