MapReduce: Simplified Data Processing on Large Clusters 컴퓨터학과 김정수
목차 1. MapReduce 2. Implementation 3. Refinements 4. Performance 5. Experience 6. Conclusions References 2
MapReduce 1) What is the MapReduce? 3
MapReduce 1) What is the MapReduce? 4
1. MapReduce 2) Why is the MapReduce needed? - Parallelizing the computation - Distributing the data - Handling failures with complexity code - Dealing with large-scale computations efficiently on large cluster system 5
2. Implementation 1) Execution Overview 2) Master Data Structure 3) Fault Tolerance - Worker Failure - Master Failure - Semantics in the Presence of Failure 4) Locality 5) Task Granularity 6) Backup Tasks 6
3. Refinements 1) Partitioning Function 2) Ordering Guarantees 3) Combiner Function 4) Input and Output types 5) Side-effects 6) Skipping Bad Records 7) Local Execution 8) Status Informations 9) Counters 7
4. Performance 1) Grep - Three-character pattern - Total records are Input split into 64MB - M = 15000, R = 1 8
4. Performance 2) Sort - Total 50 lines of user code. - Approximately 1Tbytes of data. - Input split into 64MB - M = 15000, R = Top graph shows rate at which input is read. - Middle graph shows the rate at which data is sent over the network to the reduce tasks. - Bottom graph shows the rate at which sorted data is written to the final output files. 9
4. Performance 2) Effecct of Backup Tasks - 5 straggler remains after almost tasks are finished. - It took 1283 seconds. - Increased 44% time of computation. 10
4. Performance 2) Machine Failures workers were killed. - Workers below than 0 in top graph were re- executed. - Only 5% of execution time is higher than normal execution. 11
5. Experience 1) Benefits using MapReduce system - Source code is simplified because of MapReduce hides failure tolerance, distributing and parallelizing. - MapReduce system makes it easy to change the indexing process. - MapReduce system solves many problem(machine failures, slow machines, etc.) 12
Conclusions 1) MapReduce can be used by programmers even they don’t have any experience parallel and distributed system. 2) A large variety of problems are easily expressible as MapReduce computations. 3) We have developed an implementation of MapReduce that scales to large clusters of machines comprising thousands of machines. 13
1. Mazdah 의 개인 블로그 - Bigdata Section References 14
Thank you