MapReduce: Simplified Data Processing on Large Clusters Hongfei Yan School of EECS, Peking University 7/9/2009
What’s Mapreduce Parallel/Distributed Computing Programming Model Input split shuffleoutput
Typical problem solved by MapReduce 读入数据 : key/value 对的记录格式数据 Map: 从每个记录里 extract something map (in_key, in_value) -> list(out_key, intermediate_value) 处理 input key/value pair 输出中间结果 key/value pairs Shuffle: 混排交换数据 把相同 key 的中间结果汇集到相同节点上 Reduce: aggregate, summarize, filter, etc. reduce (out_key, list(intermediate_value)) -> list(out_value) 归并某一个 key 的所有 values ,进行计算 输出合并的计算结果 (usually just one) 输出结果
Mapreduce Framework
Shuffle Implementation
Partition and Sort Group Partition function: hash(key)%reducer number Group function: sort by key
Example uses: distributed grep distributed sort web link-graph reversal term-vector / hostweb access log statsinverted index construction document clusteringmachine learningstatistical machine translation... Model is Widely Applicable MapReduce Programs In Google Source Tree
Algorithms Fit in MapReduce 文献中见到实现了的算法 K-Means, EM, SVM, PCA, Linear Regression, Naïve Bayes, Logistic Regression, Neural Network PageRank Word Co-occurrence Matrices , Pairwise Document Similarity Monte Carlo simulation ……
MapReduce Runtime System
Google MapReduce Architecture Single Master nodeMany worker bees
MapReduce Operation Initial data split into 64MB blocks Computed, results locally stored M sends data location to R workers Final output written Master informed of result locations
Fault Tolerance 通过 re-execution 实现 fault tolerance 周期性 heartbeats 检测 failure Re-execute 失效节点上已经完成 + 正在执行的 map tasks Why???? Re-execute 失效节点上正在执行的 reduce tasks Task completion committed through master Robust: lost 1600/1800 machines once finished ok Master Failure?
Refinement: Redundant Execution Slow workers significantly delay completion time Other jobs consuming resources on machine Bad disks w/ soft errors transfer data slowly Solution: Near end of phase, spawn backup tasks Whichever one finishes first "wins" Dramatically shortens job completion time
Refinement: Locality Optimization Master scheduling policy: Asks GFS for locations of replicas of input file blocks Map tasks typically split into 64MB (GFS block size) Map tasks scheduled so GFS input block replica are on same machine or same rack Effect Thousands of machines read input at local disk speed Without this, rack switches limit read rate
Refinement: Skipping Bad Records Map/Reduce functions sometimes fail for particular inputs Best solution is to debug & fix Not always possible ~ third-party source libraries On segmentation fault: Send UDP packet to master from signal handler Include sequence number of record being processed If master sees two failures for same record: Next worker is told to skip the record
Compression of intermediate data Combiner “ Combiner ” functions can run on same machine as a mapper Causes a mini-reduce phase to occur before the real reduce phase, to save bandwidth Local execution for debugging/testing User-defined counters Other Refinements
Hadoop MapReduce Architecture Master/Worker Model Load-balancing by polling mechanism Master/Worker Model Load-balancing by polling mechanism
History of Hadoop Initial versions of what is now Hadoop Distributed File System and Map-Reduce implemented by Doug Cutting & Mike Cafarella December Nutch ported to the new framework. Hadoop runs reliably on 20 nodes. January Doug Cutting joins Yahoo!Doug Cutting joins Yahoo! February Apache Hadoop project official started to support the standalone development of Map-Reduce and HDFS. March Formation of the Yahoo! Hadoop team May Yahoo sets up a Hadoop research cluster nodes April Sort benchmark run on 188 nodes in 47.9 hours May Sort benchmark run on 500 nodes in 42 hours (better hardware than April benchmark) October Research cluster reaches 600 Nodes December Sort times 20 nodes in 1.8 hrs, 100 nodes in 3.3 hrs, 500 nodes in 5.2 hrs, 900 nodes in 7.8 January Research cluster reaches 900 node April Research clusters - 2 clusters of 1000 nodes Sep Scaling Hadoop to 4000 nodes at Yahoo! April 2009 – release , many improvements, new features, bug fixes and optimizations.
Hadoop 0.18 Highlights Apache Hadoop 0.18 was released on 8/22 number of patches committed (266) patches (20%) from contributors outside of Yahoo! grid mix benchmark in ~45% of the time taken by Hadoop 0.15 new stuff in MapReduce Intermediate compression that just works (Single) reduce optimizations Archive tool
Summary MapReduce 是一个简单易用的并行编程模型,它 极大简化了大规模数据处理问题的实现
References and Resources [1]J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters," in Osdi, 2004, pp [2]Ucb/Eecs, K. Asanovic, R. Bodik, B. Catanzaro, J. Gebis, P. Husbands, K. Keutzer, D. Patterson, W. Plishker, J. Shalf, S. Williams, and K. Yelick, "The landscape of parallel computing research: a view from Berkeley," [3]I. Michael, B. Mihai, Y. Yuan, B. Andrew, and F. Dennis, "Dryad: distributed data-parallel programs from sequential building blocks," SIGOPS Oper. Syst. Rev., vol. 41, pp , [4]Hadoop, "The Hadoop Project,"