MapReduce: Simplified Data Processing on Large Clusters Hongfei Yan School of EECS, Peking University 7/9/2009.

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
These are slides with a history. I found them on the web... They are apparently based on Dan Weld’s class at U. Washington, (who in turn based his slides.
MapReduce. 2 (2012) Average Searches Per Day: 5,134,000,000 (2012) Average Searches Per Day: 5,134,000,000.
Distributed Computations
MapReduce Dean and Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, Vol. 51, No. 1, January Shahram.
MapReduce: Simplified Data Processing on Large Clusters J. Dean and S. Ghemawat (Google) OSDI 2004 Shimin Chen DISC Reading Group.
MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.
Distributed Computations MapReduce
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
MapReduce: Simplified Data Processing on Large Clusters
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science MapReduce:
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
MapReduce.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
Map Reduce: Simplified Data Processing On Large Clusters Jeffery Dean and Sanjay Ghemawat (Google Inc.) OSDI 2004 (Operating Systems Design and Implementation)
Introduction to MapReduce Amit K Singh. “The density of transistors on a chip doubles every 18 months, for the same cost” (1965) Do you recognize this.
Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.
Take a Close Look at MapReduce Xuanhua Shi. Acknowledgement  Most of the slides are from Dr. Bing Chen,
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
MapReduce: Simplified Data Processing on Large Clusters
MapReduce How to painlessly process terabytes of data.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Clustering Very Large Multi- dimensional Datasets with MapReduce 蔡跳.
Mass Data Processing Technology on Large Scale Clusters Summer, 2007, Tsinghua University All course material (slides, labs, etc) is licensed under the.
Spark and Scala Sheng QIAN The Berkeley Data Analytics Stack.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
MapReduce & BigTable 黄连恩 北京大学信息工程学院 12/10/2013.
SLIDE 1IS 240 – Spring 2013 MapReduce, HBase, and Hive University of California, Berkeley School of Information IS 257: Database Management.
Presented by: Katie Woods and Jordan Howell. * Hadoop is a distributed computing platform written in Java. It incorporates features similar to those of.
MapReduce and the New Software Stack CHAPTER 2 1.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
MapReduce: Simplified Data Processing on Large Clusters Lim JunSeok.
MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.
MapReduce : Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce: simplified data processing on large clusters Jeffrey Dean and Sanjay Ghemawat.
Hadoop Daniel Hu. Scale-up vs Scale-out 并行计算 分解任务。关键是消除任务间的依赖。 整合结果。 ◦ 每个任务产生一个结果,然后要把这些结果组合起来得 出最终结果。 ◦ 结果相互独立,但每个任务产生一个结果。 ◦ 有的任务不产生结果。 ◦ 只有一个任务产生最终的结果。
Cloud Distributed Computing Environment Hadoop. Hadoop is an open-source software system that provides a distributed computing environment on cloud (data.
MapReduce: Simplified Data Processing on Large Cluster Authors: Jeffrey Dean and Sanjay Ghemawat Presented by: Yang Liu, University of Michigan EECS 582.
Dr Zahoor Tanoli COMSATS.  Certainly not suitable to process huge volumes of scalable data  Creates too much of a bottleneck.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
MapReduce: Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Hadoop Aakash Kag What Why How 1.
MapReduce: Simplified Data Processing on Large Clusters
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
MapReduce Simplied Data Processing on Large Clusters
The Basics of Apache Hadoop
Hadoop Basics.
Map reduce use case Giuseppe Andronico INFN Sez. CT & Consorzio COMETA
Map-Reduce framework -By Jagadish Rouniyar.
Introduction to MapReduce
MapReduce: Simplified Data Processing on Large Clusters
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

MapReduce: Simplified Data Processing on Large Clusters Hongfei Yan School of EECS, Peking University 7/9/2009

What’s Mapreduce Parallel/Distributed Computing Programming Model Input split shuffleoutput

Typical problem solved by MapReduce 读入数据 : key/value 对的记录格式数据 Map: 从每个记录里 extract something map (in_key, in_value) -> list(out_key, intermediate_value) 处理 input key/value pair 输出中间结果 key/value pairs Shuffle: 混排交换数据 把相同 key 的中间结果汇集到相同节点上 Reduce: aggregate, summarize, filter, etc. reduce (out_key, list(intermediate_value)) -> list(out_value) 归并某一个 key 的所有 values ,进行计算 输出合并的计算结果 (usually just one) 输出结果

Mapreduce Framework

Shuffle Implementation

Partition and Sort Group Partition function: hash(key)%reducer number Group function: sort by key

Example uses: distributed grep distributed sort web link-graph reversal term-vector / hostweb access log statsinverted index construction document clusteringmachine learningstatistical machine translation... Model is Widely Applicable MapReduce Programs In Google Source Tree

Algorithms Fit in MapReduce 文献中见到实现了的算法 K-Means, EM, SVM, PCA, Linear Regression, Naïve Bayes, Logistic Regression, Neural Network PageRank Word Co-occurrence Matrices , Pairwise Document Similarity Monte Carlo simulation ……

MapReduce Runtime System

Google MapReduce Architecture Single Master nodeMany worker bees

MapReduce Operation Initial data split into 64MB blocks Computed, results locally stored M sends data location to R workers Final output written Master informed of result locations

Fault Tolerance 通过 re-execution 实现 fault tolerance 周期性 heartbeats 检测 failure Re-execute 失效节点上已经完成 + 正在执行的 map tasks Why???? Re-execute 失效节点上正在执行的 reduce tasks Task completion committed through master Robust: lost 1600/1800 machines once  finished ok Master Failure?

Refinement: Redundant Execution Slow workers significantly delay completion time Other jobs consuming resources on machine Bad disks w/ soft errors transfer data slowly Solution: Near end of phase, spawn backup tasks Whichever one finishes first "wins" Dramatically shortens job completion time

Refinement: Locality Optimization Master scheduling policy: Asks GFS for locations of replicas of input file blocks Map tasks typically split into 64MB (GFS block size) Map tasks scheduled so GFS input block replica are on same machine or same rack Effect Thousands of machines read input at local disk speed Without this, rack switches limit read rate

Refinement: Skipping Bad Records Map/Reduce functions sometimes fail for particular inputs Best solution is to debug & fix Not always possible ~ third-party source libraries On segmentation fault: Send UDP packet to master from signal handler Include sequence number of record being processed If master sees two failures for same record: Next worker is told to skip the record

Compression of intermediate data Combiner “ Combiner ” functions can run on same machine as a mapper Causes a mini-reduce phase to occur before the real reduce phase, to save bandwidth Local execution for debugging/testing User-defined counters Other Refinements

Hadoop MapReduce Architecture Master/Worker Model Load-balancing by polling mechanism Master/Worker Model Load-balancing by polling mechanism

History of Hadoop Initial versions of what is now Hadoop Distributed File System and Map-Reduce implemented by Doug Cutting & Mike Cafarella December Nutch ported to the new framework. Hadoop runs reliably on 20 nodes. January Doug Cutting joins Yahoo!Doug Cutting joins Yahoo! February Apache Hadoop project official started to support the standalone development of Map-Reduce and HDFS. March Formation of the Yahoo! Hadoop team May Yahoo sets up a Hadoop research cluster nodes April Sort benchmark run on 188 nodes in 47.9 hours May Sort benchmark run on 500 nodes in 42 hours (better hardware than April benchmark) October Research cluster reaches 600 Nodes December Sort times 20 nodes in 1.8 hrs, 100 nodes in 3.3 hrs, 500 nodes in 5.2 hrs, 900 nodes in 7.8 January Research cluster reaches 900 node April Research clusters - 2 clusters of 1000 nodes Sep Scaling Hadoop to 4000 nodes at Yahoo! April 2009 – release , many improvements, new features, bug fixes and optimizations.

Hadoop 0.18 Highlights Apache Hadoop 0.18 was released on 8/22 number of patches committed (266) patches (20%) from contributors outside of Yahoo! grid mix benchmark in ~45% of the time taken by Hadoop 0.15 new stuff in MapReduce Intermediate compression that just works (Single) reduce optimizations Archive tool

Summary MapReduce 是一个简单易用的并行编程模型,它 极大简化了大规模数据处理问题的实现

References and Resources [1]J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters," in Osdi, 2004, pp [2]Ucb/Eecs, K. Asanovic, R. Bodik, B. Catanzaro, J. Gebis, P. Husbands, K. Keutzer, D. Patterson, W. Plishker, J. Shalf, S. Williams, and K. Yelick, "The landscape of parallel computing research: a view from Berkeley," [3]I. Michael, B. Mihai, Y. Yuan, B. Andrew, and F. Dennis, "Dryad: distributed data-parallel programs from sequential building blocks," SIGOPS Oper. Syst. Rev., vol. 41, pp , [4]Hadoop, "The Hadoop Project,"