MapReduce: Simplified Data Processing on Large Clusters

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

These are slides with a history. I found them on the web... They are apparently based on Dan Weld’s class at U. Washington, (who in turn based his slides.
Distributed Computations
MapReduce: Simplified Data Processing on Large Clusters Cloud Computing Seminar SEECS, NUST By Dr. Zahid Anwar.
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture.
Map Reduce Allan Jefferson Armando Gonçalves Rocir Leite Filipe??
MapReduce Dean and Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, Vol. 51, No. 1, January Shahram.
MapReduce: Simplified Data Processing on Large Clusters J. Dean and S. Ghemawat (Google) OSDI 2004 Shimin Chen DISC Reading Group.
Google’s Map Reduce. Commodity Clusters Web data sets can be very large – Tens to hundreds of terabytes Standard architecture emerging: – Cluster of commodity.
MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.
Distributed Computations MapReduce
Distributed MapReduce Team B Presented by: Christian Bryan Matthew Dailey Greg Opperman Nate Piper Brett Ponsler Samuel Song Alex Ostapenko Keilin Bickar.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
MapReduce: Simplified Data Processing on Large Clusters
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science MapReduce:
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Map Reduce Architecture
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
Leroy Garcia. What is Map Reduce?  A patented programming model developed by Google Derived from LISP and other forms of functional programming  Used.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
MapReduce: Simplified Data Processing on Large Clusters 컴퓨터학과 김정수.
Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.
Take a Close Look at MapReduce Xuanhua Shi. Acknowledgement  Most of the slides are from Dr. Bing Chen,
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce and Hadoop 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 2: MapReduce and Hadoop Mining Massive.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
MapReduce How to painlessly process terabytes of data.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Mass Data Processing Technology on Large Scale Clusters Summer, 2007, Tsinghua University All course material (slides, labs, etc) is licensed under the.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
MAPREDUCE PRESENTED BY: KATIE WOODS & JORDAN HOWELL.
SLIDE 1IS 240 – Spring 2013 MapReduce, HBase, and Hive University of California, Berkeley School of Information IS 257: Database Management.
SECTION 5: PERFORMANCE CHRIS ZINGRAF. OVERVIEW: This section measures the performance of MapReduce on two computations, Grep and Sort. These programs.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
MapReduce: Simplified Data Processing on Large Clusters Lim JunSeok.
MapReduce : Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce: simplified data processing on large clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce: Simplified Data Processing on Large Cluster Authors: Jeffrey Dean and Sanjay Ghemawat Presented by: Yang Liu, University of Michigan EECS 582.
Dr Zahoor Tanoli COMSATS.  Certainly not suitable to process huge volumes of scalable data  Creates too much of a bottleneck.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
MapReduce: Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Lecture 4. MapReduce Instructor: Weidong Shi (Larry), PhD
Cloud Computing.
MapReduce: Simplified Data Processing on Large Clusters
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
CS 345A Data Mining MapReduce This presentation has been altered.
Cloud Computing MapReduce, Batch Processing
MapReduce: Simplified Data Processing on Large Clusters
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. Presented by Saleh Alnaeli At Kent State University Fall 2010 Advance Database systems course. Prof. Ruoming Jin.

Goals Introduction Programming Model Implementation & model features Refinements Performance Evaluation Conclusion

What is Map-Reduce? Programming Model, approach, for processing large data sets. Contains Map and Reduce functions. Runs on a large cluster of commodity machines. Many real world tasks are expressible in this model.

Programming model Input & Output are sets of key/value pairs Programmer specifies two functions: map (in_key, in_value) -> list(out_key, intermediate_value) reduce (out_key, list(intermediate_value)) -> list(out_value) Combines all intermediate values for a particular key Produces a set of merged output values (usually just one)

Example1 : count words in docs Input consists of (url, contents) pairs map(key=url, val=contents): For each word w in contents, emit (w, “1”) reduce(key=word, values=uniq_counts): Sum all “1”s in values list Emit result “(word, sum)”

Example1: Cont. map(key=url, val=contents): For each word w in contents, emit (w, “1”) reduce(key=word, values=uniq_counts): Sum all “1”s in values list Emit result “(word, sum)” Jin 1 Is 1 doing 1 well 1 Is 1 Active 1 Doc1: Jin is doing well Active 1 doing 1 Is 2 Jin 2 well 1 MAP Reduce Doc2: Jin is active

Model is Widely Applicable MapReduce Programs In Google Source Tree More Example distributed grep (later)   distributed sort (later) Reverse web link-graph term-vector / host URL access frequency inverted index And many …….

Implementation Many different implementations are possible The right choice is depending on the environment. Typical cluster: (wide use at Google, large clusters of PC’s connected via switched nets) Hundreds to thousands of dual-processors x86 machines, Linux, 2-4 GB of memory per machine. connected with networking HW, Limited bisection bandwidth Storage is on local IDE disks (inexpensive) GFS: distributed file system manages data Scheduling system by the users to submit the tasks (Job=set of tasks mapped by scheduler to set of available PC within the cluster) Implemented using C++ library and linked into user programs

Execution Overview Map and reduce invocations are distributed across multiple PC’s as follows: Partition input key/value pairs into M chunks, run map() tasks in parallel After map()’s are complete, merge all emitted values for each emitted intermediate key then partition space of output map keys into R pieces( user), and run reduce() in parallel. If map() or reduce() fails, fault tolerance technique is used, coming in next slides.

The master pings every worker periodically (no response -> worker marked as failed) Maser keeps several data structures for each map and reduce task. (State, worker Id) Execution Overview

Execution set of intermediate key/value pairs merges all intermediate values associated with the same intermediate key.

Parallel Execution

Fault Tolerance Works: Handled through re-execution Master failure: Detect failure via periodic heartbeats Re-execute completed + in-progress map tasks Why do we need to re-execute even the completed tasks? Re-execute in progress reduce tasks Task completion committed through master Master failure: It can be handled, but don't yet (master failure unlikely)

Locality Master scheduling policy: As a result: Asks GFS for locations of replicas of input file blocks Map tasks typically split into 64MB (GFS block size) Map tasks scheduled so GFS input block replica are on same machine or same rack As a result: most task’s input data is read locally and consumes no network bandwidth

Backup Tasks common causes that lengthens the total time taken for a MapReduce operation is a straggler. mechanism to alleviate the problem of stragglers. the master schedules backup executions of the remaining in-progress tasks. significantly reduces the time to complete large MapReduce operations.( up to 40% )

Refinement Different partitioning functions. Combiner function. User specify the number of reduce tasks/output that they desire (R). Combiner function. Useful for saving network bandwidth Different input/output types. Skipping bad records. Master asks next worker is told to skip the bad record Local execution. an alternative implementation of the MapReduce library that sequentially executes all of the work for a MapReduce operation on the local machine. Status info. Progress of the computation & more info… Counters. count occurrences of various events. (Ex: total number of words processed)

Performance measure the performance of MapReduce on two computations running on a large cluster of machines. MR_GrepScan searches through approximately one terabyte of data looking for a particular pattern MR_Sort sorts approximately one terabyte of data

Performance Cont. Tests run on : Specifications Cluster 1800 machines Memory 4 GB Processors Dual-processor 2 GHz Xeons with Hyper-threading Hard disk Dual 160 GB IDE disks Network Gigabit Ethernet per machine bandwidth approximately 100 Gbps

Data Transfer rate over time MR_Grep Scans 10 billions 100-byte records, searching for rare 3-character pattern (occurs in 92,337 records). input is split into approximately 64MB pieces (M = 15000), entire output is placed in one file , R = 1 Startup overhead is significant for short jobs

MR_Sort Backup tasks improves completion time reasonably System manages machine failures relatively quickly.

Experience & Conclusions More and more use of MapReduce approach. See the paper. MapReduce has proven to be a useful abstraction Greatly simplifies large-scale computations at Google Fun to use: focus on problem, let library deal with messy details No big need for parallelization knowledge (relief the user from dealing with low level parallelization details)

Disadvantages Might be hard to express problem in MapReduce Data parallelism is key Need to be able to break up a problem by data chunks MapReduce is closed-source (to Google) C++ Hadoop is open-source Java-based rewrite

Related Work HadoopDB: (An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads) [R4] a hybrid system that takes the best features from both technologies; the prototype approaches parallel databases in performance and efficiency, yet still yields the scalability, fault tolerance, and flexibility of MapReduce-based systems. FREERIDE (Framework for Rapid Implementation of Datamining Engines). [R3] a middleware for rapid development of data mining implementations on large SMPs and clusters of SMPs. The middleware performs distributed memory parallelization across the cluster and shared memory parallelization within each node. River (processes communicate with each other by sending data over distributed queues / similar in Dynamic load balancing) [R5]

References J. Dean and S. Ghemawat. Dan Weld’s at U. Washington MapReduce: Simplified Data Processing on Large Clusters. In OSDI, 2004. (Paper and slides) Dan Weld’s at U. Washington (tutorial & slides) Ruoming Jin, Ge Yang, and Gagan Agrawal Shared Memory Parallelization of Data Mining Algorithms: Techniques, Programming Interface, and Performance (pdf 2004) [R3] Azza Abouzeid, Kamil BajdaPawlikowski,Daniel Abadi, Avi Silberschatz, Alexander Rasin HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads [R4] Remzi H. Arpaci-Dusseau, Eric Anderson, Noah Treuhaft, David E. Culler, Joseph M. Hellerstein, David Patterson, and Kathy Yelick. Cluster I/O with River: Making the fast case common. In Proceedings of the Sixth Workshop on Input/Output in Parallel and DistributedSystems (IOPADS '99), pages 10.22, Atlanta, Georgia, May 1999. [R4] Prof.Demmel inst.eecs.berkeley.edu/~cs61c http://gcu.googlecode.com/files/intermachine-parallelism-lecture.ppt

Thank you! Questions and Comments?