Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Distributed Computations
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
Map Reduce Allan Jefferson Armando Gonçalves Rocir Leite Filipe??
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
Distributed Computations MapReduce
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Map Reduce Architecture
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Hadoop, Hadoop, Hadoop!!! Jerome Mitchell Indiana University.
Author: Murray Stokely Presenter: Pim van Pelt Distributed Computing at Google.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
MapReduce.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
Map Reduce: Simplified Data Processing On Large Clusters Jeffery Dean and Sanjay Ghemawat (Google Inc.) OSDI 2004 (Operating Systems Design and Implementation)
Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.
Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce and Hadoop 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 2: MapReduce and Hadoop Mining Massive.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
MapReduce How to painlessly process terabytes of data.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Benchmarking MapReduce-Style Parallel Computing Randal E. Bryant Carnegie Mellon University.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
MapReduce and the New Software Stack CHAPTER 2 1.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
MapReduce: Simplified Data Processing on Large Clusters Lim JunSeok.
MapReduce Computer Engineering Department Distributed Systems Course Assoc. Prof. Dr. Ahmet Sayar Kocaeli University - Fall 2015.
MapReduce : Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce: simplified data processing on large clusters Jeffrey Dean and Sanjay Ghemawat.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
MapReduce: Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc.
MapReduce: Simplified Data Processing on Large Clusters
Hadoop MapReduce Framework
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
MapReduce Simplied Data Processing on Large Clusters
The Basics of Apache Hadoop
COS 418: Distributed Systems Lecture 1 Mike Freedman
湖南大学-信息科学与工程学院-计算机与科学系
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
CS 345A Data Mining MapReduce This presentation has been altered.
5/7/2019 Map Reduce Map reduce.
Big Data Analysis MapReduce.
COS 518: Distributed Systems Lecture 11 Mike Freedman
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)

Motivation Google needed to process web-scale data – Data much larger than what fits on one machine – Need parallel processing to get results in a reasonable time – Use cheap commodity machines to do the job

Requirements Solution must – Scale to 1000s of compute nodes – Must automatically handle faults – Provide monitoring of jobs – Be easy for programmers to use

MapReduce Programming model Input and Output are sets of key/value pairs Programmer provides two functions – map(K1, V1) -> list(K2, V2) Produces list of intermediate key/value pairs for each input key/value pair – reduce(K2, list(V2)) -> list(K3, V3) Produces a list of result values for all intermediate values that are associated with the same intermediate key

MapReduce Pipeline MapShuffleReduce Read from DFS Write to DFS

MapReduce in Action M Map (k1, v1)  list(k2, v2) Processes one input key/value pair Produces a set of intermediate key/value pairs R Reduce (k2, list(v2))  list(k3, v3) Combines intermediate values for one particular key Produces a set of merged output values (usually one)

MapReduce Architecture MapReduce Distributed File System Network MapReduce Job Tracker

Software Components Job Tracker (Master) – Maintains Cluster membership of workers – Accepts MR jobs from clients and dispatches tasks to workers – Monitors workers’ progress – Restarts tasks in the event of failure Task Tracker (Worker) – Provides an environment to run a task – Maintains and serves intermediate files between Map and Reduce phases

MapReduce Parallelism  Hash Partitioning

Example 1: Count Word Occurrences Input: Set of (Document name, Document Contents) Output: Set of (Word, Count(Word)) map(k1, v1): for each word w in v1 emit(w, 1) reduce(k2, v2_list): int result = 0; for each v in v2_list result += v; emit(k2, result)

Map Example 1: Count Word Occurrences Map Reduce this is a line this is another line another line yet another line this, 1 is, 1 a, 1 line, 1 this, 1 is, 1 another, 1 line, 1 another, 1 line, 1 yet, 1 another, 1 line, 1 a, 1 another, 1 is, 1 line, 1 this, 1 another, 1 line, 1 yet, 1 a, 1 another, 3 is, 2 line, 4 this, 2 yet, 1

Example 2: Joins Input: Rows of Relation R, Rows of Relation S Output: R join S on R.x = S.y map(k1, v1) if (input == R) emit(v1.x, [“R”, v1]) else emit(v1.y, [“S”, v2]) reduce(k2, v2_list) for r in v2_list where r[1] == “R” for s in v2_list where s[1] == “S” emit(1, result(r[2], s[2]))

Other examples Distributed grep Inverted index construction Machine learning Distributed sort …

Fault Tolerant Evaluation Task Fault Tolerance achieved through re- execution All consumers consume data only after completely generated by the producer – This is an important property to isolate faults to one task Task completion committed through Master Cannot handle master failure

Task granularity and pipelining Fine granularity tasks – Many more map tasks than machines Minimizes time for fault recovery Pipelines shuffling with map execution Better load balancing

Optimization: Combiners Sometimes partial aggregation is possible on the Map side May cut down amount of data to be transferred to the reducer (sometimes significantly) combine(K2, list(V2)) -> K2, list(V2) For Word Occurrence Count example, Combine == Reduce

Map Example 1: Count Word Occurrences (With Combiners) Map Reduce this is a line this is another line another line yet another line this, 1 is, 1 a, 1 line, 1 this, 1 is, 1 another, 1 line, 1 another, 1 line, 1 yet, 1 another, 1 line, 1 a, 1 another, 1 is, 2 line, 2 this, 2 another, 2 line, 2 yet, 1 a, 1 another, 3 is, 2 line, 4 this, 2 yet, 1

Optimization: Redundant Execution Slow workers lengthen completion time Slowness happens because of – Other jobs consuming resources – Bad disks/network etc Solution: Near the end of the job spawn extra copies of long running tasks – Whichever copy finishes first, wins. – Kill the rest In Hadoop this is called “speculative execution”

Optimization: Locality Optimization Task scheduling policy – Ask DFS for locations of replicas of input file blocks – Map tasks scheduled so that input blocks are machine local or rack local Effect: Tasks read data at local disk speeds Without this rack switches limit data rate

Distributed Filesystem Used as the “store” of data MapReduce reads input from DFS and writes output to DFS Provides a “shared disk” view to applications using local storage on shared-nothing hardware Provides redundancy by replication to protect from node/disk failures

DFS Architecture Taken from Ghemawat’s SOSP’03 paper (The Google Filesystem) Single Master (with backups) that track DFS file name to chunk mapping Several Chunk servers that store chunks on local disks Chunk Size ~ 64MB or larger Chunks are replicated Master only used for chunk lookups – Does not participate in transfer of data

Chunk Replication Several Replicas of each Chunk – Usually 3 Replicas usually spread across racks and data centers to maximize availability Master tracks location of each replica of a chunk When chunk failure is detected, master automatically rebuilds new replica to maintain replication level Automatically picks chunk servers for new replicas based on utilization

Offline Reading Original MapReduce Paper – “Simplified Data Processing on Large Clusters” by Jeffrey Dean and Sanjay Ghemawat in OSDI’04 Original DFS Paper – “The Google Filesystem” by Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung in SOSP ‘03 CACM Papers on MapReduce and Parallel DBMSs – MapReduce and Parallel DBMSs: Friends or Foes? – MapReduce: A Flexible Data Processing Tool