Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Distributed Computations
MapReduce: Simplified Data Processing on Large Clusters Cloud Computing Seminar SEECS, NUST By Dr. Zahid Anwar.
Lecture 6 – Google File System (GFS) CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation.
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
MapReduce Dean and Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, Vol. 51, No. 1, January Shahram.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.
Distributed Computations MapReduce
Lecture 2 – MapReduce: Theory and Implementation CSE 490h – Introduction to Distributed Computing, Spring 2007 Except as otherwise noted, the content of.
Distributed MapReduce Team B Presented by: Christian Bryan Matthew Dailey Greg Opperman Nate Piper Brett Ponsler Samuel Song Alex Ostapenko Keilin Bickar.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
Google Distributed System and Hadoop Lakshmi Thyagarajan.
MapReduce: Simpliyed Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat To appear in OSDI 2004 (Operating Systems Design and Implementation)
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
Leroy Garcia. What is Map Reduce?  A patented programming model developed by Google Derived from LISP and other forms of functional programming  Used.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
MapReduce: Simplified Data Processing on Large Clusters 컴퓨터학과 김정수.
Map Reduce: Simplified Data Processing On Large Clusters Jeffery Dean and Sanjay Ghemawat (Google Inc.) OSDI 2004 (Operating Systems Design and Implementation)
Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.
Parallel Programming Models Basic question: what is the “right” way to write parallel programs –And deal with the complexity of finding parallelism, coarsening.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce and Hadoop 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 2: MapReduce and Hadoop Mining Massive.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
MapReduce How to painlessly process terabytes of data.
Google’s MapReduce Connor Poske Florida State University.
MapReduce M/R slides adapted from those of Jeff Dean’s.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
1 MapReduce: Theory and Implementation CSE 490h – Intro to Distributed Computing, Modified by George Lee Except as otherwise noted, the content of this.
Lecture 2 – MapReduce: Theory and Implementation CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of.
MAPREDUCE PRESENTED BY: KATIE WOODS & JORDAN HOWELL.
MapReduce Algorithm Design Based on Jimmy Lin’s slides
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
MapReduce : Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
Map Reduce. Functional Programming Review r Functional operations do not modify data structures: They always create new ones r Original data still exists.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce: simplified data processing on large clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
Indexing The World Wide Web: The Journey So Far Abhishek Das, Ankit Jain 2011 Paper Presentation : Abhishek Rangnekar 1.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Large-scale file systems and Map-Reduce
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
MapReduce Algorithm Design Adapted from Jimmy Lin’s slides.
Distributed System Gang Wu Spring,2018.
CS 345A Data Mining MapReduce This presentation has been altered.
Introduction to MapReduce
5/7/2019 Map Reduce Map reduce.
COS 518: Distributed Systems Lecture 11 Mike Freedman
MapReduce: Simplified Data Processing on Large Clusters
Distributed Systems and Concurrency: Map Reduce
Map Reduce, Types, Formats and Features
Presentation transcript:

Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 License.

Last Class Input Handling Map Function Partition Function Compare Function Reduce Function Output Writer

map (Functional Programming) Creates a new list by applying f to each element of the input list; returns output in order. map f lst: (’a->’b) -> (’a list) -> (’b list)

Fold Moves across a list, applying f to each element plus an accumulator. f returns the next accumulator value, which is combined with the next element of the list fold f x0 lst: ('a*'b->'b)->'b->('a list)->'b

Advantages of MapReduce Flexible for a wide range of problems Fault tolerant Scalable

Overview Hardware Task assignment Failure Non-Determinism Optimizations

Commodity Hardware Cheap Hardware  2 – 4 GB memory  100 megabit / sec  x86 processors running Linux Cheap Hardware + Lots of It = Failure!

Master vs Worker Users submit jobs into scheduling system  Implement map and reduce  Specify M map tasks and R reducers Many copies of program started  One task is the master Master assigns map/reduce tasks to idle workers

Map Tasks Input broken up into 16MB - 64MB chunks M map tasks processed in parallel

Reduce Tasks R reduce tasks Assigned by partitioning function  Typically: hash(key) mod R  Sometimes useful to customize

Master Data Structures For each map / reduce task, store state and identity of machine  State: Idle, In-Progress, Complete For each complete map task, store locations of output (R locations)

Worker with Map Tasks Parses input data into key/value pairs Applies map Buffered pairs written to disk, partitioned into R regions Locations of output eventually passed to master

Worker with Reduce Tasks Read data from map machines via RPC  Sorts data Applies reduce Output appended to final output file

After Reduce When all complete, master wakes up user program Output available in R output files, with names specified by user

How do you pick M and R How many scheduling decisions?  O(M+R) How much state in memory by master?  O(M*R) M: much larger than number of machines R: small multiple of number of machines

Failures & Issues Worker Failure Master Failure Stragglers Crashes, Etc

Worker Failure Master pings worker  No response -> assumes failed Failed map tasks  Completed & In-Progress tasks set to idle Failed reduce tasks  In-Progress tasks set to idle

Master Failure You could write checkpoints In practice: just let the user deal with it

Stragglers (Causes) Why?  Bad disk but correctable errors  Too many other tasks  No caching

Stragglers (Solutions) Re-schedule remaining tasks when operation is close to completion A task is complete when either primary or secondary task is complete

Crashes, Etc Causes:  Bad Records  Bug in Third Party Code Solution: Skip over errors?

Non-Determinism Deterministic = distributed implementation produces same result as sequential execution Non-Deterministic = map or reduce are non-deterministic

Non-Determinism Guarantee: output for a specific reduce task is equivalent to some sequential operation But: output from different reduce tasks may correspond to different sequential operations

Non-Determinism There may be no sequential operation that matches the full output Why?  Because R1 and R2 may have read outputs for the different execution of M

Advanced Stuff Input Types Combiner Function Counters

Input Types May need to change how input is read Implement reader interface

Combiner “Combiner” functions can run on same machine as a mapper Causes a mini-reduce phase to occur before the real reduce phase, to save bandwidth Under what conditions is it sound to use a combiner?

Combiner Function Can only be used if communicative and associative  Communicative: a + b + c = b + c + a  Associative: (a × b) × c = a × (b × c)

Counters Global Counter Masters handles issue of duplicate executions Useful for sanity checking or debugging

Discussion Questions 1. Give an example of a MapReduce problem not listed in the reading. In your example, what are the map and reduce functions (including inputs and outputs)? 2. What part of the MapReduce implementation do you find most interesting? Why? 3. Give an example of a distributable problem that should not be solved with MapReduce. What are the limitations of MapReduce that make it ill-suited for your task?

Discussion Questions 1. Assuming you had a corpus of webpages as input such that the key for each mapper is the URL and the value is the text of the page, how would you design a mapper and a reducer to construct an inverse graph of the web - that is, for each URL output the list of web pages that point to it? 2. TF–IDF is a statistical value assigned to words in a document corpus that indicates the relative importance of the word. As part of computing it, the Inverse Document Frequency of a word is found from: The number of documents in the corpus divided by the number of documents containing the word. Given a corpus of documents, and given that you know how many documents are in the corpus, how would you use map reduce to find this quantity for every word in the corpus simultaneously?