Lecture 2 – MapReduce: Theory and Implementation CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of.

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

MapReduce.
DISTRIBUTED COMPUTING & MAP REDUCE CS16: Introduction to Data Structures & Algorithms Thursday, April 17,
© M. Winter COSC 4P41 – Functional Programming Patterns of computation over lists Applying to all – mapping map :: (a -> b) -> [a] -> [b] map f.
Distributed Computations
MapReduce: Simplified Data Processing on Large Clusters Cloud Computing Seminar SEECS, NUST By Dr. Zahid Anwar.
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
Problem-solving on large-scale clusters: theory and applications Lecture 1: Introduction and Theoretical Background.
MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.
Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.
Distributed Computations MapReduce
Lecture 2 – MapReduce: Theory and Implementation CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.
Lecture 2 – MapReduce: Theory and Implementation CSE 490h – Introduction to Distributed Computing, Spring 2007 Except as otherwise noted, the content of.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
MapReduce: Acknowledgements: Some slides form Google University (licensed under the Creative Commons Attribution 2.5 License) others from Jure Leskovik.
Map Reduce and Hadoop S. Sudarshan, IIT Bombay
Parallel Programming Models Basic question: what is the “right” way to write parallel programs –And deal with the complexity of finding parallelism, coarsening.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
MapReduce Theory, Implementation and Algorithms Hongfei Yan School of EECS, Peking University 7/1/2008 Refer to Aaron.
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
MapReduce How to painlessly process terabytes of data.
MapReduce M/R slides adapted from those of Jeff Dean’s.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
1 MapReduce: Theory and Implementation CSE 490h – Intro to Distributed Computing, Modified by George Lee Except as otherwise noted, the content of this.
© Spinnaker Labs, Inc. Google Cluster Computing Faculty Training Workshop Module IV: MapReduce Theory, Implementation, and Algorithms This presentation.
PPCC Spring Map Reduce1 MapReduce Prof. Chris Carothers Computer Science Department
Database Applications (15-415) Part II- Hadoop Lecture 26, April 21, 2015 Mohammad Hammoud.
MapReduce Theory and Practice 彭波 北京大学信息科学技术学院 7/15/2010 Some Slides borrow from Jimmy Lin and.
MapReduce Theory, Implementation and Algorithms Hongfei Yan School of EECS, Peking University 7/2/2009 Refer to.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
CMSC 330: Organization of Programming Languages Maps and Folds Anonymous Functions.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
Map Reduce. Functional Programming Review r Functional operations do not modify data structures: They always create new ones r Original data still exists.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce: simplified data processing on large clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce Basics Chapter 2 Lin and Dyer & /tutorial/
Haskell Chapter 5, Part II. Topics  Review/More Higher Order Functions  Lambda functions  Folds.
CSE 341 Lecture 8 curried functions Ullman 5.5 slides created by Marty Stepp
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Google Cluster Computing Faculty Training Workshop
How to Parallelize an Algorithm
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
Adapted from: Google & UWash’s Creative Common MR Deck
Algorithm Analysis CSE 2011 Winter September 2018.
Lecture 3: Bringing it all together
Database Applications (15-415) Hadoop Lecture 26, April 19, 2016
CS162 Operating Systems and Systems Programming Lecture 19 File Systems continued Distributed Systems April 9, 2008 Prof. Anthony D. Joseph
February 26th – Map/Reduce
Cse 344 May 4th – Map/Reduce.
Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &
Distributed System Gang Wu Spring,2018.
CSE 3302 Programming Languages
CSE 341 Lecture 2 lists and tuples; more functions; mutable state
Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &
Presentation transcript:

Lecture 2 – MapReduce: Theory and Implementation CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 License.

Last Class How do I process lots of data?  Distribute the work Can I distribute the work?  Maybe… if it’s not dependent on other tasks  Example: Fibonnaci.

Last Class What problems can occur?  Large tasks  Unpredictable bugs  Machine failure How do solve / avoid these?  Break up into small chunks?  Restart tasks?  Use known working solutions

MapReduce Concept from functional programming Implemented by Google Applied to large number of problems

Functional Programming Review Java: int fooA(String[] list) { return bar1(list) + bar2(list); } int fooB(String[] list) { return bar2(list) + bar1(list); } Do they give the same result?

Functional Programming Review Functional Programming: fun fooA(l: int list) = bar1(l) + bar2(l) fun fooB(l: int list) = bar2(l) + bar1(l) Do they give the same result?

Functional Programming Review Operations do not modify data structures: They always create new ones Original data still exists in unmodified form

Functional Updates Do Not Modify Structures fun foo(x, lst) = let lst' = reverse lst in reverse ( x :: lst' ) foo: a’ -> a’ list -> a’ list The foo() function above reverses a list, adds a new element to the front, and returns all of that, reversed, which appends an item. But it never modifies lst!

Functions Can Be Used As Arguments fun DoDouble(f, x) = f (f x) It does not matter what f does to its argument; DoDouble() will do it twice. What is the type of this function? x: a’ f: a’ -> a’ DoDouble: (a’ -> a’) -> a’ -> a’

map (Functional Programming) Creates a new list by applying f to each element of the input list; returns output in order. map f lst: (’a->’b) -> (’a list) -> (’b list)

map Implementation This implementation moves left-to-right across the list, mapping elements one at a time … But does it need to? fun map f [] = [] | map f (x::xs) = (f x) :: (map f xs)

Implicit Parallelism In map In a purely functional setting, elements of a list being computed by map cannot see the effects of the computations on other elements If order of application of f to elements in list is commutative, we can reorder or parallelize execution This is the “secret” that MapReduce exploits

Fold Moves across a list, applying f to each element plus an accumulator. f returns the next accumulator value, which is combined with the next element of the list fold f x0 lst: ('a*'b->'b)->'b->('a list)->'b

fold left vs. fold right Order of list elements can be significant Fold left moves left-to-right across the list Fold right moves from right-to-left SML Implementation: fun foldl f a [] = a | foldl f a (x::xs) = foldl f (f(x, a)) xs fun foldr f a [] = a | foldr f a (x::xs) = f(x, (foldr f a xs))

Example fun foo(l: int list) = sum(l) + mul(l) + length(l) How can we implement this?

Example (Solved) fun foo(l: int list) = sum(l) + mul(l) + length(l) fun sum(lst) = foldl (fn (x,a)=>x+a) 0 lst fun mul(lst) = foldl (fn (x,a)=>x*a) 1 lst fun length(lst) = foldl (fn (x,a)=>1+a) 0 lst

Google MapReduce Input Handling Map function Partition Function Compare Function Reduce Function Output Writer

Input Handling Divides up data into bite-size chunks Starts up tasks Assigns tasks to idle workers

Map Input: Key, Value pair Output: Key, Value pairs Example: Annual Rainfall Per City

Map (Example) Example: Annual Rainfall Per City map(String key, String value): // key: date // value: weather info foreach (City c in value) EmitIntermediate(c, c.temperature)

Partition Function Allocates map output to particular reduces Input: key, number of reduces Output: Index of desired reduce Typical: hash(key) % numberOfReduces

Comparison Sorts input for each reduce Example: Annual rainfall per city  Sorts rainfall data for each city  Seattle: {0, 0, 0, 1, 4, 7, 10, …}

Reduce Input: Key, Sorted list of values Output: Single value Example: Annual rainfall per city

Reduce Input: Key, Sorted list of values Output: Single value Example: Annual rainfall per city

Reduce (Example) Example: Annual rainfall per city  reduce(String key, Iterator values): // key: city // values: temperature sum = 0, count = 0 for each (v in values) sum += v count = count + 1 Emit(sum / count)

Output Writes the output to storage (GFS, etc)

MapReduce for Google Local Intersections Rendering Tiles Finding nearest gas stations