Lecture 2 – MapReduce: Theory and Implementation CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of.

Slides:

Advertisements

Similar presentations

Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)

Advertisements

DISTRIBUTED COMPUTING & MAP REDUCE CS16: Introduction to Data Structures & Algorithms Thursday, April 17,

© M. Winter COSC 4P41 – Functional Programming Patterns of computation over lists Applying to all – mapping map :: (a -> b) -> [a] -> [b] map f.

Distributed Computations

MapReduce: Simplified Data Processing on Large Clusters Cloud Computing Seminar SEECS, NUST By Dr. Zahid Anwar.

CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.

Problem-solving on large-scale clusters: theory and applications Lecture 1: Introduction and Theoretical Background.

MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.

Parallel K-Means Clustering Based on MapReduce The Key Laboratory of Intelligent Information Processing, Chinese Academy of Sciences Weizhong Zhao, Huifang.

Distributed Computations MapReduce

Lecture 2 – MapReduce: Theory and Implementation CSE 490H This presentation incorporates content licensed under the Creative Commons Attribution 2.5 License.

Lecture 2 – MapReduce: Theory and Implementation CSE 490h – Introduction to Distributed Computing, Spring 2007 Except as otherwise noted, the content of.

L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.

Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.

MapReduce ： Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang

Google Distributed System and Hadoop Lakshmi Thyagarajan.

Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.

Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.

Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.

By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.

MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.

Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.

MapReduce: Acknowledgements: Some slides form Google University (licensed under the Creative Commons Attribution 2.5 License) others from Jure Leskovik.

Map Reduce and Hadoop S. Sudarshan, IIT Bombay

Parallel Programming Models Basic question: what is the “right” way to write parallel programs –And deal with the complexity of finding parallelism, coarsening.

MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.

1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.

MapReduce – An overview Medha Atre (May 7, 2008) Dept of Computer Science Rensselaer Polytechnic Institute.

MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.

MapReduce Theory, Implementation and Algorithms Hongfei Yan School of EECS, Peking University 7/1/2008 Refer to Aaron.

MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.

MapReduce How to painlessly process terabytes of data.

MapReduce M/R slides adapted from those of Jeff Dean’s.

MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.

1 MapReduce: Theory and Implementation CSE 490h – Intro to Distributed Computing, Modified by George Lee Except as otherwise noted, the content of this.

© Spinnaker Labs, Inc. Google Cluster Computing Faculty Training Workshop Module IV: MapReduce Theory, Implementation, and Algorithms This presentation.

PPCC Spring Map Reduce1 MapReduce Prof. Chris Carothers Computer Science Department

Database Applications (15-415) Part II- Hadoop Lecture 26, April 21, 2015 Mohammad Hammoud.

MapReduce Theory and Practice 彭波北京大学信息科学技术学院 7/15/2010 Some Slides borrow from Jimmy Lin and.

MapReduce Theory, Implementation and Algorithms Hongfei Yan School of EECS, Peking University 7/2/2009 Refer to.

By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.

CMSC 330: Organization of Programming Languages Maps and Folds Anonymous Functions.

Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.

C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.

Map Reduce. Functional Programming Review r Functional operations do not modify data structures: They always create new ones r Original data still exists.

MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.

MapReduce: simplified data processing on large clusters Jeffrey Dean and Sanjay Ghemawat.

MapReduce Basics Chapter 2 Lin and Dyer & /tutorial/

Haskell Chapter 5, Part II. Topics  Review/More Higher Order Functions  Lambda functions  Folds.

CSE 341 Lecture 8 curried functions Ullman 5.5 slides created by Marty Stepp

COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University

Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.

Google Cluster Computing Faculty Training Workshop

How to Parallelize an Algorithm

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.

Adapted from: Google & UWash’s Creative Common MR Deck

Algorithm Analysis CSE 2011 Winter September 2018.

Lecture 3: Bringing it all together

Database Applications (15-415) Hadoop Lecture 26, April 19, 2016

CS162 Operating Systems and Systems Programming Lecture 19 File Systems continued Distributed Systems April 9, 2008 Prof. Anthony D. Joseph

February 26th – Map/Reduce

Cse 344 May 4th – Map/Reduce.

Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &

Distributed System Gang Wu Spring，2018.

CSE 3302 Programming Languages

CSE 341 Lecture 2 lists and tuples; more functions; mutable state

Chapter 2 Lin and Dyer & MapReduce Basics Chapter 2 Lin and Dyer &

Presentation transcript:

Lecture 2 – MapReduce: Theory and Implementation CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 License.

Last Class How do I process lots of data?  Distribute the work Can I distribute the work?  Maybe… if it’s not dependent on other tasks  Example: Fibonnaci.

Last Class What problems can occur?  Large tasks  Unpredictable bugs  Machine failure How do solve / avoid these?  Break up into small chunks?  Restart tasks?  Use known working solutions

MapReduce Concept from functional programming Implemented by Google Applied to large number of problems

Functional Programming Review Java: int fooA(String[] list) { return bar1(list) + bar2(list); } int fooB(String[] list) { return bar2(list) + bar1(list); } Do they give the same result?

Functional Programming Review Functional Programming: fun fooA(l: int list) = bar1(l) + bar2(l) fun fooB(l: int list) = bar2(l) + bar1(l) Do they give the same result?

Functional Programming Review Operations do not modify data structures: They always create new ones Original data still exists in unmodified form

Functional Updates Do Not Modify Structures fun foo(x, lst) = let lst' = reverse lst in reverse ( x :: lst' ) foo: a’ -> a’ list -> a’ list The foo() function above reverses a list, adds a new element to the front, and returns all of that, reversed, which appends an item. But it never modifies lst!

Functions Can Be Used As Arguments fun DoDouble(f, x) = f (f x) It does not matter what f does to its argument; DoDouble() will do it twice. What is the type of this function? x: a’ f: a’ -> a’ DoDouble: (a’ -> a’) -> a’ -> a’

map (Functional Programming) Creates a new list by applying f to each element of the input list; returns output in order. map f lst: (’a->’b) -> (’a list) -> (’b list)

map Implementation This implementation moves left-to-right across the list, mapping elements one at a time … But does it need to? fun map f [] = [] | map f (x::xs) = (f x) :: (map f xs)

Implicit Parallelism In map In a purely functional setting, elements of a list being computed by map cannot see the effects of the computations on other elements If order of application of f to elements in list is commutative, we can reorder or parallelize execution This is the “secret” that MapReduce exploits

Fold Moves across a list, applying f to each element plus an accumulator. f returns the next accumulator value, which is combined with the next element of the list fold f x0 lst: ('a*'b->'b)->'b->('a list)->'b

fold left vs. fold right Order of list elements can be significant Fold left moves left-to-right across the list Fold right moves from right-to-left SML Implementation: fun foldl f a [] = a | foldl f a (x::xs) = foldl f (f(x, a)) xs fun foldr f a [] = a | foldr f a (x::xs) = f(x, (foldr f a xs))

Example fun foo(l: int list) = sum(l) + mul(l) + length(l) How can we implement this?

Example (Solved) fun foo(l: int list) = sum(l) + mul(l) + length(l) fun sum(lst) = foldl (fn (x,a)=>x+a) 0 lst fun mul(lst) = foldl (fn (x,a)=>x*a) 1 lst fun length(lst) = foldl (fn (x,a)=>1+a) 0 lst

Google MapReduce Input Handling Map function Partition Function Compare Function Reduce Function Output Writer

Input Handling Divides up data into bite-size chunks Starts up tasks Assigns tasks to idle workers

Map Input: Key, Value pair Output: Key, Value pairs Example: Annual Rainfall Per City

Map (Example) Example: Annual Rainfall Per City map(String key, String value): // key: date // value: weather info foreach (City c in value) EmitIntermediate(c, c.temperature)

Partition Function Allocates map output to particular reduces Input: key, number of reduces Output: Index of desired reduce Typical: hash(key) % numberOfReduces

Comparison Sorts input for each reduce Example: Annual rainfall per city  Sorts rainfall data for each city  Seattle: {0, 0, 0, 1, 4, 7, 10, …}

Reduce Input: Key, Sorted list of values Output: Single value Example: Annual rainfall per city

Reduce Input: Key, Sorted list of values Output: Single value Example: Annual rainfall per city

Reduce (Example) Example: Annual rainfall per city  reduce(String key, Iterator values): // key: city // values: temperature sum = 0, count = 0 for each (v in values) sum += v count = count + 1 Emit(sum / count)

Output Writes the output to storage (GFS, etc)

MapReduce for Google Local Intersections Rendering Tiles Finding nearest gas stations