Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.

Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 License.

Last Class Input Handling Map Function Partition Function Compare Function Reduce Function Output Writer

map (Functional Programming) Creates a new list by applying f to each element of the input list; returns output in order. map f lst: (’a->’b) -> (’a list) -> (’b list)

Fold Moves across a list, applying f to each element plus an accumulator. f returns the next accumulator value, which is combined with the next element of the list fold f x0 lst: ('a*'b->'b)->'b->('a list)->'b

Advantages of MapReduce Flexible for a wide range of problems Fault tolerant Scalable

Overview Hardware Task assignment Failure Non-Determinism Optimizations

Commodity Hardware Cheap Hardware  2 – 4 GB memory  100 megabit / sec  x86 processors running Linux Cheap Hardware + Lots of It = Failure!

Master vs Worker Users submit jobs into scheduling system  Implement map and reduce  Specify M map tasks and R reducers Many copies of program started  One task is the master Master assigns map/reduce tasks to idle workers

Map Tasks Input broken up into 16MB - 64MB chunks M map tasks processed in parallel

Reduce Tasks R reduce tasks Assigned by partitioning function  Typically: hash(key) mod R  Sometimes useful to customize

Master Data Structures For each map / reduce task, store state and identity of machine  State: Idle, In-Progress, Complete For each complete map task, store locations of output (R locations)

Worker with Map Tasks Parses input data into key/value pairs Applies map Buffered pairs written to disk, partitioned into R regions Locations of output eventually passed to master

Worker with Reduce Tasks Read data from map machines via RPC  Sorts data Applies reduce Output appended to final output file

After Reduce When all complete, master wakes up user program Output available in R output files, with names specified by user

How do you pick M and R How many scheduling decisions?  O(M+R) How much state in memory by master?  O(M*R) M: much larger than number of machines R: small multiple of number of machines

Failures & Issues Worker Failure Master Failure Stragglers Crashes, Etc

Worker Failure Master pings worker  No response -> assumes failed Failed map tasks  Completed & In-Progress tasks set to idle Failed reduce tasks  In-Progress tasks set to idle

Master Failure You could write checkpoints In practice: just let the user deal with it

Stragglers (Causes) Why?  Bad disk but correctable errors  Too many other tasks  No caching

Stragglers (Solutions) Re-schedule remaining tasks when operation is close to completion A task is complete when either primary or secondary task is complete

Crashes, Etc Causes:  Bad Records  Bug in Third Party Code Solution: Skip over errors?

Non-Determinism Deterministic = distributed implementation produces same result as sequential execution Non-Deterministic = map or reduce are non-deterministic

Non-Determinism Guarantee: output for a specific reduce task is equivalent to some sequential operation But: output from different reduce tasks may correspond to different sequential operations

Non-Determinism There may be no sequential operation that matches the full output Why?  Because R1 and R2 may have read outputs for the different execution of M

Advanced Stuff Input Types Combiner Function Counters

Input Types May need to change how input is read Implement reader interface

Combiner “Combiner” functions can run on same machine as a mapper Causes a mini-reduce phase to occur before the real reduce phase, to save bandwidth Under what conditions is it sound to use a combiner?

Combiner Function Can only be used if communicative and associative  Communicative: a + b + c = b + c + a  Associative: (a × b) × c = a × (b × c)

Counters Global Counter Masters handles issue of duplicate executions Useful for sanity checking or debugging

Discussion Questions 1. Give an example of a MapReduce problem not listed in the reading. In your example, what are the map and reduce functions (including inputs and outputs)? 2. What part of the MapReduce implementation do you find most interesting? Why? 3. Give an example of a distributable problem that should not be solved with MapReduce. What are the limitations of MapReduce that make it ill-suited for your task?

Discussion Questions 1. Assuming you had a corpus of webpages as input such that the key for each mapper is the URL and the value is the text of the page, how would you design a mapper and a reducer to construct an inverse graph of the web - that is, for each URL output the list of web pages that point to it? 2. TF–IDF is a statistical value assigned to words in a document corpus that indicates the relative importance of the word. As part of computing it, the Inverse Document Frequency of a word is found from: The number of documents in the corpus divided by the number of documents containing the word. Given a corpus of documents, and given that you know how many documents are in the corpus, how would you use map reduce to find this quantity for every word in the corpus simultaneously?

Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.

Similar presentations

Presentation on theme: "Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.

Similar presentations

Presentation on theme: "Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation."— Presentation transcript:

Similar presentations

About project

Feedback