Download presentation
Presentation is loading. Please wait.
Published byMeredith Blake Modified over 8 years ago
1
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 License.
2
Last Class Input Handling Map Function Partition Function Compare Function Reduce Function Output Writer
3
map (Functional Programming) Creates a new list by applying f to each element of the input list; returns output in order. map f lst: (’a->’b) -> (’a list) -> (’b list)
4
Fold Moves across a list, applying f to each element plus an accumulator. f returns the next accumulator value, which is combined with the next element of the list fold f x0 lst: ('a*'b->'b)->'b->('a list)->'b
6
Advantages of MapReduce Flexible for a wide range of problems Fault tolerant Scalable
7
Overview Hardware Task assignment Failure Non-Determinism Optimizations
8
Commodity Hardware Cheap Hardware 2 – 4 GB memory 100 megabit / sec x86 processors running Linux Cheap Hardware + Lots of It = Failure!
9
Master vs Worker Users submit jobs into scheduling system Implement map and reduce Specify M map tasks and R reducers Many copies of program started One task is the master Master assigns map/reduce tasks to idle workers
10
Map Tasks Input broken up into 16MB - 64MB chunks M map tasks processed in parallel
11
Reduce Tasks R reduce tasks Assigned by partitioning function Typically: hash(key) mod R Sometimes useful to customize
12
Master Data Structures For each map / reduce task, store state and identity of machine State: Idle, In-Progress, Complete For each complete map task, store locations of output (R locations)
13
Worker with Map Tasks Parses input data into key/value pairs Applies map Buffered pairs written to disk, partitioned into R regions Locations of output eventually passed to master
14
Worker with Reduce Tasks Read data from map machines via RPC Sorts data Applies reduce Output appended to final output file
15
After Reduce When all complete, master wakes up user program Output available in R output files, with names specified by user
16
How do you pick M and R How many scheduling decisions? O(M+R) How much state in memory by master? O(M*R) M: much larger than number of machines R: small multiple of number of machines
17
Failures & Issues Worker Failure Master Failure Stragglers Crashes, Etc
18
Worker Failure Master pings worker No response -> assumes failed Failed map tasks Completed & In-Progress tasks set to idle Failed reduce tasks In-Progress tasks set to idle
19
Master Failure You could write checkpoints In practice: just let the user deal with it
20
Stragglers (Causes) Why? Bad disk but correctable errors Too many other tasks No caching
21
Stragglers (Solutions) Re-schedule remaining tasks when operation is close to completion A task is complete when either primary or secondary task is complete
22
Crashes, Etc Causes: Bad Records Bug in Third Party Code Solution: Skip over errors?
23
Non-Determinism Deterministic = distributed implementation produces same result as sequential execution Non-Deterministic = map or reduce are non-deterministic
24
Non-Determinism Guarantee: output for a specific reduce task is equivalent to some sequential operation But: output from different reduce tasks may correspond to different sequential operations
25
Non-Determinism There may be no sequential operation that matches the full output Why? Because R1 and R2 may have read outputs for the different execution of M
26
Advanced Stuff Input Types Combiner Function Counters
27
Input Types May need to change how input is read Implement reader interface
28
Combiner “Combiner” functions can run on same machine as a mapper Causes a mini-reduce phase to occur before the real reduce phase, to save bandwidth Under what conditions is it sound to use a combiner?
29
Combiner Function Can only be used if communicative and associative Communicative: a + b + c = b + c + a Associative: (a × b) × c = a × (b × c)
30
Counters Global Counter Masters handles issue of duplicate executions Useful for sanity checking or debugging
31
Discussion Questions 1. Give an example of a MapReduce problem not listed in the reading. In your example, what are the map and reduce functions (including inputs and outputs)? 2. What part of the MapReduce implementation do you find most interesting? Why? 3. Give an example of a distributable problem that should not be solved with MapReduce. What are the limitations of MapReduce that make it ill-suited for your task?
32
Discussion Questions 1. Assuming you had a corpus of webpages as input such that the key for each mapper is the URL and the value is the text of the page, how would you design a mapper and a reducer to construct an inverse graph of the web - that is, for each URL output the list of web pages that point to it? 2. TF–IDF is a statistical value assigned to words in a document corpus that indicates the relative importance of the word. As part of computing it, the Inverse Document Frequency of a word is found from: The number of documents in the corpus divided by the number of documents containing the word. Given a corpus of documents, and given that you know how many documents are in the corpus, how would you use map reduce to find this quantity for every word in the corpus simultaneously?
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.