Introduction to Search Engines Technology CS Technion, Winter 2013 Amit Gross Some slides are courtesy of: Edward Bortnikov & Ronny Lempel, Yahoo! Labs, Haifa Map-Reduce
Problem Example
Solution Paradigm Describe the problem as a set of Map-Reduce tasks, from the functional programming paradigm. Map: data -> (key,value)* Document -> (token, ‘1’)* Reduce: (key,List ) -> (key,value’) (token,List ) -> (token,#repeats)
Word-count - example Input: D1 = The good the bad and the ugly D2 = As good as it gets and more D3 = Is it ugly and bad? It is, and more! Map: Text->(term,’1’): (The,1); (Good,1); (the,1); (bad,1); (and,1); (the,1); (ugly,1); (as,1); (good,1); (as,1); (it,1); (gets,1); (and,1); (more,1); (is,1); (it,1) (ugly,1); (and,1); (bad,1); (it,1); (is,1); (and,1); (more,1)
Word-count - example (the,[1,1,1]); (good, [1,1]); (bad, [1,1]); (ugly,[1,1]); (and, [1,1,1,1]); (as, [1,1]); (it,[1,1,1]); (gets, [1]); (more, [1,1]); (is,[1,1]) Reduce (term,list )->(term,#occurances) (the,3); (good,2); (bad,2); (ugly,2); (and,4); (as,2); (it,3); (gets,1); (more,2); (is,2)
Word-count – pseudo-code: Map(Document): terms[] <- parse(Document) for each t in terms: emit(t,’1’) Reduce(term,list ): emit(term,sum(list))
Other examples: grep(Text,regex): Map(Text,regex)->(line#,1) Reduce(line,[1])->line# Inverted-Index: Map(docId,Text) -> (term, docId) Reduce(term,list )-> (term,sorted(list )) Reverse Web-Link-Graph Map(Webpages)->(target,source) [for each link] Reduce(target,list )-> (target,list )
Data-flow MapperSort pairs by key Create a list per key Shuffle keys by hash value Reducer Framework User Supplied (key,value) (key,list ) Input (text) output
Example: MR job on 2 Machines M1 M2 M3 M4 R1 R2 Output (on DFS) Input splits (on DFS) Synchronous execution: every R starts computing after all M’s have completed Shuffle
Storage Job input and output are stored on DFS Replicated, reliable storage Intermediate files reside on local disks Non-reliable Data is transferred between Mapper to Reducers via network, on files – time consuming.
Combiners Often, the reducer does is simple aggregation Sum, average, min/max, … Commutative and associative functions We can do some aggregation at the mapper side … and eliminate a lot of network traffic! Where can we use it in an example we have already seen? Word Count – combiner identical to reducer
Data-flow with combiner MapperSort pairs by key Create a list per key Shuffle keys by hash value Reducer Framework User Supplied (key,value) (key,list ) Input (text) output Combiner (key,value’) Done on the same machine!
Fault tolerance
M1 M2 M3 M4 R1 R2 Output (on DFS) Input (on DFS) Slowest task (straggler) affects the job latency Straggler Tasks
Speculative Execution Schedule a backup task if the original task takes too long to complete Same input(s), different output(s) Failed tasks and stragglers get the same treatment Let the fastest win After one task completes, kill all the clones Challenge: how can we tell a task is late?
Summary A simple paradigm for batch processing Data- and computation-intensive jobs Simplicity is key for scalability No silver bullet E.g., MPI is better for iterative computation- intensive workloads (e.g., scientific simulations)