Introduction to Search Engines Technology CS 236375 Technion, Winter 2013 Amit Gross Some slides are courtesy of: Edward Bortnikov & Ronny Lempel, Yahoo!

Introduction to Search Engines Technology CS 236375 Technion, Winter 2013 Amit Gross Some slides are courtesy of: Edward Bortnikov & Ronny Lempel, Yahoo! Labs, Haifa Map-Reduce

Problem Example

Solution Paradigm Describe the problem as a set of Map-Reduce tasks, from the functional programming paradigm. Map: data -> (key,value)* Document -> (token, ‘1’)* Reduce: (key,List ) -> (key,value’) (token,List ) -> (token,#repeats)

Word-count - example Input: D1 = The good the bad and the ugly D2 = As good as it gets and more D3 = Is it ugly and bad? It is, and more! Map: Text->(term,’1’): (The,1); (Good,1); (the,1); (bad,1); (and,1); (the,1); (ugly,1); (as,1); (good,1); (as,1); (it,1); (gets,1); (and,1); (more,1); (is,1); (it,1) (ugly,1); (and,1); (bad,1); (it,1); (is,1); (and,1); (more,1)

Word-count - example (the,[1,1,1]); (good, [1,1]); (bad, [1,1]); (ugly,[1,1]); (and, [1,1,1,1]); (as, [1,1]); (it,[1,1,1]); (gets, [1]); (more, [1,1]); (is,[1,1]) Reduce (term,list )->(term,#occurances) (the,3); (good,2); (bad,2); (ugly,2); (and,4); (as,2); (it,3); (gets,1); (more,2); (is,2)

Word-count – pseudo-code: Map(Document): terms[] <- parse(Document) for each t in terms: emit(t,’1’) Reduce(term,list ): emit(term,sum(list))

Other examples: grep(Text,regex): Map(Text,regex)->(line#,1) Reduce(line,[1])->line# Inverted-Index: Map(docId,Text) -> (term, docId) Reduce(term,list )-> (term,sorted(list )) Reverse Web-Link-Graph Map(Webpages)->(target,source) [for each link] Reduce(target,list )-> (target,list )

Data-flow MapperSort pairs by key Create a list per key Shuffle keys by hash value Reducer Framework User Supplied (key,value) (key,list ) Input (text) output

Example: MR job on 2 Machines M1 M2 M3 M4 R1 R2 Output (on DFS) Input splits (on DFS) Synchronous execution: every R starts computing after all M’s have completed Shuffle

Storage Job input and output are stored on DFS Replicated, reliable storage Intermediate files reside on local disks Non-reliable Data is transferred between Mapper to Reducers via network, on files – time consuming.

Combiners Often, the reducer does is simple aggregation Sum, average, min/max, … Commutative and associative functions We can do some aggregation at the mapper side … and eliminate a lot of network traffic! Where can we use it in an example we have already seen? Word Count – combiner identical to reducer

Data-flow with combiner MapperSort pairs by key Create a list per key Shuffle keys by hash value Reducer Framework User Supplied (key,value) (key,list ) Input (text) output Combiner (key,value’) Done on the same machine!

Fault tolerance

M1 M2 M3 M4 R1 R2 Output (on DFS) Input (on DFS) Slowest task (straggler) affects the job latency Straggler Tasks

Speculative Execution Schedule a backup task if the original task takes too long to complete Same input(s), different output(s) Failed tasks and stragglers get the same treatment Let the fastest win After one task completes, kill all the clones Challenge: how can we tell a task is late?

Summary A simple paradigm for batch processing Data- and computation-intensive jobs Simplicity is key for scalability No silver bullet E.g., MPI is better for iterative computation- intensive workloads (e.g., scientific simulations)

Introduction to Search Engines Technology CS 236375 Technion, Winter 2013 Amit Gross Some slides are courtesy of: Edward Bortnikov & Ronny Lempel, Yahoo!

Similar presentations

Presentation on theme: "Introduction to Search Engines Technology CS 236375 Technion, Winter 2013 Amit Gross Some slides are courtesy of: Edward Bortnikov & Ronny Lempel, Yahoo!"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to Search Engines Technology CS 236375 Technion, Winter 2013 Amit Gross Some slides are courtesy of: Edward Bortnikov & Ronny Lempel, Yahoo!

Similar presentations

Presentation on theme: "Introduction to Search Engines Technology CS 236375 Technion, Winter 2013 Amit Gross Some slides are courtesy of: Edward Bortnikov & Ronny Lempel, Yahoo!"— Presentation transcript:

Similar presentations

About project

Feedback