Introduction to Search Engines Technology CS 236375 Technion, Winter 2013 Amit Gross Some slides are courtesy of: Edward Bortnikov & Ronny Lempel, Yahoo!

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

MapReduce Online Tyson Condie UC Berkeley Slides by Kaixiang MO
MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
MapReduce.
Overview of MapReduce and Hadoop
Mapreduce and Hadoop Introduce Mapreduce and Hadoop
Based on the text by Jimmy Lin and Chris Dryer; and on the yahoo tutorial on mapreduce at index.html
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Predicting Execution Bottlenecks in Map-Reduce Clusters Edward Bortnikov, Ari Frank, Eshcar Hillel, Sriram Rao Presenting: Alex Shraer Yahoo! Labs.
Distributed Computations
MapReduce Dean and Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, Vol. 51, No. 1, January Shahram.
CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy) Lecture 3: Mapreduce and Hadoop All slides © IG.
Distributed Computations MapReduce
7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.
MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
Hadoop Ida Mele. Parallel programming Parallel programming is used to improve performance and efficiency In a parallel program, the processing is broken.
MapReduce.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
Parallel Programming Models Basic question: what is the “right” way to write parallel programs –And deal with the complexity of finding parallelism, coarsening.
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Apache Hadoop MapReduce What is it ? Why use it ? How does it work Some examples Big users.
Hadoop/MapReduce Computing Paradigm 1 Shirish Agale.
HAMS Technologies 1
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
MapReduce How to painlessly process terabytes of data.
MapReduce M/R slides adapted from those of Jeff Dean’s.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
MAP REDUCE BASICS CHAPTER 2. Basics Divide and conquer – Partition large problem into smaller subproblems – Worker work on subproblems in parallel Threads.
SECTION 5: PERFORMANCE CHRIS ZINGRAF. OVERVIEW: This section measures the performance of MapReduce on two computations, Grep and Sort. These programs.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
CS525: Big Data Analytics MapReduce Computing Paradigm & Apache Hadoop Open Source Fall 2013 Elke A. Rundensteiner 1.
DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.
MapReduce : Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.
IBM Research ® © 2007 IBM Corporation Introduction to Map-Reduce and Join Processing.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
Next Generation of Apache Hadoop MapReduce Owen
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
1 Source A. Haeberlen, Z. Ives University of Pennsylvania MapReduceIntro.pptx Introduction to MapReduce.
Implementation of Classifier Tool in Twister Magesh khanna Vadivelu Shivaraman Janakiraman.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Some slides adapted from those of Yuan Yu and Michael Isard
Hadoop MapReduce Framework
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
Introduction to MapReduce and Hadoop
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
MapReduce Simplied Data Processing on Large Clusters
COS 418: Distributed Systems Lecture 1 Mike Freedman
湖南大学-信息科学与工程学院-计算机与科学系
February 26th – Map/Reduce
MapReduce Algorithm Design
Cse 344 May 4th – Map/Reduce.
CS110: Discussion about Spark
Introduction to MapReduce
5/7/2019 Map Reduce Map reduce.
COS 518: Distributed Systems Lecture 11 Mike Freedman
MapReduce: Simplified Data Processing on Large Clusters
Lecture 29: Distributed Systems
Presentation transcript:

Introduction to Search Engines Technology CS Technion, Winter 2013 Amit Gross Some slides are courtesy of: Edward Bortnikov & Ronny Lempel, Yahoo! Labs, Haifa Map-Reduce

Problem Example

Solution Paradigm Describe the problem as a set of Map-Reduce tasks, from the functional programming paradigm. Map: data -> (key,value)* Document -> (token, ‘1’)* Reduce: (key,List ) -> (key,value’) (token,List ) -> (token,#repeats)

Word-count - example Input: D1 = The good the bad and the ugly D2 = As good as it gets and more D3 = Is it ugly and bad? It is, and more! Map: Text->(term,’1’): (The,1); (Good,1); (the,1); (bad,1); (and,1); (the,1); (ugly,1); (as,1); (good,1); (as,1); (it,1); (gets,1); (and,1); (more,1); (is,1); (it,1) (ugly,1); (and,1); (bad,1); (it,1); (is,1); (and,1); (more,1)

Word-count - example (the,[1,1,1]); (good, [1,1]); (bad, [1,1]); (ugly,[1,1]); (and, [1,1,1,1]); (as, [1,1]); (it,[1,1,1]); (gets, [1]); (more, [1,1]); (is,[1,1]) Reduce (term,list )->(term,#occurances) (the,3); (good,2); (bad,2); (ugly,2); (and,4); (as,2); (it,3); (gets,1); (more,2); (is,2)

Word-count – pseudo-code: Map(Document): terms[] <- parse(Document) for each t in terms: emit(t,’1’) Reduce(term,list ): emit(term,sum(list))

Other examples: grep(Text,regex): Map(Text,regex)->(line#,1) Reduce(line,[1])->line# Inverted-Index: Map(docId,Text) -> (term, docId) Reduce(term,list )-> (term,sorted(list )) Reverse Web-Link-Graph Map(Webpages)->(target,source) [for each link] Reduce(target,list )-> (target,list )

Data-flow MapperSort pairs by key Create a list per key Shuffle keys by hash value Reducer Framework User Supplied (key,value) (key,list ) Input (text) output

Example: MR job on 2 Machines M1 M2 M3 M4 R1 R2 Output (on DFS) Input splits (on DFS) Synchronous execution: every R starts computing after all M’s have completed Shuffle

Storage Job input and output are stored on DFS Replicated, reliable storage Intermediate files reside on local disks Non-reliable Data is transferred between Mapper to Reducers via network, on files – time consuming.

Combiners Often, the reducer does is simple aggregation Sum, average, min/max, … Commutative and associative functions We can do some aggregation at the mapper side … and eliminate a lot of network traffic! Where can we use it in an example we have already seen? Word Count – combiner identical to reducer

Data-flow with combiner MapperSort pairs by key Create a list per key Shuffle keys by hash value Reducer Framework User Supplied (key,value) (key,list ) Input (text) output Combiner (key,value’) Done on the same machine!

Fault tolerance

M1 M2 M3 M4 R1 R2 Output (on DFS) Input (on DFS) Slowest task (straggler) affects the job latency Straggler Tasks

Speculative Execution Schedule a backup task if the original task takes too long to complete Same input(s), different output(s) Failed tasks and stragglers get the same treatment Let the fastest win After one task completes, kill all the clones Challenge: how can we tell a task is late?

Summary A simple paradigm for batch processing Data- and computation-intensive jobs Simplicity is key for scalability No silver bullet E.g., MPI is better for iterative computation- intensive workloads (e.g., scientific simulations)