The Map-Reduce Framework

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Parallel Computing MapReduce Examples Parallel Efficiency Assignment
These are slides with a history. I found them on the web... They are apparently based on Dan Weld’s class at U. Washington, (who in turn based his slides.
Distributed Computations
MapReduce: Simplified Data Processing on Large Clusters Cloud Computing Seminar SEECS, NUST By Dr. Zahid Anwar.
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
Map Reduce Allan Jefferson Armando Gonçalves Rocir Leite Filipe??
MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.
Distributed Computations MapReduce
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science MapReduce:
Map Reduce Architecture
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
MapReduce: Simplified Data Processing on Large Clusters 컴퓨터학과 김정수.
Jeffrey D. Ullman Stanford University. 2 Chunking Replication Distribution on Racks.
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce and Hadoop 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 2: MapReduce and Hadoop Mining Massive.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
The Technology Behind. The World Wide Web In July 2008, Google announced that they found 1 trillion unique webpages! Billions of new web pages appear.
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
SLIDE 1IS 240 – Spring 2013 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval Lecture.
MapReduce How to painlessly process terabytes of data.
MapReduce M/R slides adapted from those of Jeff Dean’s.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
SLIDE 1IS 240 – Spring 2013 MapReduce, HBase, and Hive University of California, Berkeley School of Information IS 257: Database Management.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
Chapter 5 Ranking with Indexes 1. 2 More Indexing Techniques n Indexing techniques:  Inverted files - best choice for most applications  Suffix trees.
MapReduce : Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
Dr Zahoor Tanoli COMSATS.  Certainly not suitable to process huge volumes of scalable data  Creates too much of a bottleneck.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
1 Student Date Time Wei Li Nov 30, 2015 Monday 9:00-9:25am Shubbhi Taneja Nov 30, 2015 Monday9:25-9:50am Rodrigo Sanandan Dec 2, 2015 Wednesday9:00-9:25am.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
Lecture 4. MapReduce Instructor: Weidong Shi (Larry), PhD
Cloud Computing.
Distributed Programming in “Big Data” Systems Pramod Bhatotia wp
Large-scale file systems and Map-Reduce
MapReduce: Simplified Data Processing on Large Clusters
Map Reduce.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn.
MapReduce Simplified Data Processing on Large Cluster
Lecture 3. MapReduce Instructor: Weidong Shi (Larry), PhD
MapReduce Computing Paradigm Basics Fall 2013 Elke A. Rundensteiner
Ministry of Higher Education
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
Map reduce use case Giuseppe Andronico INFN Sez. CT & Consorzio COMETA
Cse 344 May 4th – Map/Reduce.
Charles Tappert Seidenberg School of CSIS, Pace University
Introduction to MapReduce
CS427 Multicore Architecture and Parallel Computing
MapReduce: Simplified Data Processing on Large Clusters
5/7/2019 Map Reduce Map reduce.
COS 518: Distributed Systems Lecture 11 Mike Freedman
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other

Big Data (From Wikipedia) Big data is a broad term for data sets so large or complex that traditional data processing applications are inadequate […] Big data "size" is a constantly moving target, as of 2012 ranging from a few dozen terabytes to many petabytes of data.

Motivation Large-Scale data processing Want to process and deduct “new” information from a huge, mostly distributed, database Want to use 1,000s of CPUs But don’t want hassle of managing things

Distributed Computing Frameworks (1) Hadoop: Basic MapReduce. Intermediate results in disk. Now Yarn and support DAG with Tez Spark: Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk. Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing. Hive: SQL-like queries (HiveQL), which are implicitly converted into MapReduce, Tez, or Spark jobs. Mahout: Machine learning algs. mostly on top of Hadoop Tez™: A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. Tez is being adopted by Hive™, Pig™ and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace Hadoop™ MapReduce as the underlying execution engine. Pregel: A System for Large-Scale Graph Processing (Google) Dryad: Distributed Data-parallel Programs from Sequential Building Blocks (Microsoft) Dremel: Interactive Analysis of Web-Scale Datasets (Google) PowerDrill: Processing a Trillion Cells per Mouse Click (Google) Peregrine: A Mapreduce Framework For Iterative And Pipelined Jobs

Distributed Computing Frameworks (2) A Distributed Computing Framework provides: Automatic parallelization & distribution Fault tolerance I/O scheduling Monitoring & status updates A MapReduce framework provides the above for any task that match its specific, yet generic enough, pattern

The Map-Reduce Paradigm You are given a phonebook: How many entries have the last name “Smith”? Create a list of all last names, with the numbers of entries they appear in How many entries have the first name “John”? Create a list of all first names, with the numbers of entries they appear in How can these tasks be distributed to many computers?

Map-Reduce Programming Model Programming model from Lisp (and other functional languages)‏ Many problems can be phrased this way Easy to distribute across nodes Nice retry/failure semantics

Map and Reduce in Lisp (Scheme)‏ Unary operator (map fu list [list2 list3 …])‏ (reduce fb list [list2 list3 …])‏ (map square ‘(1 2 3 4))‏ (12 22 32 42)‏ ⟹ (1 4 9 16)‏ (reduce + ‘(1 4 9 16))‏ (+ 16 (+ 9 (+ 4 1) ) )‏ ⟹ 30 (reduce + (map square (map – L1 L2)))‏ Binary operator

Map-Reduce ala Google map(key, val) is run on each item in set emit (new-key, new-val) pairs reduce(new-key, lisf of new-vals) is run for each unique new-key emitted by map()‏ emit final output MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat (Google, Inc.) https://www.usenix.org/legacy/event/osdi04/tech/full_papers/dean/dean.pdf

Semantics Assumptions on Map and Reduce operations Map has no side effects Reduce is Associative: (a # b) # c = a # (b # c) Commutative: a # b = b # a Idempotent: The same call with the same parameters will produce the same results

Count words in docs Inputs: (url, content) pair for each URL Outputs: (word, count) pair for each word that appears in the documents map(key=url, val=content): For each word in content, emit (word, “1”)‏ pair reduce(key=word, values=unique-counts): sum all “1”s in unique-counts list emit (word, sum) pair

Count Illustrated see 1 bob 1 bob 1 david 1 run 1 run 1 see bob throw map(key=url, val=content): For each word in content, emit (word, “1”)‏ pair reduce(key=word, values=unique-counts): sum all “1”s in unique-counts list and emit (word, sum) pair see 1 bob 1 run 1 bob 1 david 1 run 1 see 2 throw 1 see bob throw Doc 1: see david run Doc 2: see 1 david 1 throw 1

Grep Inputs: (file, regexp)‏ pair for each file and regular expression Outputs: (regexp, file, #line) for each regexp, all the lines that match the regular expression map(key=file, val=regexp): If contents matches regexp, emit (regexp, {file-name, #line}) ‏ reduce(key=regexp, values={file-name, #line}): emit (regexp, list of {file-name, #line})

Reverse Web-Link Graph Inputs: (source-URL, page-content)‏ for each URL Outputs: (URL, list of URLs that links to this URL) for each URL map(key=source-URL, val=page-content): For each link in page-content: emit (target-URL, source-URL) pairs reduce(key=target-URL, values=source-URL): emit (target-URL, list of source-URLs) pairs

Inverted Index Inputs: (source-URL, page-content)‏ for each URL Outputs: (word, list of pages where it appears) for each word that appears in the documents map(key=source-URL, val=page-content): For each word in page-content: emit (word, source-URL) pair reduce(key=word, values=source-URL): emit (word, list of source-URLs) pairs

Map-Reduce Framework Input: a collection of keys and their values User written Map function Each input (k,v) mapped to set of intermediate key-value pairs Sort all key-value pairs by key Each list of values is reduced to a single value (more or less) for that key User written Reduce function Compile one list of intermediate values for each key:(k, [v1,…,vn])‏

Parallelization How is this distributed? Map: Shuffle: Reduce: Partition input key/value pairs into chunks Run map() tasks in parallel Shuffle: After all map()s are complete Consolidate all emitted values for each unique emitted key Reduce: Partition space of output map keys Run reduce() in parallel If map() or reduce() fails, re-execute!

Shuffle map reduce

Another example In a credit card company: Inputs: credit card number -> bargain (cost, date, etc…) for each bargain Outputs: The average bargain cost per month per card (card, month, avg-cost) map(key=card, val=bargain): emit ({card, bargain.month}, bargain.cost) reduce(key={card, month}, values=costs): Find average cost emit (card, month, average)

Implementation A program forks a master process and many worker processes. Input is partitioned into some number of splits. Worker processes are assigned either to perform map on a split or reduce for some set of intermediate keys.

Distributed Execution Overview User Program Worker Master fork assign map reduce read local write remote read, sort Output File 0 File 1 write Split 0 Split 1 Split 2 Input Data

Distributed Execution (2) Map Task 1 Map Task 2 Map Task 3 Reduce Task 1 Reduce Task 2

Responsibility of the Master Assign map and reduce tasks to workers Check that no worker has died (because its processor failed) Communicate results of map to the reduce tasks

Locality Master program divides up tasks based on location of data: Tries to have map tasks on same machine as physical file data Or at least same rack map task inputs are divided into 64-128 MB blocks Same size as Google File System chunks

Communication from Map to Reduce Select a number R of reduce tasks Divide the intermediate keys into R groups, e.g. by hashing Each map task creates, at its own processor, R files of intermediate key/value pairs, one for each reduce task

Fault Tolerance Master detects worker failures Re-executes failed map tasks Re-executes failed reduce tasks Master notices particular input key/values cause crashes in map, and skips those values on re-execution. Effect: Can work around bugs in third-party libraries

Strugglers No reduce can start until map is complete: A single slow disk controller can rate-limit the whole process Master redundantly executes “slow” map tasks Uses results of first correctly-terminating replica

Task Granularity & Pipelining Fine granularity tasks: #map tasks >> #machines Minimizes time for fault recovery Can pipeline shuffling with map execution Hides latency of I/O wait time Better dynamic load balancing Often use 200,000 map & 5000 reduce tasks Running on 2000 machines

Distributed Sort (1) Sorting guarantees within each reduce partition over the order of processing the keys in each reduce task Note: no order is guaranteed over the values of each key How do we implement a distributed sort of all the data?

Distributed Sort (2) Inputs: (document-name, document-content) for each document Outputs: List of documents names ordered by the first word in the document map(key=document-name, val=document-content): For the first word in the document: emit (word, document-name)‏ pair reduce(key=word, values=document-names): For each name in document-names: emit name Something is missing .... Shuffle phase must allow distributed sort!

Other Refinements Compression of intermediate data Combiner Useful for saving network bandwidth Combiner Perform reduce on local map results before shuffle

Yet another example Distributed single source shortest path Input: (u, list-of-neighbors) for each edge in the graph, and s – source node Output: (v, minimal-distance) map(key=u, val={u-distance, neighbours}): For each v in neighbours: emit (v, u-distance + 1)‏ pair reduce(key=u, values=distances): emit (u, min(distances)) pair “Graph Algorithms with MapReduce”: http://www.umiacs.umd.edu/~jimmylin/cloud-computing/Session5.ppt

Hadoop An open-source implementation of Map-Reduce in Java Download from: http://apache.org/hadoop More about Hadoop in the tutorials...