Leroy Garcia. What is Map Reduce?  A patented programming model developed by Google Derived from LISP and other forms of functional programming  Used.

Slides:



Advertisements
Similar presentations
Lecture 12: MapReduce: Simplified Data Processing on Large Clusters Xiaowei Yang (Duke University)
Advertisements

MAP REDUCE PROGRAMMING Dr G Sudha Sadasivam. Map - reduce sort/merge based distributed processing Best for batch- oriented processing Sort/merge is primitive.
Computations have to be distributed !
These are slides with a history. I found them on the web... They are apparently based on Dan Weld’s class at U. Washington, (who in turn based his slides.
Distributed Computations
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
MapReduce Dean and Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. Communications of the ACM, Vol. 51, No. 1, January Shahram.
MapReduce: Simplified Data Processing on Large Clusters J. Dean and S. Ghemawat (Google) OSDI 2004 Shimin Chen DISC Reading Group.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
MapReduce Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra.
Distributed Computations MapReduce
Distributed MapReduce Team B Presented by: Christian Bryan Matthew Dailey Greg Opperman Nate Piper Brett Ponsler Samuel Song Alex Ostapenko Keilin Bickar.
L22: SC Report, Map Reduce November 23, Map Reduce What is MapReduce? Example computing environment How it works Fault Tolerance Debugging Performance.
MapReduce Simplified Data Processing On large Clusters Jeffery Dean and Sanjay Ghemawat.
Lecture 2 – MapReduce CPE 458 – Parallel Programming, Spring 2009 Except as otherwise noted, the content of this presentation is licensed under the Creative.
MapReduce : Simplified Data Processing on Large Clusters Hongwei Wang & Sihuizi Jin & Yajing Zhang
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science MapReduce:
Google Distributed System and Hadoop Lakshmi Thyagarajan.
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
SIDDHARTH MEHTA PURSUING MASTERS IN COMPUTER SCIENCE (FALL 2008) INTERESTS: SYSTEMS, WEB.
MapReduce.
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
By: Jeffrey Dean & Sanjay Ghemawat Presented by: Warunika Ranaweera Supervised by: Dr. Nalin Ranasinghe.
MapReduce. Web data sets can be very large – Tens to hundreds of terabytes Cannot mine on a single server Standard architecture emerging: – Cluster of.
MapReduce “Divide and Conquer”.
Google MapReduce Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development.
MapReduce: Simplified Data Processing on Large Clusters 컴퓨터학과 김정수.
Map Reduce: Simplified Data Processing On Large Clusters Jeffery Dean and Sanjay Ghemawat (Google Inc.) OSDI 2004 (Operating Systems Design and Implementation)
Süleyman Fatih GİRİŞ CONTENT 1. Introduction 2. Programming Model 2.1 Example 2.2 More Examples 3. Implementation 3.1 ExecutionOverview 3.2.
Take a Close Look at MapReduce Xuanhua Shi. Acknowledgement  Most of the slides are from Dr. Bing Chen,
Map Reduce for data-intensive computing (Some of the content is adapted from the original authors’ talk at OSDI 04)
CS525: Special Topics in DBs Large-Scale Data Management Hadoop/MapReduce Computing Paradigm Spring 2013 WPI, Mohamed Eltabakh 1.
MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce and Hadoop 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 2: MapReduce and Hadoop Mining Massive.
1 The Map-Reduce Framework Compiled by Mark Silberstein, using slides from Dan Weld’s class at U. Washington, Yaniv Carmeli and some other.
MapReduce: Hadoop Implementation. Outline MapReduce overview Applications of MapReduce Hadoop overview.
Map Reduce: Simplified Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat Google, Inc. OSDI ’04: 6 th Symposium on Operating Systems Design.
MAP REDUCE : SIMPLIFIED DATA PROCESSING ON LARGE CLUSTERS Presented by: Simarpreet Gill.
MapReduce How to painlessly process terabytes of data.
Google’s MapReduce Connor Poske Florida State University.
MapReduce M/R slides adapted from those of Jeff Dean’s.
Mass Data Processing Technology on Large Scale Clusters Summer, 2007, Tsinghua University All course material (slides, labs, etc) is licensed under the.
MapReduce Kristof Bamps Wouter Deroey. Outline Problem overview MapReduce o overview o implementation o refinements o conclusion.
CS 345A Data Mining MapReduce. Single-node architecture Memory Disk CPU Machine Learning, Statistics “Classical” Data Mining.
MAPREDUCE PRESENTED BY: KATIE WOODS & JORDAN HOWELL.
SECTION 5: PERFORMANCE CHRIS ZINGRAF. OVERVIEW: This section measures the performance of MapReduce on two computations, Grep and Sort. These programs.
MapReduce and the New Software Stack CHAPTER 2 1.
By Jeff Dean & Sanjay Ghemawat Google Inc. OSDI 2004 Presented by : Mohit Deopujari.
MapReduce: Simplified Data Processing on Large Clusters Lim JunSeok.
MapReduce : Simplified Data Processing on Large Clusters P 謝光昱 P 陳志豪 Operating Systems Design and Implementation 2004 Jeffrey Dean, Sanjay.
C-Store: MapReduce Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY May. 22, 2009.
Hadoop/MapReduce Computing Paradigm 1 CS525: Special Topics in DBs Large-Scale Data Management Presented By Kelly Technologies
MapReduce: Simplified Data Processing on Large Clusters By Dinesh Dharme.
MapReduce: simplified data processing on large clusters Jeffrey Dean and Sanjay Ghemawat.
MapReduce: Simplified Data Processing on Large Cluster Authors: Jeffrey Dean and Sanjay Ghemawat Presented by: Yang Liu, University of Michigan EECS 582.
Dr Zahoor Tanoli COMSATS.  Certainly not suitable to process huge volumes of scalable data  Creates too much of a bottleneck.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
COMP7330/7336 Advanced Parallel and Distributed Computing MapReduce - Introduction Dr. Xiao Qin Auburn University
MapReduce: Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc.
Lecture 3 – MapReduce: Implementation CSE 490h – Introduction to Distributed Computing, Spring 2009 Except as otherwise noted, the content of this presentation.
MapReduce: Simplified Data Processing on Large Clusters
Map Reduce.
MapReduce Simplied Data Processing on Large Clusters
湖南大学-信息科学与工程学院-计算机与科学系
Map-Reduce framework -By Jagadish Rouniyar.
CS 345A Data Mining MapReduce This presentation has been altered.
Cloud Computing MapReduce, Batch Processing
MapReduce: Simplified Data Processing on Large Clusters
MapReduce: Simplified Data Processing on Large Clusters
Presentation transcript:

Leroy Garcia

What is Map Reduce?  A patented programming model developed by Google Derived from LISP and other forms of functional programming  Used for processing large data and generating large data sets  Exploits large set of commodity computers  Executes process in distributed manner  Easy to use, no messy code

Implementation at Google  Machines w/ Multiple Processors  Commodity Networking Hardware  Cluster of Hundreds or Thousands of Machines  IDE Disks used for storage  Input Data managed by GFS  Users submit jobs to a scheduling system

Introduction  How does Map Reduce work?

Overview  Programming Model  Implementation  Refinement  Performance  Related Topics  Conclusion

Programming Model

 Map  Input: key/value pair  Key: ex. Document Name  Value: ex. Document Contents  Output:  Set of Intermediate key/values

Programming Model  Reduce  Input: Intermediate key, values  Key: ex. A Word  Values: Values  Output  List of Values or a Single Value

REDUCEREDUCE MAP Partitioning Function

Execution MMMMMMM k1:v k1:v k2:vk1:vk3:v k2:vk4:v k5:vk4:vk1:v k3:v Group by Key k1:v,v,v,vk2:vk3:v,vk5:v RRRR Input: Intermediate: Grouped: k4:v,v,v R

Parallel Execution MMMMMMM k1:v k1:v k2:vk1:vk3:v k2:vk4:v k5:vk4:vk1:v k3:v k1:v,v,v,v k2:v k3:v,v k5:v R R R RR k4:v,v,v Map Task 1Map Task 2Map Task 3 Sort and Group Partition Function Reduce 1

The Map Step v k kv kv map v k v k … kv Input key-value pairs Intermediate key-value pairs … kv

Reduce Step kv … kv kv kv Intermediate key-value pairs group reduce kvkvkv … kv … kv kvv vv Key-value groups Output key-value pairs

Word Count MAP v {Girl,3} {Girl,8} {Girl,5} {Boy,34} {Boy,16} {Boy,12} {Boy,23} {Girl,18} {Boy,34} {Boy,12} {Boy,23} {Boy,16} {Girl,18} {Girl,5} {Girl,8} {Girl,5} {Girl,12} {Boy,85} {Girl,43}

Examples  Distributed Grep  Count of URL Access Frequency  Reverse Web-Link Graph  Term-Vector per Host  Inverted Index  Distributed Sort

Practical Examples  Large PDF Generation  Artificial Intelligence  Statistical Data  Geographical Data

Large-Scale PDF Generation The New York Times needs to generate PDF files for 11,000,000 articles (every article from ) in the form of images scanned from the original paper Each article is composed of numerous TIFF images which are scaled and glued together Code for generating a PDF is relatively straightforward

 Compute statistics Central Limit Theorem  N voting nodes cast votes (map)  Tally votes and take action (reduce) Artificial Intelligence

Statistical Analysis Photos from: stockcharts.com Statistical analysis of current stock against historical data Each node (map) computes similarity and ROI. Tally Votes (reduce) to generate expected ROI and standard deviation

Geographical Data  Large data sets including road, intersection, and feature data  Problems that Google Maps has used MapReduce to solve Locating roads connected to a given intersection Rendering of map tiles Finding nearest feature to a given address or location

 Input: Graph describing node network with all gas stations marked  Map: Search five mile radius of each gas station and mark distance to each node  Sort: Sort by key  Reduce: For each node, emit path and gas station with the shortest distance  Output: Graph marked and nearest gas station to each node Geographical Data

Implementation

Map/Reduce Walkthrough  Map: (Functional Programming)uses a function on each element of the array  Mapper: The node that performs a function on one element of the set.  Reduce: (Functional programming) iterate a function across an array  Reducer: The node that reduces across all the like-keyed elements.

Execution Overview 1. Split input files 2. Starts up copies of the program in cluster. Copy of program is sent to the Master Master assigns either map or reduce responsibilities 3. Map Worker reads the splits Parses key/value pairs out of the input data Passes each pair to the user-defined Map function. 4. Buffer pairs are written to local disc partitioned into regions by partitioning function The locations of these buffered pairs on the local disk are passed back to the master. Master is responsible for forwarding these locations to the reduce workers. 5. Location of the buffer pairs are given to Reduce Worker by the master Sorts Intermediate keys 6. The reduce worker iterates over the sorted intermediate data for each unique intermediate key. passes the key and the corresponding set of intermediate values to the user's Reduce function. The output of the Reduce function is appended to a final output file for this reduce partition. 7. When all map tasks and reduce tasks have been completed, the master wakes up the user program

Distributed Execution Overview User Program Worker Master Worker fork assign map assign reduce read local write remote read, sort Output File 0 Output File 1 write Split 0 Split 1 Split 2 Input Data

Fault Tolerance  Worker Failure  Master Failure  Dealing with Stragglers  Locality  Task Granularity  Skipping Bad Record

Worker Failure Master Worker A Map Task 1 Complete Worker B Map Task 2 Complete Failed Map Task 2 Idle Worker C Map Task 2 In Progress Worker AZ Reduce Task 1 Failed Ping Reduce Task 1 Idle Worker BX Reduce Task 1 In Progress Ping

Master Failure Checkpoints Checkpoint 124 Checkpoint 123 Checkpoint 125 Master Fail MASTER Checkpoint 125 NEW MASTER

Dealing with Stragglers  Straggler- a machine in a cluster than is running significantly slower than the rest Straggler Map Task Good Machine Map Task Copy Finish Task Line

Locality  Input Data is stored locally  GFS divides files in 64 MB blocks  Stores 3 copies of the blocks on different machines  Finds Replica of input data and scheduled map tasks.  Map tasks scheduled so GFS input block replica are on same machine or same rack

Task Granularity  Minimizes time for fault recovery  Can pipeline shuffling with map execution  Better dynamic load balancing  Often use 200,000 map/5000 reduce tasks w/ 2000 machines

Refinements

Partitoning Function The users of MapReduce specify the number of reduce tasks/output files that they desire. Data gets partitioned across these tasks using a partitioning function on the intermediate key. Special partitioning function. eg.hash(Hostname(urlkey)) mod R. Ordering Guarantee Intermediate keys are process in increasing key order. Generates sorted output per partition.

Combiner Function(Optional) Used by the Map Task when there is a significant repetition in the intermediate keys produced by each Map Task Map Worker Map Function Combiner Function Text Document (Girls, 1) (Girls, 2) (Girls, 6)

Input and Output Types Input: Supports reading data of various formats Support for new input type using a simple implementation of a reader interface. Ex.Database Ex. Datastructure Mapped in Memory Output: User codes supports to handle new type

Skipping Bad Records Map/Reduce functions sometimes fail for particular inputs Best solution is to debug & fix, but not always possible On seg fault: Send UDP packet to master from signal handler Include sequence number of record being processed If master sees two failures for same record: Next worker is told to skip the record

Status Pages

Performance Tests run on cluster of 1800 machines: 4 GB of memory Dual-processor 2 GHz Xeons with Hyperthreading Dual 160 GB IDE disks Gigabit Ethernet per machine Bisection bandwidth approximately 100 Gbps MR_Grep Scan byte records to extract records matching a rare pattern (92K matching records) MR_Sort Sort byte records (modeled after TeraSort benchmark) Two Benchmarks

MR_Grep Locality optimization helps: 1800 machines read 1 TB of data at peak of ~31 GB/s Without this, rack switches would limit to 10 GB/s Inputs Scanned

MR_Sort NormalNo Backup Tasks200 Processes Killed

Related Topics

Hadoop – Open-source implementation of MapReduce  HDFS Primary storage system used by Hadoop applications. HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations. Amazon Elastic Compute Cloud (EC2) – Virtualized computing environment designed for use with other Amazon services (especially S3) Amazon Simple Storage Service (S3) – Scalable, inexpensive internet storage which can store and retrieve any amount of data at any time from anywhere on the web – Asynchronous, decentralized system which aims to reduce scaling bottlenecks and single points of failure Other Notable Implementations of MapReduce

Conclusion MapReduce has proven to be a useful abstraction Greatly simplifies large-scale computations at Google Easily Handles machine failure. Allows users to focus on problem, without having to deal with complicated code behind the scene.

Questions?????